What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

对比 Rust PDF 库

将 PDF Oxide 与最常用的 Rust PDF crate 进行对比：lopdf、printpdf、pdf-rs 和 pdf_extract。它们各自面向不同的抽象层次和不同的使用场景。

概览

	PDF Oxide	lopdf	printpdf	pdf-rs	pdf_extract
API 层级	高层	底层	中层（创建）	底层（读取）	中层（读取）
读取 PDF	是	是	否	是	是
写入 PDF	是	是	是	否	否
文本提取	是（高层）	手动	否	手动	是（基础）
图像提取	是（高层）	手动	否	手动	否
表单字段	读取 + 写入	手动	否	仅读取	否
PDF 创建	是	是	是	否	否
Markdown/HTML 输入	是	否	否	否	否
编辑现有 PDF	是	是（底层）	否	否	否
注释	读取 + 写入	手动	否	仅读取	否
加密	读取 + 写入	否	否	否	否
PDF/A 校验	是	否	否	否	否
渲染	是（tiny-skia）	否	否	部分	否
Python 绑定	是	否	否	否	否
许可证	MIT	MIT	MIT	MIT	Apache-2.0

所有库都采用宽松许可证。差异在于功能范围和抽象层次。

性能对比

完整语料库基准测试（3,830 个 PDF）

在完整的 3,830 个 PDF 语料库上进行测试——由三个独立、公开可用的测试套件组成，涵盖 PDF 规范符合性（veraPDF，2,907 个文件）、真实世界的浏览器渲染边缘情况（Mozilla pdf.js，897 个文件），以及包含畸形结构和模糊测试生成损坏的安全性/健壮性压力测试（DARPA SafeDocs，26 个文件）。参见完整语料库详情。

库	均值	p99	通过率	文本提取	备注
PDF Oxide	0.8ms	9ms	100%	内置，生产级	Unicode、CJK、阅读顺序
oxidize_pdf	13.5ms	11ms	99.1%	基础	最大 48 秒异常值
unpdf	2.8ms	10ms	95.1%	基础	完整语料库上 185 次失败
pdf_extract	4.08ms	37ms	91.5%	基础	遗漏复杂布局
lopdf	0.3ms	2ms	80.2%	无内置提取	在 20% 的 PDF 上失败

lopdf 在它能解析的 PDF 上更快——但它在 20% 的语料库上失败，并且不提供文本提取。你需要自己构建字体解码、CMap 解析和间距分析。

pdf_extract 提供基础的文本提取，但通过率为 91.5%，在处理复杂布局、CJK 文本和带标签的 PDF 时表现吃力。oxidize_pdf 具有不错的可靠性（99.1%），但在平均提取时间上比 pdf_oxide 慢 17 倍，最坏情况下有 48 秒的异常值。unpdf 能处理完整语料库，但在 185 个 PDF 上失败。

PDF Oxide 是唯一一个将 100% 可靠性与生产级文本提取结合在一起的 Rust crate。

API 设计对比

PDF Oxide：高层、面向任务

PDF Oxide 为常见任务提供专门设计的方法。你处理的是文本、图像和表单字段——而非 PDF 对象和字典。

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;

// Text extraction -- one call
let text = doc.extract_text(0)?;
println!("{}", text);

// Styled spans with font metadata
let spans = doc.extract_spans(0)?;
for span in &spans {
    println!("'{}' font={} size={:.1}pt", span.text, span.font_name, span.font_size);
}

// Image extraction
let images = doc.extract_images(0)?;
for img in &images {
    println!("{}x{} {:?}", img.width, img.height, img.format);
}

// Form fields
let fields = doc.extract_form_fields()?;
for field in &fields {
    println!("{}: {:?}", field.name, field.value);
}

创建 PDF 同样简单直接：

use pdf_oxide::api::Pdf;

// From Markdown
let pdf = Pdf::from_markdown("# Report\n\n| A | B |\n|---|---|\n| 1 | 2 |")?;
pdf.save("report.pdf")?;

// From HTML
let pdf = Pdf::from_html("<h1>Report</h1><p>Content here.</p>")?;
pdf.save("report.pdf")?;

lopdf：底层对象操作

lopdf 让你直接访问 PDF 对象、流和交叉引用表。要有效使用它，你必须理解 PDF 规范。它没有内置文本提取——你需要自己浏览字典并解码流。

use lopdf::Document;

let doc = Document::load("report.pdf")?;

// Get page dictionary
let page_id = doc.page_iter().next().unwrap();
let page = doc.get_dictionary(page_id)?;

// Get content stream -- manual work
let contents = page.get("Contents")?;
let stream = doc.get_object(contents.as_reference()?)?;

// To extract text you must:
// 1. Parse the content stream operators
// 2. Resolve font references from /Resources
// 3. Decode CMap/ToUnicode mappings
// 4. Apply text matrix transformations
// 5. Handle encoding differences
//
// lopdf does not provide any of this -- it is raw object access
println!("Page has {} objects", doc.objects.len());

当你需要直接操作 PDF 结构时，lopdf 是合适的工具：合并文档、重写对象流，或构建专门的 PDF 处理器。

printpdf：仅用于创建 PDF

printpdf 是一个仅用于创建的库。它无法读取或解析现有的 PDF。它提供了一个类型化的 API，用于从零开始构建包含文本、图像和矢量图形的 PDF 文档。

use printpdf::*;

let (doc, page1, layer1) = PdfDocument::new(
    "Report", Mm(210.0), Mm(297.0), "Layer 1"
);

let current_layer = doc.get_page(page1).get_layer(layer1);

// Add text -- requires manual font loading
let font = doc.add_builtin_font(BuiltinFont::Helvetica)?;
current_layer.use_text("Hello World", 24.0, Mm(10.0), Mm(280.0), &font);

// Save
doc.save(&mut std::io::BufWriter::new(
    std::fs::File::create("output.pdf")?,
))?;

// Cannot read existing PDFs
// Cannot extract text, images, or form fields

当你只需要生成新 PDF 并且想要一个简洁、专注的创建 API 时，printpdf 是合适的工具。

pdf-rs：底层 PDF 读取

pdf-rs 将 PDF 结构解析为 Rust 类型，但只提供极少的高层功能。你可以获得对 PDF 对象的类型化访问，但仍需自己处理文本解码、字体解析和内容流解析。

use pdf::file::FileOptions;

let file = FileOptions::cached().open("report.pdf")?;

// Access page objects
let page = file.get_page(0)?;
let media_box = page.media_box()?;
println!("Page size: {:?}", media_box);

// Content stream access -- low-level
if let Some(ref contents) = page.contents {
    // Returns raw operations -- you must interpret them
    // No built-in text assembly, font decoding, or layout analysis
}

// Cannot write or modify PDFs

当你需要一个类型安全的 PDF 解析器用于分析、校验或构建自定义渲染器时，pdf-rs 是合适的工具。

按任务划分的功能对比

文本提取

库	内置	质量	所需投入
PDF Oxide	是	生产级（Unicode、CJK、阅读顺序）	一次方法调用
pdf_extract	是	基础（遗漏复杂布局）	一次方法调用
lopdf	否	不适用	数百行自定义代码
printpdf	否	不适用	不可能（仅写入）
pdf-rs	否	不适用	需要大量自定义代码

PDF Oxide 处理 CMap/ToUnicode 解码、基于字体度量的间距、结构树阅读顺序以及连字重建。在 lopdf 或 pdf-rs 之上实现等效功能需要数千行代码和深厚的 PDF 规范知识。

PDF 创建

库	方式	Markdown/HTML 输入	表格	条形码
PDF Oxide	高层 + 底层	是	是	是
lopdf	原始对象构建	否	否	否
printpdf	类型化图层 API	否	否	否
pdf-rs	不适用（仅读取）	不适用	不适用	不适用

加密

库	读取加密	写入加密	算法
PDF Oxide	是	是	RC4-40、RC4-128、AES-128、AES-256
lopdf	否	否	–
printpdf	否	否	–
pdf-rs	部分	否	仅 RC4

合规性

库	PDF/A	PDF/X	PDF/UA
PDF Oxide	校验 + 转换	校验	校验
lopdf	否	否	否
printpdf	部分（PDF/A-1b 输出）	否	否
pdf-rs	否	否	否

依赖占用

库	依赖项	编译时间	二进制大小
PDF Oxide	~40（核心）	~30s	~4 MB
lopdf	~15	~10s	~1 MB
printpdf	~20	~15s	~2 MB
pdf-rs	~25	~20s	~2 MB

PDF Oxide 的依赖项更多，因为它包含了字体解析、图像解码、内容流解释和加密——这些功能是其他库留给用户处理或干脆省略的。启用所有可选功能（rendering、barcodes、office）后，数量会上升到约 100。

组合使用多个库

由于它们都采用宽松许可证，你可以在同一个项目中组合使用：

[dependencies]
pdf_oxide = "0.3"
lopdf = "0.32"        # Optional: raw object access for edge cases

常见模式：

PDF Oxide + lopdf：用 PDF Oxide 做提取和创建，对于需要原始对象操作的边缘情况回退到 lopdf。
PDF Oxide + printpdf：用 PDF Oxide 做读取，用 printpdf 做专门的创建工作流。

使用场景矩阵

“我需要从 PDF 中提取文本”

Crate	适合吗？	备注
PDF Oxide	是	最佳提取质量、100% 通过率、阅读顺序、字体元数据
pdf_extract	部分	基础提取，91.5% 通过率
lopdf	否	无文本提取
printpdf	否	无法读取 PDF
pdf-rs	部分	基础解析，无高层文本提取

“我需要创建 PDF”

Crate	适合吗？	备注
PDF Oxide	是	高层（Markdown/HTML）和底层 API
lopdf	部分	底层对象构建
printpdf	是	简洁的创建 API，无读取功能
pdf-rs	否	仅读取

“我需要编辑现有的 PDF”

Crate	适合吗？	备注
PDF Oxide	是	类 DOM 编辑、注释、表单
lopdf	部分	底层对象操作
printpdf	否	无法读取 PDF
pdf-rs	否	仅读取

“我需要完整的生命周期（提取 + 创建 + 编辑）”

Crate	适合吗？	备注
PDF Oxide	是	唯一覆盖这三者的 crate
lopdf + printpdf	部分	两个 crate，无文本提取
pdf-rs + printpdf	部分	两个 crate，无编辑功能

各自的适用场景

选择 PDF Oxide，如果你需要不止一种 PDF 能力（提取 + 创建，或提取 + 编辑），并且想要一个经过充分测试、具备 100% 可靠性的单一依赖。

选择 lopdf，如果你需要底层 PDF 结构操作，并且乐于直接面对 PDF 规范。适合合并、拆分和批量 PDF 处理。

选择 printpdf，如果你只创建 PDF 而从不需要读取它们。是报表和文档生成的最简洁 API。

选择 pdf-rs，如果你需要一个符合规范的解析器用于 PDF 分析，或者正在构建自己的渲染管线。

选择 pdf_extract，如果你需要基础的文本提取，且不要求高可靠性或复杂布局支持。