What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs lopdf

lopdf 是一个直接操作 PDF 对象的底层 Rust crate，不内置文本提取或渲染。PDF Oxide 是开箱即用的高层库，提取、生成、编辑一站式。两者定位的场景完全不同。

关键差异

抽象层级。 lopdf 提供原始 PDF 对象——字典、流和交叉引用表。没有文本提取，没有字体解码，没有图片导出。PDF Oxide 提供面向任务的方法：extract_text()、extract_images()、to_markdown()。

可靠性。 lopdf 在 3,830 个 PDF 测试语料库中有 20% 无法解析。在它能解析的 PDF 中，57% 产生空输出，因为 lopdf 没有文本提取功能——你拿到了对象但没有文本。PDF Oxide 通过率 100%。

可解析 PDF 上的速度。 lopdf 的原始对象解析更快：平均 0.3ms vs PDF Oxide 的 0.8ms。但 lopdf 不做文本提取工作——你需要自己构建字体解码、CMap 解析、间距分析和阅读顺序。

快速对比

	PDF Oxide	lopdf
API 层级	高级	底层
文本提取	内置（生产级）	无
通过率 (3,830 个 PDF)	100%	80.2%
平均解析时间	0.8ms	0.3ms
图片提取	内置	手动（原始流）
表单字段	读写	手动（原始字典）
PDF 创建	支持（Markdown/HTML）	支持（原始对象）
Markdown/HTML 输出	支持	不支持
加密	读写	不支持
渲染	支持	不支持
PDF/A 验证	支持	不支持
许可证	MIT	MIT

lopdf 做不到的事

lopdf 提供对 PDF 对象的访问，但文本提取需要按照 PDF 规范解释这些对象。以下是你需要自己构建的部分：

内容流解析 — 解析 PostScript 风格的操作符（Tj、TJ、Tm、Tf 等）
字体解析 — 查找 /Font 资源，解析间接引用
CMap/ToUnicode 解码 — 将字形 ID 转换为 Unicode 字符
字体度量间距 — 从字体描述符计算字符宽度
文本矩阵变换 — 应用 Tm、Td、T* 操作符定位文本
阅读顺序 — 确定多栏布局的正确顺序
连字重建 — 处理 fi、fl、ffi 连字
CJK 编码 — 解码中文、日文、韩文文本编码

这需要数千行代码和对 ISO 32000 的深入了解。PDF Oxide 在内部处理了所有这些。

并排代码对比

文本提取

PDF Oxide：

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;
let text = doc.extract_text(0)?;
println!("{}", text);

lopdf：

use lopdf::Document;

let doc = Document::load("report.pdf")?;

// lopdf 不提供文本提取。
// 你只能访问 PDF 对象：
let page_id = doc.page_iter().next().unwrap();
let page = doc.get_dictionary(page_id)?;
let contents = page.get("Contents")?;
let stream = doc.get_object(contents.as_reference()?)?;

// 要获取实际文本，你必须：
// 1. 解析内容流操作符
// 2. 从 /Resources 解析字体引用
// 3. 解码 CMap/ToUnicode 映射
// 4. 应用文本矩阵变换
// 5. 处理编码差异
// ...（数百到数千行代码）

PDF 创建

PDF Oxide：

use pdf_oxide::api::Pdf;

let pdf = Pdf::from_markdown("# Report\n\n| Q1 | Q2 |\n|---|---|\n| $1M | $2M |")?;
pdf.save("report.pdf")?;

lopdf：

use lopdf::{Document, Object, Stream, dictionary};

let mut doc = Document::with_version("1.5");

// 创建字体字典
let font_id = doc.add_object(dictionary! {
    "Type" => "Font",
    "Subtype" => "Type1",
    "BaseFont" => "Helvetica",
});

// 创建资源
let resources_id = doc.add_object(dictionary! {
    "Font" => dictionary! { "F1" => font_id },
});

// 创建内容流（原始 PostScript 操作符）
let content = Stream::new(
    dictionary! {},
    b"BT /F1 12 Tf 72 720 Td (Hello World) Tj ET".to_vec(),
);
let content_id = doc.add_object(content);

// 创建页面
let page_id = doc.add_object(dictionary! {
    "Type" => "Page",
    "MediaBox" => vec![0.into(), 0.into(), 612.into(), 792.into()],
    "Contents" => content_id,
    "Resources" => resources_id,
});

// 组装页面树
let pages_id = doc.add_object(dictionary! {
    "Type" => "Pages",
    "Kids" => vec![page_id.into()],
    "Count" => 1,
});
doc.add_object(dictionary! {
    "Type" => "Catalog",
    "Pages" => pages_id,
});

doc.save("report.pdf")?;

加密 PDF

PDF Oxide：

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open_with_password("encrypted.pdf", "secret")?;
let text = doc.extract_text(0)?;
println!("{}", text);

lopdf：

// lopdf 不支持加密 PDF。
// 加载加密 PDF 会失败或产生未解密的流。

可靠性对比

指标	PDF Oxide	lopdf
成功解析的 PDF	3,823 / 3,823 (100%)	3,071 / 3,823 (80.2%)
有文本输出的 PDF	3,823 / 3,823	~1,320 / 3,823（估计值）
加密 PDF 支持	支持	不支持
格式错误 PDF 恢复	支持	不支持

lopdf 80.2% 的通过率意味着大约每 5 个 PDF 就有 1 个失败。失败发生在加密文档、使用非标准 xref 表的 PDF 以及使用交叉引用流的文档上。PDF Oxide 通过宽松解析和回退策略处理了所有这些情况。

何时使用各库

选择 PDF Oxide 的场景：

需要文本提取、图片提取或任何内容级操作
需要一个 crate 同时支持读取 + 写入 + 创建
需要可靠地处理所有 PDF（加密、格式错误、复杂文档）
需要 Markdown/HTML 输出、渲染或 OCR
需要合规性验证（PDF/A、PDF/X、PDF/UA）

选择 lopdf 的场景：

需要直接访问 PDF 对象进行自定义处理
正在构建在对象级别工作的专用 PDF 工具
需要通过直接操作对象树来合并文档
你的 PDF 简单且格式规范（不加密、标准 xref 表）

组合使用：

用 PDF Oxide 做高级操作，用 lopdf 处理需要原始对象访问的边界情况：

[dependencies]
pdf_oxide = "0.3"
lopdf = "0.32"

PDF Oxide vs lopdf

关键差异

快速对比

lopdf 做不到的事

并排代码对比

文本提取

PDF 创建

加密 PDF

可靠性对比

何时使用各库

选择 PDF Oxide 的场景：

选择 lopdf 的场景：

组合使用：

相关页面