What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs pypdf

PDF Oxide 比 pypdf 快 15 倍，通过率更高，并内置渲染以及 Markdown/HTML 转换功能。如果你的需求不止于基础的 PDF 操作，那么 pypdf 需要多个软件包才能完成的工作，PDF Oxide 用一个库就能搞定。

为什么考虑用 PDF Oxide 取代 pypdf

速度。 pypdf 是纯 Python 实现。PDF Oxide 使用通过 PyO3 编译的 Rust 内核，直接在 Python 进程中运行。文本提取的平均耗时为 0.8ms 对 12.1ms——相差 15 倍。

可靠性。 PDF Oxide 在 3,830 个测试 PDF 上的通过率为 100%。pypdf 的通过率为 98.4%——在有效 PDF 上有 61 次失败。

功能。 pypdf 是一个 PDF 操作库（合并、拆分、旋转、加密）。要做文本提取、渲染、Markdown 输出或表单创建，你还需要额外的软件包。PDF Oxide 一次安装即可涵盖以上全部功能。

快速对比

	PDF Oxide	pypdf
平均提取耗时	0.8ms	12.1ms
通过率（3,830 个 PDF）	100%	98.4%
许可证	MIT	BSD-3
语言	Rust + PyO3	纯 Python
文本提取	支持	支持
字符位置	支持	部分支持
图像提取	支持	支持
Markdown 输出	支持	不支持
HTML 输出	支持	不支持
PDF 创建	支持（Markdown/HTML/图像）	有限（仅合并）
表单字段	读取 + 写入	读取 + 写入
加密	读取 + 写入	读取 + 写入
渲染	支持	不支持
OCR	内置	不支持
搜索	正则 + 空间搜索	不支持
安装体积	约 5 MB	约 1 MB

代码并排对比

文本提取

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
text = reader.pages[0].extract_text()
print(text)

提取所有页面

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(f"--- Page {i + 1} ---")
    print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("book.pdf")
for page in reader.pages:
    text = page.extract_text()
    print(text)

图像提取

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
page = reader.pages[0]
for i, image in enumerate(page.images):
    with open(f"image_{i}.{image.name.split('.')[-1]}", "wb") as f:
        f.write(image.data)

加密的 PDF

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("encrypted.pdf", password="secret")
text = doc.extract_text(0)

pypdf:

from pypdf import PdfReader

reader = PdfReader("encrypted.pdf")
reader.decrypt("secret")
text = reader.pages[0].extract_text()

Markdown 转换

PDF Oxide（内置）:

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

pypdf:

# pypdf has no Markdown conversion.
# You would need a separate tool chain.

基准测试详情

指标	PDF Oxide	pypdf
平均提取耗时	0.8ms	12.1ms
p99 提取耗时	9ms	97ms
通过率（有效 PDF）	100%（3,823/3,823）	98.4%（3,762/3,823）

pypdf 的纯 Python 实现意味着每一个操作都在解释器中运行。PDF Oxide 的 Rust 内核以原生方式处理解析、字体解码和文本组装，只有最终结果才跨越 Python 边界。

语料库详情请参阅完整基准测试方法。

功能差距

pypdf 擅长 PDF 操作——合并、拆分、旋转和加密。但它缺少：

功能	PDF Oxide	pypdf
Markdown 转换	`doc.to_markdown(0)`	不提供
HTML 转换	`doc.to_html(0)`	不提供
从内容创建 PDF	`Pdf.from_markdown()`、`Pdf.from_html()`	不提供
渲染为图像	支持	不提供
扫描件 PDF 的 OCR	内置 PaddleOCR	不提供
文本搜索	`doc.search("query")`	不提供
字符级边界框	`doc.extract_chars(0)`	部分支持
PDF/A 校验	支持	不提供

如果你的工作流纯粹是合并/拆分/旋转，pypdf 轻量的纯 Python 方案是合理的选择。但只要涉及文本提取质量、创建或转换，PDF Oxide 都是更完整的选项。

何时继续使用 pypdf

你需要一个不含任何编译扩展的纯 Python 依赖
你的用例严格限于合并/拆分/旋转/加密，不涉及文本提取
你需要 pypdf 特定的 PDF 操作方法以兼容遗留集成

PDF Oxide vs pypdf

为什么考虑用 PDF Oxide 取代 pypdf

快速对比

代码并排对比

文本提取

提取所有页面

图像提取

加密的 PDF

Markdown 转换

基准测试详情

功能差距

何时继续使用 pypdf

相关页面