What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs pypdfium2

PDF Oxide 和 pypdfium2 都是快速、原生编译的 Python PDF 库。pypdfium2 封装了 Google 的 PDFium 引擎；PDF Oxide 则构建于 Rust 内核之上。关键区别在于覆盖范围：pypdfium2 主要是阅读器和渲染器，而 PDF Oxide 涵盖 PDF 的完整生命周期。

主要区别

速度。 两者都很快。PDF Oxide 略快一些：平均 0.8ms，对比 4.1ms（相差 5.1 倍）。两者都比纯 Python 库快得多。

功能。 pypdfium2 是只读的，带有渲染能力。PDF Oxide 额外提供创建、编辑、表单写入、加密、Markdown/HTML 输出和 OCR。

可靠性。 PDF Oxide 能通过 100% 的有效 PDF。pypdfium2 通过率为 99.2%——有 31 个失败。

许可证。 两者都是宽松许可证。PDF Oxide 是 MIT；pypdfium2 是 Apache-2.0。两者都不涉及 AGPL 方面的顾虑。

快速对比

	PDF Oxide	pypdfium2
平均提取时间	0.8ms	4.1ms
通过率（3,830 个 PDF）	100%	99.2%
许可证	MIT	Apache-2.0
语言	Rust + PyO3	C (PDFium)
文本提取	支持	支持
字符位置	支持	支持
图像提取	支持	支持
Markdown 输出	支持	不支持
HTML 输出	支持	不支持
PDF 创建	支持	不支持
PDF 编辑	支持	不支持
表单字段	读取 + 写入	仅读取
加密	读取 + 写入	仅读取
渲染	支持	支持
OCR	内置	不支持
搜索	正则 + 空间	支持

代码对比

文本提取

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
textpage = page.get_textpage()
text = textpage.get_text_range()
print(text)

图像提取

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
for i, obj in enumerate(page.get_objects()):
    if obj.type == pdfium.FPDF_PAGEOBJ_IMAGE:
        bitmap = obj.get_bitmap()
        bitmap.to_pil().save(f"image_{i}.png")

PDF 创建

PDF Oxide:

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Report\n\nQuarterly results are in.")
pdf.save("report.pdf")

pypdfium2:

# pypdfium2 cannot create PDFs.
# It is a read-only library with rendering capabilities.

渲染

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
image = doc.render_page(0, dpi=150)
image.save("page.png")

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
bitmap = page.render(scale=150/72)
bitmap.to_pil().save("page.png")

基准测试详情

指标	PDF Oxide	pypdfium2
平均提取时间	0.8ms	4.1ms
p99 提取时间	9ms	42ms
通过率（有效 PDF）	100%（3,823/3,823）	99.2%（3,792/3,823）

两个库都使用原生代码（分别为 Rust 和 C），但 PDF Oxide 的文本提取流水线专门针对这项任务进行了优化——采用预分配缓冲区和缓存页面树的单遍提取。

有关语料库详情，请参阅完整的基准测试方法。

功能完整性

这些库之间最大的区别在于覆盖范围。pypdfium2 是带渲染功能的阅读器；PDF Oxide 涵盖 PDF 的完整生命周期：

能力	PDF Oxide	pypdfium2
读取和提取	支持	支持
渲染页面	支持	支持
创建 PDF	支持（Markdown、HTML、图像）	不支持
编辑现有 PDF	支持（文本、图像、注释）	不支持
填写表单字段	支持	不支持
写入加密	支持（AES-256）	不支持
Markdown/HTML 输出	支持	不支持
扫描页面 OCR	支持（通过 ONNX 的 PaddleOCR）	不支持
PDF/A 验证	支持	不支持

如果你只需要读取和渲染 PDF，pypdfium2 是个稳妥的选择。如果你需要任何写入能力——创建、编辑、表单填写或加密——PDF Oxide 是单库即可搞定的解决方案。

pypdfium2 许可证（Apache-2.0）

pypdfium2 采用 Apache-2.0 许可证，允许商业使用。不过，它封装了 Google 的 PDFium（Chromium 的 PDF 引擎），后者有自己的 BSD 风格许可证。两者都是宽松许可证。

主要考量：

Apache-2.0 — 宽松，允许商业使用，要求署名
PDFium 依赖 — 二进制文件包含 Chromium 的 PDFium 引擎（约 15 MB）
Google 的发布周期 — pypdfium2 依赖于 Chromium 项目的 PDFium 发布
无 Python API 稳定性保证 — 该 API 紧密遵循 PDFium 的 C API

PDF Oxide 采用 MIT 许可证——比 Apache-2.0 更加宽松，二进制分发无署名要求。

何时使用哪个

在以下情况选择 PDF Oxide：

你需要读取/渲染之外的功能（创建、编辑、表单、加密）
你想要 Markdown 或 HTML 转换
你想要为扫描文档内置 OCR
你需要最高的可靠性（100% 对比 99.2%）
速度至关重要，且 5 倍的差距在大规模场景下有意义

在以下情况选择 pypdfium2：

你只需要读取和渲染 PDF
你偏好 PDFium 特定的渲染输出
你想要更小的依赖占用

PDF Oxide vs pypdfium2

主要区别

快速对比

代码对比

文本提取

图像提取

PDF 创建

渲染

基准测试详情

功能完整性

pypdfium2 许可证（Apache-2.0）

何时使用哪个

在以下情况选择 PDF Oxide：

在以下情况选择 pypdfium2：

相关页面