What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs pypdfium2

PDF Oxide 与 pypdfium2 都是原生编译的高性能 Python PDF 库。pypdfium2 封装了 Google 的 PDFium 引擎；PDF Oxide 建立在 Rust 内核之上。核心差异在功能范围——pypdfium2 主要是阅读器和渲染器，而 PDF Oxide 覆盖 PDF 全生命周期：创建、提取、OCR、表单、加密与合规。

关键差异

速度。 两者都很快。PDF Oxide 稍快：平均 0.8ms vs 4.1ms（5.1 倍差距）。两者都比纯 Python 库快得多。

功能。 pypdfium2 是只读加渲染。PDF Oxide 额外支持创建、编辑、表单写入、加密、Markdown/HTML 输出和 OCR。

可靠性。 PDF Oxide 通过 100% 的有效 PDF。pypdfium2 通过 99.2%——31 个失败。

许可证。 两者都是宽松许可。PDF Oxide 采用 MIT；pypdfium2 采用 Apache-2.0。两者都没有 AGPL 的顾虑。

快速对比

	PDF Oxide	pypdfium2
平均提取耗时	0.8ms	4.1ms
通过率 (3,830 个 PDF)	100%	99.2%
许可证	MIT	Apache-2.0
底层语言	Rust + PyO3	C (PDFium)
文本提取	支持	支持
字符位置	支持	支持
图片提取	支持	支持
Markdown 输出	支持	不支持
HTML 输出	支持	不支持
PDF 创建	支持	不支持
PDF 编辑	支持	不支持
表单字段	读写	只读
加密	读写	只读
渲染	支持	支持
OCR	内置	不支持
搜索	正则 + 空间搜索	支持

并排代码对比

文本提取

PDF Oxide：

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pypdfium2：

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
textpage = page.get_textpage()
text = textpage.get_text_range()
print(text)

图片提取

PDF Oxide：

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

pypdfium2：

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
for i, obj in enumerate(page.get_objects()):
    if obj.type == pdfium.FPDF_PAGEOBJ_IMAGE:
        bitmap = obj.get_bitmap()
        bitmap.to_pil().save(f"image_{i}.png")

PDF 创建

PDF Oxide：

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Report\n\nQuarterly results are in.")
pdf.save("report.pdf")

pypdfium2：

# pypdfium2 无法创建 PDF。
# 它是一个只读库，仅具备渲染能力。

渲染

PDF Oxide：

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
image = doc.render_page(0, dpi=150)
image.save("page.png")

pypdfium2：

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
bitmap = page.render(scale=150/72)
bitmap.to_pil().save("page.png")

基准测试详情

指标	PDF Oxide	pypdfium2
平均提取耗时	0.8ms	4.1ms
p99 提取耗时	9ms	42ms
通过率（有效 PDF）	100% (3,823/3,823)	99.2% (3,792/3,823)

两个库都使用原生代码（分别是 Rust 和 C），但 PDF Oxide 的文本提取流水线专门为此任务优化——单次遍历提取，使用预分配缓冲区和缓存的页面树。

参见完整基准测试方法了解语料库详情。

功能完整性

这两个库最大的差异在于功能范围。pypdfium2 是带渲染的阅读器；PDF Oxide 覆盖完整的 PDF 生命周期：

能力	PDF Oxide	pypdfium2
读取和提取	支持	支持
渲染页面	支持	支持
创建 PDF	支持（Markdown、HTML、图片）	不支持
编辑现有 PDF	支持（文本、图片、注释）	不支持
填写表单字段	支持	不支持
写入加密	支持（AES-256）	不支持
Markdown/HTML 输出	支持	不支持
OCR 扫描页面	支持（PaddleOCR via ONNX）	不支持
PDF/A 验证	支持	不支持

如果你只需要读取和渲染 PDF，pypdfium2 是一个可靠的选择。如果你需要任何写入能力——创建、编辑、表单填写或加密——PDF Oxide 是一站式解决方案。

pypdfium2 许可证 (Apache-2.0)

pypdfium2 采用 Apache-2.0 许可，允许商业使用。不过它封装了 Google 的 PDFium（Chromium PDF 引擎），后者有自己的 BSD 风格许可证。两者都是宽松许可。

需要考虑的要点：

Apache-2.0 — 宽松许可，允许商业使用，需要归属声明
PDFium 依赖 — 二进制文件包含 Chromium 的 PDFium 引擎（~15 MB）
Google 的发布周期 — pypdfium2 依赖 Chromium 项目的 PDFium 发布
无 Python API 稳定性保证 — API 紧跟 PDFium 的 C API

PDF Oxide 采用 MIT 许可——比 Apache-2.0 更宽松，二进制分发无需归属声明。

何时使用各库

选择 PDF Oxide 的场景：

需要读取/渲染之外的功能（创建、编辑、表单、加密）
需要 Markdown 或 HTML 转换
需要内置 OCR 处理扫描文档
需要最高可靠性（100% vs 99.2%）
速度关键且 5 倍差距在规模化时有意义

选择 pypdfium2 的场景：

只需要读取和渲染 PDF
偏好 PDFium 特定的渲染输出
需要更小的依赖足迹

PDF Oxide vs pypdfium2

关键差异

快速对比

并排代码对比

文本提取

图片提取

PDF 创建

渲染

基准测试详情

功能完整性

pypdfium2 许可证 (Apache-2.0)

何时使用各库

选择 PDF Oxide 的场景：

选择 pypdfium2 的场景：

相关页面