What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs PyMuPDF

PDF Oxide 是一个比 PyMuPDF 更快、采用 MIT 许可证的替代方案。如果你正在为商业项目评估 PyMuPDF，或希望因 AGPL 许可证而替换它，本页将介绍其中的关键差异。

开发者为何从 PyMuPDF 迁移

许可证。 PyMuPDF 封装了采用 AGPL-3.0 许可证的 MuPDF。如果你分发包含 PyMuPDF 的软件 —— 包括 SaaS、Web 应用和 Docker 容器 —— 你必须在 AGPL 下开源你的代码，或者向 Artifex 购买商业许可证。PDF Oxide 采用 MIT 许可证，没有任何限制。

速度。 PDF Oxide 提取文本的平均耗时为 0.8ms，而 PyMuPDF 为 4.6ms —— 在 3,830 个 PDF 上快 5.8 倍。

可靠性。 在 PyMuPDF 通过率为 99.3%（有效 PDF 上有 27 次失败）的同一语料库中，PDF Oxide 达到了 100% 的通过率。

快速对比

	PDF Oxide	PyMuPDF
许可证	MIT	AGPL-3.0
平均提取时间	0.8ms	4.6ms
通过率（3,830 个 PDF）	100%	99.3%
文本提取	支持	支持
字符位置	支持	支持
图像提取	支持	支持
表单字段	读取 + 写入	读取 + 写入
PDF 生成	支持（Markdown/HTML）	支持
Markdown 输出	支持	不支持
HTML 输出	支持	不支持
加密	读取 + 写入	读取 + 写入
渲染	支持	支持
OCR	内置（PaddleOCR）	Tesseract
安装体积	约 5 MB	约 20 MB
Python 版本	3.8–3.14	3.8–3.12

并排代码对比

文本提取

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
text = page.get_text()
print(text)

Markdown 转换

PDF Oxide（内置）:

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

PyMuPDF:

# PyMuPDF has no built-in Markdown conversion.
# Use pymupdf4llm (separate package, 69× slower than PDF Oxide):
import pymupdf4llm

md = pymupdf4llm.to_markdown("paper.pdf")

图像提取

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
for i, img in enumerate(page.get_images()):
    xref = img[0]
    base_image = doc.extract_image(xref)
    with open(f"image_{i}.{base_image['ext']}", "wb") as f:
        f.write(base_image["image"])

从 Markdown 生成 PDF

PDF Oxide:

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Invoice\n\n| Item | Price |\n|------|-------|\n| Widget | $9.99 |")
pdf.save("invoice.pdf")

PyMuPDF:

import fitz

# PyMuPDF cannot create PDFs from Markdown.
# You must manually place text on pages:
doc = fitz.open()
page = doc.new_page()
page.insert_text(fitz.Point(72, 72), "Invoice", fontsize=24)
doc.save("invoice.pdf")

基准测试详情

在来自三个独立公开测试套件（veraPDF、Mozilla pdf.js、DARPA SafeDocs）的 3,830 个 PDF 上进行了基准测试。

指标	PDF Oxide	PyMuPDF
平均提取时间	0.8ms	4.6ms
p99 提取时间	9ms	28ms
通过率（有效 PDF）	100%（3,823/3,823）	99.3%（3,796/3,823）
文本质量一致度	99.5%	基准

有关语料库详情和复现步骤，请参阅完整的基准测试方法。

AGPL 许可证：对你意味着什么

PyMuPDF 封装了 MuPDF，而 MuPDF 采用 AGPL-3.0 许可证。在以下情况下会对你产生影响：

你分发软件 时使用了 PyMuPDF（二进制文件、Docker 镜像、Electron 应用）
你运营 SaaS 时，PyMuPDF 在你的服务器上处理用户的 PDF
你在产品中嵌入 PyMuPDF —— 即使它是 API 背后的微服务

在所有这些情况下，AGPL 要求你以 AGPL-3.0 发布整个应用的源代码 —— 或者向 Artifex 购买商业许可证。

PDF Oxide 采用 MIT 许可证。可用于任何项目 —— 商业、专有、SaaS 或开源 —— 没有任何义务。

使用场景	PDF Oxide（MIT）	PyMuPDF（AGPL）
商业产品	支持	需要许可证
闭源 SaaS	支持	需要许可证
内部工具	支持	支持
开源项目	支持	支持（若与 AGPL 兼容）
Docker 分发	支持	需要许可证

PyMuPDF 商业许可证定价

Artifex（MuPDF 和 PyMuPDF 背后的公司）没有公开发布商业许可证的定价。根据行业报告：

需要联系 —— 你必须向 Artifex 销售团队申请报价
按应用授权 —— 定价因部署类型和规模而异
年费 —— 商业许可证通常每年续订
没有免费层级 —— AGPL 不存在「社区版」或「初创企业」例外

对于评估 PyMuPDF 用于商业用途的团队而言，许可证费用是在开发时间之外持续产生的运营开支。

PDF Oxide 采用 MIT 许可证 —— 所有用途永久免费。 无需销售通话、无需许可证审计、无合规风险。可用于 SaaS、在 Docker 容器中分发、嵌入商业产品 —— 没有任何限制。

迁移指南

API 映射

任务	PyMuPDF	PDF Oxide
打开 PDF	`fitz.open("f.pdf")`	`PdfDocument("f.pdf")`
页数	`doc.page_count`	`doc.page_count()`
提取文本	`doc[0].get_text()`	`doc.extract_text(0)`
字符数据	`doc[0].get_text("dict")`	`doc.extract_chars(0)`
提取图像	`doc[0].get_images()` + `doc.extract_image(xref)`	`doc.extract_images(0)`
搜索文本	`doc[0].search_for("query")`	`doc.search_page(0, "query")`
加密 PDF	`doc.authenticate("pw")`	`PdfDocument("f.pdf", password="pw")`
转为 Markdown	pymupdf4llm（独立包）	`doc.to_markdown(0)`
从文本创建	手动 `insert_text()`	`Pdf.from_markdown("# Title")`

分步操作

安装： pip install pdf_oxide
替换导入： import fitz → from pdf_oxide import PdfDocument
替换打开操作： fitz.open(path) → PdfDocument(path)
替换提取操作： page.get_text() → doc.extract_text(page_index)
替换图像操作： 多步 xref 查找 → doc.extract_images(page_index)
更新密码处理： 使用 PdfDocument(path, password="pw")，或在打开后使用 doc.authenticate("pw")
测试： 在你现有的测试文件上运行你的流水线

何时继续使用 PyMuPDF

你已拥有商业 MuPDF 许可证，并依赖 MuPDF 特有的渲染功能
你需要 SVG 导出（PDF Oxide 不支持 SVG 输出）
你的项目已采用 AGPL 许可证

PDF Oxide vs PyMuPDF

开发者为何从 PyMuPDF 迁移

快速对比

并排代码对比

文本提取

Markdown 转换

图像提取

从 Markdown 生成 PDF

基准测试详情

AGPL 许可证：对你意味着什么

PyMuPDF 商业许可证定价

迁移指南

API 映射

分步操作

何时继续使用 PyMuPDF

相关页面