What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

对比 Python PDF 库

将 PDF Oxide 与 PyMuPDF（fitz）、pypdfium2、pypdf、pdfplumber、pdfminer 等进行对比。本页涵盖性能、功能覆盖、许可证以及 API 差异，帮助你为文本提取选择合适的 Python PDF 库。

概览

	PDF Oxide	PyMuPDF	pypdfium2	pypdf	pdfplumber	pdfminer
平均提取时间	0.8ms	4.6ms	4.1ms	12.1ms	23.2ms	16.8ms
通过率（3,830 个 PDF）	100%	99.3%	99.2%	98.4%	98.8%	98.8%
许可证	MIT	AGPL-3.0	Apache-2.0	BSD-3	MIT	MIT
语言	Rust + PyO3	C (MuPDF)	C (PDFium)	Pure Python	Pure Python	Pure Python
文本提取	支持	支持	支持	支持	支持	支持
字符位置	支持	支持	支持	部分支持	支持	支持
图像提取	支持	支持	支持	支持	不支持	不支持
表单字段	读取 + 写入	读取 + 写入	仅读取	读取 + 写入	仅读取	不支持
PDF 创建	支持	支持	不支持	有限支持	不支持	不支持
PDF 编辑	支持	支持	不支持	支持	不支持	不支持
Markdown 输出	支持	不支持	不支持	不支持	不支持	不支持
HTML 输出	支持	不支持	不支持	不支持	不支持	不支持
加密	读取 + 写入	读取 + 写入	仅读取	读取 + 写入	不支持	不支持
PDF/A 校验	支持	不支持	不支持	不支持	不支持	不支持
渲染	支持	支持	支持	不支持	不支持	不支持
搜索	正则 + 空间检索	支持	支持	不支持	不支持	不支持
Python 版本	3.8–3.14	3.8–3.12	3.8+	3.6+	3.8+	3.6+
安装体积	约 5 MB wheel	约 20 MB wheel	约 3 MB wheel	约 1 MB	约 1 MB	约 1 MB

性能对比

每个 PDF 的平均文本提取时间，基于完整的 3,830 个 PDF 语料库进行基准测试——三套独立、公开可用的测试集，共同覆盖了所有 PDF 规范版本（1.0–2.0）、加密文件、格式错误的文档、CJK 编码、复杂版面以及安全边界情况。关于每套测试集测试了什么、以及为何这些结果可复现，请参阅完整语料库详情。

库	平均	相对	p99	通过率
PDF Oxide	0.8ms	1×	9ms	100%
PyMuPDF	4.6ms	5.8×	28ms	99.3%
pypdfium2	4.1ms	5.1×	42ms	99.2%
pymupdf4llm	55.5ms	69×	280ms	99.1%
pdftext	7.3ms	9.1×	82ms	99.0%
pdfminer	16.8ms	21×	124ms	98.8%
pdfplumber	23.2ms	29×	189ms	98.8%
markitdown	108.8ms	136×	378ms	98.6%
pypdf	12.1ms	15.1×	97ms	98.4%

PDF Oxide 通过原生 Rust 内核实现高速，该内核经 PyO3 编译为 Python 扩展模块。没有子进程开销，也没有 C 库桥接——Rust 代码直接在 Python 进程中运行。

可靠性

PDF Oxide 处理了 3,823 个有效 PDF 中的全部 3,823 个，无一失败——通过率 100%。在 3,830 个文件的语料库中，未通过的 7 个文件是故意损坏的测试夹具（缺失 PDF 头、被模糊测试破坏的目录、无效的 xref 流）。

库	通过的有效 PDF	通过率
PDF Oxide	3,823 / 3,823	100%
PyMuPDF	3,796 / 3,823	99.3%
pypdfium2	3,792 / 3,823	99.2%
pymupdf4llm	3,787 / 3,823	99.1%
pdftext	3,784 / 3,823	99.0%
pdfminer	3,777 / 3,823	98.8%
pdfplumber	3,777 / 3,823	98.8%
markitdown	3,771 / 3,823	98.6%
pypdf	3,762 / 3,823	98.4%

文本质量

在完整语料库上，PDF Oxide 相比 PyMuPDF 和 pypdfium2 达到 99.5% 的文本一致率。质量通过逐字符比较提取的文本输出进行衡量。剩余 0.5% 的差异在于空白归一化和连字处理，而 PDF Oxide 在这些方面产生了更干净的输出。

许可证对比

库	许可证	商业使用	Copyleft
PDF Oxide	MIT	不受限制	否
pypdfium2	Apache-2.0	不受限制	否
PyMuPDF	AGPL-3.0	需购买商业许可证（付费）	是
pypdf	BSD-3	不受限制	否
pdfplumber	MIT	不受限制	否
pdfminer	MIT	不受限制	否
pdftext	GPL-3.0	需开源	是

PyMuPDF 在 AGPL-3.0 许可证下使用 MuPDF。如果你分发使用了 PyMuPDF 的软件，你的软件也必须以 AGPL-3.0 发布——或者你必须从 Artifex 购买商业许可证。这适用于 SaaS 产品、Web 应用以及任何被分发的二进制文件。

PDF Oxide 采用 MIT 许可证，没有任何限制。你可以在专有产品、SaaS 平台或闭源应用中使用它，而无需承担任何许可义务。

使用场景	PDF Oxide (MIT)	PyMuPDF (AGPL)	pypdfium2 (Apache)	pypdf (BSD)	pdfplumber (MIT)	pdfminer (MIT)
商业产品	可以	需许可证	可以	可以	可以	可以
闭源	可以	不可以（除非购买许可证）	可以	可以	可以	可以
SaaS/云	可以	需许可证	可以	可以	可以	可以
内部工具	可以	可以	可以	可以	可以	可以

API 对比

文本提取

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
text = page.get_text()
print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
page = reader.pages[0]
text = page.extract_text()
print(text)

pdfplumber:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    print(text)

pdfminer:

from pdfminer.high_level import extract_text

text = extract_text("report.pdf", page_numbers=[0])
print(text)

字符级提取

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for ch in chars:
    print(f"'{ch.char}' at ({ch.bbox[0]:.1f}, {ch.bbox[1]:.1f}) "
          f"size={ch.font_size:.1f}")

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
blocks = page.get_text("dict")["blocks"]
for block in blocks:
    if "lines" in block:
        for line in block["lines"]:
            for span in line["spans"]:
                print(f"'{span['text']}' size={span['size']:.1f}")

pdfplumber:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    for char in page.chars:
        print(f"'{char['text']}' at ({char['x0']:.1f}, {char['top']:.1f}) "
              f"size={char['size']:.1f}")

pdfminer:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTChar

for page_layout in extract_pages("report.pdf"):
    for element in page_layout:
        if hasattr(element, '__iter__'):
            for text_line in element:
                if hasattr(text_line, '__iter__'):
                    for char in text_line:
                        if isinstance(char, LTChar):
                            print(f"'{char.get_text()}' at ({char.x0:.1f}, {char.y0:.1f}) "
                                  f"size={char.size:.1f}")

图像提取

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
for i, img in enumerate(page.get_images()):
    xref = img[0]
    base_image = doc.extract_image(xref)
    with open(f"image_{i}.{base_image['ext']}", "wb") as f:
        f.write(base_image["image"])

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
page = reader.pages[0]
for i, image in enumerate(page.images):
    with open(f"image_{i}.{image.name.split('.')[-1]}", "wb") as f:
        f.write(image.data)

PDF 创建

PDF Oxide:

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Hello World\n\nThis is a PDF.")
pdf.save("output.pdf")

# Also supports HTML
pdf = Pdf.from_html("<h1>Hello</h1><p>World</p>")
pdf.save("output.pdf")

PyMuPDF:

import fitz

doc = fitz.open()
page = doc.new_page()
text_point = fitz.Point(72, 72)
page.insert_text(text_point, "Hello World", fontsize=24)
doc.save("output.pdf")

pypdf:

# pypdf can merge/modify PDFs but cannot create from scratch with text content.
# Use reportlab or fpdf2 for creation, then merge with pypdf.

加密 PDF

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)

PyMuPDF:

import fitz

doc = fitz.open("encrypted.pdf")
doc.authenticate("password")
page = doc[0]
text = page.get_text()

pypdf:

from pypdf import PdfReader

reader = PdfReader("encrypted.pdf")
reader.decrypt("password")
text = reader.pages[0].extract_text()

Markdown 和 HTML 输出

PDF Oxide（独有功能）:

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")

# Convert to Markdown with heading detection
md = doc.to_markdown(0, detect_headings=True)
print(md)

# Convert to HTML
html = doc.to_html(0)
print(html)

没有其他 Python PDF 库提供内置的 Markdown 或 HTML 转换。

各库简介

PDF Oxide

优势:

得益于 Rust 内核，在基准测试中文本提取最快——比 PyMuPDF 快 5.8 倍
在 3,830 个 PDF 语料库上通过率 100%——所有测试库中可靠性最高
在单个库中统一了提取、创建和编辑的 API
内置 Markdown 和 HTML 导出，并支持标题检测
MIT 许可证，无 copyleft 限制
原生合规校验（PDF/A、PDF/UA、PDF/X）
为所有主流平台和 Python 3.8–3.14 提供预编译 wheel
无系统依赖——wheel 已包含一切

局限:

较新的库，社区规模较小
与 pdfplumber 的算法相比，表格提取较为基础
渲染引擎不如 MuPDF 成熟

PyMuPDF (fitz)

优势:

成熟且久经考验（以 MuPDF 为基础，自 2005 年起持续开发）
对复杂 PDF 的渲染质量出色
内置 OCR 集成（Tesseract）
功能丰富：SVG 导出、页面操作、表格检测

局限:

AGPL-3.0 许可证要求你将应用开源或购买商业许可证
wheel 体积大（约 20 MB），因为捆绑了 MuPDF
无内置 Markdown 导出
无合规校验

pypdfium2

优势:

速度快（以 Google 的 PDFium 引擎为基础）
Apache-2.0 许可证——对商业使用宽松
渲染质量良好

局限:

与 PDF Oxide 或 PyMuPDF 相比，文本提取 API 有限
无法创建或编辑 PDF
表单字段仅支持只读

pypdf

优势:

Pure Python——可在任何环境安装，无需编译依赖
轻量且维护良好
适合 PDF 操作（合并、拆分、旋转、加密）
社区庞大，文档详尽

局限:

文本提取比 PDF Oxide 慢 15 倍
在复杂版面下文本提取质量欠佳
无渲染、无 Markdown/HTML 导出、无表格提取

pdfplumber

优势:

所有 Python PDF 库中最佳的表格提取
出色的字符级定位数据
可视化调试工具（带注释的页面图像）
MIT 许可证

局限:

Pure Python——比 PDF Oxide 慢 29 倍
只读——无法创建或编辑 PDF
无加密或渲染

pdfminer

优势:

详尽的字符与版面分析
良好的 CJK 文本支持
是 pdfplumber 及其他工具的基础
MIT 许可证

局限:

比 PDF Oxide 慢 21 倍（Pure Python，未经优化）
只读，无法创建或编辑
常见任务的 API 较为冗长
维护活跃度较低

何时使用哪个

使用场景	推荐库
快速文本提取	PDF Oxide
商业 / 专有产品	PDF Oxide、pypdfium2、pypdf、pdfplumber 或 pdfminer
PyMuPDF 替代品（MIT 许可证）	PDF Oxide
从 Markdown/HTML 创建 PDF	PDF Oxide
合规校验（PDF/A、PDF/X）	PDF Oxide
从发票中提取表格	pdfplumber
可视化调试提取过程	pdfplumber
已有 MuPDF 投入	PyMuPDF（若兼容 AGPL）
最小化依赖	pypdf（Pure Python）
详细版面分析	pdfminer
扫描文档的 OCR	PyMuPDF

安装

# PDF Oxide
pip install pdf_oxide

# PyMuPDF
pip install pymupdf

# pypdfium2
pip install pypdfium2

# pypdf
pip install pypdf

# pdfplumber
pip install pdfplumber

# pdfminer
pip install pdfminer.six

PDF Oxide 为 Linux（x86_64、aarch64）、macOS（x86_64、arm64）和 Windows（x86_64）提供预编译 wheel。无需编译器或系统库。

对比 Python PDF 库

概览

性能对比

可靠性

文本质量

许可证对比

API 对比

文本提取

字符级提取

图像提取

PDF 创建

加密 PDF

Markdown 和 HTML 输出

各库简介

PDF Oxide

PyMuPDF (fitz)

pypdfium2

pypdf

pdfplumber

pdfminer

何时使用哪个

安装

相关页面