What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs pdfplumber

PDF Oxide 在文本提取上比 pdfplumber 快 29 倍，同时提供更广泛的功能集。pdfplumber 拥有更成熟的表格提取算法。本页将帮助你为自己的使用场景选择合适的工具。

主要差异

速度。 pdfplumber 是纯 Python 实现（构建于 pdfminer 之上）。PDF Oxide 的 Rust 核心提取文本的平均耗时为 0.8ms，相比 23.2ms 快 29 倍。

可靠性。 PDF Oxide 在 3,830 个测试 PDF 上通过率为 100%。pdfplumber 通过率为 98.8%，在有效 PDF 上有 46 次失败。

表格。 在所有 Python PDF 库中，pdfplumber 拥有最出色的表格提取能力。PDF Oxide 的表格检测虽然可用，但对于包含合并单元格的复杂多行多列布局尚不够成熟。

适用范围。 pdfplumber 仅支持只读。PDF Oxide 还增加了创建、编辑、加密、渲染以及 Markdown/HTML 输出。

快速对比

	PDF Oxide	pdfplumber
平均提取时间	0.8ms	23.2ms
通过率（3,830 个 PDF）	100%	98.8%
许可证	MIT	MIT
语言	Rust + PyO3	纯 Python
文本提取	支持	支持
字符位置	支持	支持
表格提取	基础	高级
图像提取	支持	不支持
可视化调试	不支持	支持
Markdown 输出	支持	不支持
HTML 输出	支持	不支持
PDF 创建	支持	不支持
PDF 编辑	支持	不支持
加密	读取 + 写入	不支持
渲染	支持	不支持
表单字段	读取 + 写入	仅读取

代码并排对比

文本提取

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pdfplumber:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    print(text)

字符级提取

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for ch in chars[:10]:
    print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f}) size={ch.font_size:.1f}")

pdfplumber:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    for char in page.chars[:10]:
        print(f"'{char['text']}' at ({char['x0']:.1f}, {char['top']:.1f}) "
              f"size={char['size']:.1f}")

表格提取

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0, detect_headings=True)
# Tables are converted to Markdown table syntax
print(md)

pdfplumber:

import pdfplumber

with pdfplumber.open("invoice.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()
    for table in tables:
        for row in table:
            print(row)

pdfplumber 的 extract_tables() 会返回结构化的行/列数据，并支持可配置的线条检测。对于包含合并单元格、跨列表头或无边框布局的复杂表格，pdfplumber 的算法更为稳健。

基准测试详情

指标	PDF Oxide	pdfplumber
平均提取时间	0.8ms	23.2ms
p99 提取时间	9ms	189ms
通过率（有效 PDF）	100% (3,823/3,823)	98.8% (3,777/3,823)

29 倍的速度差距源于 pdfplumber 的纯 Python 架构。pdfplumber 以 pdfminer 作为解析基础，再叠加自己的空间分析层——两者均由 Python 编写。PDF Oxide 则在编译后的 Rust 中处理所有解析、字体解码和文本组装。

有关语料库的详细信息，请参阅完整的基准测试方法。

何时使用哪一个

在以下情况选择 PDF Oxide：

速度至关重要。 处理数千个 PDF 时，快 29 倍意味着耗时从数小时缩短为数分钟。
你需要的不仅是提取。 创建、编辑、加密、渲染或 Markdown 输出。
你想要最高的可靠性。 100% 通过率 vs 98.8%。
你需要图像提取。 pdfplumber 不提取图像。
批量处理流水线。 每个 PDF 仅 0.8ms，意味着 3,830 个 PDF 只需 3.1 秒。

在以下情况选择 pdfplumber：

复杂表格提取是你的主要用途。 pdfplumber 的表格算法能更好地处理合并单元格、无边框表格和跨列表头。
你需要可视化调试。 pdfplumber 可以渲染带注释的页面图像，展示检测到的线条、字符和表格边界。
你偏好纯 Python。 没有编译依赖，可在任何环境安装。

两者并用：

对于既需要快速文本提取、又需要复杂表格解析的流水线，可以用 PDF Oxide 提取文本，用 pdfplumber 处理表格：

from pdf_oxide import PdfDocument
import pdfplumber

# Fast text extraction with PDF Oxide
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)

# Complex table extraction with pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()

PDF Oxide vs pdfplumber

主要差异

快速对比

代码并排对比

文本提取

字符级提取

表格提取

基准测试详情

何时使用哪一个

在以下情况选择 PDF Oxide：

在以下情况选择 pdfplumber：

两者并用：

相关页面