What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

提取配置文件 — 按文档类型调优空格检测

不同 PDF 藏空格的方式各不相同。arXiv 论文用紧凑、两端对齐的分栏排版,IRS 表单依赖严格的单元格对齐,GDPR 政策文档则是最小字距调整下密集的两端对齐段落。一个在某种场景下合适的 tj_offset_threshold,换到另一种就会塞进一堆乱空格。

ExtractionProfile 提供九组预调参数,分别对应真实的文档类别。把配置文件传给 extract_text() 或 extract_words(),PDF Oxide 就会为这种文档风格应用合适的 word-margin 比例、TJ offset 阈值以及自适应阈值开关。

绑定支持情况。 提取配置文件目前在 Python (pdf_oxide.ExtractionProfile) 和 Rust (pdf_oxide::config::ExtractionProfile) 中开放。Node、WASM、Go、C# 绑定内部使用 CONSERVATIVE 默认值;要从这些运行时应用其他配置文件,可调用 Rust CLI (pdf-oxide extract --profile academic doc.pdf),或通过 Python / Rust 做一层桥接。

快速示例

Python

from pdf_oxide import PdfDocument, ExtractionProfile

doc = PdfDocument("paper.pdf")

# 学术论文:紧凑字距,开启引用检测
text = doc.extract_text(0, profile=ExtractionProfile.academic())
print(text)

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::config::ExtractionProfile;

let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text_with_profile(0, ExtractionProfile::ACADEMIC)?;
println!("{}", text);

可用的配置文件

配置	适用场景	TJ 阈值	Word-margin 比例	自适应
`conservative()`	默认 — 通用文本,尽量少误插空格	−120	0.10	关
`aggressive()`	抑制空格的 PDF;修复粘连单词	−80	0.20	关
`balanced()`	混合内容	−100	0.15	关
`academic()`	arXiv 论文、会议论文集、技术报告	−105	0.12	开 + 引用 / 邮箱检测
`policy()`	法律、GDPR、政府法规	−110	0.18	开
`form()`	IRS 表单、申请表、问卷	−120	0.08	关
`government()`	含表格的政府报告	−105	0.14	关
`scanned_ocr()`	坐标含噪声的 OCR 输出	自动	自动	开
`adaptive()`	由提取器基于字体统计自动调优	自动	自动	开

各配置文件的适用场景

学术 / 会议论文 — `academic()`

字距紧凑、双栏布局、内嵌引用。默认设置经常在连字 (fi、fl) 内误插空格,或在字距激进时漏掉单词间的空格。

doc = PdfDocument("neurips-paper.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.academic())

academic 配置文件同时开启自适应阈值和引用 / 邮箱检测,让 [1,2,3] 这种行内引用和 author@lab.edu 之类的邮箱地址完整保留。

IRS 表单、申请表 — `form()`

表单 PDF 更在意列对齐,而非单词边界。form() 配置文件采用非常紧凑的 word-margin 比例 (0.08),避免严格对齐的字段标签和其值连成一团。

doc = PdfDocument("w2.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.form())

GDPR / 政策 / 法规 — `policy()`

两端对齐段落会插入宽度不定的空白,击穿默认阈值。policy() 使用更宽松的 word-margin (0.18) 搭配自适应阈值,正确读出密集的法律文本。

doc = PdfDocument("gdpr.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.policy())

扫描 OCR 输出 — `scanned_ocr()`

经过 OCR (Tesseract、PaddleOCR、Azure) 的页面字符坐标带噪声,字距提示也已丢失。scanned_ocr() 用按页重读字体统计的自适应阈值来补偿这一点。

doc = PdfDocument("scanned.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.scanned_ocr())

交给库自己判断 — `adaptive()`

如果事先不清楚文档类别,adaptive() 会在第一遍中对字体统计进行采样,在真正提取前定好阈值。比固定配置文件略慢,但对混合语料更宽容。

for pdf_path in Path("mixed_corpus/").glob("*.pdf"):
    doc = PdfDocument(str(pdf_path))
    text = doc.extract_text(0, profile=ExtractionProfile.adaptive())

配置文件字段

每个配置文件都暴露了调优参数,方便读取或克隆:

Python

from pdf_oxide import ExtractionProfile

p = ExtractionProfile.academic()
print(p.name)                # "Academic"
print(p.word_margin_ratio)   # 0.12
print(p.tj_offset_threshold) # -105.0

# 查看所有预设
for profile in ExtractionProfile.all_profiles():
    print(profile.name, profile.word_margin_ratio)

Rust

use pdf_oxide::config::ExtractionProfile;

let p = ExtractionProfile::ACADEMIC;
println!("{} margin={} tj={}",
    p.name, p.word_margin_ratio, p.tj_offset_threshold);

在生产管道中选择配置文件

如果接入的是混合语料 — 学术论文、IRS 表单、网页抓取的 HTML 导出同台存在 — 把 adaptive() 作为默认值。它会带来每页几个百分点的额外开销,但能消除最严重的失败场景 (粘连单词、跨栏缺空格)。

如果语料同质化 — 比如 Title IX 接入管道、合同审阅工具、arXiv 爬虫 — 显式选择对应的配置文件:既能获得最佳提取质量,又能省去 adaptive() 的按页采样开销。

提取配置文件 — 按文档类型调优空格检测

快速示例

可用的配置文件

各配置文件的适用场景

学术 / 会议论文 — academic()

IRS 表单、申请表 — form()

GDPR / 政策 / 法规 — policy()

扫描 OCR 输出 — scanned_ocr()

交给库自己判断 — adaptive()