What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

从 pdfplumber 迁移到 PDF Oxide

从 pdfplumber 切换到 PDF Oxide 的完整指南，涵盖你目前使用的所有 API 及其替代方案。

为什么要从 pdfplumber 迁移？

四个值得迁移的理由：

快 29 倍 — PDF Oxide 每页平均 0.8ms，pdfplumber 为 23.2ms。100 页文档从 2.3 秒降到 80ms。
支持加密 PDF — pdfplumber 完全无法打开加密 PDF。PDF Oxide 透明处理所有加密方式，包括 AES-256。
图片提取 — pdfplumber 没有图片提取功能。PDF Oxide 一次调用即可提取嵌入图片。
Markdown 输出 — pdfplumber 以需要手动格式化的 Python 列表返回表格。PDF Oxide 输出保留表格结构的 Markdown，可直接用于 LLM。

第 1 步：安装

pip install pdf_oxide
pip uninstall pdfplumber  # 可选

第 2 步：替换导入

# 之前
import pdfplumber

# 之后
from pdf_oxide import PdfDocument

第 3 步：API 映射表

任务	pdfplumber	PDF Oxide
打开 PDF	`pdfplumber.open("file.pdf")`	`PdfDocument("file.pdf")`
页数	`len(pdf.pages)`	`doc.page_count()`
提取文本	`pdf.pages[0].extract_text()`	`doc.extract_text(0)`
字符位置	`pdf.pages[0].chars`	`doc.extract_chars(0)`
提取表格	`pdf.pages[0].extract_tables()`	`doc.to_markdown(0)`
表单字段	不支持（只读）	`doc.get_form_fields()`
加密 PDF	不支持	`PdfDocument("file.pdf", password="pw")`
提取图片	不支持	`doc.extract_image_bytes(0)`
转为 Markdown	不支持	`doc.to_markdown(0)`
渲染	不支持	`doc.render_page(0)`
OCR	不支持	`doc.extract_text_ocr(0)`
创建 PDF	不支持	`Pdf.from_markdown("# Title")`

第 4 步：常见模式变更

文本提取

pdfplumber 需要上下文管理器。PDF Oxide 不需要：

# pdfplumber — 需要上下文管理器
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

# PDF Oxide — 无需上下文管理器
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(text)

表格提取

pdfplumber 以嵌套的 Python 列表返回表格。PDF Oxide 以 Markdown 输出：

# pdfplumber — 返回列表的列表
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()
    for table in tables:
        for row in table:
            print(row)

# PDF Oxide — 结构化 Markdown 输出
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
md = doc.to_markdown(0)
print(md)  # 表格以 Markdown 表格渲染

字符级提取

# pdfplumber
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    chars = pdf.pages[0].chars
    for c in chars:
        print(f"{c['text']} at ({c['x0']}, {c['top']})")

# PDF Oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for c in chars:
    print(f"{c.char} at ({c.x}, {c.y})")

加密 PDF（新功能）

pdfplumber 无法打开加密 PDF。PDF Oxide 透明处理：

from pdf_oxide import PdfDocument

# 支持所有加密方式，包括 AES-256
doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)
print(text)

图片提取（新功能）

pdfplumber 没有图片提取功能。PDF Oxide 让它变得简单：

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

扫描文档 OCR（新功能）

pdfplumber 无法处理扫描 PDF。PDF Oxide 内置 OCR：

from pdf_oxide import PdfDocument

doc = PdfDocument("scanned.pdf")
text = doc.extract_text_ocr(0)
print(text)

主要区别

无需上下文管理器 — pdfplumber 使用 with pdfplumber.open(...) as pdf:，而 PDF Oxide 不需要上下文管理器。
加密 PDF — pdfplumber 完全无法打开加密文件。PDF Oxide 透明地处理加密。
表格 — pdfplumber 返回 Python 列表。PDF Oxide 以 Markdown 或 HTML 格式输出表格。对于需要可视化调试的复杂表格，可以将 pdfplumber 与 PDF Oxide 配合使用。

第 5 步：测试迁移

通过两个库运行现有测试文件并比较输出：

from pdf_oxide import PdfDocument

doc = PdfDocument("your-test-file.pdf")

# 验证文本提取
text = doc.extract_text(0)
print(text[:500])

# 验证页数
print(f"Pages: {doc.page_count()}")

# 验证表单字段（如适用）
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name}: {f.value}")