What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 快速上手（Julia）

PDF Oxide 是 Julia 上最快的 PDF 库 — 文本提取平均耗时 0.8ms，在 3,830 个 PDF 上达到 100% 通过率。PdfOxide.jl 包通过 C ABI 直接封装 Rust 核心，让你既能享受原生速度，又能使用地道的 Julia API。页码索引从 0 开始。

安装

在 Julia REPL 的包管理器中添加该包：

using Pkg
Pkg.add("PdfOxide")

原生库（libpdf_oxide）在运行时加载。如果它不在系统加载器路径上，可以通过 PdfOxide.jl 检查的某个环境变量来指向它，检查顺序如下：PDF_OXIDE_LIB_PATH（文件的完整路径）、PDF_OXIDE_LIB_DIR（目录），最后是本地的 target/release 构建目录。

export PDF_OXIDE_LIB_DIR=/path/to/pdf_oxide/target/release

快速上手

打开一个 PDF 并从第一页提取文本。extract_text 接受一个从 0 开始的页码索引。

using PdfOxide

doc = open_document("report.pdf")

println("pages:   ", page_count(doc))
v = version(doc)
println("version: ", v.major, ".", v.minor)

# Plain text from the first page (0-based index)
println(extract_text(doc, 0))

你也可以在内存中构建文档并从字节数据打开它 — 这对于完全不接触磁盘的测试和流水线非常方便：

using PdfOxide

pdf = from_markdown("# Hello pdf_oxide\n\nThis is the **Julia** binding.\n")
doc = open_from_bytes(to_bytes(pdf))

println("pages: ", page_count(doc))
println(extract_text(doc, 0))

文档检查

在提取之前，几个开销很低的调用就能告诉你正在处理的是什么：

using PdfOxide

doc = open_document("report.pdf")

@show page_count(doc)        # number of pages
@show version(doc).major     # PDF spec version
@show is_encrypted(doc)      # true if the file is password-protected

Markdown 与 HTML 转换

可以转换单个页面，也可以一次性转换整个文档。Markdown 会保留标题、列表和强调格式；带 _all 后缀的变体会拼接所有页面。

using PdfOxide

doc = open_document("paper.pdf")

# One page (0-based)
md = to_markdown(doc, 0)
println(md)

# Whole document
println(to_markdown_all(doc))

# HTML for a single page
html = to_html(doc, 0)
println(html)

# Plain text without any markup
println(to_plain_text(doc, 0))

词级提取

extract_words 返回一个 Word 值的向量，每个值都携带其文本、边界框、字号以及一个加粗标志。边界框是一个 Bbox，包含 width、height 和位置字段。

using PdfOxide

doc = open_document("paper.pdf")
words = extract_words(doc, 0)

for w in first(words, 10)
    println(rpad(w.text, 20),
            " size=", w.font_size,
            " bold=", w.bold,
            " width=", round(w.bbox.width; digits = 1))
end

如果需要按行组织的布局，extract_text_lines 会返回 TextLine 值，每个值都带有其文本、一个 word_count 和一个 bbox：

using PdfOxide

doc = open_document("paper.pdf")
lines = extract_text_lines(doc, 0)

for line in lines
    println(line.word_count, " words: ", line.text)
end

搜索

可以搜索单个页面，也可以搜索整个文档。第三个参数是大小写敏感标志（false 表示不区分大小写）。每个命中结果都会报告其 text、命中所在的 page 以及一个 bbox。

using PdfOxide

doc = open_document("manual.pdf")

# Search one page (case-insensitive)
hits = search(doc, 0, "configuration", false)
for h in hits
    println("page ", h.page, ": ", h.text)
end

# Search every page
all_hits = search_all(doc, "configuration", false)
println(length(all_hits), " total matches")
for h in all_hits
    println("page ", h.page, " at (",
            round(h.bbox.x; digits = 0), ", ",
            round(h.bbox.y; digits = 0), ")")
end

创建 PDF

from_* 工厂函数可以从 Markdown、HTML 或纯文本构建一个 Pdf。调用 to_bytes 获取原始字节，或调用 save 直接写入文件。

using PdfOxide

# From Markdown
pdf = from_markdown("# Invoice\n\nAmount due: **\$42**\n")
save(pdf, "invoice.pdf")

# From HTML
html_pdf = from_html("<h1>Report</h1><p>Quarterly results.</p>")
save(html_pdf, "report.pdf")

# From plain text — grab the bytes instead of writing a file
text_pdf = from_text("Plain text body.")
bytes = to_bytes(text_pdf)
println("generated ", length(bytes), " bytes")

错误处理

操作失败时会抛出 PdfOxideError。对涉及不可信输入的调用，请用 try/catch 包裹：

using PdfOxide

try
    doc = open_document("missing.pdf")
    println(extract_text(doc, 0))
catch e
    e isa PdfOxideError || rethrow()
    println("PDF error: ", e)
end

后续步骤

Rust 快速上手 — PDF Oxide 构建于其上的原生核心
Python 快速上手 — 在 Python 中使用 PDF Oxide
文本提取 — 详细的提取选项与实用技巧
PDF 创建 — 带元数据和样式的进阶创建