What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 快速上手（Zig）

PDF Oxide 是内置文本提取功能、速度最快的 PDF 库 — 在 3,830 个 PDF 上达成 0.8ms 平均耗时、100% 通过率。Zig 绑定通过 @cImport 在 pdf_oxide C ABI 之上提供了地道的 Zig 接口 — 无需任何中间层，享有一等的 C 互操作能力。句柄是带 deinit 的结构体，返回的 C 字符串/缓冲区会被复制到调用方提供的 allocator 中。

固定使用 Zig 0.15.1 — 1.0 之前的版本及其 C 导入 API 在各发行版之间会发生变动，因此 CI 也使用同一版本。

安装

该绑定链接的是默认特性的 cdylib（而非 Python wheel）。先构建原生库，再让 Zig 指向头文件和 cdylib：

# 1. build the native library (shipped binding feature set)
cargo build --release --lib \
  --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. test + run the example
cd zig
LD_LIBRARY_PATH="$PWD/../target/release" \
  zig build test \
    -DPDF_OXIDE_INCLUDE_DIR="$PWD/../include" \
    -DPDF_OXIDE_LIB_DIR="$PWD/../target/release"

LD_LIBRARY_PATH="$PWD/../target/release" \
  zig build example \
    -DPDF_OXIDE_INCLUDE_DIR="$PWD/../include" \
    -DPDF_OXIDE_LIB_DIR="$PWD/../target/release"

在你自己的代码中，导入该模块即可开始使用：

const pdf_oxide = @import("pdf_oxide");

打开 PDF

用 Document.open 打开文件（内存中的数据则用 Document.openFromBytes），并查看其元数据。每个句柄都持有 C 资源，因此要搭配 defer doc.deinit() 一起使用。

const std = @import("std");
const pdf_oxide = @import("pdf_oxide");

pub fn main() !void {
    const a = std.heap.page_allocator;

    var doc = try pdf_oxide.Document.open("research-paper.pdf");
    defer doc.deinit();

    std.debug.print("pages:   {d}\n", .{try doc.pageCount()});
    const v = doc.version();
    std.debug.print("version: {d}.{d}\n", .{ v.major, v.minor });
    std.debug.print("encrypted: {}\n", .{doc.isEncrypted()});
}

文本提取

extractText 返回单页（从 0 开始计数）的文本。结果归你传入的 allocator 所有，因此用完后请释放它。

const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("report.pdf");
defer doc.deinit();

const text = try doc.extractText(a, 0);
defer a.free(text);
std.debug.print("{s}\n", .{text});

整文档版本会一次性提取所有页面：

const all_text = try doc.toPlainTextAll(a);
defer a.free(all_text);
std.debug.print("{s}\n", .{all_text});

Markdown 与 HTML 转换

可将单页或整个文档转换为 Markdown 或 HTML。每个方法都返回一个归 allocator 所有的切片。

const md = try doc.toMarkdown(a, 0);
defer a.free(md);
std.debug.print("{s}\n", .{md});

const md_all = try doc.toMarkdownAll(a);
defer a.free(md_all);

const html = try doc.toHtml(a, 0);
defer a.free(html);

单词级提取

extractWords 返回一个 Word 结构体切片，每个结构体包含文本、边界框、字体和加粗标志。用配套的 freeWords 辅助函数释放整个切片 — 它会一并释放每个单词的字符串以及底层切片。

const words = try doc.extractWords(a, 0);
defer pdf_oxide.Document.freeWords(a, words);

for (words) |w| {
    std.debug.print("'{s}' at ({d:.1}, {d:.1}) font={s} size={d:.1} bold={}\n", .{
        w.text, w.bbox.x, w.bbox.y, w.fontName, w.fontSize, w.bold,
    });
}

Word 字段：

字段	类型	说明
`text`	`[]u8`	单词文本（归 allocator 所有）
`bbox`	`Bbox`	`{ x, y, width, height }`，单位为点
`fontName`	`[]u8`	PostScript 字体名
`fontSize`	`f32`	字号，单位为点
`bold`	`bool`	该文本段是否为加粗

同样的模式也能提取字符和文本行：

const chars = try doc.extractChars(a, 0);
defer pdf_oxide.Document.freeChars(a, chars);

const lines = try doc.extractTextLines(a, 0);
defer pdf_oxide.Document.freeTextLines(a, lines);

搜索

search 在单页内查找；searchAll 则扫描每一页。两者都接受一个以 NUL 结尾的关键词和一个 case_sensitive 标志，并返回一个 SearchResult 切片。

const hits = try doc.searchAll(a, "configuration", false);
defer pdf_oxide.Document.freeSearchResults(a, hits);

for (hits) |hit| {
    std.debug.print("page {d}: '{s}' at ({d:.0}, {d:.0})\n", .{
        hit.page, hit.text, hit.bbox.x, hit.bbox.y,
    });
}

若要将搜索限定在单页内，请使用 search 并传入页索引：

const page_hits = try doc.search(a, 0, "Alpha", false);
defer pdf_oxide.Document.freeSearchResults(a, page_hits);

创建 PDF

Pdf 类型可从 Markdown、HTML 或纯文本构建文档。toBytes 序列化到内存；save 写入磁盘。

const a = std.heap.page_allocator;

var pdf = try pdf_oxide.Pdf.fromMarkdown("# Hello\n\nThis is a **Zig** PDF.\n");
defer pdf.deinit();

// Serialize to memory...
const bytes = try pdf.toBytes(a);
defer a.free(bytes);

// ...or write straight to disk.
try pdf.save("output.pdf");

你可以把刚构建好的 PDF 直接送回提取器走一趟完整的往返流程：

var pdf = try pdf_oxide.Pdf.fromHtml("<h1>Invoice</h1><p>Amount: $42</p>");
defer pdf.deinit();

const bytes = try pdf.toBytes(a);
defer a.free(bytes);

var doc = try pdf_oxide.Document.openFromBytes(bytes);
defer doc.deinit();

const text = try doc.extractText(a, 0);
defer a.free(text);
std.debug.print("{s}\n", .{text});

错误处理

可能失败的调用返回 Error!T，其中 Error 为 error{ PdfOxide, OutOfMemory }。由于 Zig 的错误值无法携带额外数据，底层的 C-ABI 错误码通过 lastErrorCode() 暴露出来 — 请在捕获 error.PdfOxide 后立即读取它。

const text = doc.extractText(a, 99) catch |err| switch (err) {
    error.PdfOxide => {
        std.debug.print("pdf_oxide error code: {d}\n", .{pdf_oxide.lastErrorCode()});
        return;
    },
    error.OutOfMemory => return err,
};
defer a.free(text);

后续步骤

Rust 快速上手 — PDF Oxide 所基于的原生核心
Python 快速上手 — 在 Python 中使用 PDF Oxide
文本提取 — 详细的提取选项与实用方案
创建 PDF — 带元数据与加密的进阶创建