What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 快速上手（Objective-C）

PDF Oxide 在其 Rust 核心之上提供了地道的 Objective-C 绑定 —— 文本提取平均 0.8ms，在 3,830 个 PDF 上 100% 通过率。NSObject 封装（POXDocument、POXPdf）持有底层句柄，并在 ARC 下自动释放；返回的字符串以 NSString 形式回传；任何 C-ABI 错误码都会以 POXErrorDomain 域中的 NSError 呈现。v0.3.69 新增。

安装

Objective-C 绑定链接默认特性的 cdylib，并在 ARC 下用 clang 构建。先构建原生库，再针对它执行 make build：

# 1. 构建原生库（绑定所附带的特性集）
cargo build --release --lib --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. 构建 Objective-C 绑定（clang，ARC）
cd objc
make build PDF_OXIDE_LIB_DIR="$PWD/../target/release"
DYLD_LIBRARY_PATH="$PWD/../target/release" ./basic_extraction

在源文件中导入唯一的公共头文件：

#import "POXPdfOxide.h"

快速上手

打开一个 PDF，查看其元数据，并从第一页提取文本。每个可能失败的调用都接受一个末尾的 NSError** 参数。

#import "POXPdfOxide.h"

NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"research-paper.pdf" error:&err];
if (!doc) {
    NSLog(@"open failed: %@", err.localizedDescription);
    return;
}

NSInteger pages = [doc pageCountError:&err];
POXVersion ver  = [doc version];
NSLog(@"pages: %ld  version: %d.%d", (long)pages, ver.major, ver.minor);

NSString *text = [doc extractText:0 error:&err];
NSLog(@"%@", text);

你也可以从内存中的字节数据打开 —— 当 PDF 通过网络传输或来自数据库时很有用 —— 还可以打开受密码保护的文件：

POXDocument *doc = [POXDocument openFromBytes:pdfData error:&err];

// 加密文档，预先提供密码：
POXDocument *enc = [POXDocument openWithPassword:@"confidential.pdf"
                                       password:@"secret"
                                          error:&err];

// 或在打开之后再进行身份验证：
BOOL ok = [doc authenticate:@"secret" error:&err];

文本提取

纯文本是最快的路径。可按从零开始的页索引提取单页，也可一次性提取整个文档。

// 单页
NSString *text = [doc extractText:0 error:&err];

// 整个文档，合并输出
NSString *all = [doc toPlainTextAllWithError:&err];

// 逐页提取
NSInteger count = [doc pageCountError:&err];
for (NSInteger i = 0; i < count; i++) {
    NSLog(@"--- page %ld ---\n%@", (long)i, [doc extractText:i error:&err]);
}

单词与文本行

extractWords: 和 extractTextLines: 返回元素对象数组，包含边界框和字体元数据，全部以 PDF 用户空间点（point）为单位。

NSArray<POXWord *> *words = [doc extractWords:0 error:&err];
for (POXWord *w in words) {
    POXBbox box = w.bbox;
    NSLog(@"'%@' at (%.1f, %.1f) %.1fx%.1f  font=%@ size=%.1f  bold=%d",
          w.text, box.x, box.y, box.width, box.height,
          w.fontName, w.fontSize, w.bold);
}

NSArray<POXTextLine *> *lines = [doc extractTextLines:0 error:&err];
for (POXTextLine *line in lines) {
    NSLog(@"%@  (%ld words)", line.text, (long)line.wordCount);
}

POXChar（来自 extractChars:）在字符粒度上暴露相同的结构 —— character、bbox、fontName 和 fontSize。

Markdown 与 HTML

将单页 —— 或整个文档 —— 转换为 Markdown 或 HTML。

// 单页
NSString *md   = [doc toMarkdown:0 error:&err];
NSString *html = [doc toHtml:0 error:&err];

// 整个文档
NSString *mdAll   = [doc toMarkdownAllWithError:&err];
NSString *htmlAll = [doc toHtmlAllWithError:&err];

搜索

用 search:term:caseSensitive:error: 搜索单页，或用 searchAll:caseSensitive:error: 搜索整个文档。两者都返回 POXSearchResult 数组，携带匹配到的文本、页索引和边界框。

NSArray<POXSearchResult *> *hits =
    [doc searchAll:@"configuration" caseSensitive:NO error:&err];

for (POXSearchResult *r in hits) {
    POXBbox b = r.bbox;
    NSLog(@"page %ld: '%@' at (%.0f, %.0f)", (long)r.page, r.text, b.x, b.y);
}

// 单页变体：
NSArray<POXSearchResult *> *pageHits =
    [doc search:0 term:@"configuration" caseSensitive:NO error:&err];

创建 PDF

POXPdf 构建器可从 Markdown、HTML 或纯文本生成 PDF。既可保存到路径，也可将字节以 NSData 形式取回。

POXPdf *pdf = [POXPdf fromMarkdown:@"# Hello World\n\nThis is a PDF.\n"
                            error:&err];
[pdf saveToPath:@"output.pdf" error:&err];

// 或将字节保留在内存中
NSData *bytes = [pdf toBytesWithError:&err];

// 同样提供 HTML 和纯文本的构造方法
POXPdf *invoice = [POXPdf fromHtml:@"<h1>Invoice</h1><p>Amount: $42</p>"
                            error:&err];
POXPdf *notes   = [POXPdf fromText:@"Plain text content." error:&err];

从构建器直接转入文档进行提取，实现往返：

POXPdf *pdf      = [POXPdf fromMarkdown:@"# Report\n\nBody text.\n" error:&err];
POXDocument *doc = [POXDocument openFromBytes:[pdf toBytesWithError:&err]
                                        error:&err];
NSLog(@"%@", [doc extractText:0 error:&err]);

错误处理

每个可能失败的方法都会写入末尾的 NSError**，并在失败时返回 nil 或一个哨兵值。错误归入 POXErrorDomain 域。

NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"document.pdf" error:&err];
if (!doc) {
    if ([err.domain isEqualToString:POXErrorDomain]) {
        NSLog(@"PDF error: %@", err.localizedDescription);
    }
    return;
}

NSString *text = [doc extractText:0 error:&err];
if (!text) {
    NSLog(@"extract failed: %@", err.localizedDescription);
}

句柄会在 ARC 下自行释放，但你也可以用 -close（幂等）提前释放底层句柄：

[doc close];

后续步骤

Rust 快速上手 —— 在 Rust 中使用 PDF Oxide
Python 快速上手 —— 在 Python 中使用 PDF Oxide
文本提取 —— 详细的提取选项与实用示例
PDF 创建 —— 使用构建器 API 进行进阶创建
编辑 —— 修改现有 PDF、注释和表单字段