What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 快速上手（Swift）

PDF Oxide 是最快的内置文本提取 PDF 库——平均 0.8ms，在 3,830 个 PDF 上 100% 通过率。Swift 绑定在 v0.3.69 中新增，通过 C ABI 封装 Rust 内核：句柄由类持有（在 deinit 中释放），C 缓冲区会被复制到 Swift 的 String/[UInt8] 中，错误码则以 PdfOxideError 的形式抛出。

安装

该绑定链接默认特性集的 cdylib。先构建原生库，再让 SwiftPM 指向相应的头文件和库文件：

# 1. 构建原生库（绑定所需的特性集）
cargo build --release --lib --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. 测试 + 运行示例（Package.swift 会读取 PDF_OXIDE_INCLUDE_DIR / _LIB_DIR）
cd swift
export PDF_OXIDE_INCLUDE_DIR="$PWD/../include"
export PDF_OXIDE_LIB_DIR="$PWD/../target/release"
DYLD_LIBRARY_PATH="$PDF_OXIDE_LIB_DIR" swift test
DYLD_LIBRARY_PATH="$PDF_OXIDE_LIB_DIR" swift run basic_extraction

快速上手

从 Markdown 构建一个 PDF，再用生成的字节打开它，然后提取文本。整个往返过程无需任何外部测试文件：

import PdfOxide

let pdf = try Pdf.fromMarkdown("# Hello pdf_oxide\n\nThis is a **Swift** binding.\n")
let doc = try Document.openFromBytes(try pdf.toBytes())

print("pages:   \(try doc.pageCount())")
print("version: \(try doc.version())")
print(try doc.extractText(0))

要从磁盘打开文件，使用 Document.open(_:)：

import PdfOxide

let doc = try Document.open("research-paper.pdf")
print("Pages:   \(try doc.pageCount())")
print("Version: \(try doc.version())")        // 例如 1.7

文本提取

extractText(_:) 返回指定单页（从 0 开始计数）的文本。遍历 pageCount() 即可读取整个文档：

import PdfOxide

let doc = try Document.open("book.pdf")
for i in 0..<(try doc.pageCount()) {
    print("--- Page \(i + 1) ---")
    print(try doc.extractText(i))
}

toPlainText(_:) 给出去除版面信息的扁平化文本，而 *All() 系列方法一次性提取所有页面：

let doc = try Document.open("report.pdf")
let plain = try doc.toPlainText(0)            // 单页，无版面
let everything = try doc.toPlainTextAll()     // 所有页面拼接在一起

单词与字符

extractWords(_:) 返回 [Word]，其中每个单词都带有边界框和字体元数据。extractChars(_:) 返回 [Char]，提供逐字符的定位信息：

import PdfOxide

let doc = try Document.open("paper.pdf")

let words = try doc.extractWords(0)
for word in words.prefix(10) {
    print("'\(word.text)' at (\(word.bbox.x), \(word.bbox.y)) "
        + "font=\(word.fontName) size=\(word.fontSize) bold=\(word.bold)")
}

let chars = try doc.extractChars(0)
for ch in chars.prefix(10) {
    let scalar = Unicode.Scalar(ch.character).map(String.init) ?? "?"
    print("'\(scalar)' size=\(ch.fontSize) font=\(ch.fontName)")
}

Word 字段：text（String）、bbox（Bbox）、fontName（String）、fontSize（Double）、bold（Bool）。Char 字段：character（UInt32 码点）、bbox、fontName、fontSize。Bbox 暴露 x、y、width 和 height，均为 Double。

你还可以用 extractTextLines(_:) 逐行提取文本，它返回 [TextLine]（text、bbox、wordCount）：

let lines = try doc.extractTextLines(0)
for line in lines {
    print("\(line.wordCount) words: \(line.text)")
}

Markdown 与 HTML 转换

将单个页面或整个文档转换为 Markdown 或 HTML：

import PdfOxide

let doc = try Document.open("paper.pdf")

let md = try doc.toMarkdown(0)        // 单页转 Markdown
let mdAll = try doc.toMarkdownAll()   // 整个文档转 Markdown
let html = try doc.toHtml(0)          // 单页转 HTML
let htmlAll = try doc.toHtmlAll()     // 整个文档转 HTML

print(mdAll)

搜索

search(_:_:_:) 搜索单个页面；searchAll(_:_:) 搜索整个文档。两者都接受一个搜索词和一个 caseSensitive 标志，返回 [SearchResult]（text、page、bbox）：

import PdfOxide

let doc = try Document.open("manual.pdf")

// 搜索单个页面（第 0 页，不区分大小写）
let hits = try doc.search(0, "configuration", false)
for hit in hits {
    print("page \(hit.page): '\(hit.text)' at (\(hit.bbox.x), \(hit.bbox.y))")
}

// 搜索整个文档
let allHits = try doc.searchAll("configuration", false)
print("\(allHits.count) total matches")

创建 PDF

Pdf 类型提供了一系列工厂方法，可从源格式构建文档。用 save(_:) 将其保存到磁盘，或用 toBytes() 获取原始字节：

import PdfOxide

try Pdf.fromMarkdown("# Hello World\n\nThis is a PDF.").save("output.pdf")
try Pdf.fromHtml("<h1>Invoice</h1><p>Amount: $42</p>").save("invoice.pdf")
try Pdf.fromText("Plain text content.").save("notes.pdf")

let bytes = try Pdf.fromMarkdown("# In-memory\n\nbody\n").toBytes()
print("produced \(bytes.count) bytes")

错误处理

每个可能失败的调用都会抛出 PdfOxideError，其中携带了失败操作的名称和底层 C-ABI 错误码：

import PdfOxide

do {
    let doc = try Document.open("document.pdf")
    print(try doc.extractText(0))
} catch let error as PdfOxideError {
    print("PDF error: \(error)")   // 例如 "PdfOxideError: open failed (error code 1)"
}

后续步骤

Rust 快速上手 —— 在 Rust 中使用 PDF Oxide
Python 快速上手 —— 在 Python 中使用 PDF Oxide
文本提取 —— 详细的提取选项与实战示例
创建 PDF —— 带元数据和加密的进阶创建
编辑 —— 修改现有 PDF、注释和表单字段