What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide (Rust) 快速上手

PDF Oxide 是最快的 Rust PDF crate，内置文本提取：平均 0.8 ms，在 3,830 个 PDF 上 100% 通过。一套库即可完成提取、创建和编辑。

安装

在 Cargo.toml 中加入 pdf_oxide：

[dependencies]
pdf_oxide = "0.3"

功能开关（feature flags）

按需启用相应功能：

# 默认 -- 文本提取、创建、编辑
pdf_oxide = "0.3"

# 页面渲染为图像
pdf_oxide = { version = "0.3", features = ["rendering"] }

# 条形码生成
pdf_oxide = { version = "0.3", features = ["barcodes"] }

# 数字签名
pdf_oxide = { version = "0.3", features = ["signatures"] }

# Office 文档转换（DOCX、XLSX、PPTX）
pdf_oxide = { version = "0.3", features = ["office"] }

# 全部
pdf_oxide = { version = "0.3", features = ["full"] }

打开 PDF

用 PdfDocument::open() 加载文件并查看元数据。

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("research-paper.pdf")?;
println!("Pages: {}", doc.page_count());
println!("PDF version: {}", doc.version());

文本提取

纯文本

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("report.pdf")?;
let text = doc.extract_text(0)?;
println!("{text}");

文本 span

extract_spans() 返回 Vec<TextSpan>，每个样式一致的文本片段都附带字体元数据。

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("paper.pdf")?;
let spans = doc.extract_spans(0)?;

for span in &spans {
    println!("'{}' at ({:.1}, {:.1}) font={} size={:.1}",
        span.text, span.x, span.y, span.font_name, span.font_size);
}

TextSpan 字段：

字段	类型	说明
`text`	`String`	文本内容
`x`	`f64`	水平位置（点）
`y`	`f64`	垂直位置（点）
`font_name`	`String`	PostScript 字体名
`font_size`	`f64`	字号（点）
`bbox`	`Rect`	外接矩形

字符级提取

extract_chars() 返回 Vec<TextChar>，提供每个字符的精确位置。

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("paper.pdf")?;
let chars = doc.extract_chars(0)?;

for ch in chars.iter().take(10) {
    println!("'{}' at ({:.1}, {:.1}) size={:.1} font={}",
        ch.char, ch.x, ch.y, ch.font_size, ch.font_name);
}

TextChar 字段：

字段	类型	说明
`char`	`char`	Unicode 字符
`x`	`f64`	水平位置（点）
`y`	`f64`	垂直位置（点）
`font_size`	`f64`	字号（点）
`font_name`	`String`	PostScript 字体名
`bbox`	`Rect`	外接矩形

Markdown 转换

将页面转换为 Markdown，并可配置选项。

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let doc = PdfDocument::open("paper.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown(0, &options)?;
println!("{md}");

HTML 转换

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("paper.pdf")?;
let html = doc.to_html(0)?;
println!("{html}");

图像提取

extract_images() 返回页面上每张图像的元数据和原始数据，包括内容流中的图像以及嵌套 Form XObject 中的图像。

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("brochure.pdf")?;
let images = doc.extract_images(0)?;

for (i, img) in images.iter().enumerate() {
    println!("Image {i}: {}x{} {} {}bpc ({} bytes)",
        img.width, img.height, img.color_space,
        img.bits_per_component, img.data.len());
}

用 extract_images_to_files() 把图像直接写入磁盘：

let doc = PdfDocument::open("brochure.pdf")?;
let paths = doc.extract_images_to_files(0, "output_dir")?;
for path in &paths {
    println!("Saved: {}", path.display());
}

创建 PDF

工厂方法

Pdf 类型提供了便捷的高层工厂方法。

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::from_markdown("# Hello World\n\nThis is a PDF.")?;
pdf.save("output.pdf")?;

let mut pdf = Pdf::from_html("<h1>Invoice</h1><p>Amount: $42</p>")?;
pdf.save("invoice.pdf")?;

let mut pdf = Pdf::from_text("Plain text content.")?;
pdf.save("notes.pdf")?;

let mut pdf = Pdf::from_image("scan.jpg")?;
pdf.save("scan.pdf")?;

PdfBuilder 链式 API

用于完整控制元数据、页面尺寸和边距：

use pdf_oxide::api::PdfBuilder;
use pdf_oxide::writer::PageSize;

let mut pdf = PdfBuilder::new()
    .title("Annual Report")
    .author("Acme Corp")
    .page_size(PageSize::A4)
    .margins(72.0, 72.0, 72.0, 72.0)
    .font_size(11.0)
    .from_markdown("# Annual Report\n\n...")?;

pdf.save("annual-report.pdf")?;

DocumentBuilder 低层 API

可按像素精确放置文本、图形和图像：

use pdf_oxide::writer::DocumentBuilder;

let mut builder = DocumentBuilder::new();
builder.add_page(612.0, 792.0)
    .text("Hello, world!", 72.0, 720.0, 12.0)
    .rect(100.0, 600.0, 200.0, 50.0)
    .image_at("logo.png", 400.0, 700.0, 100.0, 50.0)?;

builder.save("custom.pdf")?;

搜索

在整个文档中搜索文本，或使用更细粒度的选项。

use pdf_oxide::api::Pdf;

let pdf = Pdf::open("manual.pdf")?;

// 跨全部页面的简易搜索
let results = pdf.search("configuration")?;
for r in &results {
    println!("Page {}: '{}' at ({:.0}, {:.0})", r.page, r.text, r.x, r.y);
}

use pdf_oxide::api::{Pdf, SearchOptions};

let pdf = Pdf::open("manual.pdf")?;

let opts = SearchOptions {
    case_sensitive: false,
    whole_word: true,
    max_results: Some(50),
    ..Default::default()
};
let results = pdf.search_with_options("configuration", &opts)?;

编辑

DocumentEditor

打开现有 PDF，执行页面旋转、表单字段操作等结构性编辑。

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open_editor("form-template.pdf")?;

// 旋转页面
pdf.rotate_page(0, 90)?;

// 添加表单字段
pdf.add_text_field("name", [100.0, 700.0, 300.0, 720.0])?;
pdf.add_checkbox("agree", [100.0, 650.0, 120.0, 670.0], false)?;

pdf.save("modified.pdf")?;

类 DOM 的页面编辑

遍历页面元素并就地修改文本。

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("document.pdf")?;
let mut page = pdf.page(0)?;

// 查找文本元素
for t in page.find_text_containing("Draft") {
    println!("Found '{}' at {:?}", t.text(), t.bbox());
}

// 替换文本
let matches = page.find_text_containing("Draft");
for t in &matches {
    page.set_text(t.id(), "Final")?;
}

pdf.save_page(page)?;
pdf.save("updated.pdf")?;

错误处理

所有可能失败的操作都返回 Result<T, PdfError>。PdfError 枚举涵盖了主要的失败情形。

use pdf_oxide::PdfDocument;
use pdf_oxide::PdfError;

fn extract(path: &str) -> Result<String, PdfError> {
    let doc = PdfDocument::open(path)?;
    doc.extract_text(0)
}

match extract("file.pdf") {
    Ok(text) => println!("{text}"),
    Err(PdfError::Io(e)) => eprintln!("I/O error: {e}"),
    Err(PdfError::Parse(msg)) => eprintln!("Parse error: {msg}"),
    Err(PdfError::Password) => eprintln!("Password required"),
    Err(PdfError::PageOutOfRange { index, count }) => {
        eprintln!("Page {index} does not exist ({count} pages total)");
    }
    Err(e) => eprintln!("Error: {e}"),
}

PdfError 变体：

变体	说明
`Io`	文件系统或 I/O 失败
`Parse`	PDF 结构损坏
`Password`	文档已加密，但未提供密码
`PageOutOfRange`	所请求的页索引超出总页数

下一步

Python 快速上手 – 在 Python 中使用 PDF Oxide
文本提取 – 详细的提取选项与示例
创建 PDF – 使用 PdfBuilder 的进阶创建、加密和元数据
编辑 – 修改现有 PDF、注释和表单字段
API 参考 – 完整 API 文档