What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide（Rust）入門

PDF Oxide はテキスト抽出を標準搭載した、もっとも高速な Rust 用 PDF crate です。平均 0.8 ms、3,830 件の PDF で 100% の成功率を記録。抽出・作成・編集をひとつのライブラリで扱えます。

インストール

Cargo.toml に pdf_oxide を追加します。

[dependencies]
pdf_oxide = "0.3"

フィーチャーフラグ

必要な機能だけを有効にできます。

# デフォルト -- テキスト抽出、作成、編集
pdf_oxide = "0.3"

# ページを画像にレンダリング
pdf_oxide = { version = "0.3", features = ["rendering"] }

# バーコード生成
pdf_oxide = { version = "0.3", features = ["barcodes"] }

# デジタル署名
pdf_oxide = { version = "0.3", features = ["signatures"] }

# Office 文書変換（DOCX、XLSX、PPTX）
pdf_oxide = { version = "0.3", features = ["office"] }

# すべて
pdf_oxide = { version = "0.3", features = ["full"] }

PDF を開く

PdfDocument::open() でファイルを読み込み、メタデータを確認します。

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("research-paper.pdf")?;
println!("Pages: {}", doc.page_count());
println!("PDF version: {}", doc.version());

テキスト抽出

プレーンテキスト

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("report.pdf")?;
let text = doc.extract_text(0)?;
println!("{text}");

テキストスパン

extract_spans() は、同じスタイルが続く文字列ごとのフォント情報を含む Vec<TextSpan> を返します。

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("paper.pdf")?;
let spans = doc.extract_spans(0)?;

for span in &spans {
    println!("'{}' at ({:.1}, {:.1}) font={} size={:.1}",
        span.text, span.x, span.y, span.font_name, span.font_size);
}

TextSpan のフィールド:

フィールド	型	説明
`text`	`String`	テキスト内容
`x`	`f64`	ポイント単位の水平位置
`y`	`f64`	ポイント単位の垂直位置
`font_name`	`String`	PostScript フォント名
`font_size`	`f64`	ポイント単位のフォントサイズ
`bbox`	`Rect`	バウンディング矩形

文字単位の抽出

extract_chars() は、文字ごとの正確な位置を含む Vec<TextChar> を返します。

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("paper.pdf")?;
let chars = doc.extract_chars(0)?;

for ch in chars.iter().take(10) {
    println!("'{}' at ({:.1}, {:.1}) size={:.1} font={}",
        ch.char, ch.x, ch.y, ch.font_size, ch.font_name);
}

TextChar のフィールド:

フィールド	型	説明
`char`	`char`	Unicode 文字
`x`	`f64`	ポイント単位の水平位置
`y`	`f64`	ポイント単位の垂直位置
`font_size`	`f64`	ポイント単位のフォントサイズ
`font_name`	`String`	PostScript フォント名
`bbox`	`Rect`	バウンディング矩形

Markdown 変換

ページを Markdown に変換します。オプションも指定できます。

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let doc = PdfDocument::open("paper.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown(0, &options)?;
println!("{md}");

HTML 変換

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("paper.pdf")?;
let html = doc.to_html(0)?;
println!("{html}");

画像抽出

extract_images() は、ページ上のすべての画像のメタデータと生データを返します。コンテンツストリーム内の画像や、ネストされた Form XObject 内の画像も対象です。

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("brochure.pdf")?;
let images = doc.extract_images(0)?;

for (i, img) in images.iter().enumerate() {
    println!("Image {i}: {}x{} {} {}bpc ({} bytes)",
        img.width, img.height, img.color_space,
        img.bits_per_component, img.data.len());
}

extract_images_to_files() で画像を直接ディスクに書き出せます。

let doc = PdfDocument::open("brochure.pdf")?;
let paths = doc.extract_images_to_files(0, "output_dir")?;
for path in &paths {
    println!("Saved: {}", path.display());
}

PDF 作成

ファクトリメソッド

Pdf 型は高レベルなファクトリメソッドを提供します。

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::from_markdown("# Hello World\n\nThis is a PDF.")?;
pdf.save("output.pdf")?;

let mut pdf = Pdf::from_html("<h1>Invoice</h1><p>Amount: $42</p>")?;
pdf.save("invoice.pdf")?;

let mut pdf = Pdf::from_text("Plain text content.")?;
pdf.save("notes.pdf")?;

let mut pdf = Pdf::from_image("scan.jpg")?;
pdf.save("scan.pdf")?;

PdfBuilder による Fluent API

メタデータ、ページサイズ、マージンまで細かく制御する場合:

use pdf_oxide::api::PdfBuilder;
use pdf_oxide::writer::PageSize;

let mut pdf = PdfBuilder::new()
    .title("Annual Report")
    .author("Acme Corp")
    .page_size(PageSize::A4)
    .margins(72.0, 72.0, 72.0, 72.0)
    .font_size(11.0)
    .from_markdown("# Annual Report\n\n...")?;

pdf.save("annual-report.pdf")?;

DocumentBuilder による低レベル API

テキスト、図形、画像をピクセル単位で配置できます。

use pdf_oxide::writer::DocumentBuilder;

let mut builder = DocumentBuilder::new();
builder.add_page(612.0, 792.0)
    .text("Hello, world!", 72.0, 720.0, 12.0)
    .rect(100.0, 600.0, 200.0, 50.0)
    .image_at("logo.png", 400.0, 700.0, 100.0, 50.0)?;

builder.save("custom.pdf")?;

検索

ドキュメント全体を検索したり、オプションで細かく制御したりできます。

use pdf_oxide::api::Pdf;

let pdf = Pdf::open("manual.pdf")?;

// 全ページにまたがるシンプルな検索
let results = pdf.search("configuration")?;
for r in &results {
    println!("Page {}: '{}' at ({:.0}, {:.0})", r.page, r.text, r.x, r.y);
}

use pdf_oxide::api::{Pdf, SearchOptions};

let pdf = Pdf::open("manual.pdf")?;

let opts = SearchOptions {
    case_sensitive: false,
    whole_word: true,
    max_results: Some(50),
    ..Default::default()
};
let results = pdf.search_with_options("configuration", &opts)?;

編集

DocumentEditor

既存の PDF を開いて、ページ回転やフォームフィールドの操作など構造的な変更を行えます。

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open_editor("form-template.pdf")?;

// ページを回転
pdf.rotate_page(0, 90)?;

// フォームフィールドを追加
pdf.add_text_field("name", [100.0, 700.0, 300.0, 720.0])?;
pdf.add_checkbox("agree", [100.0, 650.0, 120.0, 670.0], false)?;

pdf.save("modified.pdf")?;

DOM ライクなページ編集

ページ要素をたどり、テキストをその場で書き換えます。

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("document.pdf")?;
let mut page = pdf.page(0)?;

// テキスト要素を検索
for t in page.find_text_containing("Draft") {
    println!("Found '{}' at {:?}", t.text(), t.bbox());
}

// テキストを置換
let matches = page.find_text_containing("Draft");
for t in &matches {
    page.set_text(t.id(), "Final")?;
}

pdf.save_page(page)?;
pdf.save("updated.pdf")?;

エラー処理

失敗する可能性のある操作はすべて Result<T, PdfError> を返します。PdfError 列挙型は主要な失敗パターンを網羅しています。

use pdf_oxide::PdfDocument;
use pdf_oxide::PdfError;

fn extract(path: &str) -> Result<String, PdfError> {
    let doc = PdfDocument::open(path)?;
    doc.extract_text(0)
}

match extract("file.pdf") {
    Ok(text) => println!("{text}"),
    Err(PdfError::Io(e)) => eprintln!("I/O error: {e}"),
    Err(PdfError::Parse(msg)) => eprintln!("Parse error: {msg}"),
    Err(PdfError::Password) => eprintln!("Password required"),
    Err(PdfError::PageOutOfRange { index, count }) => {
        eprintln!("Page {index} does not exist ({count} pages total)");
    }
    Err(e) => eprintln!("Error: {e}"),
}

PdfError のバリアント:

バリアント	説明
`Io`	ファイルシステムや I/O の失敗
`Parse`	PDF 構造の破損
`Password`	ドキュメントが暗号化されておりパスワードが未指定
`PageOutOfRange`	指定したページ番号が総ページ数を超えている

次のステップ

Python 入門 – Python から PDF Oxide を利用する
テキスト抽出 – 抽出オプションとレシピの詳細
PDF 作成 – PdfBuilder、暗号化、メタデータを用いた高度な作成
編集 – 既存 PDF の変更、注釈、フォームフィールド
API リファレンス – API の完全なドキュメント