What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide と lopdf の比較

lopdf は PDF オブジェクトを直接操作するための低レベルな Rust クレートです。一方 PDF Oxide は、テキスト抽出・作成・編集を組み込みで備えた高レベルなライブラリです。両者は根本的に異なるユースケースを対象としています。

主な違い

抽象化レベル。 lopdf が提供するのは生の PDF オブジェクト — 辞書、ストリーム、相互参照テーブルです。テキスト抽出もフォントのデコードも画像のエクスポートもありません。PDF Oxide は目的に特化したメソッドを提供します: extract_text()、extract_images()、to_markdown()。

信頼性。 lopdf は 3,830 件の PDF テストコーパスのうち 20% のパースに失敗します。パースできた PDF のうち 57% は、lopdf にテキスト抽出機能がないため空の出力になります — オブジェクトは取得できてもテキストは得られません。PDF Oxide は 100% を合格します。

パース可能な PDF での速度。 生のオブジェクトパースでは lopdf のほうが高速です: 平均 0.3ms 対 PDF Oxide の 0.8ms。しかし lopdf はテキスト抽出を一切行いません — フォントのデコード、CMap の解決、字間解析、読み取り順序の判定を自前で構築する必要があります。

クイック比較

	PDF Oxide	lopdf
API レベル	高レベル	低レベル
テキスト抽出	組み込み（プロダクション品質）	なし
合格率（3,830 PDF）	100%	80.2%
平均パース時間	0.8ms	0.3ms
画像抽出	組み込み	手動（生ストリーム）
フォームフィールド	読み取り + 書き込み	手動（生辞書）
PDF 作成	あり（Markdown/HTML）	あり（生オブジェクト）
Markdown/HTML 出力	あり	なし
暗号化	読み取り + 書き込み	なし
レンダリング	あり	なし
PDF/A 検証	あり	なし
ライセンス	MIT	MIT

lopdf にできないこと

lopdf は PDF オブジェクトへのアクセスを提供しますが、テキスト抽出にはそれらのオブジェクトを PDF 仕様に従って解釈する処理が必要です。自前で構築しなければならないのは以下のとおりです:

コンテンツストリームのパース — PostScript に似た演算子（Tj、TJ、Tm、Tf など）を解析する
フォントの解決 — /Font リソースを参照し、間接参照を解決する
CMap/ToUnicode のデコード — グリフ ID を Unicode 文字に変換する
フォントメトリクスによる字間 — フォントディスクリプタから文字幅を計算する
テキストマトリクス変換 — Tm、Td、T* 演算子を適用してテキストを配置する
読み取り順序 — 多段組レイアウトの正しい順序を判定する
合字の再構築 — fi、fl、ffi の合字を処理する
CJK エンコーディング — 中国語・日本語・韓国語のテキストエンコーディングをデコードする

これは数千行のコードと ISO 32000 に関する深い知識を要します。PDF Oxide はそのすべてを内部で処理します。

コードの並列比較

テキスト抽出

PDF Oxide:

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;
let text = doc.extract_text(0)?;
println!("{}", text);

lopdf:

use lopdf::Document;

let doc = Document::load("report.pdf")?;

// lopdf does not provide text extraction.
// You get access to PDF objects only:
let page_id = doc.page_iter().next().unwrap();
let page = doc.get_dictionary(page_id)?;
let contents = page.get("Contents")?;
let stream = doc.get_object(contents.as_reference()?)?;

// To get actual text, you must:
// 1. Parse content stream operators
// 2. Resolve font references from /Resources
// 3. Decode CMap/ToUnicode mappings
// 4. Apply text matrix transformations
// 5. Handle encoding differences
// ... (hundreds to thousands of lines of code)

PDF 作成

PDF Oxide:

use pdf_oxide::api::Pdf;

let pdf = Pdf::from_markdown("# Report\n\n| Q1 | Q2 |\n|---|---|\n| $1M | $2M |")?;
pdf.save("report.pdf")?;

lopdf:

use lopdf::{Document, Object, Stream, dictionary};

let mut doc = Document::with_version("1.5");

// Create font dictionary
let font_id = doc.add_object(dictionary! {
    "Type" => "Font",
    "Subtype" => "Type1",
    "BaseFont" => "Helvetica",
});

// Create resources
let resources_id = doc.add_object(dictionary! {
    "Font" => dictionary! { "F1" => font_id },
});

// Create content stream (raw PostScript operators)
let content = Stream::new(
    dictionary! {},
    b"BT /F1 12 Tf 72 720 Td (Hello World) Tj ET".to_vec(),
);
let content_id = doc.add_object(content);

// Create page
let page_id = doc.add_object(dictionary! {
    "Type" => "Page",
    "MediaBox" => vec![0.into(), 0.into(), 612.into(), 792.into()],
    "Contents" => content_id,
    "Resources" => resources_id,
});

// Wire up page tree
let pages_id = doc.add_object(dictionary! {
    "Type" => "Pages",
    "Kids" => vec![page_id.into()],
    "Count" => 1,
});
doc.add_object(dictionary! {
    "Type" => "Catalog",
    "Pages" => pages_id,
});

doc.save("report.pdf")?;

暗号化された PDF

PDF Oxide:

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open_with_password("encrypted.pdf", "secret")?;
let text = doc.extract_text(0)?;
println!("{}", text);

lopdf:

// lopdf does not support encrypted PDFs.
// Loading an encrypted PDF will fail or produce undecrypted streams.

信頼性の比較

指標	PDF Oxide	lopdf
パースに成功した PDF	3,823 / 3,823（100%）	3,071 / 3,823（80.2%）
テキスト出力が得られた PDF	3,823 / 3,823	約 1,320 / 3,823（推定）
暗号化 PDF のサポート	あり	なし
不正な形式の PDF からの回復	あり	なし

lopdf の 80.2% という合格率は、およそ 5 件に 1 件の PDF で失敗することを意味します。失敗は暗号化された文書、非標準の xref テーブルを持つ PDF、相互参照ストリームを使用する文書で発生します。PDF Oxide は寛容なパースとフォールバック戦略により、これらすべてを処理します。

それぞれの使いどころ

次の場合は PDF Oxide を選ぶ:

テキスト抽出、画像抽出、その他コンテンツレベルの操作が必要
読み取り + 書き込み + 作成を単一のクレートで済ませたい
すべての PDF を確実に処理する必要がある（暗号化、不正な形式、複雑なもの）
Markdown/HTML 出力、レンダリング、OCR が必要
準拠性検証（PDF/A、PDF/X、PDF/UA）が必要

次の場合は lopdf を選ぶ:

カスタム処理のために PDF オブジェクトへ直接アクセスする必要がある
オブジェクトレベルで動作する特殊な PDF ツールを構築している
オブジェクトツリーを直接操作して文書をマージする必要がある
扱う PDF が単純で整形式である（暗号化なし、標準の xref テーブル）

両方を組み合わせる:

高レベルな操作には PDF Oxide を、生のオブジェクトアクセスを要するエッジケースには lopdf を使います:

[dependencies]
pdf_oxide = "0.3"
lopdf = "0.32"

PDF Oxide と lopdf の比較

主な違い

クイック比較

lopdf にできないこと

コードの並列比較

テキスト抽出

PDF 作成

暗号化された PDF

信頼性の比較

それぞれの使いどころ

次の場合は PDF Oxide を選ぶ:

次の場合は lopdf を選ぶ:

両方を組み合わせる:

関連ページ