What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs lopdf

lopdf は PDF オブジェクトを直接操作するための低レベル Rust crate で、テキスト抽出やレンダリングは内蔵していません。PDF Oxide は抽出・作成・編集をそのまま提供する高レベルライブラリです。両者は根本的に異なるユースケースを対象としています。

主な違い

抽象化レベル。 lopdf は生の PDF オブジェクト — ディクショナリ、ストリーム、クロスリファレンステーブルを提供します。テキスト抽出もフォントデコードも画像エクスポートもありません。PDF Oxide は用途別のメソッドを提供します: extract_text()、extract_images()、to_markdown()。

信頼性。 lopdf は 3,830 件の PDF テストコーパスの 20% でパースに失敗します。パースできた PDF のうち 57% は空の出力になります。これは lopdf にテキスト抽出機能がないためで、オブジェクトは取得できてもテキストは得られません。PDF Oxide は 100% をパスします。

パース可能な PDF での速度。 lopdf は生のオブジェクトパースでは高速です: 平均 0.3ms vs PDF Oxide の 0.8ms。しかし lopdf はテキスト抽出処理を一切行わないため、フォントデコード、CMap 解決、スペーシング分析、読み取り順序を自分で構築する必要があります。

比較概要

	PDF Oxide	lopdf
API レベル	高レベル	低レベル
テキスト抽出	内蔵（プロダクショングレード）	なし
パス率（3,830 PDF）	100%	80.2%
平均パース時間	0.8ms	0.3ms
画像抽出	内蔵	手動（生ストリーム）
フォームフィールド	読み書き	手動（生ディクショナリ）
PDF 作成	あり（Markdown/HTML）	あり（生オブジェクト）
Markdown/HTML 出力	あり	なし
暗号化	読み書き	なし
レンダリング	あり	なし
PDF/A バリデーション	あり	なし
ライセンス	MIT	MIT

lopdf にできないこと

lopdf は PDF オブジェクトへのアクセスを提供しますが、テキスト抽出には PDF 仕様に従ったオブジェクトの解釈が必要です。自分で構築する必要があるものは以下のとおりです:

コンテントストリームのパース — PostScript 風の演算子（Tj, TJ, Tm, Tf など）をパース
フォント解決 — /Font リソースの参照、間接参照の解決
CMap/ToUnicode デコード — グリフ ID から Unicode 文字への変換
フォントメトリクスのスペーシング — フォントディスクリプタからの文字幅計算
テキスト行列変換 — Tm, Td, T* 演算子の適用によるテキスト位置決め
読み取り順序 — 複数カラムレイアウトの正しい順序の決定
リガチャの再構成 — fi, fl, ffi リガチャの処理
CJK エンコーディング — 中国語、日本語、韓国語テキストエンコーディングのデコード

これは数千行のコードと ISO 32000 の深い知識が必要です。PDF Oxide はこれらすべてを内部的に処理します。

コード比較

テキスト抽出

PDF Oxide:

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;
let text = doc.extract_text(0)?;
println!("{}", text);

lopdf:

use lopdf::Document;

let doc = Document::load("report.pdf")?;

// lopdf にはテキスト抽出機能がありません。
// PDF オブジェクトへのアクセスのみ:
let page_id = doc.page_iter().next().unwrap();
let page = doc.get_dictionary(page_id)?;
let contents = page.get("Contents")?;
let stream = doc.get_object(contents.as_reference()?)?;

// 実際のテキストを得るには:
// 1. コンテントストリーム演算子のパース
// 2. /Resources からのフォント参照の解決
// 3. CMap/ToUnicode マッピングのデコード
// 4. テキスト行列変換の適用
// 5. エンコーディングの違いへの対処
// ...（数百〜数千行のコード）

PDF 作成

PDF Oxide:

use pdf_oxide::api::Pdf;

let pdf = Pdf::from_markdown("# Report\n\n| Q1 | Q2 |\n|---|---|\n| $1M | $2M |")?;
pdf.save("report.pdf")?;

lopdf:

use lopdf::{Document, Object, Stream, dictionary};

let mut doc = Document::with_version("1.5");

// フォントディクショナリの作成
let font_id = doc.add_object(dictionary! {
    "Type" => "Font",
    "Subtype" => "Type1",
    "BaseFont" => "Helvetica",
});

// リソースの作成
let resources_id = doc.add_object(dictionary! {
    "Font" => dictionary! { "F1" => font_id },
});

// コンテントストリームの作成（生の PostScript 演算子）
let content = Stream::new(
    dictionary! {},
    b"BT /F1 12 Tf 72 720 Td (Hello World) Tj ET".to_vec(),
);
let content_id = doc.add_object(content);

// ページの作成
let page_id = doc.add_object(dictionary! {
    "Type" => "Page",
    "MediaBox" => vec![0.into(), 0.into(), 612.into(), 792.into()],
    "Contents" => content_id,
    "Resources" => resources_id,
});

// ページツリーの接続
let pages_id = doc.add_object(dictionary! {
    "Type" => "Pages",
    "Kids" => vec![page_id.into()],
    "Count" => 1,
});
doc.add_object(dictionary! {
    "Type" => "Catalog",
    "Pages" => pages_id,
});

doc.save("report.pdf")?;

暗号化された PDF

PDF Oxide:

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open_with_password("encrypted.pdf", "secret")?;
let text = doc.extract_text(0)?;
println!("{}", text);

lopdf:

// lopdf は暗号化 PDF に対応していません。
// 暗号化 PDF を読み込むと、失敗するか復号されていないストリームが返されます。

信頼性の比較

指標	PDF Oxide	lopdf
正常にパースされた PDF	3,823 / 3,823（100%）	3,071 / 3,823（80.2%）
テキスト出力のある PDF	3,823 / 3,823	~1,320 / 3,823（推定）
暗号化 PDF サポート	あり	なし
不正 PDF のリカバリ	あり	なし

lopdf の 80.2% のパス率は、約 5 件に 1 件の PDF が失敗することを意味します。失敗は暗号化ドキュメント、非標準の xref テーブルを持つ PDF、クロスリファレンスストリームを使用するドキュメントで発生します。PDF Oxide は寛容なパースとフォールバック戦略ですべてを処理します。

使い分けガイド

PDF Oxide を選ぶべき場合：

テキスト抽出、画像抽出など、コンテンツレベルの操作が必要
読み取り + 書き込み + 作成を単一のクレートで実現したい
すべての PDF を確実に処理する必要がある（暗号化、不正、複雑な PDF）
Markdown/HTML 出力、レンダリング、OCR が必要
準拠バリデーション（PDF/A, PDF/X, PDF/UA）が必要

lopdf を選ぶべき場合：

カスタム処理のために PDF オブジェクトに直接アクセスする必要がある
オブジェクトレベルで動作する特殊な PDF ツールを構築している
オブジェクトツリーを直接操作してドキュメントを結合する必要がある
PDF がシンプルで整形式（暗号化なし、標準 xref テーブル）

両方を組み合わせて使用：

高レベル操作に PDF Oxide、生オブジェクトアクセスが必要なエッジケースに lopdf を使用できます:

[dependencies]
pdf_oxide = "0.3"
lopdf = "0.32"

PDF Oxide vs lopdf

主な違い

比較概要

lopdf にできないこと

コード比較

テキスト抽出

PDF 作成

暗号化された PDF

信頼性の比較

使い分けガイド

PDF Oxide を選ぶべき場合：

lopdf を選ぶべき場合：

両方を組み合わせて使用：

関連ページ