What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

抽出プロファイル — ドキュメント種別ごとのスペース検出チューニング

PDF によってスペースの埋め込み方は大きく異なります。arXiv 論文は詰め込んだ両端揃えの段組で組まれ、IRS フォームは厳格なセル整列を前提にし、GDPR ポリシーは最小限のカーニングで両端揃えされた密なパラグラフを流し込みます。ひとつの tj_offset_threshold を一方に最適化すると、もう一方では余計なスペースが紛れ込んでしまいます。

ExtractionProfile は、実際のドキュメントクラスに素直に対応する、事前調整済みのパラメータセットを 9 種類用意しています。プロファイルを extract_text() や extract_words() に渡すだけで、PDF Oxide はその文書スタイルに合わせた word-margin ratio、TJ オフセットしきい値、アダプティブしきい値の有効/無効を適用します。

バインディング対応状況。 抽出プロファイルは現時点で Python (pdf_oxide.ExtractionProfile) と Rust (pdf_oxide::config::ExtractionProfile) で公開されています。Node、WASM、Go、C# の各バインディングは内部的に CONSERVATIVE デフォルトを使用しています。これらのランタイムから別のプロファイルを適用するには、Rust CLI (pdf-oxide extract --profile academic doc.pdf) を呼び出すか、Python / Rust をブリッジとして経由してください。

クイック例

Python

from pdf_oxide import PdfDocument, ExtractionProfile

doc = PdfDocument("paper.pdf")

# 学術論文: 詰め組み、引用検出オン
text = doc.extract_text(0, profile=ExtractionProfile.academic())
print(text)

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::config::ExtractionProfile;

let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text_with_profile(0, ExtractionProfile::ACADEMIC)?;
println!("{}", text);

利用可能なプロファイル

プロファイル	想定用途	TJ しきい値	Word-margin ratio	アダプティブ
`conservative()`	デフォルト — 一般的な文書、誤スペース最小	−120	0.10	off
`aggressive()`	スペースが抑制された PDF、単語結合を修正	−80	0.20	off
`balanced()`	混在コンテンツ	−100	0.15	off
`academic()`	arXiv 論文、カンファレンス予稿、技術レポート	−105	0.12	on + 引用 / メール検出
`policy()`	法令、GDPR、行政規則	−110	0.18	on
`form()`	IRS フォーム、申請書、アンケート	−120	0.08	off
`government()`	表を含む行政文書	−105	0.14	off
`scanned_ocr()`	座標がノイジーな OCR 出力	自動	自動	on
`adaptive()`	フォント統計からエクストラクタが自動調整	自動	自動	on

どのプロファイルが効くか

学術論文・カンファレンス予稿 — `academic()`

詰め組み、二段組、引用の埋め込み。デフォルト設定では合字 (fi、ff) の内側に余計なスペースが入ったり、強いカーニングでは単語間のスペースが足りなくなったりします。

doc = PdfDocument("neurips-paper.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.academic())

academic プロファイルはアダプティブしきい値に加えて引用・メール検出を有効化するため、[1,2,3] のインライン参照や author@lab.edu のようなメールアドレスがそのまま残ります。

IRS フォーム、申請書 — `form()`

フォーム系 PDF は単語境界よりも列の整列を重視します。form() プロファイルは きわめて タイトな word-margin ratio (0.08) を使い、剛直に整列したラベルが値と連結しないようにします。

doc = PdfDocument("w2.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.form())

GDPR / ポリシー / 規制文書 — `policy()`

両端揃えの段落は可変幅の空白を挟むので、デフォルトしきい値では破綻します。policy() はより広い word-margin (0.18) にアダプティブしきい値を組み合わせ、密な法律文書を正しく読み取ります。

doc = PdfDocument("gdpr.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.policy())

スキャン OCR 出力 — `scanned_ocr()`

OCR (Tesseract、PaddleOCR、Azure) でページ化されたものは、文字位置にノイズが乗り、カーニング情報も失われています。scanned_ocr() はページごとにフォント統計を読み直すアダプティブしきい値でそれを補います。

doc = PdfDocument("scanned.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.scanned_ocr())

ライブラリに任せる — `adaptive()`

ドキュメントクラスが事前にわからない場合は、adaptive() が最初のパスでフォント統計をサンプリングし、抽出前にしきい値を決定します。固定プロファイルよりわずかに遅いですが、混在コーパスに対して堅牢です。

for pdf_path in Path("mixed_corpus/").glob("*.pdf"):
    doc = PdfDocument(str(pdf_path))
    text = doc.extract_text(0, profile=ExtractionProfile.adaptive())

プロファイルのフィールド

各プロファイルは調整パラメータを公開しており、参照やクローンが可能です。

Python

from pdf_oxide import ExtractionProfile

p = ExtractionProfile.academic()
print(p.name)                # "Academic"
print(p.word_margin_ratio)   # 0.12
print(p.tj_offset_threshold) # -105.0

# すべてのプリセットを確認
for profile in ExtractionProfile.all_profiles():
    print(profile.name, profile.word_margin_ratio)

Rust

use pdf_oxide::config::ExtractionProfile;

let p = ExtractionProfile::ACADEMIC;
println!("{} margin={} tj={}",
    p.name, p.word_margin_ratio, p.tj_offset_threshold);

本番パイプラインでのプロファイル選択

学術論文・IRS フォーム・Web スクレイプされた HTML エクスポートが混在するコーパスを扱うなら、デフォルトを adaptive() にするのが無難です。1 ページあたり数パーセントのオーバーヘッドが発生しますが、単語の結合や段組間のスペース欠落といった最悪ケースを避けられます。

均質なコーパス — Title IX 取り込みパイプライン、契約レビューツール、arXiv クローラーなど — では、対応するプロファイルを明示的に選んでください。最高の抽出品質が得られ、adaptive() のページごとサンプリングコストも避けられます。

抽出プロファイル — ドキュメント種別ごとのスペース検出チューニング

クイック例

利用可能なプロファイル

どのプロファイルが効くか

学術論文・カンファレンス予稿 — academic()

IRS フォーム、申請書 — form()

GDPR / ポリシー / 規制文書 — policy()

スキャン OCR 出力 — scanned_ocr()

ライブラリに任せる — adaptive()