What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs pypdfium2

PDF Oxide と pypdfium2 は、どちらも高速でネイティブコンパイルされた Python の PDF ライブラリです。pypdfium2 は Google の PDFium エンジンをラップしており、PDF Oxide は Rust コアの上に構築されています。最大の違いはカバー範囲です。pypdfium2 は主にリーダーおよびレンダラーであるのに対し、PDF Oxide は PDF のライフサイクル全体をカバーします。

主な違い

速度。 どちらも高速です。PDF Oxide はわずかに速く、平均 0.8ms に対して 4.1ms（5.1 倍の差）です。どちらも純粋な Python ライブラリより圧倒的に高速です。

機能。 pypdfium2 は読み取り専用でレンダリング機能を備えています。PDF Oxide はこれに加えて、作成、編集、フォーム書き込み、暗号化、Markdown/HTML 出力、OCR を備えています。

信頼性。 PDF Oxide は有効な PDF の 100% を処理します。pypdfium2 は 99.2% で、31 件の失敗があります。

ライセンス。 どちらも寛容なライセンスです。PDF Oxide は MIT、pypdfium2 は Apache-2.0 です。どちらも AGPL に関する懸念はありません。

クイック比較

	PDF Oxide	pypdfium2
平均抽出時間	0.8ms	4.1ms
合格率（3,830 件の PDF）	100%	99.2%
ライセンス	MIT	Apache-2.0
言語	Rust + PyO3	C (PDFium)
テキスト抽出	あり	あり
文字位置	あり	あり
画像抽出	あり	あり
Markdown 出力	あり	なし
HTML 出力	あり	なし
PDF 作成	あり	なし
PDF 編集	あり	なし
フォームフィールド	読み取り + 書き込み	読み取りのみ
暗号化	読み取り + 書き込み	読み取りのみ
レンダリング	あり	あり
OCR	組み込み	なし
検索	正規表現 + 空間	あり

コードの比較

テキスト抽出

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
textpage = page.get_textpage()
text = textpage.get_text_range()
print(text)

画像抽出

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
for i, obj in enumerate(page.get_objects()):
    if obj.type == pdfium.FPDF_PAGEOBJ_IMAGE:
        bitmap = obj.get_bitmap()
        bitmap.to_pil().save(f"image_{i}.png")

PDF 作成

PDF Oxide:

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Report\n\nQuarterly results are in.")
pdf.save("report.pdf")

pypdfium2:

# pypdfium2 cannot create PDFs.
# It is a read-only library with rendering capabilities.

レンダリング

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
image = doc.render_page(0, dpi=150)
image.save("page.png")

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
bitmap = page.render(scale=150/72)
bitmap.to_pil().save("page.png")

ベンチマークの詳細

指標	PDF Oxide	pypdfium2
平均抽出時間	0.8ms	4.1ms
p99 抽出時間	9ms	42ms
合格率（有効な PDF）	100%（3,823/3,823）	99.2%（3,792/3,823）

どちらのライブラリもネイティブコード（それぞれ Rust と C）を使用していますが、PDF Oxide のテキスト抽出パイプラインはこのタスク専用に最適化されています。事前割り当てされたバッファとキャッシュされたページツリーによるシングルパス抽出を行います。

コーパスの詳細についてはベンチマーク手法の全体をご覧ください。

機能の充実度

これらのライブラリの最大の違いはカバー範囲です。pypdfium2 はレンダリング機能付きのリーダーであり、PDF Oxide は PDF のライフサイクル全体をカバーします。

機能	PDF Oxide	pypdfium2
読み取りと抽出	あり	あり
ページのレンダリング	あり	あり
PDF の作成	あり（Markdown、HTML、画像）	なし
既存 PDF の編集	あり（テキスト、画像、注釈）	なし
フォームフィールドの入力	あり	なし
暗号化の書き込み	あり（AES-256）	なし
Markdown/HTML 出力	あり	なし
スキャンページの OCR	あり（ONNX 経由の PaddleOCR）	なし
PDF/A 検証	あり	なし

PDF の読み取りとレンダリングだけが必要なら、pypdfium2 は確実な選択肢です。書き込み機能（作成、編集、フォーム入力、暗号化）が少しでも必要なら、PDF Oxide が単一ライブラリで完結する解決策です。

pypdfium2 のライセンス（Apache-2.0）

pypdfium2 は Apache-2.0 でライセンスされており、商用利用が可能です。ただし、これは Google の PDFium（Chromium の PDF エンジン）をラップしており、PDFium には独自の BSD スタイルのライセンスがあります。どちらも寛容なライセンスです。

主な考慮事項:

Apache-2.0 — 寛容で、商用利用が可能だが、帰属表示が必要
PDFium への依存 — バイナリには Chromium の PDFium エンジン（約 15 MB）が含まれる
Google のリリースサイクル — pypdfium2 は Chromium プロジェクトの PDFium リリースに依存する
Python API の安定性保証なし — API は PDFium の C API に密接に従う

PDF Oxide は MIT ライセンスであり、Apache-2.0 よりさらに寛容で、バイナリ配布に帰属表示の要件がありません。

それぞれの使いどころ

次の場合は PDF Oxide を選びましょう:

読み取り/レンダリング以上の機能が必要（作成、編集、フォーム、暗号化）
Markdown または HTML への変換が必要
スキャン文書向けの組み込み OCR が必要
最高の信頼性が必要（100% 対 99.2%）
速度が重要で、5 倍の差が大規模環境で意味を持つ

次の場合は pypdfium2 を選びましょう:

PDF の読み取りとレンダリングだけが必要
PDFium 特有のレンダリング出力を好む
より小さい依存関係のフットプリントが欲しい

PDF Oxide vs pypdfium2

主な違い

クイック比較

コードの比較

テキスト抽出

画像抽出

PDF 作成

レンダリング

ベンチマークの詳細

機能の充実度

pypdfium2 のライセンス（Apache-2.0）

それぞれの使いどころ

次の場合は PDF Oxide を選びましょう:

次の場合は pypdfium2 を選びましょう:

関連ページ