What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs pypdfium2

PDF Oxide と pypdfium2 はどちらもネイティブコンパイルされた高速な Python PDF ライブラリです。pypdfium2 は Google の PDFium エンジンをラップし、PDF Oxide は Rust コア上に構築されています。最大の違いは機能範囲：pypdfium2 は主にリーダーとレンダラーである一方、PDF Oxide は作成、抽出、OCR、フォーム、暗号化、コンプライアンスまで PDF のライフサイクル全体をカバーします。

主な違い

速度。 どちらも高速です。PDF Oxide はわずかに速く、平均 0.8ms vs 4.1ms（5.1 倍の差）。どちらも Pure Python ライブラリに比べると劇的に高速です。

機能。 pypdfium2 はレンダリング付きの読み取り専用です。PDF Oxide は作成、編集、フォーム書き込み、暗号化、Markdown/HTML 出力、OCR にも対応しています。

信頼性。 PDF Oxide は有効な PDF の 100% をパス。pypdfium2 は 99.2% — 31 件の失敗があります。

ライセンス。 どちらもパーミッシブライセンスです。PDF Oxide は MIT、pypdfium2 は Apache-2.0。いずれも AGPL の問題はありません。

比較概要

	PDF Oxide	pypdfium2
平均抽出時間	0.8ms	4.1ms
パス率（3,830 PDF）	100%	99.2%
ライセンス	MIT	Apache-2.0
言語	Rust + PyO3	C (PDFium)
テキスト抽出	あり	あり
文字位置	あり	あり
画像抽出	あり	あり
Markdown 出力	あり	なし
HTML 出力	あり	なし
PDF 作成	あり	なし
PDF 編集	あり	なし
フォームフィールド	読み書き	読み取りのみ
暗号化	読み書き	読み取りのみ
レンダリング	あり	あり
OCR	内蔵	なし
検索	正規表現 + 空間検索	あり

コード比較

テキスト抽出

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
textpage = page.get_textpage()
text = textpage.get_text_range()
print(text)

画像抽出

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
for i, obj in enumerate(page.get_objects()):
    if obj.type == pdfium.FPDF_PAGEOBJ_IMAGE:
        bitmap = obj.get_bitmap()
        bitmap.to_pil().save(f"image_{i}.png")

PDF 作成

PDF Oxide:

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Report\n\nQuarterly results are in.")
pdf.save("report.pdf")

pypdfium2:

# pypdfium2 では PDF を作成できません。
# レンダリング機能付きの読み取り専用ライブラリです。

レンダリング

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
image = doc.render_page(0, dpi=150)
image.save("page.png")

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
bitmap = page.render(scale=150/72)
bitmap.to_pil().save("page.png")

ベンチマーク詳細

指標	PDF Oxide	pypdfium2
平均抽出時間	0.8ms	4.1ms
p99 抽出時間	9ms	42ms
パス率（有効 PDF）	100%（3,823/3,823）	99.2%（3,792/3,823）

どちらもネイティブコード（それぞれ Rust と C）を使用していますが、PDF Oxide のテキスト抽出パイプラインはこのタスクに特化して最適化されています。シングルパス抽出、事前割り当てバッファ、キャッシュされたページツリーを使用しています。

ベンチマークの詳細はパフォーマンスベンチマークをご覧ください。

機能の網羅性

2 つのライブラリの最大の違いは機能範囲です。pypdfium2 はレンダリング付きリーダーで、PDF Oxide は PDF のライフサイクル全体をカバーします:

機能	PDF Oxide	pypdfium2
読み取りと抽出	あり	あり
ページレンダリング	あり	あり
PDF 作成	あり（Markdown, HTML, 画像）	なし
既存 PDF の編集	あり（テキスト, 画像, 注釈）	なし
フォームフィールド入力	あり	なし
暗号化書き込み	あり（AES-256）	なし
Markdown/HTML 出力	あり	なし
スキャンページの OCR	あり（ONNX 経由 PaddleOCR）	なし
PDF/A バリデーション	あり	なし

PDF の読み取りとレンダリングだけが必要であれば、pypdfium2 は堅実な選択肢です。作成、編集、フォーム入力、暗号化のいずれかの書き込み機能が必要な場合は、PDF Oxide がワンストップソリューションになります。

pypdfium2 のライセンス（Apache-2.0）

pypdfium2 は Apache-2.0 ライセンスで、商用利用が可能です。ただし、Google の PDFium（Chromium の PDF エンジン）をラップしており、独自の BSD スタイルのライセンスがあります。どちらもパーミッシブです。

主なポイント:

Apache-2.0 — パーミッシブ、商用利用可、帰属表示が必要
PDFium 依存 — バイナリには Chromium の PDFium エンジン（~15 MB）が含まれます
Google のリリースサイクル — pypdfium2 は Chromium プロジェクトからの PDFium リリースに依存
Python API の安定性保証なし — API は PDFium の C API に忠実に従っています

PDF Oxide は MIT ライセンスで、Apache-2.0 よりもさらにパーミッシブです。バイナリ配布時の帰属表示義務もありません。

使い分けガイド

PDF Oxide を選ぶべき場合：

読み取り/レンダリング以上の機能が必要（作成、編集、フォーム、暗号化）
Markdown や HTML への変換が必要
スキャンドキュメント用の内蔵 OCR が必要
最高の信頼性が必要（100% vs 99.2%）
速度が重要で、大規模処理で 5 倍の差が問題になる

pypdfium2 を選ぶべき場合：

PDF の読み取りとレンダリングだけが必要
PDFium 固有のレンダリング出力を好む
より小さな依存パッケージサイズを求める

PDF Oxide vs pypdfium2

主な違い

比較概要

コード比較

テキスト抽出

画像抽出

PDF 作成

レンダリング

ベンチマーク詳細

機能の網羅性

pypdfium2 のライセンス（Apache-2.0）

使い分けガイド

PDF Oxide を選ぶべき場合：

pypdfium2 を選ぶべき場合：

関連ページ