What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs pypdf

PDF Oxide は pypdf より15倍高速で、合格率が高く、レンダリングや Markdown/HTML 変換を標準で備えています。基本的な PDF 操作以上のことが必要なら、pypdf が複数のパッケージを組み合わせて実現する処理を、PDF Oxide は1つのライブラリだけでこなします。

pypdf ではなく PDF Oxide を検討すべき理由

速度。 pypdf は純粋な Python 実装です。PDF Oxide は PyO3 経由でコンパイルされた Rust コアを使い、Python プロセス内で直接動作します。テキスト抽出の平均は 0.8ms 対 12.1ms ——15倍の差です。

信頼性。 PDF Oxide は 3,830 件のテスト用 PDF すべての100%に合格します。pypdf の合格率は98.4%——有効な PDF で61件の失敗があります。

機能。 pypdf は PDF 操作ライブラリ（結合、分割、回転、暗号化）です。テキスト抽出、レンダリング、Markdown 出力、フォーム作成には追加のパッケージが必要です。PDF Oxide はこれらすべてを1回のインストールでカバーします。

クイック比較

	PDF Oxide	pypdf
平均抽出時間	0.8ms	12.1ms
合格率（3,830 件の PDF）	100%	98.4%
ライセンス	MIT	BSD-3
言語	Rust + PyO3	純粋な Python
テキスト抽出	あり	あり
文字位置情報	あり	部分的
画像抽出	あり	あり
Markdown 出力	あり	なし
HTML 出力	あり	なし
PDF 作成	あり（Markdown/HTML/画像）	限定的（結合のみ）
フォームフィールド	読み取り + 書き込み	読み取り + 書き込み
暗号化	読み取り + 書き込み	読み取り + 書き込み
レンダリング	あり	なし
OCR	標準搭載	なし
検索	正規表現 + 空間検索	なし
インストールサイズ	約5 MB	約1 MB

コードの比較

テキスト抽出

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
text = reader.pages[0].extract_text()
print(text)

全ページの抽出

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(f"--- Page {i + 1} ---")
    print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("book.pdf")
for page in reader.pages:
    text = page.extract_text()
    print(text)

画像抽出

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
page = reader.pages[0]
for i, image in enumerate(page.images):
    with open(f"image_{i}.{image.name.split('.')[-1]}", "wb") as f:
        f.write(image.data)

暗号化された PDF

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("encrypted.pdf", password="secret")
text = doc.extract_text(0)

pypdf:

from pypdf import PdfReader

reader = PdfReader("encrypted.pdf")
reader.decrypt("secret")
text = reader.pages[0].extract_text()

Markdown 変換

PDF Oxide（標準搭載）:

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

pypdf:

# pypdf has no Markdown conversion.
# You would need a separate tool chain.

ベンチマークの詳細

指標	PDF Oxide	pypdf
平均抽出時間	0.8ms	12.1ms
p99 抽出時間	9ms	97ms
合格率（有効な PDF）	100%（3,823/3,823）	98.4%（3,762/3,823）

pypdf の純粋な Python 実装では、すべての処理がインタープリタ内で実行されます。PDF Oxide の Rust コアは解析、フォントデコード、テキスト組み立てをネイティブに処理し、最終結果だけが Python の境界を越えます。

コーパスの詳細についてはベンチマーク手法の全体をご覧ください。

機能の差

pypdf は PDF 操作——結合、分割、回転、暗号化——に優れています。しかし、以下の機能はありません。

機能	PDF Oxide	pypdf
Markdown 変換	`doc.to_markdown(0)`	利用不可
HTML 変換	`doc.to_html(0)`	利用不可
コンテンツからの PDF 作成	`Pdf.from_markdown()`、`Pdf.from_html()`	利用不可
画像へのレンダリング	あり	利用不可
スキャン PDF の OCR	PaddleOCR を標準搭載	利用不可
テキスト検索	`doc.search("query")`	利用不可
文字単位のバウンディングボックス	`doc.extract_chars(0)`	部分的
PDF/A 検証	あり	利用不可

ワークフローが純粋に結合/分割/回転だけなら、pypdf の軽量な純粋 Python アプローチは妥当な選択です。テキスト抽出の品質、作成、変換が絡む場合は、PDF Oxide の方がより完全な選択肢です。

pypdf を使い続けるべき場合

コンパイル済み拡張をまったく含まない純粋な Python 依存が必要
ユースケースが厳密に結合/分割/回転/暗号化のみで、テキスト抽出がない
レガシー統合のために pypdf 固有の PDF 操作メソッドが必要

PDF Oxide vs pypdf

pypdf ではなく PDF Oxide を検討すべき理由

クイック比較

コードの比較

テキスト抽出

全ページの抽出

画像抽出

暗号化された PDF

Markdown 変換

ベンチマークの詳細

機能の差

pypdf を使い続けるべき場合

関連ページ