What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PyMuPDF (fitz) から PDF Oxide への移行

PyMuPDF から PDF Oxide へ乗り換えるための完全ガイド。今使っているすべての API と、その置き換え方法を取り上げます。

なぜ PyMuPDF から移行するのか

移行する理由は次の 4 点です。

5.8× 高速 — PDF Oxide はページあたり平均 0.8 ms、PyMuPDF は 4.6 ms です。規模が大きくなるほど差は広がり、1,000 ページのバッチは 5 秒ではなく 1 秒未満で完了します。
MIT ライセンス — PyMuPDF は AGPL のため、連携するすべてのコードをオープンソース化するか、商用ライセンスを購入する必要があります。PDF Oxide は MIT で、どこでも制約なく使えます。
信頼性 100% — PDF Oxide は PDF テストスイートを 100% 通過します。PyMuPDF は 0.7% のファイルで失敗（成功率 99.3%）し、およそ 140 件に 1 件で出力が壊れる計算です。
組み込み機能が豊富 — Markdown 変換、HTML 出力、OCR、XFA フォーム対応、PDF レンダリングがすべて同梱されています。PyMuPDF では同等機能に別パッケージ（pymupdf4llm）や外部ツール（Tesseract）が必要です。

ステップ 1：インストール

pip install pdf_oxide
pip uninstall pymupdf  # 任意 — 問題なければ削除します

ステップ 2：インポートの差し替え

# Before
import fitz

# After
from pdf_oxide import PdfDocument

Markdown 変換のために pymupdf4llm を使っていた場合、その依存関係は丸ごと削除できます。PDF Oxide は組み込みで対応します。

ステップ 3：API 対応表

処理	PyMuPDF	PDF Oxide
PDF を開く	`fitz.open("file.pdf")`	`PdfDocument("file.pdf")`
ページ数	`doc.page_count`	`doc.page_count()`
テキスト抽出	`doc[0].get_text()`	`doc.extract_text(0)`
文字位置	`doc[0].get_text("dict")`	`doc.extract_chars(0)`
画像抽出	`doc[0].get_images()` + `doc.extract_image(xref)`	`doc.extract_images(0)`
テキスト検索	`doc[0].search_for("query")`	`doc.search_page(0, "query")`
フォームフィールド	`doc[0].widgets()` または `doc.get_form_fields()`	`doc.get_form_fields()`
暗号化 PDF	`doc.authenticate("pw")`	`PdfDocument("f.pdf", password="pw")`
Markdown 変換	`pymupdf4llm.to_markdown("file.pdf")`（別パッケージ）	`doc.to_markdown(0)`（組み込み）
HTML 変換	非対応	`doc.to_html(0)`
PDF 生成	`insert_text()` を手動で使用	`Pdf.from_markdown("# Title")`
画像レンダリング	`doc[0].get_pixmap()`	`doc.render_page(0)`
XFA フォーム	非対応	`doc.has_xfa()`
OCR	Tesseract が必要	PaddleOCR を内蔵

ステップ 4：よくあるパターンの書き換え

テキスト抽出ループ

# PyMuPDF
import fitz
doc = fitz.open("report.pdf")
for page in doc:
    text = page.get_text()
    print(text)

# PDF Oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(text)

画像抽出

PyMuPDF では xref ルックアップを多段で行う必要があります。PDF Oxide は 1 回の呼び出しで完結します。

# PyMuPDF — 多段の xref ルックアップ
import fitz
doc = fitz.open("report.pdf")
page = doc[0]
for img in page.get_images():
    xref = img[0]
    base = doc.extract_image(xref)
    with open(f"img.{base['ext']}", "wb") as f:
        f.write(base["image"])

# PDF Oxide — 1 ステップ
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for i, img in enumerate(doc.extract_image_bytes(0)):
    with open(f"img_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

暗号化 PDF

PyMuPDF では「開いてから認証」の 2 段階が必要です。PDF Oxide はコンストラクタの password= と、開いた後の doc.authenticate() の両方に対応します。

# PyMuPDF
import fitz
doc = fitz.open("encrypted.pdf")
doc.authenticate("password")
text = doc[0].get_text()

# PDF Oxide — password= で 1 ステップ
from pdf_oxide import PdfDocument
doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)

Markdown 変換

PyMuPDF では別パッケージ pymupdf4llm が必要です。PDF Oxide は Markdown を組み込みで提供します。

# PyMuPDF — 別パッケージが必要
import pymupdf4llm
md = pymupdf4llm.to_markdown("report.pdf")

# PDF Oxide — 組み込み
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
md = doc.to_markdown(0)

ページのレンダリング

# PyMuPDF
import fitz
doc = fitz.open("report.pdf")
pix = doc[0].get_pixmap()
pix.save("page.png")

# PDF Oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
png_bytes = doc.render_page(0, dpi=150)
with open("page.png", "wb") as f:
    f.write(png_bytes)

ステップ 5：移行のテスト

既存のテストファイルを両ライブラリに通し、出力を比較してください。

from pdf_oxide import PdfDocument

doc = PdfDocument("your-test-file.pdf")

# テキスト抽出の確認
text = doc.extract_text(0)
print(text[:500])

# ページ数の確認
print(f"Pages: {doc.page_count()}")

# フォームフィールドの確認（該当する場合）
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name}: {f.value}")