What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

pdfplumber から PDF Oxide への移行

pdfplumber から PDF Oxide に乗り換えるための完全ガイドです。現在お使いのすべての API とその置き換え方法を解説します。

なぜ pdfplumber から乗り換えるのか？

移行すべき 4 つの理由があります:

29 倍高速 — PDF Oxide はページあたり平均 0.8ms、pdfplumber は 23.2ms です。100 ページのドキュメントが 2.3 秒ではなく 80ms で処理されます。
暗号化 PDF に対応 — pdfplumber は暗号化された PDF をまったく開けません。PDF Oxide は AES-256 を含むすべての暗号化方式を透過的に処理します。
画像抽出 — pdfplumber には画像抽出の機能がありません。PDF Oxide は 1 回の呼び出しで埋め込み画像を抽出できます。
Markdown 出力 — pdfplumber はテーブルを手動フォーマットが必要な Python リストとして返します。PDF Oxide はテーブルを保持した構造化 Markdown を出力し、LLM への投入に最適です。

ステップ 1: インストール

pip install pdf_oxide
pip uninstall pdfplumber  # 任意

ステップ 2: インポートの置き換え

# 変更前
import pdfplumber

# 変更後
from pdf_oxide import PdfDocument

ステップ 3: API 対応表

タスク	pdfplumber	PDF Oxide
PDF を開く	`pdfplumber.open("file.pdf")`	`PdfDocument("file.pdf")`
ページ数	`len(pdf.pages)`	`doc.page_count()`
テキスト抽出	`pdf.pages[0].extract_text()`	`doc.extract_text(0)`
文字位置	`pdf.pages[0].chars`	`doc.extract_chars(0)`
テーブル抽出	`pdf.pages[0].extract_tables()`	`doc.to_markdown(0)`
フォームフィールド	非対応（読み取り専用）	`doc.get_form_fields()`
暗号化 PDF	非対応	`PdfDocument("file.pdf", password="pw")`
画像抽出	非対応	`doc.extract_image_bytes(0)`
Markdown 変換	非対応	`doc.to_markdown(0)`
レンダリング	非対応	`doc.render_page(0)`
OCR	非対応	`doc.extract_text_ocr(0)`
PDF 作成	非対応	`Pdf.from_markdown("# Title")`

ステップ 4: よくあるパターンの変更

テキスト抽出

pdfplumber はコンテキストマネージャが必要です。PDF Oxide は不要です:

# pdfplumber — コンテキストマネージャが必要
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

# PDF Oxide — コンテキストマネージャ不要
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(text)

テーブル抽出

pdfplumber はテーブルをネストされた Python リストとして返します。PDF Oxide は Markdown で出力します:

# pdfplumber — リストのリストを返す
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()
    for table in tables:
        for row in table:
            print(row)

# PDF Oxide — 構造化 Markdown 出力
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
md = doc.to_markdown(0)
print(md)  # テーブルは Markdown テーブルとしてレンダリング

文字レベルの抽出

# pdfplumber
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    chars = pdf.pages[0].chars
    for c in chars:
        print(f"{c['text']} at ({c['x0']}, {c['top']})")

# PDF Oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for c in chars:
    print(f"{c.char} at ({c.x}, {c.y})")

暗号化 PDF（新機能）

pdfplumber は暗号化 PDF を開けません。PDF Oxide は透過的に処理します:

from pdf_oxide import PdfDocument

# AES-256 を含むすべての暗号化方式に対応
doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)
print(text)

画像抽出（新機能）

pdfplumber には画像抽出がありません。PDF Oxide なら簡単です:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

スキャン文書の OCR（新機能）

pdfplumber はスキャンされた PDF を処理できません。PDF Oxide は OCR を内蔵しています:

from pdf_oxide import PdfDocument

doc = PdfDocument("scanned.pdf")
text = doc.extract_text_ocr(0)
print(text)

主な違い

コンテキストマネージャ不要 — pdfplumber は with pdfplumber.open(...) as pdf: を使いますが、PDF Oxide ではコンテキストマネージャは不要です。
暗号化 PDF — pdfplumber では暗号化 PDF を開けません。PDF Oxide は暗号化を透過的に処理します。
テーブル — pdfplumber は Python のリストを返します。PDF Oxide はテーブルを Markdown または HTML で出力します。視覚的なデバッグが必要な複雑なテーブルの場合は、PDF Oxide と併用することもできます。

ステップ 5: 移行のテスト

既存のテストファイルを両方のライブラリで実行し、出力を比較してください:

from pdf_oxide import PdfDocument

doc = PdfDocument("your-test-file.pdf")

# テキスト抽出の検証
text = doc.extract_text(0)
print(text[:500])

# ページ数の検証
print(f"Pages: {doc.page_count()}")

# フォームフィールドの検証（該当する場合）
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name}: {f.value}")