What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs pdfplumber

PDF Oxide はテキスト抽出において pdfplumber より29倍高速でありながら、より幅広い機能セットを備えています。一方、pdfplumber はテーブル抽出アルゴリズムがより成熟しています。このページは、用途に応じて適切なツールを選ぶための手助けとなります。

主な違い

速度。 pdfplumber は純粋な Python（pdfminer をベースに構築）です。PDF Oxide の Rust コアはテキストを平均0.8msで抽出し、23.2msに対して29倍高速です。

信頼性。 PDF Oxide は3,830件のテスト用 PDF を100%合格します。pdfplumber の合格率は98.8%で、有効な PDF で46件の失敗が発生します。

テーブル。 pdfplumber は、あらゆる Python PDF ライブラリの中で最も優れたテーブル抽出機能を備えています。PDF Oxide のテーブル検出は実用的ではありますが、結合セルを含む複雑な複数行・複数列のレイアウトに対してはまだ成熟していません。

スコープ。 pdfplumber は読み取り専用です。PDF Oxide は作成、編集、暗号化、レンダリング、Markdown/HTML 出力を加えています。

クイック比較

	PDF Oxide	pdfplumber
平均抽出時間	0.8ms	23.2ms
合格率（3,830件の PDF）	100%	98.8%
ライセンス	MIT	MIT
言語	Rust + PyO3	純粋な Python
テキスト抽出	あり	あり
文字位置	あり	あり
テーブル抽出	基本的	高度
画像抽出	あり	なし
ビジュアルデバッグ	なし	あり
Markdown 出力	あり	なし
HTML 出力	あり	なし
PDF 作成	あり	なし
PDF 編集	あり	なし
暗号化	読み取り + 書き込み	なし
レンダリング	あり	なし
フォームフィールド	読み取り + 書き込み	読み取りのみ

コードの並列比較

テキスト抽出

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pdfplumber:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    print(text)

文字レベルの抽出

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for ch in chars[:10]:
    print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f}) size={ch.font_size:.1f}")

pdfplumber:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    for char in page.chars[:10]:
        print(f"'{char['text']}' at ({char['x0']:.1f}, {char['top']:.1f}) "
              f"size={char['size']:.1f}")

テーブル抽出

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0, detect_headings=True)
# Tables are converted to Markdown table syntax
print(md)

pdfplumber:

import pdfplumber

with pdfplumber.open("invoice.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()
    for table in tables:
        for row in table:
            print(row)

pdfplumber の extract_tables() は、設定可能な罫線検出を備えた構造化された行/列データを返します。結合セル、結合ヘッダー、または罫線のないレイアウトを含む複雑なテーブルでは、pdfplumber のアルゴリズムの方が堅牢です。

ベンチマークの詳細

指標	PDF Oxide	pdfplumber
平均抽出時間	0.8ms	23.2ms
p99 抽出時間	9ms	189ms
合格率（有効な PDF）	100% (3,823/3,823)	98.8% (3,777/3,823)

29倍の速度差は、pdfplumber の純粋な Python アーキテクチャに起因します。pdfplumber は解析に pdfminer をベースとし、その上に独自の空間解析レイヤーを追加していますが、いずれも Python で書かれています。PDF Oxide は、すべての解析、フォントデコード、テキスト組み立てをコンパイル済みの Rust で処理します。

コーパスの詳細については完全なベンチマーク手法をご覧ください。

それぞれの使いどころ

次の場合は PDF Oxide を選択：

速度が重要。 数千件の PDF を処理する場合、29倍の高速化は時間単位ではなく分単位での処理を意味します。
抽出以上のものが必要。 作成、編集、暗号化、レンダリング、または Markdown 出力。
最大限の信頼性が欲しい。 98.8%に対して合格率100%。
画像抽出が必要。 pdfplumber は画像を抽出しません。
バッチ処理パイプライン。 1 PDF あたり0.8msということは、3,830件の PDF を3.1秒で処理できることを意味します。

次の場合は pdfplumber を選択：

複雑なテーブル抽出が主な用途。 pdfplumber のテーブルアルゴリズムは、結合セル、罫線のないテーブル、結合ヘッダーをより適切に処理します。
ビジュアルデバッグが必要。 pdfplumber は、検出された罫線、文字、テーブル境界を示す注釈付きのページ画像をレンダリングできます。
純粋な Python を好む。 コンパイル済みの依存関係がなく、どこにでもインストールできます。

両方を使う：

高速なテキスト抽出と複雑なテーブル解析の両方が必要なパイプラインでは、テキストには PDF Oxide を、テーブルには pdfplumber を使用します：

from pdf_oxide import PdfDocument
import pdfplumber

# Fast text extraction with PDF Oxide
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)

# Complex table extraction with pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()

PDF Oxide vs pdfplumber

主な違い

クイック比較

コードの並列比較

テキスト抽出

文字レベルの抽出

テーブル抽出

ベンチマークの詳細

それぞれの使いどころ

次の場合は PDF Oxide を選択：

次の場合は pdfplumber を選択：

両方を使う：

関連ページ