What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide MCP Server — AI アシスタント向け PDF 抽出

pdf-oxide-mcp は、AI アシスタントから PDF の内容を取り出すための Model Context Protocol サーバーです。すべての処理はローカルで完結し、ファイルがマシンの外に出ることはありません。

crgx をインストール（初回のみ）

crgx は Rust バイナリ向けの npx 風ランナーで、初回実行時に pdf_oxide_mcp を自動でダウンロードします。MCP を手動でインストールする必要はありません。

Linux / macOS

curl -fsSL crgx.dev/install.sh | sh

Windows (PowerShell)

irm crgx.dev/install.ps1 | iex

設定

crgx をインストールしたあとは、以下の設定を利用中の AI ツールに追加するだけです。pdf_oxide_mcp のダウンロードや更新は crgx が自動で行います。

Claude Desktop

~/.config/claude/claude_desktop_config.json（Linux）または ~/Library/Application Support/Claude/claude_desktop_config.json（macOS）に以下を追加します。

{
  "mcpServers": {
    "pdf-oxide": {
      "command": "crgx",
      "args": ["pdf_oxide_mcp@latest"]
    }
  }
}

Claude Code

プロジェクトの .claude/settings.json に追加します。

{
  "mcpServers": {
    "pdf-oxide": {
      "command": "crgx",
      "args": ["pdf_oxide_mcp@latest"]
    }
  }
}

Cursor

Cursor の MCP 設定に追加します。

{
  "mcpServers": {
    "pdf-oxide": {
      "command": "crgx",
      "args": ["pdf_oxide_mcp@latest"]
    }
  }
}

その他のインストール方法

crgx を使いたくない場合は、pdf_oxide_mcp を直接インストールできます。

Homebrew (macOS / Linux)

brew install yfedoseev/tap/pdf-oxide    # pdf-oxide-mcp を含む

Cargo

cargo install pdf_oxide_mcp

その後、設定でバイナリのパスを直接指定します。

{
  "mcpServers": {
    "pdf-oxide": {
      "command": "pdf-oxide-mcp"
    }
  }
}

利用可能なツール

`extract`

PDF ファイルからテキスト・Markdown・HTML を抽出します。

パラメータ	型	必須	説明
`file_path`	string	はい	PDF ファイルのパス
`output_path`	string	はい	抽出結果を書き込むパス
`format`	string	いいえ	`"text"`（既定）、`"markdown"`、`"html"`
`pages`	string	いいえ	ページ範囲。例：`"1-3,7,10-12"`
`password`	string	いいえ	暗号化された PDF のパスワード
`images`	boolean	いいえ	画像を出力ファイルのそばに書き出す
`embed_images`	boolean	いいえ	markdown/html に base64 で画像を埋め込む（既定：true）

仕組み

MCP サーバーは stdio 上で JSON-RPC 2.0 を使って通信します。AI アシスタントが PDF を読み取りたいとき、tools/call リクエストを送信し、抽出結果を受け取ります。

処理はすべてローカルで行われ、ライブラリや CLI と同じ Rust 抽出エンジンを使用します。データが外部サービスに送信されることはありません。

アシスタントに渡せるプロンプト例

MCP サーバーを接続すると、アシスタントは自ら extract を呼び出します。うまく動くプロンプトの例：

「report.pdf の Markdown を report.md に出力してください。」
「contract.pdf の 4〜8 ページを画像を埋め込んだ HTML にして contract.html へ保存してください。」
「bank-statement.pdf はパスワード保護されています（pw: hunter2）。取引一覧の表だけをテキストで抽出してください。」

内部的には、アシスタントは次のような JSON-RPC 呼び出しを送ります。

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "extract",
    "arguments": {
      "file_path": "/path/report.pdf",
      "output_path": "/path/report.md",
      "format": "markdown",
      "pages": "4-8",
      "images": true,
      "embed_images": true
    }
  }
}

サーバーは結果を output_path に書き込み、短い確認メッセージを返します。アシスタントはそのファイルを読み戻してコンテキストに取り込みます。