What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Getting Started with PDF Oxide MCP Server

pdf-oxide-mcp is a Model Context Protocol server that lets AI assistants extract content from PDFs. It runs locally — no files leave your machine.

Install crgx (one-time)

crgx is an npx-like runner for Rust binaries — it auto-downloads pdf_oxide_mcp on first run. No manual MCP install needed.

Linux / macOS

curl -fsSL crgx.dev/install.sh | sh

Windows (PowerShell)

irm crgx.dev/install.ps1 | iex

Configuration

After installing crgx, add the config below to your AI tool. That’s it — crgx handles downloading and updating pdf_oxide_mcp automatically.

Claude Desktop

Add to ~/.config/claude/claude_desktop_config.json (Linux) or ~/Library/Application Support/Claude/claude_desktop_config.json (macOS):

{
  "mcpServers": {
    "pdf-oxide": {
      "command": "crgx",
      "args": ["pdf_oxide_mcp@latest"]
    }
  }
}

Claude Code

Add to your project’s .claude/settings.json:

{
  "mcpServers": {
    "pdf-oxide": {
      "command": "crgx",
      "args": ["pdf_oxide_mcp@latest"]
    }
  }
}

Cursor

Add to Cursor MCP settings:

{
  "mcpServers": {
    "pdf-oxide": {
      "command": "crgx",
      "args": ["pdf_oxide_mcp@latest"]
    }
  }
}

Alternative Installation

If you prefer not to use crgx, you can install pdf_oxide_mcp directly:

Homebrew (macOS / Linux)

brew install yfedoseev/tap/pdf-oxide    # includes pdf-oxide-mcp

Cargo

cargo install pdf_oxide_mcp

Then use the binary path directly in your config:

{
  "mcpServers": {
    "pdf-oxide": {
      "command": "pdf-oxide-mcp"
    }
  }
}

Available Tools

`extract`

Extract text, markdown, or HTML from a PDF file.

Parameter	Type	Required	Description
`file_path`	string	Yes	Path to the PDF file
`output_path`	string	Yes	Path to write extracted content
`format`	string	No	`"text"` (default), `"markdown"`, or `"html"`
`pages`	string	No	Page range, e.g. `"1-3,7,10-12"`
`password`	string	No	Password for encrypted PDFs
`images`	boolean	No	Extract images to files alongside output
`embed_images`	boolean	No	Embed images as base64 in markdown/html (default: true)

How It Works

The MCP server communicates over stdio using JSON-RPC 2.0. When an AI assistant needs to read a PDF, it sends a tools/call request and receives the extracted content back.

All processing happens locally using the same Rust extraction engine as the library and CLI — no data is sent to external services.

Prompts You Can Give the Assistant

Once the MCP server is wired up, the assistant can call extract on its own. Prompts that work well:

“Pull the markdown of report.pdf into report.md.”
“Extract pages 4–8 of contract.pdf as HTML with images embedded, save to contract.html.”
“bank-statement.pdf is password-protected (pw: hunter2) — extract just the transactions table to text.”

Under the hood the assistant issues a JSON-RPC call like:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "extract",
    "arguments": {
      "file_path": "/path/report.pdf",
      "output_path": "/path/report.md",
      "format": "markdown",
      "pages": "4-8",
      "images": true,
      "embed_images": true
    }
  }
}

The server writes the result to output_path and returns a short confirmation — the assistant can then read that file back into its context.