What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide MCP 服务器 — 面向 AI 助手的 PDF 提取

pdf-oxide-mcp 是一个 Model Context Protocol 服务器，让 AI 助手能够读取 PDF 内容。所有处理都在本地完成，文件不会离开你的机器。

安装 crgx（只需一次）

crgx 是一个类似 npx 的 Rust 二进制运行器，首次运行时会自动下载 pdf_oxide_mcp，不需要手动安装 MCP。

Linux / macOS

curl -fsSL crgx.dev/install.sh | sh

Windows (PowerShell)

irm crgx.dev/install.ps1 | iex

配置

装好 crgx 后，把下面的配置加到你使用的 AI 工具里即可。pdf_oxide_mcp 的下载和升级会由 crgx 自动处理。

Claude Desktop

加到 ~/.config/claude/claude_desktop_config.json（Linux）或 ~/Library/Application Support/Claude/claude_desktop_config.json（macOS）：

{
  "mcpServers": {
    "pdf-oxide": {
      "command": "crgx",
      "args": ["pdf_oxide_mcp@latest"]
    }
  }
}

Claude Code

加到项目的 .claude/settings.json：

{
  "mcpServers": {
    "pdf-oxide": {
      "command": "crgx",
      "args": ["pdf_oxide_mcp@latest"]
    }
  }
}

Cursor

加到 Cursor 的 MCP 设置中：

{
  "mcpServers": {
    "pdf-oxide": {
      "command": "crgx",
      "args": ["pdf_oxide_mcp@latest"]
    }
  }
}

其他安装方式

如果不想用 crgx，也可以直接安装 pdf_oxide_mcp：

Homebrew (macOS / Linux)

brew install yfedoseev/tap/pdf-oxide    # 包含 pdf-oxide-mcp

Cargo

cargo install pdf_oxide_mcp

然后在配置里直接指向该二进制：

{
  "mcpServers": {
    "pdf-oxide": {
      "command": "pdf-oxide-mcp"
    }
  }
}

可用工具

`extract`

从 PDF 文件中提取文本、Markdown 或 HTML。

参数	类型	是否必填	说明
`file_path`	string	是	PDF 文件路径
`output_path`	string	是	写入提取结果的路径
`format`	string	否	`"text"`（默认）、`"markdown"` 或 `"html"`
`pages`	string	否	页面范围，例如 `"1-3,7,10-12"`
`password`	string	否	加密 PDF 的密码
`images`	boolean	否	将图片以单独文件保存到输出旁
`embed_images`	boolean	否	在 markdown/html 中以 base64 方式嵌入图片（默认：true）

工作原理

MCP 服务器通过 stdio 使用 JSON-RPC 2.0 通信。AI 助手需要读取 PDF 时，会发送一个 tools/call 请求，收到提取后的内容作为回复。

所有处理都在本地进行，使用与库和 CLI 相同的 Rust 提取引擎——没有任何数据会发送到外部服务。

可以给助手的提示词

MCP 服务器接入后，助手会自己调用 extract。一些好用的提示词：

“把 report.pdf 的 Markdown 写入 report.md。”
“把 contract.pdf 的第 4–8 页导出为 HTML（把图片一起嵌进去），保存到 contract.html。”
“bank-statement.pdf 有密码（pw：hunter2）——只把交易明细那张表提取成纯文本。”

在底层，助手会发出类似这样的 JSON-RPC 调用：

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "extract",
    "arguments": {
      "file_path": "/path/report.pdf",
      "output_path": "/path/report.md",
      "format": "markdown",
      "pages": "4-8",
      "images": true,
      "embed_images": true
    }
  }
}

服务器把结果写到 output_path 并回一条简短的确认消息——助手之后会把那份文件读回来加入自己的上下文。