What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

异步 PDF 处理

PDF Oxide 在 Python、C# 和 Node.js 中提供一流的异步 API，让提取操作绝不会阻塞你的事件循环或 HTTP 请求处理函数。相同的 PdfDocument 方法以异步包装器的形式暴露，并将调用分派到后台线程 / 线程池。

Python — AsyncPdfDocument

PdfDocument 上的每个方法在 AsyncPdfDocument 上都有对应的可 await 版本。每次调用都用 asyncio.to_thread() 包装同步方法。

Python

import asyncio
from pdf_oxide import AsyncPdfDocument

async def extract(path):
    doc = await AsyncPdfDocument.open(path)
    text = await doc.extract_text(0)
    return text

asyncio.run(extract("report.pdf"))

跨页面并行扇出

Python

import asyncio
from pdf_oxide import AsyncPdfDocument

async def extract_all(path):
    doc = await AsyncPdfDocument.open(path)
    page_count = await doc.page_count()
    pages = await asyncio.gather(*[doc.extract_text(i) for i in range(page_count)])
    return pages

创建

AsyncPdf 对应创建流程中的 Pdf 类：

Python

from pdf_oxide import AsyncPdf

pdf = await AsyncPdf.from_markdown("# Hello")
await pdf.save_async("out.pdf")

Office 转换

AsyncOfficeConverter 负责异步处理 DOCX / XLSX / PPTX → PDF。

Python

from pdf_oxide import AsyncOfficeConverter

converter = AsyncOfficeConverter()
pdf_bytes = await converter.docx_to_pdf_bytes("input.docx")

自由线程 Python (cp314t)

pdf_oxide 扩展模块声明了 gil_used = false，因此可以安全地在 cp314t（Python 3.14 自由线程构建）下使用。多个线程可以并行调用 PdfDocument 方法，无需 GIL 串行化。

Python

from concurrent.futures import ThreadPoolExecutor
from pdf_oxide import PdfDocument

doc = PdfDocument("large.pdf")

with ThreadPoolExecutor(max_workers=8) as pool:
    pages = list(pool.map(doc.extract_text, range(doc.page_count())))

C# — `async Task<T>` 与 `CancellationToken`

每个提取和保存方法都有一个 *Async 变体，返回 Task<T> 并接受可选的 CancellationToken。

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");

string text = await doc.ExtractTextAsync(0);

using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(30));
var tasks = Enumerable.Range(0, doc.PageCount)
    .Select(i => doc.ExtractTextAsync(i, cts.Token));
string[] pages = await Task.WhenAll(tasks);

编辑器

using var editor = DocumentEditor.Open("form.pdf");
editor.SetFormFieldValue("name", "Jane Doe");
await editor.SaveAsync("filled.pdf");

Node.js — `*Async` 方法

每个同步方法都有一个异步对应版本——extractText → extractTextAsync、save → saveAsync 等。异步调用在 libuv 线程池上运行。

Node.js

const { PdfDocument } = require("pdf-oxide");

async function extractAll(path) {
  const doc = new PdfDocument(path);
  try {
    const pageCount = doc.getPageCount();
    const pages = await Promise.all(
      Array.from({ length: pageCount }, (_, i) => doc.extractTextAsync(i))
    );
    return pages;
  } finally {
    doc.close();
  }
}

HTTP 处理函数示例

Node.js

import express from "express";
import { PdfDocument } from "pdf-oxide";

const app = express();

app.post("/extract", express.raw({ type: "application/pdf", limit: "50mb" }), async (req, res) => {
  const doc = PdfDocument.openFromBytes(req.body);
  try {
    const text = await doc.extractTextAsync(0);
    res.json({ text });
  } finally {
    doc.close();
  }
});

并发模型

语言	机制
Python	`asyncio.to_thread` 将同步调用分派到默认执行器
Python (cp314t)	真正的线程并行——GIL 可选
C#	基于任务的异步模式，分派到 ThreadPool
Node.js	libuv 工作线程（`Napi::AsyncWorker`）
Go	无需异步——goroutine 直接调用同步方法
Rust	未提供——请使用 `tokio::task::spawn_blocking` 或你选用的执行器

关于在多个线程间共享单个 PdfDocument，请参阅并发指南。

异步 PDF 处理

Python — AsyncPdfDocument

跨页面并行扇出

创建

Office 转换

自由线程 Python (cp314t)

C# — async Task<T> 与 CancellationToken

编辑器

Node.js — *Async 方法

HTTP 处理函数示例

并发模型

相关链接

C# — `async Task<T>` 与 `CancellationToken`

Node.js — `*Async` 方法