What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 上手指南（PHP）

PDF Oxide 是文本提取速度最快的 PHP PDF 库 —— 平均耗时 0.8ms，在 3,830 个 PDF 上通过率 100%。一个库即可完成 PDF 的提取、转换和创建，底层与 Python、Node、Go、C#、Ruby 和 Java 绑定共用同一套 Rust 内核。

安装

composer require oxide/pdf-oxide

Composer 的安装后钩子会根据你的平台（linux-x86_64、linux-aarch64、darwin-x86_64、darwin-arm64、windows-x64）将对应的预编译原生库下载到 vendor/oxide/pdf-oxide/lib/。

环境要求： PHP 8.2+（8.2、8.3、8.4、8.5），并启用 ext-ffi。可用 php -m | grep -i ffi 确认。部分托管主机会禁用 ext-ffi；如遇此情况，请改用 php:8.3-cli 之类的 Docker 镜像。

打开 PDF

使用 PdfDocument::open() 加载文件并查看其元数据。

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('research-paper.pdf');
echo $doc->pageCount(), " pages\n";

$version = $doc->version();   // ['major' => int, 'minor' => int]
printf("PDF version: %d.%d\n", $version['major'], $version['minor']);

$doc->close();   // or rely on __destruct()

文本提取

单页

按从零开始的索引提取任意一页的纯文本。

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('report.pdf');
echo $doc->extractText(0);
$doc->close();

全部页面

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('book.pdf');
for ($i = 0; $i < $doc->pageCount(); $i++) {
    echo "--- Page " . ($i + 1) . " ---\n";
    echo $doc->extractText($i), "\n";
}
$doc->close();

一次性提取

extractTextOnce() 是一个静态辅助方法，一次调用即可完成打开文件、提取第 0 页和关闭文件。

use PdfOxide\PdfDocument;

echo PdfDocument::extractTextOnce('report.pdf');

自动路由提取

当存在原生文本时，extractTextAuto() 会返回该文本，并在必要时平滑回退到任何可恢复的文本 —— 它在回退路径上绝不会抛出异常。

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('mixed.pdf');
echo $doc->extractTextAuto(0);
$doc->close();

页面 API

pages() 返回一个 PdfPage 视图数组，pagesIter() 则会连同索引一起惰性地逐个生成它们。每个页面都会将提取操作委托回文档。

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');

foreach ($doc->pagesIter() as $index => $page) {
    echo "Page {$index}:\n";
    echo $page->text(), "\n";
}

// Or grab a single page directly:
$page = $doc->page(0);
echo $page->toMarkdown();

$doc->close();

PdfPage 的方法：index()、parent()、text()、textAuto()、toMarkdown()、toHtml()。

Markdown 与 HTML 转换

将单页或整个文档转换为 Markdown，或将某一页渲染为 HTML。

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');

echo $doc->toMarkdown(0);     // one page (defaults to page 0)
echo $doc->toMarkdownAll();   // entire document
echo $doc->toHtml(0);         // one page as HTML

$doc->close();

静态类 MarkdownConverter 提供相同的转换功能，无需在文档调用处携带页面索引。

use PdfOxide\PdfDocument;
use PdfOxide\MarkdownConverter;

$doc = PdfDocument::open('paper.pdf');

echo MarkdownConverter::toMarkdown($doc, 0);
echo MarkdownConverter::toMarkdownAll($doc);
echo MarkdownConverter::toHtml($doc, 0);
echo MarkdownConverter::toPlainText($doc, 0);

$doc->close();

结构化提取

extractStructured() 以关联数组的形式返回某一页的版面感知视图 —— 包含各区域的类型、文本、边界框和列索引。

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');
$structured = $doc->extractStructured(0);

printf("Page %d: %.0f x %.0f\n",
    $structured['page_index'],
    $structured['page_width'],
    $structured['page_height']);

foreach ($structured['regions'] as $region) {
    echo "[{$region['kind']}] {$region['text']}\n";
}

$doc->close();

从字节流打开

从内存中的字符串打开 PDF —— 在从 S3、HTTP 或数据库获取数据时非常有用。

use PdfOxide\PdfDocument;

$bytes = file_get_contents('report.pdf');
$doc = PdfDocument::openBytes($bytes);
echo $doc->extractText(0);
$doc->close();

检查文档特性

在处理文档之前先对其进行探测。

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('form.pdf');

var_dump($doc->hasStructureTree());   // tagged PDF?
var_dump($doc->hasFormFields());      // AcroForm fields?
var_dump($doc->hasSignatures());      // digital signatures?

$doc->close();

创建 PDF

Pdf 类提供了一系列工厂方法，可从 Markdown、HTML 或纯文本构建 PDF。

use PdfOxide\Pdf;

$pdf = Pdf::fromMarkdown("# Invoice\n\n**Total:** \$42.00\n");
$pdf->saveTo('invoice.pdf');           // write to a path
$pdf->close();

$pdf = Pdf::fromHtml('<h1>Report</h1><p>Quarterly figures.</p>');
$bytes = $pdf->save();                 // or get the raw bytes
file_put_contents('report.pdf', $bytes);
$pdf->close();

$pdf = Pdf::fromText("Plain text document.\n\nSecond paragraph.");
$pdf->saveTo('notes.pdf');
$pdf->close();

后续步骤

Python 上手指南 —— 在 Python 中使用 PDF Oxide
Rust 上手指南 —— 在 Rust 中使用 PDF Oxide
文本提取 —— 详细的提取选项与实用技巧
创建 PDF —— 带元数据和样式的高级创建
编辑 —— 修改现有 PDF、注释与表单字段