What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PHP API 参考

PDF Oxide 提供基于 PHP 内置 FFI 扩展构建的纯 PHP 绑定。这层 PHP 代码只是对 libpdf_oxide cdylib 的薄封装，与 Python、Node、Go、C#、Ruby 和 Java 绑定共用同一套底层库。

composer require oxide/pdf-oxide

所有类都位于 PdfOxide\ 命名空间下。需要 PHP 8.2+ 并启用 ext-ffi。Composer 的安装后钩子会下载与之匹配的预编译原生库；可设置 PDF_OXIDE_CDYLIB_PATH 指向自定义构建，或设置 PDF_OXIDE_SKIP_DOWNLOAD=1 跳过下载。

关于 Rust API，请参阅 Rust API 参考。关于 Python API，请参阅 Python API 参考。关于 JavaScript API，请参阅 Node.js API 参考或 WASM API 参考。

该绑定共暴露 10 个类：

类	用途
`PdfOxide\PdfDocument`	打开、读取、提取 PDF 并按页迭代
`PdfOxide\Pdf`	创建 PDF（Markdown 转 PDF、HTML 转 PDF）、版本查询、模型预取
`PdfOxide\PdfPage`	每页的轻量视图
`PdfOxide\MarkdownConverter`	静态的 Markdown / HTML / 纯文本转换
`PdfOxide\AutoExtractor`	带类型化原因的提取与页面分类
`PdfOxide\AutoExtractResult`	`AutoExtractor` 返回的只读结果值对象
`PdfOxide\DocumentEditor`	编辑、清除元数据、破坏性涂黑、保存
`PdfOxide\PdfSigner`	PAdES B-B / B-T / B-LT / B-LTA 签名
`PdfOxide\PdfValidator`	PDF/A、PDF/UA 和 PDF/X 合规性检查
`PdfOxide\PdfPolicy`	一次性设定的进程级全局加密治理策略

PdfDocument

用于打开、读取和提取 PDF 文件的核心类。

use PdfOxide\PdfDocument;

构造方法 / 工厂方法

PdfDocument::open(string $path): self

从文件路径打开 PDF。

PdfDocument::openBytes(string $bytes): self

从内存中的字节字符串打开 PDF（例如从 S3 下载或经 HTTP 接收到的数据）。

PdfDocument::extractTextOnce(string $path): string

一次调用即可打开文件、提取全部文本并释放句柄。

文档信息

$doc->pageCount(): int

返回文档的页数。

$doc->version(): array

以 [major, minor] 整数数组的形式返回 PDF 版本。

$doc->hasStructureTree(): bool

检查文档是否为带结构树的 Tagged PDF。

$doc->hasFormFields(): bool

检查文档是否包含 AcroForm 表单字段。

$doc->hasSignatures(): bool

检查文档是否包含数字签名。

$doc->getSourcePath(): ?string

返回文档所打开来源的文件路径；对于从字节加载的文档则返回 null。

文本提取

$doc->extractText(int $pageIndex): string

从单个从零开始索引的页面提取纯文本。

$doc->extractStructured(int $page): array

以关联数组的形式提取结构化的页面内容（spans、字符和尺寸）。

$doc->extractTextAuto(int $pageIndex): string

通过自动选择原生文本或 OCR 的方式从某页提取文本。

转换

$doc->toMarkdown(int $pageIndex = 0): string

将单页转换为 Markdown。

$doc->toMarkdownAll(): string

将整个文档转换为 Markdown。

$doc->toHtml(int $pageIndex = 0): string

将单页转换为 HTML。

页面访问

$doc->page(int $index): PdfPage

返回指定从零开始索引的轻量 PdfPage 视图。

$doc->pages(): array

以 PdfPage 对象数组的形式返回所有页面。

$doc->pagesIter(): \Generator

惰性地迭代页面，每次生成一个 PdfPage。

生命周期

$doc->isOpen(): bool

返回原生句柄是否仍处于打开状态。

$doc->close(): void

释放原生句柄。__destruct() 也会自动调用它。

$doc->getHandle(): CData

返回原始 FFI 句柄（用于高级 / 互操作场景）。

Pdf

从源内容创建新的 PDF，并访问库级别的辅助方法。

use PdfOxide\Pdf;

工厂方法

Pdf::fromMarkdown(string $markdown): self

从 Markdown 内容创建 PDF。

Pdf::fromHtml(string $html): self

从 HTML 内容创建 PDF。

Pdf::fromText(string $text): self

从纯文本创建 PDF。

保存

$pdf->save(): string

渲染 PDF 并以字符串形式返回其字节。

$pdf->saveTo(string $path): void

渲染 PDF 并将其写入文件路径。

库辅助方法

Pdf::version(): string

返回底层原生库的版本字符串。

Pdf::prefetchAvailable(): bool

返回当前构建是否支持 OCR 模型预取。

Pdf::prefetchModels(array $languages): string

为给定的语言代码预取 OCR 模型；返回状态字符串。

Pdf::VERSION  // string constant, e.g. "0.3.69"

生命周期

$pdf->isOpen(): bool
$pdf->close(): void
$pdf->getHandle(): CData

检视、释放或访问原始原生句柄。

PdfPage

由 PdfDocument::page()、pages() 或 pagesIter() 返回的每页轻量视图。

use PdfOxide\PdfPage;

方法

$page->parent(): PdfDocument

返回所属的 PdfDocument。

$page->index(): int

返回从零开始的页面索引。

$page->text(): string

从本页提取纯文本。

$page->textAuto(): string

通过自动选择原生文本或 OCR 的方式从本页提取文本。

$page->toMarkdown(): string

将本页转换为 Markdown。

$page->toHtml(): string

将本页转换为 HTML。

$page->__toString(): string

当对象在字符串上下文中使用时，返回该页的纯文本。

MarkdownConverter

用于将已打开的 PdfDocument 转换为 Markdown、HTML 或纯文本的静态辅助方法。

use PdfOxide\MarkdownConverter;

方法

MarkdownConverter::toMarkdown(PdfDocument $doc, int $pageIndex): string

将单页转换为 Markdown。

MarkdownConverter::toMarkdownAll(PdfDocument $doc): string

将整个文档转换为 Markdown。

MarkdownConverter::toHtml(PdfDocument $doc, int $pageIndex): string

将单页转换为 HTML。

MarkdownConverter::toPlainText(PdfDocument $doc, int $pageIndex): string

将单页转换为纯文本。

AutoExtractor

带类型化原因的提取，支持原生文本或 OCR 的自动选择以及逐页文档分类。

use PdfOxide\AutoExtractor;

常量

常量	值	含义
`AutoExtractor::MODE_AUTO`	`0`	自动选择原生文本或 OCR
`AutoExtractor::MODE_TEXT_ONLY`	`1`	仅原生文本，绝不使用 OCR
`AutoExtractor::MODE_FORCE_OCR`	`2`	始终运行 OCR

工厂方法

AutoExtractor::of(PdfDocument $doc, int $mode = self::MODE_AUTO): self

以显式模式在文档之上创建一个提取器。

AutoExtractor::fast(PdfDocument $doc): self

创建一个针对速度调优的提取器。

AutoExtractor::balanced(PdfDocument $doc): self

创建一个在速度与保真度之间取得平衡的提取器。

AutoExtractor::highFidelity(PdfDocument $doc): self

创建一个针对最高保真度调优的提取器。

提取

$ex->extractText(): string

从整个文档提取文本。

$ex->extractTextForPage(int $pageIndex): string

从单页提取文本。

$ex->extractAutoPage(int $pageIndex): AutoExtractResult

提取单页，返回带原因和置信度的类型化 AutoExtractResult。

$ex->extractAutoDocument(): AutoExtractResult

提取整个文档，返回类型化的 AutoExtractResult。

$ex->extractPageJson(int $pageIndex): string

以 JSON 字符串的形式提取单页。

$ex->extractDocumentJson(): string

以 JSON 字符串的形式提取整个文档。

分类

$ex->classifyPageKind(int $pageIndex): string

对单页进行分类（例如 text_layer、scanned、image_text、mixed、empty）。

$ex->classifyDocumentKinds(): array

对每一页进行分类，返回页面类型字符串数组。

访问器

$ex->document(): PdfDocument

返回底层的 PdfDocument。

$ex->mode(): int

返回当前生效的提取模式常量。

AutoExtractResult

由 AutoExtractor::extractAutoPage() 和 extractAutoDocument() 返回的只读值对象。

use PdfOxide\AutoExtractResult;

属性

属性	类型	说明
`text`	`string`	提取出的文本
`reason`	`string`	`REASON_*` 常量之一
`confidence`	`float`	`[0.0, 1.0]` 区间内的置信度
`ocrUsed`	`bool`	本次结果是否运行了 OCR
`regions`	`array`	逐区域的详尽信息
`pagesNeedingOcr`	`array`	仍需 OCR 的页面索引

常量

Reason 常量	值
`REASON_OK`	`ok`
`REASON_NATIVE_TEXT_HIGH_CONFIDENCE`	`native_text_high_confidence`
`REASON_NO_TEXT_LAYER_PRESENT`	`no_text_layer_present`
`REASON_OCR_REQUESTED_BUT_UNAVAILABLE`	`ocr_requested_but_unavailable`
`REASON_OCR_LOW_CONFIDENCE_FALLBACK`	`ocr_low_confidence_fallback`
`REASON_IMAGE_TABLE_RECONSTRUCTED`	`image_table_reconstructed`
`REASON_EMPTY`	`empty`

Kind 常量	值
`KIND_TEXT_LAYER`	`text_layer`
`KIND_SCANNED`	`scanned`
`KIND_IMAGE_TEXT`	`image_text`
`KIND_MIXED`	`mixed`
`KIND_EMPTY`	`empty`

方法

$result->isOk(): bool

返回提取是否成功且未发生降级。

$result->isOcrFallback(): bool

返回是否触发了 OCR 不可用或低置信度的回退路径。

DocumentEditor

就地编辑 PDF：清除元数据、应用破坏性涂黑、设置 producer 并保存。

use PdfOxide\DocumentEditor;

构造方法

DocumentEditor::open(string $path): self

打开一个 PDF 以进行编辑。

涂黑

$editor->addRedaction(int $pageIndex, float $x1, float $y1, float $x2, float $y2): self

排入一个矩形涂黑区域（坐标以 PDF 点为单位）。链式调用 —— 返回 $this。

$editor->redactionCount(int $pageIndex): int

返回某页待处理的涂黑数量。

$editor->applyRedactionsDestructive(bool $scrubMetadata = true): int

永久性地应用所有已排入的涂黑（出错时安全失败）。返回已应用的数量；可选地清除元数据。

元数据

$editor->scrubMetadata(): self

移除文档元数据。链式调用 —— 返回 $this。

$editor->getProducer(): string

返回文档的 Producer 字符串。

$editor->setProducer(string $producer): self

设置文档的 Producer 字符串。链式调用 —— 返回 $this。

文档信息

$editor->version(): array

以 [major, minor] 数组的形式返回 PDF 版本。

$editor->pageCount(): int

返回页数。

$editor->isModified(): bool

返回文档是否存在未保存的修改。

$editor->sourcePath(): string

返回源文件路径。

保存

$editor->saveTo(string $path): void

将编辑后的 PDF 写入文件路径。

$editor->save(): string

渲染编辑后的 PDF 并以字符串形式返回其字节。

生命周期

$editor->isOpen(): bool
$editor->close(): void
$editor->getHandle(): CData

检视、释放或访问原始原生句柄。

PdfSigner

支持 B-B、B-T、B-LT 和 B-LTA 合规级别的 PAdES 数字签名。

use PdfOxide\PdfSigner;

常量

常量	值	级别
`PdfSigner::LEVEL_B_B`	`0`	PAdES B-B（基线）
`PdfSigner::LEVEL_B_T`	`1`	PAdES B-T（带时间戳）
`PdfSigner::LEVEL_B_LT`	`2`	PAdES B-LT（长期）
`PdfSigner::LEVEL_B_LTA`	`3`	PAdES B-LTA（带归档时间戳的长期）

构造方法 / 工厂方法

PdfSigner::fromPkcs12(string $keystorePath, string $password): self

从 PKCS#12（.p12 / .pfx）密钥库创建签名器。

签名

$signer->sign(
    string $pdfBytes,
    string|int $level = self::LEVEL_B_B,
    ?string $tsaUrl = null,
    ?string $reason = null,
    ?string $location = null,
): string

对 PDF 字节进行签名并返回签名后的 PDF 字节。高于 B-B 的级别需要提供 tsaUrl。$level 接受 LEVEL_B_* 序号或简短标签（'b'、't'、'lt'、'lta'）。

PdfSigner::signWithHandle(
    string $pdfBytes,
    CData $certificateHandle,
    string|int $level,
    ?string $tsaUrl = null,
    ?string $reason = null,
    ?string $location = null,
): string

静态便捷方法：无需构造受管理的签名器实例即可签名。调用方保留对 $certificateHandle 的所有权。

验证

PdfSigner::verify(string $pdfBytes): bool

返回这些 PDF 字节是否携带签名字典和字节范围。

生命周期

$signer->isOpen(): bool
$signer->close(): void

检视或释放签名凭据。

PdfValidator

针对已打开的 PdfDocument 进行的静态 PDF/A、PDF/UA 和 PDF/X 合规性检查。

use PdfOxide\PdfValidator;

常量

PDF/A 常量	值	PDF/UA 常量	值
`PDFA_1B`	`0`	`PDFUA_1`	`1`
`PDFA_1A`	`1`	`PDFUA_2`	`2`
`PDFA_2B`	`2`
`PDFA_2A`	`3`
`PDFA_2U`	`4`
`PDFA_3B`	`5`
`PDFA_3A`	`6`
`PDFA_3U`	`7`

方法

PdfValidator::isPdfA(PdfDocument $doc, int $level = self::PDFA_1B): bool

返回文档是否符合给定的 PDF/A 级别。

PdfValidator::isPdfUa(PdfDocument $doc, int $level = self::PDFUA_1): bool

返回文档是否符合给定的 PDF/UA 级别。

PdfValidator::isPdfX(PdfDocument $doc): bool

尚未实现——pdf_oxide 目前在 C ABI 中未暴露公开的 PDF/X 验证器。调用此方法总是会抛出 BadMethodCallException。

PdfValidator::validatePdfA(PdfDocument $doc, int $level = self::PDFA_1B): array

运行完整的 PDF/A 校验，并返回一个结构化的结果数组（合规标志以及任何违规项）。

PdfPolicy

一次性设定的进程级全局加密治理策略。必须在打开任何 PDF 之前完成配置。

use PdfOxide\PdfPolicy;

常量

常量	值
`PdfPolicy::COMPAT`	`compat`
`PdfPolicy::STRICT`	`strict`
`PdfPolicy::FIPS_STRICT`	`fips_strict`

方法

PdfPolicy::current(): string

返回当前生效的策略模式。

PdfPolicy::set(string $mode): void

设置进程级全局策略（取策略常量之一）。每个进程只能设定一次。

PdfPolicy::fipsAvailable(): bool

返回是否有经 FIPS 验证的加密提供程序可用。

PdfPolicy::activeProvider(): string

返回当前生效的加密提供程序的名称。

PdfPolicy::compat(): string
PdfPolicy::strict(): string
PdfPolicy::fipsStrict(): string

返回对应策略模式字符串的便捷访问器。

错误处理

所有 PDF 相关的错误都继承自 PdfOxide\Exceptions\PdfException。各专门的子类让你能够捕获更细分的失败类别。

use PdfOxide\PdfDocument;
use PdfOxide\Exceptions\PdfException;

try {
    $doc = PdfDocument::open('file.pdf');
    echo $doc->extractText(0);
} catch (PdfException $e) {
    error_log("PDF error: {$e->getMessage()}");
}

异常（位于 `PdfOxide\Exceptions\` 下）	原因
`PdfException`	所有 PDF 错误的基类
`ParseException`	格式错误或无法解析的 PDF
`IoException`	文件读写失败
`NotFoundException`	缺失的文件、页面或对象
`EncryptionException`	加密 / 密码失败
`ValidationException`	无效的参数或输入
`SignatureException`	签名或签名验证失败
`RedactionException`	涂黑失败
`SearchException`	搜索失败
`OptimizationException`	优化失败
`ComplianceException`	PDF/A 或 PDF/UA 合规失败
`AccessibilityException`	无障碍 / 标记失败
`UnsupportedException`	不受支持的功能或操作
`InvalidStateException`	对已关闭或无效句柄的操作
`InternalError`	内部原生错误

完整示例

use PdfOxide\PdfDocument;
use PdfOxide\Pdf;
use PdfOxide\AutoExtractor;
use PdfOxide\DocumentEditor;
use PdfOxide\PdfSigner;
use PdfOxide\PdfValidator;

// --- Extraction ---
$doc = PdfDocument::open('input.pdf');
echo $doc->pageCount(), " pages\n";

for ($i = 0; $i < $doc->pageCount(); $i++) {
    echo "Page {$i}: ", strlen($doc->extractText($i)), " chars\n";
}
echo $doc->toMarkdownAll();

// --- Auto-extraction with typed reasons ---
$ex = AutoExtractor::of($doc);
$result = $ex->extractAutoPage(0);
if (!$result->isOk()) {
    error_log("degraded extraction: {$result->reason}");
}
$doc->close();

// --- Creation ---
$pdf = Pdf::fromMarkdown("# Invoice\n\n**Total:** \$42.00\n");
$pdf->saveTo('invoice.pdf');
$pdf->close();

// --- Destructive redaction ---
$editor = DocumentEditor::open('in.pdf');
$editor->addRedaction(0, 100.0, 700.0, 300.0, 720.0);
$editor->applyRedactionsDestructive();
$editor->saveTo('redacted.pdf');
$editor->close();

// --- PAdES B-T signature ---
$signer = PdfSigner::fromPkcs12('certs/sign.p12', 'p12-password');
$signed = $signer->sign(
    pdfBytes: file_get_contents('contract.pdf'),
    level:    PdfSigner::LEVEL_B_T,
    tsaUrl:   'https://freetsa.org/tsr',
    reason:   'Final contract',
);
file_put_contents('signed.pdf', $signed);
$signer->close();

// --- Compliance check ---
$doc = PdfDocument::open('archive.pdf');
var_dump(PdfValidator::isPdfA($doc, PdfValidator::PDFA_2B));
$doc->close();

Other Language Bindings

PDF Oxide 为所有主流生态系统提供原生绑定：Rust, Python, Node.js, WASM, C#, Golang, Java, Ruby, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, Elixir。

后续步骤

类型与枚举 — 所有共享类型与枚举
Page API 参考 — 各绑定间一致的逐页迭代方式
PHP 快速上手 — 教程