What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Ruby API 参考

PDF Oxide 提供原生 Ruby 绑定（gem pdf_oxide），基于 FFI 构建在 cdylib C ABI 之上。该 gem 内置了预编译的原生库，并在 PdfOxide 命名空间下镜像了 Java 绑定的 9 类结构。

gem install pdf_oxide

require 'pdf_oxide'

关于 Rust API，请参阅 Rust API 参考。关于 Python API，请参阅 Python API 参考。关于类型细节，请参阅类型与枚举。

所有持有句柄的对象（PdfDocument、Pdf、DocumentEditor）都拥有原生内存，必须被关闭。惯用写法是块（block）形式，它会自动关闭；#close 是幂等的。

PdfOxide（模块）

顶层便捷入口以及进程全局开关。

PdfOxide.open(source, password: nil) { |doc| ... } -> PdfDocument

打开一个 PDF 以供读取。委托给 PdfDocument.open。接受文件路径或原始 PDF 字节；块形式会自动关闭。

PdfOxide.version -> String

返回库的版本字符串（例如 "0.3.69"）。

PdfOxide.set_max_ops_per_stream(limit) -> Integer

设置进程全局的内容流操作符上限。负的 limit 会恢复默认值（1,000,000）；任何非负值都会成为显式上限。返回先前的上限值。

PdfOxide.set_preserve_unmapped_glyphs(preserve) -> Integer

切换文本提取所使用的进程全局 U+FFFD（未映射字形）保留标志。真值/非零表示保留；假值/0 表示过滤（默认）。返回先前的值（0 或 1）。

PdfDocument

PDF 的主要只读入口：提取、搜索、转换、渲染以及页面访问。

doc = PdfOxide::PdfDocument.open('invoice.pdf')

构造器与类方法

PdfOxide::PdfDocument.open(source, password: nil) { |doc| ... } -> PdfDocument

从文件系统路径或原始 PDF 字节打开 PDF（对二进制输入会通过 %PDF- 魔数自动识别）。块形式会自动关闭；非块形式返回该文档。可能抛出 FileNotFoundError、ParseError 或 EncryptedError。

PdfOxide::PdfDocument.new(source, password: nil) -> PdfDocument

不使用块直接构造。建议优先使用 .open。

PdfOxide::PdfDocument.extract_text(source, page: 0) -> String

一次性辅助方法：打开、提取某一页的文本，然后关闭。

文档信息

doc.page_count -> Integer

文档中的页数。

doc.pdf_version -> String

PDF 版本字符串（例如 "1.7"），若不可用则为 "unknown"。

doc.encrypted? -> Boolean

PDF 是否带有加密字典。

doc.path -> String

文档打开时所用的绝对路径（对于从字节打开的文档则为 <in-memory>）。

身份验证

doc.authenticate(password) -> Boolean

针对该文档的加密进行身份验证。成功时或对于未加密文档返回 true。

文本提取

doc.extract_text(page_index) -> String

从单个从零开始的页面提取纯文本（对于没有文本层的页面返回空字符串）。

doc.extract_structured(page) -> Hash

将某一页提取为结构化表示，返回一个包含 page_index、page_width、page_height 以及 regions（每项含 kind、text、bbox、spans、column_index）的 Hash。

doc.extract_text_auto(page_index) -> String

自动路由提取：有原生文本处使用原生文本，在 ocr 特性可用时对扫描区域使用 OCR，并带有优雅的原生回退（绝不会抛出 “OCR unavailable”）。

转换

doc.to_markdown(page_index = nil) -> String

将单页转换为 Markdown，当 page_index 为 nil 时转换整篇文档。

doc.to_html(page_index = nil) -> String

将单页转换为 HTML，当 page_index 为 nil 时转换整篇文档。

搜索

doc.search(query, case_sensitive: false, regex: false) -> Array<Hash>

在文档中搜索。每个匹配项为 { page:, text:, bbox: { x:, y:, width:, height: } }。传入 regex: true 可将 query 解释为正则表达式（如果该构建不支持正则搜索，则抛出 UnsupportedFeatureError）。

表单

doc.form_fields -> Array<Hash>

以 { name:, value:, type:, page: } Hash 形式返回 AcroForm 字段。当该构建缺少表单提取访问器时返回 []。

渲染

doc.render(page_index, dpi: 150) -> String

以指定的 DPI 将单页渲染为 PNG 字节数据（BINARY）。

doc.render_with_layers(page_index, dpi: 150, format: 0,
                       background: [1.0, 1.0, 1.0, 1.0], transparent: false,
                       render_annotations: true, jpeg_quality: 90,
                       excluded_layers: []) -> String

渲染一页，使用完整的 RenderOptions 接口，并支持可选内容组（OCG）图层过滤。format：0 = PNG，1 = JPEG；excluded_layers 列出要抑制的 OCG /Name。返回编码后的图像字节（BINARY）。

页面访问

doc.page(index) -> PdfPage

位于 index 处页面的轻量级 PdfPage 视图。

doc.pages -> Array<PdfPage>

文档中的每一页（即时求值）。

自动提取

doc.auto_extractor -> AutoExtractor

为该文档配置的 AutoExtractor（带记忆化缓存）。

生命周期

doc.close -> nil

释放原生句柄。幂等。

doc.open? -> Boolean
doc.closed? -> Boolean

文档是否仍处于打开状态 / 是否已被关闭。

PdfPage

从 PdfDocument 借用的轻量级单页视图。自身不持有任何原生句柄。通过 PdfDocument#page 或 #pages 构造。

page = doc.page(0)

属性

page.parent -> PdfDocument
page.index -> Integer

所属文档以及从零开始的页面索引。

几何信息

page.width -> Float
page.height -> Float

页面在 PDF 用户空间单位下的宽度和高度。

page.media_box -> Hash
page.crop_box -> Hash

媒体框 / 裁剪框的 { x:, y:, width:, height: }（裁剪框会回退到媒体框）。

page.rotation -> Integer

页面的旋转角度（度）。

文本

page.text -> String

提取该页的文本（等同于 parent.extract_text(index)）。

page.to_s -> String
page.inspect -> String

简短的检视标签（#<PdfOxide::PdfPage index=N>）。

Pdf

创建并保存 PDF：Markdown/HTML/文本/图像来源、字节导出，以及书签拆分规划。

pdf = PdfOxide::Pdf.from_markdown("# Title\n\nBody")

工厂方法

PdfOxide::Pdf.from_markdown(markdown) { |pdf| ... } -> Pdf

从 Markdown 构建 PDF。

PdfOxide::Pdf.from_html(html) { |pdf| ... } -> Pdf

从 HTML 构建 PDF（通过 html_css 流水线支持 CSS）。

PdfOxide::Pdf.from_text(text) { |pdf| ... } -> Pdf

从纯文本构建 PDF。

PdfOxide::Pdf.from_images(images) { |pdf| ... } -> Pdf

从 JPEG/PNG 字节块的数组构建 PDF（格式会根据魔数自动识别）。

PdfOxide::Pdf.create_empty { |pdf| ... } -> Pdf

创建一个空白的单页 PDF。

静态辅助方法

PdfOxide::Pdf.version -> String

库的版本。

PdfOxide::Pdf.prefetch_models(languages) -> String

为给定的 BCP-47/ISO 语言标签预取 OCR 模型。返回缓存目录路径（在无 OCR 的构建中为空）。

PdfOxide::Pdf.prefetch_available? -> Boolean

该构建是否支持 OCR 模型供给。

PdfOxide::Pdf.plan_split_by_bookmarks_count(source_pdf, level) -> Integer

统计在 level（1 = 顶层，0 = 全部）处拆分 source_pdf（原始字节）将产生的书签拆分段数，但不生成实际输出。

实例方法

pdf.to_bytes -> String

以 BINARY 编码字节形式返回该 PDF。

pdf.save(path) -> String

将 PDF 写入 path。返回所写入的绝对路径。

pdf.close -> nil
pdf.closed? -> Boolean

释放原生句柄（幂等）/ 是否已被关闭。

DocumentEditor

写入侧编辑器：破坏性涂黑、元数据清除、表单填充以及增量保存。每个涂黑操作都采用失败即关闭（fail closed）策略（非零返回会抛出异常）。

PdfOxide::DocumentEditor.open('source.pdf') do |ed|
  ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
  ed.apply_redactions!
  ed.save_to('redacted.pdf')
end

构造器

PdfOxide::DocumentEditor.open(source) { |ed| ... } -> DocumentEditor

针对磁盘上或内存字节中的 PDF 打开一个编辑器。块形式会自动关闭。

PdfOxide::DocumentEditor.new(source) -> DocumentEditor

不使用块直接构造。

涂黑

ed.add_redaction(page:, rect:, color: [0.0, 0.0, 0.0]) -> self

排入一个涂黑矩形（rect = PDF 用户空间下的 [x1, y1, x2, y2]；color = [r, g, b]）。在调用 apply_redactions! 之前不会实际应用。

ed.redaction_count(page) -> Integer

为该页排入的涂黑总数。

ed.apply_redactions!(scrub_metadata: false, fill_color: [0.0, 0.0, 0.0]) -> self

破坏性地应用所有已排入的涂黑，可选地清除 /Info、XMP 和 JS。

ed.scrub_metadata -> self

在不进行涂黑区域处理的情况下剥离元数据。

表单

ed.set_form_field(name, value) -> self

通过点分隔的完整名称设置一个 AcroForm 字段。布尔型 value 针对复选框/单选框；否则设置文本值。

保存与生命周期

ed.save_to(path) -> String

保存已编辑的 PDF。返回所写入的绝对路径。

ed.to_bytes -> String

以 BINARY 编码字节形式返回已编辑的 PDF。

ed.close -> nil
ed.closed? -> Boolean

释放原生句柄（幂等）/ 是否已被关闭。

AutoExtractor

带类型化原因的自动提取（文本与 OCR 路由，并带优雅的原生回退）。从 PdfDocument 构造。

ax = PdfOxide::AutoExtractor.new(doc)
result = ax.extract_page(0)
warn "degraded: #{result[:reason]}" unless ax.ok?(result[:reason])

构造器与属性

PdfOxide::AutoExtractor.new(document) -> AutoExtractor

包装一个 PdfDocument 以进行自动提取。

ax.document -> PdfDocument

被包装的文档。

分类

ax.classify_page(page_index) -> Hash

廉价的逐页分类器（不涉及 OCR/栅格化）。返回 { reason:, kind:, confidence:, classification: }。

ax.classify_document -> Hash

全文档分类器；返回解码后的 JSON 信封。

提取

ax.extract_text(page_index) -> Hash

通过自动路由器提取某一页的文本；返回 { text:, reason:, kind:, confidence:, classification: }。

ax.extract_page(page_index, options: nil) -> Hash

丰富的逐页提取，返回完整的 PageExtraction 信封（文本 + 逐区域 bbox + reason + confidence）合并为一个 Hash。

谓词

ax.ok?(reason) -> Boolean

reason 是否表示一次干净的提取。

ax.ocr_fallback?(reason) -> Boolean

是否触发了 OCR 不可用时的优雅回退路径。

PdfOxide::AutoExtractor.prefetch_available? -> Boolean

该构建是否支持 OCR 供给。

常量

AutoExtractor::REASONS —— 类型化原因符号的冻结数组（:ok、:native_text_high_confidence、:no_text_layer_present、:ocr_requested_but_unavailable 等）。AutoExtractor::PAGE_KINDS —— 页面类型符号（:text_layer、:scanned、:image_text、:mixed、:empty）。

MarkdownConverter

将 PdfDocument 转换为 Markdown 或 HTML 的无状态模块。

PdfOxide::MarkdownConverter.to_markdown(doc, page_index = nil) -> String

将一页（或当 page_index 为 nil 时的整篇文档）转换为 Markdown。

PdfOxide::MarkdownConverter.to_html(doc, page_index = nil) -> String

将一页（或整篇文档）转换为 HTML。

PdfPolicy

进程全局的加密治理策略，采用一次性设置（set-once）语义。请在任何其他 PDF Oxide 操作之前调用 .set。

PdfOxide::PdfPolicy.current -> Symbol

当前进程策略模式（:compat、:strict 或 :fips_strict）。

PdfOxide::PdfPolicy.set(mode) -> Symbol

设置进程全局策略模式。如果已设置或该构建不支持，则抛出异常。

PdfOxide::PdfPolicy.compat -> Symbol
PdfOxide::PdfPolicy.strict -> Symbol
PdfOxide::PdfPolicy.fips_strict -> Symbol

预设模式符号：接受所有算法 / 拒绝旧算法 / 仅 FIPS 140-3。

PdfPolicy::MODES —— 模式符号到 cdylib 序数的冻结映射。

PdfSigner

PAdES B-B / B-T / B-LT / B-LTA 数字签名签署器。签名是安全操作：每个非零返回都采用失败即关闭策略。

PdfOxide::PdfSigner.new(certificate_handle) -> PdfSigner

从一个不透明的 PKCS#12/PEM 凭据句柄构造签署器。

signer.sign(pdf, level:, tsa_url: nil, reason: nil, location: nil) -> String

以所请求的 PAdES 级别（:b、:t、:lt、:lta）签署原始 PDF 字节。对于 >= :t 的级别需要 tsa_url。返回 BINARY 编码的已签名 PDF 字节。

PdfOxide::PdfSigner.sign(pdf:, certificate_handle:, level:, tsa_url: nil, reason: nil, location: nil) -> String

静态便捷方法：无需构造签署器实例即可签名。

PdfOxide::PdfSigner.pades_level(signature_handle) -> Integer

现有签名句柄的 PAdES 级别序数。

PdfOxide::PdfSigner.document_has_timestamp?(document_handle) -> Boolean

文档是否带有文档级的 /DocTimeStamp。

PdfSigner::LEVELS —— 级别符号到代码的冻结映射。PdfSigner::PadesSignOptions —— 镜像 C PadesSignOptionsC 布局的紧凑 FFI::Struct。

PdfValidator

无状态的 PDF/A 和 PDF/UA 合规校验。

PdfOxide::PdfValidator.pdf_a?(doc, level: :a1b) -> Boolean

文档是否在 level（:a1b、:a1a、:a2b、:a2a、:a2u、:a3b、:a3a、:a3u）下符合 PDF/A。

PdfOxide::PdfValidator.pdf_ua?(doc, level: :ua1) -> Boolean

文档是否在 level（:ua1 或 :ua2）下符合 PDF/UA。

PdfOxide::PdfValidator.validate_pdf_a(doc, level: :a1b) -> Hash

简化的 PDF/A 结果：{ compliant:, violations: }。

PdfOxide::PdfValidator.validate_pdf_ua(doc, level: :ua1) -> Hash

简化的 PDF/UA 结果：{ compliant:, violations: }。

PdfValidator::PDF_A_LEVELS 和 PdfValidator::PDF_UA_LEVELS —— 级别到序数的冻结映射。

错误处理

所有 PDF Oxide 异常都派生自 PdfOxide::Error。原生错误码与下列子类一一对应。

begin
  doc = PdfOxide::PdfDocument.open('file.pdf')
  text = doc.extract_text(0)
rescue PdfOxide::FileNotFoundError
  warn 'file not found'
rescue PdfOxide::ParseError => e
  warn "malformed PDF: #{e.message}"
rescue PdfOxide::Error => e
  warn "PDF error: #{e.message}"
ensure
  doc&.close
end

异常	原因
`Error`	所有 PDF Oxide 错误的基类
`UnsupportedPlatformError`	内置 cdylib 不支持当前主机平台
`ArgumentError`	参数在原生调用前未通过校验
`IoError`	文件系统 / I/O 故障
`FileNotFoundError`	文件缺失（`IoError` 的特化）
`ParseError`	文件头损坏、xref 损坏、提取失败
`StateError`	操作顺序错误
`InvalidStateError`	对已关闭句柄进行操作（`StateError` 的特化）
`EncryptedError`	加密 / 密码错误故障
`PermissionError`	加密 PDF 缺少提取/签名权限
`UnsupportedFeatureError`	该 cdylib 构建未编译此特性
`SignatureError`	PAdES 签名 / 验证失败
`RedactionError`	破坏性涂黑失败（失败即关闭）
`ComplianceError`	PDF/A · PDF/UA 校验失败
`SearchError`	原生文本搜索失败
`InternalError`	通用的原生侧故障

完整示例

require 'pdf_oxide'

# --- Extraction ---
PdfOxide::PdfDocument.open('input.pdf') do |doc|
  puts "Pages: #{doc.page_count}"
  doc.page_count.times do |i|
    puts "Page #{i + 1}: #{doc.extract_text(i).length} characters"
  end

  # Search
  doc.search('configuration', case_sensitive: false).each do |m|
    puts "Page #{m[:page] + 1}: '#{m[:text]}' at (#{m[:bbox][:x]}, #{m[:bbox][:y]})"
  end

  # Render page 1 to PNG
  File.binwrite('page1.png', doc.render(0, dpi: 150))
end

# --- Creation ---
PdfOxide::Pdf.from_markdown("# Report\n\nGenerated by PDF Oxide.") do |pdf|
  pdf.save('report.pdf')
end

# --- Redaction ---
PdfOxide::DocumentEditor.open('source.pdf') do |ed|
  ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
  ed.apply_redactions!(scrub_metadata: true)
  ed.save_to('redacted.pdf')
end

# --- Validation ---
PdfOxide::PdfDocument.open('archive.pdf') do |doc|
  puts "PDF/A-1b compliant: #{PdfOxide::PdfValidator.pdf_a?(doc, level: :a1b)}"
end

Other Language Bindings

PDF Oxide 为所有主流生态系统提供原生绑定：Rust, Python, Node.js, WASM, C#, Golang, Java, PHP, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, Elixir。

后续步骤

类型与枚举 — 所有共享类型与枚举
Page API 参考 — 各绑定间一致的逐页迭代方式
Ruby 快速上手 — 教程