What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 快速上手（Ruby）

PDF Oxide 是速度最快的 Ruby PDF 库 —— 文本提取平均耗时 0.8ms，在 3,830 份 PDF 上达到 100% 通过率。一个 gem 即可完成 PDF 的提取、搜索、转换、创建和涂黑，底层是为 Python、Java、Node、Go、C# 和 PHP 绑定提供支持的同一套 Rust 内核。

安装

gem install pdf_oxide

或者将它加入你的 Gemfile：

gem 'pdf_oxide', '~> 0.3'

预编译的 libpdf_oxide 原生库已随带平台标签的 gem 一同发布 —— 无需编译器，也无需在系统层面安装任何东西。预编译 gem 覆盖 x86_64-linux、aarch64-linux 上的 Ruby 3.1–3.4，Intel 与 Apple Silicon 版 macOS，以及 Windows（x64-mingw-ucrt）。

打开 PDF

使用 PdfDocument.open 加载文件。块（block）写法会在块返回时自动关闭文档；#close 同样可用，且可以重复调用。

require 'pdf_oxide'

PdfOxide::PdfDocument.open('research-paper.pdf') do |doc|
  puts "Pages: #{doc.page_count}"
  puts "PDF version: #{doc.pdf_version}"
  puts "Encrypted: #{doc.encrypted?}"
end

对于加密文档，传入 password:：

PdfOxide::PdfDocument.open('confidential.pdf', password: 'secret') do |doc|
  puts doc.extract_text(0)
end

你也可以从内存中的字节数据打开 —— 当从 S3、HTTP 或数据库流式读取时很方便。PdfDocument.open 会通过 %PDF- 魔数头自动识别原始 PDF 字节：

bytes = File.binread('report.pdf')
PdfOxide::PdfDocument.open(bytes) do |doc|
  puts doc.extract_text(0)
end

文本提取

单页

通过从零开始的页码索引，从任意一页提取纯文本。

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  text = doc.extract_text(0)
  puts text
end

全部页面

PdfOxide::PdfDocument.open('book.pdf') do |doc|
  doc.page_count.times do |i|
    puts "--- Page #{i + 1} ---"
    puts doc.extract_text(i)
  end
end

一次性辅助方法

当你只需要某一页的文本时，PdfDocument.extract_text 会在一次调用中完成打开、提取并关闭：

text = PdfOxide::PdfDocument.extract_text('report.pdf', page: 0)
puts text

自动路由提取

extract_text_auto 使用 v0.3.51 引入的自动路由器，为每一页选择原生文本提取或 OCR。在未启用 ocr 特性的构建中，它会优雅地回退到原生文本层 —— 绝不会抛出 “OCR unavailable” 之类的错误。

PdfOxide::PdfDocument.open('scan.pdf') do |doc|
  puts doc.extract_text_auto(0)
end

如果想要一个描述提取质量的带类型说明，可以使用 AutoExtractor：

PdfOxide::PdfDocument.open('scan.pdf') do |doc|
  ax     = doc.auto_extractor
  result = ax.extract_page(0)
  puts result[:text]
  warn "degraded: #{result[:reason]}" unless ax.ok?(result[:reason])
end

操作页面

PdfDocument#page 返回一个轻量级的 PdfPage 视图，它借用自文档本身。#pages 则为每一页都返回一个这样的视图。

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  page = doc.page(0)
  puts "Index: #{page.index}"
  puts page.text   # same as doc.extract_text(0)

  doc.pages.each do |p|
    puts "Page #{p.index}: #{p.text.length} chars"
  end
end

Markdown 与 HTML 转换

将单页（传入页码索引）或整篇文档（省略索引）转换为 Markdown 或 HTML。

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  puts doc.to_markdown(0)   # first page to Markdown
  puts doc.to_html(0)       # first page to HTML
  puts doc.to_markdown      # entire document to Markdown
end

结构化提取

extract_structured 以 Hash 形式返回解析后的页面布局 —— 包括页面尺寸，以及带有文本、边界框和列索引的带类型区域。

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  page = doc.extract_structured(0)
  puts "Size: #{page['page_width']} x #{page['page_height']}"
  page['regions'].each do |region|
    puts "#{region['kind']}: #{region['text']}"
  end
end

搜索

search 会扫描整篇文档，返回一个匹配项的 Hash 数组，每一项包含 :page、:text，以及一个由 :x、:y、:width、:height 组成的 :bbox Hash。

PdfOxide::PdfDocument.open('manual.pdf') do |doc|
  matches = doc.search('configuration', case_sensitive: false)
  matches.each do |m|
    bbox = m[:bbox]
    puts "Page #{m[:page]}: '#{m[:text]}' at (#{bbox[:x].round}, #{bbox[:y].round})"
  end
end

渲染

以指定的 DPI 将某一页渲染为 PNG 字节数据：

PdfOxide::PdfDocument.open('poster.pdf') do |doc|
  png = doc.render(0, dpi: 150)
  File.binwrite('page-0.png', png)
end

创建 PDF

Pdf 类可从 Markdown、HTML 或纯文本创建 PDF。实例持有一个原生句柄；请使用块写法（自动关闭），或者自行调用 #close。

PdfOxide::Pdf.from_markdown("# Hello World\n\nThis is a PDF.") do |pdf|
  pdf.save('output.pdf')
end

PdfOxide::Pdf.from_html('<h1>Invoice</h1><p>Amount due: $42.00</p>') do |pdf|
  pdf.save('invoice.pdf')
end

PdfOxide::Pdf.from_text("Plain text document.\n\nSecond paragraph.") do |pdf|
  pdf.save('notes.pdf')
end

用 #to_bytes 直接拿到字节数据，而不必保存到磁盘：

pdf_bytes = PdfOxide::Pdf.from_markdown('# Report').to_bytes
# upload pdf_bytes, attach to an email, etc.

涂黑（Redaction）

DocumentEditor 会打开一份已有的 PDF 以进行破坏性涂黑。apply_redactions! 会永久移除被覆盖的内容，并可在同一遍处理中清除文档元数据。

PdfOxide::DocumentEditor.open('source.pdf') do |ed|
  ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
  ed.apply_redactions!(scrub_metadata: true)
  ed.save_to('redacted.pdf')
end

错误处理

针对 PDF 特有的失败情形，PDF Oxide 会抛出 PdfOxide::Error 的带类型子类。

begin
  PdfOxide::PdfDocument.open('document.pdf') do |doc|
    puts doc.extract_text(0)
  end
rescue PdfOxide::FileNotFoundError
  warn 'File not found'
rescue PdfOxide::EncryptedError
  warn 'Wrong or missing password'
rescue PdfOxide::ParseError => e
  warn "Malformed PDF: #{e.message}"
rescue PdfOxide::Error => e
  warn "PDF error: #{e.message}"
end

后续步骤

Python 快速上手 —— 在 Python 中使用 PDF Oxide
Rust 快速上手 —— 在 Rust 中使用 PDF Oxide
文本提取 —— 详细的提取选项与实用范例
创建 PDF —— 进阶的创建、加密与元数据
编辑 —— 修改已有 PDF、注释与表单字段