What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide をはじめる（Ruby）

PDF Oxide は最速の Ruby 向け PDF ライブラリです — テキスト抽出は平均 0.8ms、3,830 件の PDF で 100% の成功率を達成しています。抽出・検索・変換・作成・墨消しを 1 つの gem でこなせます。Python、Java、Node、Go、C#、PHP の各バインディングを支えているのと同じ Rust コアの上に構築されています。

インストール

gem install pdf_oxide

または Gemfile に追加します:

gem 'pdf_oxide', '~> 0.3'

プリビルド済みの libpdf_oxide ネイティブライブラリは、プラットフォームタグ付きの gem に同梱されています — コンパイラもシステム全体へのインストールも不要です。プリビルド gem は x86_64-linux、aarch64-linux 上の Ruby 3.1〜3.4、Intel と Apple Silicon の macOS、そして Windows（x64-mingw-ucrt）をカバーしています。

PDF を開く

ファイルを読み込むには PdfDocument.open を使います。ブロック形式ではブロックを抜けると自動的にドキュメントがクローズされます。#close も利用でき、冪等です。

require 'pdf_oxide'

PdfOxide::PdfDocument.open('research-paper.pdf') do |doc|
  puts "Pages: #{doc.page_count}"
  puts "PDF version: #{doc.pdf_version}"
  puts "Encrypted: #{doc.encrypted?}"
end

暗号化されたドキュメントには password: を渡します:

PdfOxide::PdfDocument.open('confidential.pdf', password: 'secret') do |doc|
  puts doc.extract_text(0)
end

メモリ上のバイト列から開くこともできます — S3、HTTP、データベースからストリーミングするときに便利です。PdfDocument.open は %PDF- のマジックヘッダーから生の PDF バイト列を自動検出します:

bytes = File.binread('report.pdf')
PdfOxide::PdfDocument.open(bytes) do |doc|
  puts doc.extract_text(0)
end

テキスト抽出

単一ページ

任意のページから、0 始まりのインデックスを指定してプレーンテキストを抽出します。

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  text = doc.extract_text(0)
  puts text
end

全ページ

PdfOxide::PdfDocument.open('book.pdf') do |doc|
  doc.page_count.times do |i|
    puts "--- Page #{i + 1} ---"
    puts doc.extract_text(i)
  end
end

ワンショットヘルパー

1 ページ分のテキストだけが必要なときは、PdfDocument.extract_text が 1 回の呼び出しでオープン・抽出・クローズをまとめて行います:

text = PdfOxide::PdfDocument.extract_text('report.pdf', page: 0)
puts text

自動振り分け抽出

extract_text_auto は v0.3.51 の自動ルーターを使い、ページごとにネイティブテキストか OCR かを選択します。ocr フィーチャーなしでビルドした場合は、ネイティブのテキストレイヤーへ穏やかにフォールバックします — 「OCR unavailable」のようなエラーを送出することはありません。

PdfOxide::PdfDocument.open('scan.pdf') do |doc|
  puts doc.extract_text_auto(0)
end

抽出品質を表す型付きの理由（reason）を取得したい場合は、AutoExtractor を使います:

PdfOxide::PdfDocument.open('scan.pdf') do |doc|
  ax     = doc.auto_extractor
  result = ax.extract_page(0)
  puts result[:text]
  warn "degraded: #{result[:reason]}" unless ax.ok?(result[:reason])
end

ページの操作

PdfDocument#page は、ドキュメントから借用する軽量な PdfPage ビューを返します。#pages は全ページ分のビューを返します。

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  page = doc.page(0)
  puts "Index: #{page.index}"
  puts page.text   # doc.extract_text(0) と同じ

  doc.pages.each do |p|
    puts "Page #{p.index}: #{p.text.length} chars"
  end
end

Markdown と HTML への変換

単一ページ（インデックスを渡す）またはドキュメント全体（インデックスを省略）を Markdown や HTML に変換します。

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  puts doc.to_markdown(0)   # 先頭ページを Markdown に
  puts doc.to_html(0)       # 先頭ページを HTML に
  puts doc.to_markdown      # ドキュメント全体を Markdown に
end

構造化抽出

extract_structured は、解析したページレイアウトを Hash として返します — ページ寸法に加え、テキスト・バウンディングボックス・カラムインデックスを伴う型付きの領域（region）が含まれます。

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  page = doc.extract_structured(0)
  puts "Size: #{page['page_width']} x #{page['page_height']}"
  page['regions'].each do |region|
    puts "#{region['kind']}: #{region['text']}"
  end
end

検索

search はドキュメント全体を走査し、マッチを表す Hash の配列を返します。各 Hash には :page、:text、そして :x・:y・:width・:height を持つ :bbox Hash が含まれます。

PdfOxide::PdfDocument.open('manual.pdf') do |doc|
  matches = doc.search('configuration', case_sensitive: false)
  matches.each do |m|
    bbox = m[:bbox]
    puts "Page #{m[:page]}: '#{m[:text]}' at (#{bbox[:x].round}, #{bbox[:y].round})"
  end
end

レンダリング

指定した DPI でページを PNG バイト列にレンダリングします:

PdfOxide::PdfDocument.open('poster.pdf') do |doc|
  png = doc.render(0, dpi: 150)
  File.binwrite('page-0.png', png)
end

PDF の作成

Pdf クラスは Markdown、HTML、プレーンテキストから PDF を作成します。インスタンスはネイティブハンドルを保持します。ブロック形式（自動クローズ）を使うか、自分で #close を呼び出してください。

PdfOxide::Pdf.from_markdown("# Hello World\n\nThis is a PDF.") do |pdf|
  pdf.save('output.pdf')
end

PdfOxide::Pdf.from_html('<h1>Invoice</h1><p>Amount due: $42.00</p>') do |pdf|
  pdf.save('invoice.pdf')
end

PdfOxide::Pdf.from_text("Plain text document.\n\nSecond paragraph.") do |pdf|
  pdf.save('notes.pdf')
end

ディスクに保存せずバイト列をそのまま取得するには #to_bytes を使います:

pdf_bytes = PdfOxide::Pdf.from_markdown('# Report').to_bytes
# pdf_bytes をアップロードしたり、メールに添付したりなど

墨消し

DocumentEditor は既存の PDF を破壊的な墨消しのために開きます。apply_redactions! は覆われたコンテンツを完全に削除し、同じ処理の中でドキュメントのメタデータも消去できます。

PdfOxide::DocumentEditor.open('source.pdf') do |ed|
  ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
  ed.apply_redactions!(scrub_metadata: true)
  ed.save_to('redacted.pdf')
end

エラーハンドリング

PDF Oxide は、PDF 固有の失敗に対して PdfOxide::Error の型付きサブクラスを送出します。

begin
  PdfOxide::PdfDocument.open('document.pdf') do |doc|
    puts doc.extract_text(0)
  end
rescue PdfOxide::FileNotFoundError
  warn 'File not found'
rescue PdfOxide::EncryptedError
  warn 'Wrong or missing password'
rescue PdfOxide::ParseError => e
  warn "Malformed PDF: #{e.message}"
rescue PdfOxide::Error => e
  warn "PDF error: #{e.message}"
end

次のステップ

Python のはじめ方 – Python から PDF Oxide を使う
Rust のはじめ方 – Rust から PDF Oxide を使う
テキスト抽出 – 抽出オプションとレシピの詳細
PDF の作成 – 高度な作成、暗号化、メタデータ
編集 – 既存 PDF の変更、注釈、フォームフィールド