What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Ruby API リファレンス

PDF Oxide は、cdylib C ABI 上の FFI を介して構築されたネイティブ Ruby バインディング（gem pdf_oxide）を提供します。この gem はプリビルド済みのネイティブライブラリを同梱し、Java バインディングと同じ 9 クラス構成を PdfOxide 名前空間の下にミラーリングしています。

gem install pdf_oxide

require 'pdf_oxide'

Rust API については Rust API リファレンスを参照してください。Python API については Python API リファレンスを参照してください。型の詳細については型と列挙型を参照してください。

ハンドルを保持するすべてのオブジェクト（PdfDocument、Pdf、DocumentEditor）はネイティブメモリを所有しており、必ずクローズしなければなりません。最も Ruby らしい書き方はブロック形式で、自動的にクローズされます。#close は冪等です。

PdfOxide（モジュール）

トップレベルの便利なエントリーポイントと、プロセス全体に作用するトグルです。

PdfOxide.open(source, password: nil) { |doc| ... } -> PdfDocument

PDF を読み取り用に開きます。PdfDocument.open に委譲します。ファイルパスまたは生の PDF バイト列を受け付けます。ブロック形式では自動的にクローズされます。

PdfOxide.version -> String

ライブラリのバージョン文字列（例: "0.3.69"）を返します。

PdfOxide.set_max_ops_per_stream(limit) -> Integer

プロセス全体のコンテンツストリーム演算子の上限を設定します。負の limit はデフォルト（1,000,000）を復元し、0 以上の値は明示的な上限になります。直前の上限値を返します。

PdfOxide.set_preserve_unmapped_glyphs(preserve) -> Integer

テキスト抽出で使われる、プロセス全体の U+FFFD（マッピングされていないグリフ）保持フラグを切り替えます。真値／0 以外で保持、偽値／0 でフィルタリング（デフォルト）します。直前の値（0 または 1）を返します。

PdfDocument

PDF への主要な読み取り専用エントリーポイントです。抽出、検索、変換、レンダリング、ページアクセスを担います。

doc = PdfOxide::PdfDocument.open('invoice.pdf')

コンストラクタとクラスメソッド

PdfOxide::PdfDocument.open(source, password: nil) { |doc| ... } -> PdfDocument

ファイルシステム上のパス、または生の PDF バイト列（バイナリ入力の %PDF- マジックから自動検出）から PDF を開きます。ブロック形式では自動的にクローズされ、非ブロック形式ではドキュメントを返します。FileNotFoundError、ParseError、EncryptedError を送出します。

PdfOxide::PdfDocument.new(source, password: nil) -> PdfDocument

ブロックなしで直接構築します。.open の使用を推奨します。

PdfOxide::PdfDocument.extract_text(source, page: 0) -> String

ワンショットのヘルパー: オープン、単一ページのテキスト抽出、クローズをまとめて行います。

ドキュメント情報

doc.page_count -> Integer

ドキュメントのページ数です。

doc.pdf_version -> String

PDF のバージョン文字列（例: "1.7"）、取得できない場合は "unknown" です。

doc.encrypted? -> Boolean

PDF が暗号化辞書を持つかどうかです。

doc.path -> String

ドキュメントを開いた際の絶対パス（バイト列から開いたドキュメントでは <in-memory>）です。

認証

doc.authenticate(password) -> Boolean

このドキュメントの暗号化に対して認証を行います。成功した場合、または暗号化されていないドキュメントの場合は true を返します。

テキスト抽出

doc.extract_text(page_index) -> String

0 始まりの単一ページからプレーンテキストを抽出します（テキストレイヤーのないページでは空になります）。

doc.extract_structured(page) -> Hash

ページの構造化表現を Hash として抽出します。page_index、page_width、page_height、そして regions（各要素は kind、text、bbox、spans、column_index を持つ）を含みます。

doc.extract_text_auto(page_index) -> String

自動振り分け抽出: ネイティブテキストが存在すればそれを使い、スキャンされた領域では ocr フィーチャーが利用可能なときに OCR を使います。穏やかなネイティブフォールバックを備えており、「OCR unavailable」を送出することはありません。

変換

doc.to_markdown(page_index = nil) -> String

1 ページを Markdown に変換します。page_index が nil のときはドキュメント全体を変換します。

doc.to_html(page_index = nil) -> String

1 ページを HTML に変換します。page_index が nil のときはドキュメント全体を変換します。

検索

doc.search(query, case_sensitive: false, regex: false) -> Array<Hash>

ドキュメントを検索します。各マッチは { page:, text:, bbox: { x:, y:, width:, height: } } です。regex: true を渡すと query を正規表現として解釈します（ビルドに正規表現検索が含まれない場合は UnsupportedFeatureError を送出します）。

フォーム

doc.form_fields -> Array<Hash>

AcroForm フィールドを { name:, value:, type:, page: } の Hash として返します。ビルドにフォーム抽出アクセサが含まれない場合は [] を返します。

レンダリング

doc.render(page_index, dpi: 150) -> String

単一ページを指定した DPI で PNG バイト列（BINARY）にレンダリングします。

doc.render_with_layers(page_index, dpi: 150, format: 0,
                       background: [1.0, 1.0, 1.0, 1.0], transparent: false,
                       render_annotations: true, jpeg_quality: 90,
                       excluded_layers: []) -> String

RenderOptions の全機能に加え、Optional-Content-Group（OCG）レイヤーのフィルタリングを伴ってページをレンダリングします。format: 0 = PNG、1 = JPEG。excluded_layers は抑制する OCG の /Name を列挙します。エンコードされた画像バイト列（BINARY）を返します。

ページアクセス

doc.page(index) -> PdfPage

index のページの軽量な PdfPage ビューです。

doc.pages -> Array<PdfPage>

ドキュメント内のすべてのページ（即時評価）です。

自動抽出

doc.auto_extractor -> AutoExtractor

このドキュメント用に構成された AutoExtractor（メモ化されます）です。

ライフサイクル

doc.close -> nil

ネイティブハンドルを解放します。冪等です。

doc.open? -> Boolean
doc.closed? -> Boolean

ドキュメントがまだ開いているか／クローズされたかを示します。

PdfPage

PdfDocument から借用される軽量なページ単位のビューです。自身ではネイティブハンドルを保持しません。PdfDocument#page または #pages で構築します。

page = doc.page(0)

属性

page.parent -> PdfDocument
page.index -> Integer

所有元のドキュメントと、0 始まりのページインデックスです。

ジオメトリ

page.width -> Float
page.height -> Float

PDF ユーザー空間単位でのページの幅と高さです。

page.media_box -> Hash
page.crop_box -> Hash

メディアボックス／クロップボックスの { x:, y:, width:, height: } です（クロップボックスはメディアボックスにフォールバックします）。

page.rotation -> Integer

ページの回転角度（度）です。

テキスト

page.text -> String

このページのテキストを抽出します（parent.extract_text(index) と同等です）。

page.to_s -> String
page.inspect -> String

短い検査用ラベル（#<PdfOxide::PdfPage index=N>）です。

Pdf

PDF を作成・保存します。Markdown/HTML/テキスト/画像ソースからの生成、バイト列のエクスポート、ブックマーク分割の計画を担います。

pdf = PdfOxide::Pdf.from_markdown("# Title\n\nBody")

ファクトリメソッド

PdfOxide::Pdf.from_markdown(markdown) { |pdf| ... } -> Pdf

Markdown から PDF を生成します。

PdfOxide::Pdf.from_html(html) { |pdf| ... } -> Pdf

HTML から PDF を生成します（CSS は html_css パイプラインを介して反映されます）。

PdfOxide::Pdf.from_text(text) { |pdf| ... } -> Pdf

プレーンテキストから PDF を生成します。

PdfOxide::Pdf.from_images(images) { |pdf| ... } -> Pdf

JPEG/PNG のバイト列ブロブの配列から PDF を生成します（フォーマットはマジックバイトから自動検出されます）。

PdfOxide::Pdf.create_empty { |pdf| ... } -> Pdf

空白の単一ページ PDF を作成します。

静的ヘルパー

PdfOxide::Pdf.version -> String

ライブラリのバージョンです。

PdfOxide::Pdf.prefetch_models(languages) -> String

指定した BCP-47/ISO 言語タグの OCR モデルを事前取得します。キャッシュディレクトリのパスを返します（OCR なしのビルドでは空です）。

PdfOxide::Pdf.prefetch_available? -> Boolean

ビルドが OCR モデルのプロビジョニングをサポートするかどうかです。

PdfOxide::Pdf.plan_split_by_bookmarks_count(source_pdf, level) -> Integer

source_pdf（生のバイト列）を level（1 = トップレベル、0 = すべて）で分割した場合に生成されるブックマーク分割セグメント数を、出力を生成せずに数えます。

インスタンスメソッド

pdf.to_bytes -> String

PDF を BINARY エンコードされたバイト列として返します。

pdf.save(path) -> String

PDF を path に書き込みます。書き込んだ絶対パスを返します。

pdf.close -> nil
pdf.closed? -> Boolean

ネイティブハンドルを解放します（冪等）／クローズされたかどうかを示します。

DocumentEditor

書き込み側のエディタです。破壊的な墨消し、メタデータの消去、フォームフィル、インクリメンタル保存を担います。すべての墨消し操作はフェイルクローズで動作します（0 以外の戻り値は例外を送出します）。

PdfOxide::DocumentEditor.open('source.pdf') do |ed|
  ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
  ed.apply_redactions!
  ed.save_to('redacted.pdf')
end

コンストラクタ

PdfOxide::DocumentEditor.open(source) { |ed| ... } -> DocumentEditor

ディスク上の PDF、またはメモリ上のバイト列に対してエディタを開きます。ブロック形式では自動的にクローズされます。

PdfOxide::DocumentEditor.new(source) -> DocumentEditor

ブロックなしで直接構築します。

墨消し

ed.add_redaction(page:, rect:, color: [0.0, 0.0, 0.0]) -> self

墨消しの矩形をキューに追加します（rect = PDF ユーザー空間での [x1, y1, x2, y2]、color = [r, g, b]）。apply_redactions! を呼ぶまで適用されません。

ed.redaction_count(page) -> Integer

そのページにキューされた墨消しの総数です。

ed.apply_redactions!(scrub_metadata: false, fill_color: [0.0, 0.0, 0.0]) -> self

キューされたすべての墨消しを破壊的に適用し、オプションで /Info、XMP、JS を消去します。

ed.scrub_metadata -> self

墨消し領域なしでメタデータを除去します。

フォーム

ed.set_form_field(name, value) -> self

ドット区切りのフルネームで AcroForm フィールドを設定します。Boolean の value はチェックボックス／ラジオを対象とし、それ以外はテキスト値を設定します。

保存とライフサイクル

ed.save_to(path) -> String

編集した PDF を保存します。書き込んだ絶対パスを返します。

ed.to_bytes -> String

編集した PDF を BINARY エンコードされたバイト列として返します。

ed.close -> nil
ed.closed? -> Boolean

ネイティブハンドルを解放します（冪等）／クローズされたかどうかを示します。

AutoExtractor

型付きの理由（reason）を伴う自動抽出です（テキストか OCR かのルーティングを、穏やかなネイティブフォールバックとともに行います）。PdfDocument から構築します。

ax = PdfOxide::AutoExtractor.new(doc)
result = ax.extract_page(0)
warn "degraded: #{result[:reason]}" unless ax.ok?(result[:reason])

コンストラクタと属性

PdfOxide::AutoExtractor.new(document) -> AutoExtractor

PdfDocument を自動抽出用にラップします。

ax.document -> PdfDocument

ラップされたドキュメントです。

分類

ax.classify_page(page_index) -> Hash

低コストなページ単位の分類器です（OCR/ラスタライズを行いません）。{ reason:, kind:, confidence:, classification: } を返します。

ax.classify_document -> Hash

ドキュメント全体の分類器です。デコード済みの JSON エンベロープを返します。

抽出

ax.extract_text(page_index) -> Hash

自動ルーターを介してページのテキストを抽出します。{ text:, reason:, kind:, confidence:, classification: } を返します。

ax.extract_page(page_index, options: nil) -> Hash

リッチなページ単位の抽出です。完全な PageExtraction エンベロープ（テキスト + 領域ごとの bbox + reason + confidence）を Hash にマージして返します。

述語

ax.ok?(reason) -> Boolean

reason がクリーンな抽出を表すかどうかです。

ax.ocr_fallback?(reason) -> Boolean

OCR が利用できない場合の穏やかなフォールバック経路が作動したかどうかです。

PdfOxide::AutoExtractor.prefetch_available? -> Boolean

ビルドが OCR プロビジョニングをサポートするかどうかです。

定数

AutoExtractor::REASONS — 型付きの理由シンボルの凍結された配列（:ok、:native_text_high_confidence、:no_text_layer_present、:ocr_requested_but_unavailable など）。AutoExtractor::PAGE_KINDS — ページ種別のシンボル（:text_layer、:scanned、:image_text、:mixed、:empty）。

MarkdownConverter

PdfDocument を Markdown または HTML に変換するステートレスなモジュールです。

PdfOxide::MarkdownConverter.to_markdown(doc, page_index = nil) -> String

ページ（page_index が nil のときはドキュメント全体）を Markdown に変換します。

PdfOxide::MarkdownConverter.to_html(doc, page_index = nil) -> String

ページ（page_index が nil のときはドキュメント全体）を HTML に変換します。

PdfPolicy

set-once セマンティクスを持つ、プロセス全体の暗号ガバナンスポリシーです。他のすべての PDF Oxide 操作の前に .set を呼び出してください。

PdfOxide::PdfPolicy.current -> Symbol

現在のプロセスポリシーモード（:compat、:strict、:fips_strict）です。

PdfOxide::PdfPolicy.set(mode) -> Symbol

プロセス全体のポリシーモードを設定します。すでに設定済みの場合、またはビルドがサポートしない場合は例外を送出します。

PdfOxide::PdfPolicy.compat -> Symbol
PdfOxide::PdfPolicy.strict -> Symbol
PdfOxide::PdfPolicy.fips_strict -> Symbol

プリセットのモードシンボルです: すべてのアルゴリズムを受け入れる／レガシーアルゴリズムを拒否する／FIPS 140-3 のみ。

PdfPolicy::MODES — モードシンボルから cdylib の序数への凍結されたマッピングです。

PdfSigner

PAdES B-B / B-T / B-LT / B-LTA のデジタル署名を行うサイナーです。署名はセキュリティ操作であり、0 以外の戻り値はすべてフェイルクローズで動作します。

PdfOxide::PdfSigner.new(certificate_handle) -> PdfSigner

不透明な PKCS#12/PEM 資格情報ハンドルからサイナーを構築します。

signer.sign(pdf, level:, tsa_url: nil, reason: nil, location: nil) -> String

要求された PAdES レベル（:b、:t、:lt、:lta）で生の PDF バイト列に署名します。>= :t のレベルでは tsa_url が必要です。BINARY エンコードされた署名済み PDF バイト列を返します。

PdfOxide::PdfSigner.sign(pdf:, certificate_handle:, level:, tsa_url: nil, reason: nil, location: nil) -> String

静的な便利メソッド: サイナーインスタンスを構築せずに署名します。

PdfOxide::PdfSigner.pades_level(signature_handle) -> Integer

既存の署名ハンドルの PAdES レベルの序数です。

PdfOxide::PdfSigner.document_has_timestamp?(document_handle) -> Boolean

ドキュメントがドキュメントスコープの /DocTimeStamp を持つかどうかです。

PdfSigner::LEVELS — レベルシンボルからコードへの凍結されたマッピングです。PdfSigner::PadesSignOptions — C の PadesSignOptionsC レイアウトをミラーリングしたパック済みの FFI::Struct です。

PdfValidator

ステートレスな PDF/A および PDF/UA 準拠検証です。

PdfOxide::PdfValidator.pdf_a?(doc, level: :a1b) -> Boolean

ドキュメントが level（:a1b、:a1a、:a2b、:a2a、:a2u、:a3b、:a3a、:a3u）に対して PDF/A 準拠かどうかです。

PdfOxide::PdfValidator.pdf_ua?(doc, level: :ua1) -> Boolean

ドキュメントが level（:ua1 または :ua2）に対して PDF/UA 準拠かどうかです。

PdfOxide::PdfValidator.validate_pdf_a(doc, level: :a1b) -> Hash

簡略化された PDF/A の結果です: { compliant:, violations: }。

PdfOxide::PdfValidator.validate_pdf_ua(doc, level: :ua1) -> Hash

簡略化された PDF/UA の結果です: { compliant:, violations: }。

PdfValidator::PDF_A_LEVELS と PdfValidator::PDF_UA_LEVELS — レベルから序数への凍結されたマッピングです。

エラーハンドリング

PDF Oxide のすべての例外は PdfOxide::Error を継承します。ネイティブのエラーコードは、以下のサブクラスに 1 対 1 で対応します。

begin
  doc = PdfOxide::PdfDocument.open('file.pdf')
  text = doc.extract_text(0)
rescue PdfOxide::FileNotFoundError
  warn 'file not found'
rescue PdfOxide::ParseError => e
  warn "malformed PDF: #{e.message}"
rescue PdfOxide::Error => e
  warn "PDF error: #{e.message}"
ensure
  doc&.close
end

例外	原因
`Error`	すべての PDF Oxide エラーの基底クラス
`UnsupportedPlatformError`	ホストプラットフォームが同梱の cdylib でサポートされていない
`ArgumentError`	ネイティブ呼び出しの前に引数の検証が失敗した
`IoError`	ファイルシステム／I/O の失敗
`FileNotFoundError`	ファイルが存在しない（`IoError` を特殊化）
`ParseError`	ヘッダーの不正、xref の破損、抽出の失敗
`StateError`	操作の順序が誤っている
`InvalidStateError`	すでにクローズされたハンドルに対する操作（`StateError` を特殊化）
`EncryptedError`	暗号化／誤ったパスワードの失敗
`PermissionError`	抽出／署名の権限を欠く暗号化 PDF
`UnsupportedFeatureError`	この cdylib ビルドにコンパイルされていない機能
`SignatureError`	PAdES 署名／検証の失敗
`RedactionError`	破壊的墨消しの失敗（フェイルクローズ）
`ComplianceError`	PDF/A・PDF/UA 検証の失敗
`SearchError`	ネイティブテキスト検索の失敗
`InternalError`	ネイティブ側の汎用的な失敗

完全な例

require 'pdf_oxide'

# --- Extraction ---
PdfOxide::PdfDocument.open('input.pdf') do |doc|
  puts "Pages: #{doc.page_count}"
  doc.page_count.times do |i|
    puts "Page #{i + 1}: #{doc.extract_text(i).length} characters"
  end

  # Search
  doc.search('configuration', case_sensitive: false).each do |m|
    puts "Page #{m[:page] + 1}: '#{m[:text]}' at (#{m[:bbox][:x]}, #{m[:bbox][:y]})"
  end

  # Render page 1 to PNG
  File.binwrite('page1.png', doc.render(0, dpi: 150))
end

# --- Creation ---
PdfOxide::Pdf.from_markdown("# Report\n\nGenerated by PDF Oxide.") do |pdf|
  pdf.save('report.pdf')
end

# --- Redaction ---
PdfOxide::DocumentEditor.open('source.pdf') do |ed|
  ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
  ed.apply_redactions!(scrub_metadata: true)
  ed.save_to('redacted.pdf')
end

# --- Validation ---
PdfOxide::PdfDocument.open('archive.pdf') do |doc|
  puts "PDF/A-1b compliant: #{PdfOxide::PdfValidator.pdf_a?(doc, level: :a1b)}"
end

他の言語のバインディング

PDF Oxide はあらゆる主要なエコシステム向けにネイティブバインディングを提供しています： Rust, Python, Node.js, WASM, C#, Golang, Java, PHP, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, Elixir。

次のステップ

型と列挙型 — すべての共有型と列挙型
Page API リファレンス — バインディング間で一貫したページ単位の反復処理
Ruby 入門 — チュートリアル