What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Ruby API Reference

PDF Oxide ships native Ruby bindings (gem pdf_oxide) built on FFI over the cdylib C ABI. The gem bundles a prebuilt native library and mirrors the 9-class shape of the Java binding under the PdfOxide namespace.

gem install pdf_oxide

require 'pdf_oxide'

For the Rust API, see the Rust API Reference. For the Python API, see the Python API Reference. For type details, see Types & Enums.

All handle-owning objects (PdfDocument, Pdf, DocumentEditor) own native memory and must be closed. The idiomatic pattern is the block form, which closes automatically; #close is idempotent.

PdfOxide (module)

Top-level convenience entry points and process-global toggles.

PdfOxide.open(source, password: nil) { |doc| ... } -> PdfDocument

Open a PDF for reading. Delegates to PdfDocument.open. Accepts a file path or raw PDF bytes; block form auto-closes.

PdfOxide.version -> String

Return the library version string (e.g. "0.3.69").

PdfOxide.set_max_ops_per_stream(limit) -> Integer

Set the process-global content-stream operator cap. A negative limit restores the default (1,000,000); any non-negative value becomes the explicit cap. Returns the previous cap.

PdfOxide.set_preserve_unmapped_glyphs(preserve) -> Integer

Toggle the process-global U+FFFD (unmapped-glyph) preservation flag used by text extraction. Truthy/non-zero preserves; falsey/0 filters (the default). Returns the previous value (0 or 1).

PdfDocument

The primary read-only entry point to a PDF: extraction, search, conversion, rendering, and page access.

doc = PdfOxide::PdfDocument.open('invoice.pdf')

Constructor & Class Methods

PdfOxide::PdfDocument.open(source, password: nil) { |doc| ... } -> PdfDocument

Open a PDF from a filesystem path or raw PDF bytes (auto-detected via the %PDF- magic on binary input). Block form auto-closes; non-block form returns the document. Raises FileNotFoundError, ParseError, or EncryptedError.

PdfOxide::PdfDocument.new(source, password: nil) -> PdfDocument

Construct directly without a block. Prefer .open.

PdfOxide::PdfDocument.extract_text(source, page: 0) -> String

One-shot helper: open, extract a single page’s text, and close.

Document Info

doc.page_count -> Integer

Number of pages in the document.

doc.pdf_version -> String

PDF version string (e.g. "1.7"), or "unknown" if unavailable.

doc.encrypted? -> Boolean

Whether the PDF carries an encryption dictionary.

doc.path -> String

Absolute path the document was opened from (or <in-memory> for byte-opened docs).

Authentication

doc.authenticate(password) -> Boolean

Authenticate against this document’s encryption. Returns true on success or for unencrypted docs.

Text Extraction

doc.extract_text(page_index) -> String

Extract plain text from a single 0-based page (empty for pages with no text layer).

doc.extract_structured(page) -> Hash

Extract a structured representation of a page as a Hash with page_index, page_width, page_height, and regions (each with kind, text, bbox, spans, column_index).

doc.extract_text_auto(page_index) -> String

Auto-routed extraction: native text where present, OCR for scanned regions when the ocr feature is available, with graceful native fallback (never raises “OCR unavailable”).

Conversion

doc.to_markdown(page_index = nil) -> String

Convert one page to Markdown, or the whole document when page_index is nil.

doc.to_html(page_index = nil) -> String

Convert one page to HTML, or the whole document when page_index is nil.

Search

doc.search(query, case_sensitive: false, regex: false) -> Array<Hash>

Search the document. Each match is { page:, text:, bbox: { x:, y:, width:, height: } }. Pass regex: true to interpret query as a regex (raises UnsupportedFeatureError if the build lacks regex search).

Forms

doc.form_fields -> Array<Hash>

AcroForm fields as { name:, value:, type:, page: } hashes. Returns [] when the build lacks the form-extract accessor.

Rendering

doc.render(page_index, dpi: 150) -> String

Render a single page to PNG bytes (BINARY) at the supplied DPI.

doc.render_with_layers(page_index, dpi: 150, format: 0,
                       background: [1.0, 1.0, 1.0, 1.0], transparent: false,
                       render_annotations: true, jpeg_quality: 90,
                       excluded_layers: []) -> String

Render a page with the full RenderOptions surface plus Optional-Content-Group (OCG) layer filtering. format: 0 = PNG, 1 = JPEG; excluded_layers lists OCG /Names to suppress. Returns encoded image bytes (BINARY).

Page Access

doc.page(index) -> PdfPage

A lightweight PdfPage view of the page at index.

doc.pages -> Array<PdfPage>

Every page in the document (eager).

Auto-Extraction

doc.auto_extractor -> AutoExtractor

The configured AutoExtractor for this document (memoized).

Lifecycle

doc.close -> nil

Free the native handle. Idempotent.

doc.open? -> Boolean
doc.closed? -> Boolean

Whether the document is still open / has been closed.

PdfPage

A lightweight per-page view borrowed from a PdfDocument. Holds no native handle of its own. Construct via PdfDocument#page or #pages.

page = doc.page(0)

Attributes

page.parent -> PdfDocument
page.index -> Integer

The owning document and the 0-based page index.

Geometry

page.width -> Float
page.height -> Float

Page width and height in PDF user-space units.

page.media_box -> Hash
page.crop_box -> Hash

{ x:, y:, width:, height: } for the media box / crop box (crop box falls back to the media box).

page.rotation -> Integer

Page rotation in degrees.

Text

page.text -> String

Extract this page’s text (equivalent to parent.extract_text(index)).

page.to_s -> String
page.inspect -> String

Short inspection label (#<PdfOxide::PdfPage index=N>).

Pdf

Create and save PDFs: Markdown/HTML/text/image sources, byte export, and bookmark-split planning.

pdf = PdfOxide::Pdf.from_markdown("# Title\n\nBody")

Factory Methods

PdfOxide::Pdf.from_markdown(markdown) { |pdf| ... } -> Pdf

Build a PDF from Markdown.

PdfOxide::Pdf.from_html(html) { |pdf| ... } -> Pdf

Build a PDF from HTML (CSS honored via the html_css pipeline).

PdfOxide::Pdf.from_text(text) { |pdf| ... } -> Pdf

Build a PDF from plain text.

PdfOxide::Pdf.from_images(images) { |pdf| ... } -> Pdf

Build a PDF from an array of JPEG/PNG byte blobs (format auto-detected from magic bytes).

PdfOxide::Pdf.create_empty { |pdf| ... } -> Pdf

Create a blank single-page PDF.

Static Helpers

PdfOxide::Pdf.version -> String

The library version.

PdfOxide::Pdf.prefetch_models(languages) -> String

Prefetch OCR models for the given BCP-47/ISO language tags. Returns the cache directory path (empty on no-OCR builds).

PdfOxide::Pdf.prefetch_available? -> Boolean

Whether the build supports OCR model provisioning.

PdfOxide::Pdf.plan_split_by_bookmarks_count(source_pdf, level) -> Integer

Count the bookmark-split segments that would result from splitting source_pdf (raw bytes) at level (1 = top-level, 0 = all), without producing output.

Instance Methods

pdf.to_bytes -> String

The PDF as BINARY-encoded bytes.

pdf.save(path) -> String

Write the PDF to path. Returns the absolute path written.

pdf.close -> nil
pdf.closed? -> Boolean

Free the native handle (idempotent) / whether it has been closed.

DocumentEditor

Write-side editor: destructive redaction, metadata scrubbing, form-fill, and incremental save. Every redaction operation fails closed (non-zero return raises).

PdfOxide::DocumentEditor.open('source.pdf') do |ed|
  ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
  ed.apply_redactions!
  ed.save_to('redacted.pdf')
end

Constructor

PdfOxide::DocumentEditor.open(source) { |ed| ... } -> DocumentEditor

Open an editor over a PDF on disk or in-memory bytes. Block form auto-closes.

PdfOxide::DocumentEditor.new(source) -> DocumentEditor

Construct directly without a block.

Redaction

ed.add_redaction(page:, rect:, color: [0.0, 0.0, 0.0]) -> self

Queue a redaction rectangle (rect = [x1, y1, x2, y2] in PDF user-space; color = [r, g, b]). Not applied until apply_redactions!.

ed.redaction_count(page) -> Integer

Total redactions queued for the page.

ed.apply_redactions!(scrub_metadata: false, fill_color: [0.0, 0.0, 0.0]) -> self

Apply all queued redactions destructively, optionally scrubbing /Info, XMP, and JS.

ed.scrub_metadata -> self

Strip metadata without redaction regions.

Forms

ed.set_form_field(name, value) -> self

Set an AcroForm field by dot-separated full name. A Boolean value targets a checkbox/radio; otherwise sets a text value.

Save & Lifecycle

ed.save_to(path) -> String

Save the edited PDF. Returns the absolute path written.

ed.to_bytes -> String

The edited PDF as BINARY-encoded bytes.

ed.close -> nil
ed.closed? -> Boolean

Free the native handle (idempotent) / whether it has been closed.

AutoExtractor

Auto-extraction with typed reasons (text-vs-OCR routing with graceful native fallback). Construct from a PdfDocument.

ax = PdfOxide::AutoExtractor.new(doc)
result = ax.extract_page(0)
warn "degraded: #{result[:reason]}" unless ax.ok?(result[:reason])

Constructor & Attribute

PdfOxide::AutoExtractor.new(document) -> AutoExtractor

Wrap a PdfDocument for auto-extraction.

ax.document -> PdfDocument

The wrapped document.

Classification

ax.classify_page(page_index) -> Hash

Cheap per-page classifier (no OCR/rasterization). Returns { reason:, kind:, confidence:, classification: }.

ax.classify_document -> Hash

Whole-document classifier; returns the decoded JSON envelope.

Extraction

ax.extract_text(page_index) -> Hash

Extract a page’s text via the auto-router; returns { text:, reason:, kind:, confidence:, classification: }.

ax.extract_page(page_index, options: nil) -> Hash

Rich per-page extraction returning the full PageExtraction envelope (text + per-region bbox + reason + confidence) merged into a Hash.

Predicates

ax.ok?(reason) -> Boolean

Whether reason represents a clean extract.

ax.ocr_fallback?(reason) -> Boolean

Whether the OCR-unavailable graceful-fallback path engaged.

PdfOxide::AutoExtractor.prefetch_available? -> Boolean

Whether the build supports OCR provisioning.

Constants

AutoExtractor::REASONS — frozen array of typed reason symbols (:ok, :native_text_high_confidence, :no_text_layer_present, :ocr_requested_but_unavailable, etc.). AutoExtractor::PAGE_KINDS — page-kind symbols (:text_layer, :scanned, :image_text, :mixed, :empty).

MarkdownConverter

Stateless module converting a PdfDocument to Markdown or HTML.

PdfOxide::MarkdownConverter.to_markdown(doc, page_index = nil) -> String

Convert a page (or the whole document when page_index is nil) to Markdown.

PdfOxide::MarkdownConverter.to_html(doc, page_index = nil) -> String

Convert a page (or the whole document) to HTML.

PdfPolicy

Process-global crypto-governance policy with set-once semantics. Call .set before any other PDF Oxide operation.

PdfOxide::PdfPolicy.current -> Symbol

The current process policy mode (:compat, :strict, or :fips_strict).

PdfOxide::PdfPolicy.set(mode) -> Symbol

Set the process-global policy mode. Raises if already set or unsupported by the build.

PdfOxide::PdfPolicy.compat -> Symbol
PdfOxide::PdfPolicy.strict -> Symbol
PdfOxide::PdfPolicy.fips_strict -> Symbol

Preset mode symbols: accept all algorithms / reject legacy algorithms / FIPS 140-3 only.

PdfPolicy::MODES — frozen mapping of mode symbols to cdylib ordinals.

PdfSigner

PAdES B-B / B-T / B-LT / B-LTA digital-signature signer. Signing is a security operation: every non-zero return fails closed.

PdfOxide::PdfSigner.new(certificate_handle) -> PdfSigner

Construct a signer from an opaque PKCS#12/PEM credentials handle.

signer.sign(pdf, level:, tsa_url: nil, reason: nil, location: nil) -> String

Sign raw PDF bytes at the requested PAdES level (:b, :t, :lt, :lta). A tsa_url is required for levels >= :t. Returns BINARY-encoded signed PDF bytes.

PdfOxide::PdfSigner.sign(pdf:, certificate_handle:, level:, tsa_url: nil, reason: nil, location: nil) -> String

Static convenience: sign without constructing a signer instance.

PdfOxide::PdfSigner.pades_level(signature_handle) -> Integer

The PAdES level ordinal of an existing signature handle.

PdfOxide::PdfSigner.document_has_timestamp?(document_handle) -> Boolean

Whether the document carries a document-scoped /DocTimeStamp.

PdfSigner::LEVELS — frozen mapping of level symbols to codes. PdfSigner::PadesSignOptions — packed FFI::Struct mirroring the C PadesSignOptionsC layout.

PdfValidator

Stateless PDF/A and PDF/UA compliance validation.

PdfOxide::PdfValidator.pdf_a?(doc, level: :a1b) -> Boolean

Whether the document is PDF/A compliant for level (:a1b, :a1a, :a2b, :a2a, :a2u, :a3b, :a3a, :a3u).

PdfOxide::PdfValidator.pdf_ua?(doc, level: :ua1) -> Boolean

Whether the document is PDF/UA compliant for level (:ua1 or :ua2).

PdfOxide::PdfValidator.validate_pdf_a(doc, level: :a1b) -> Hash

Simplified PDF/A result: { compliant:, violations: }.

PdfOxide::PdfValidator.validate_pdf_ua(doc, level: :ua1) -> Hash

Simplified PDF/UA result: { compliant:, violations: }.

PdfValidator::PDF_A_LEVELS and PdfValidator::PDF_UA_LEVELS — frozen level-to-ordinal mappings.

Error Handling

All PDF Oxide exceptions derive from PdfOxide::Error. Native error codes map 1-to-1 to the subclasses below.

begin
  doc = PdfOxide::PdfDocument.open('file.pdf')
  text = doc.extract_text(0)
rescue PdfOxide::FileNotFoundError
  warn 'file not found'
rescue PdfOxide::ParseError => e
  warn "malformed PDF: #{e.message}"
rescue PdfOxide::Error => e
  warn "PDF error: #{e.message}"
ensure
  doc&.close
end

Exception	Cause
`Error`	Base class for all PDF Oxide errors
`UnsupportedPlatformError`	Host platform not supported by the bundled cdylib
`ArgumentError`	Argument failed validation before the native call
`IoError`	Filesystem / I/O failure
`FileNotFoundError`	Missing file (specialises `IoError`)
`ParseError`	Malformed header, corrupt xref, extraction failure
`StateError`	Wrong operation order
`InvalidStateError`	Operation on an already-closed handle (specialises `StateError`)
`EncryptedError`	Encryption / wrong-password failure
`PermissionError`	Encrypted PDF lacking extract/sign permission
`UnsupportedFeatureError`	Feature not compiled into this cdylib build
`SignatureError`	PAdES signing / verifying failure
`RedactionError`	Destructive-redaction failure (fails closed)
`ComplianceError`	PDF/A · PDF/UA validation failure
`SearchError`	Native text-search failure
`InternalError`	Generic native-side failure

Complete Example

require 'pdf_oxide'

# --- Extraction ---
PdfOxide::PdfDocument.open('input.pdf') do |doc|
  puts "Pages: #{doc.page_count}"
  doc.page_count.times do |i|
    puts "Page #{i + 1}: #{doc.extract_text(i).length} characters"
  end

  # Search
  doc.search('configuration', case_sensitive: false).each do |m|
    puts "Page #{m[:page] + 1}: '#{m[:text]}' at (#{m[:bbox][:x]}, #{m[:bbox][:y]})"
  end

  # Render page 1 to PNG
  File.binwrite('page1.png', doc.render(0, dpi: 150))
end

# --- Creation ---
PdfOxide::Pdf.from_markdown("# Report\n\nGenerated by PDF Oxide.") do |pdf|
  pdf.save('report.pdf')
end

# --- Redaction ---
PdfOxide::DocumentEditor.open('source.pdf') do |ed|
  ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
  ed.apply_redactions!(scrub_metadata: true)
  ed.save_to('redacted.pdf')
end

# --- Validation ---
PdfOxide::PdfDocument.open('archive.pdf') do |doc|
  puts "PDF/A-1b compliant: #{PdfOxide::PdfValidator.pdf_a?(doc, level: :a1b)}"
end

Other Language Bindings

PDF Oxide ships native bindings for every major ecosystem: Rust, Python, Node.js, WASM, C#, Golang, Java, PHP, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, and Elixir.

Next Steps

Types & Enums — all shared types and enums
Page API Reference — consistent per-page iteration across bindings
Getting Started with Ruby — tutorial