Skip to content

Ruby API Reference

PDF Oxide ships native Ruby bindings (gem pdf_oxide) built on FFI over the cdylib C ABI. The gem bundles a prebuilt native library and mirrors the 9-class shape of the Java binding under the PdfOxide namespace.

gem install pdf_oxide
require 'pdf_oxide'

For the Rust API, see the Rust API Reference. For the Python API, see the Python API Reference. For type details, see Types & Enums.

All handle-owning objects (PdfDocument, Pdf, DocumentEditor) own native memory and must be closed. The idiomatic pattern is the block form, which closes automatically; #close is idempotent.


PdfOxide (module)

Top-level convenience entry points and process-global toggles.

PdfOxide.open(source, password: nil) { |doc| ... } -> PdfDocument

Open a PDF for reading. Delegates to PdfDocument.open. Accepts a file path or raw PDF bytes; block form auto-closes.

PdfOxide.version -> String

Return the library version string (e.g. "0.3.69").

PdfOxide.set_max_ops_per_stream(limit) -> Integer

Set the process-global content-stream operator cap. A negative limit restores the default (1,000,000); any non-negative value becomes the explicit cap. Returns the previous cap.

PdfOxide.set_preserve_unmapped_glyphs(preserve) -> Integer

Toggle the process-global U+FFFD (unmapped-glyph) preservation flag used by text extraction. Truthy/non-zero preserves; falsey/0 filters (the default). Returns the previous value (0 or 1).


PdfDocument

The primary read-only entry point to a PDF: extraction, search, conversion, rendering, and page access.

doc = PdfOxide::PdfDocument.open('invoice.pdf')

Constructor & Class Methods

PdfOxide::PdfDocument.open(source, password: nil) { |doc| ... } -> PdfDocument

Open a PDF from a filesystem path or raw PDF bytes (auto-detected via the %PDF- magic on binary input). Block form auto-closes; non-block form returns the document. Raises FileNotFoundError, ParseError, or EncryptedError.

PdfOxide::PdfDocument.new(source, password: nil) -> PdfDocument

Construct directly without a block. Prefer .open.

PdfOxide::PdfDocument.extract_text(source, page: 0) -> String

One-shot helper: open, extract a single page’s text, and close.

Document Info

doc.page_count -> Integer

Number of pages in the document.

doc.pdf_version -> String

PDF version string (e.g. "1.7"), or "unknown" if unavailable.

doc.encrypted? -> Boolean

Whether the PDF carries an encryption dictionary.

doc.path -> String

Absolute path the document was opened from (or <in-memory> for byte-opened docs).

Authentication

doc.authenticate(password) -> Boolean

Authenticate against this document’s encryption. Returns true on success or for unencrypted docs.

Text Extraction

doc.extract_text(page_index) -> String

Extract plain text from a single 0-based page (empty for pages with no text layer).

doc.extract_structured(page) -> Hash

Extract a structured representation of a page as a Hash with page_index, page_width, page_height, and regions (each with kind, text, bbox, spans, column_index).

doc.extract_text_auto(page_index) -> String

Auto-routed extraction: native text where present, OCR for scanned regions when the ocr feature is available, with graceful native fallback (never raises “OCR unavailable”).

Conversion

doc.to_markdown(page_index = nil) -> String

Convert one page to Markdown, or the whole document when page_index is nil.

doc.to_html(page_index = nil) -> String

Convert one page to HTML, or the whole document when page_index is nil.

doc.search(query, case_sensitive: false, regex: false) -> Array<Hash>

Search the document. Each match is { page:, text:, bbox: { x:, y:, width:, height: } }. Pass regex: true to interpret query as a regex (raises UnsupportedFeatureError if the build lacks regex search).

Forms

doc.form_fields -> Array<Hash>

AcroForm fields as { name:, value:, type:, page: } hashes. Returns [] when the build lacks the form-extract accessor.

Rendering

doc.render(page_index, dpi: 150) -> String

Render a single page to PNG bytes (BINARY) at the supplied DPI.

doc.render_with_layers(page_index, dpi: 150, format: 0,
                       background: [1.0, 1.0, 1.0, 1.0], transparent: false,
                       render_annotations: true, jpeg_quality: 90,
                       excluded_layers: []) -> String

Render a page with the full RenderOptions surface plus Optional-Content-Group (OCG) layer filtering. format: 0 = PNG, 1 = JPEG; excluded_layers lists OCG /Names to suppress. Returns encoded image bytes (BINARY).

Page Access

doc.page(index) -> PdfPage

A lightweight PdfPage view of the page at index.

doc.pages -> Array<PdfPage>

Every page in the document (eager).

Auto-Extraction

doc.auto_extractor -> AutoExtractor

The configured AutoExtractor for this document (memoized).

Lifecycle

doc.close -> nil

Free the native handle. Idempotent.

doc.open? -> Boolean
doc.closed? -> Boolean

Whether the document is still open / has been closed.


PdfPage

A lightweight per-page view borrowed from a PdfDocument. Holds no native handle of its own. Construct via PdfDocument#page or #pages.

page = doc.page(0)

Attributes

page.parent -> PdfDocument
page.index -> Integer

The owning document and the 0-based page index.

Geometry

page.width -> Float
page.height -> Float

Page width and height in PDF user-space units.

page.media_box -> Hash
page.crop_box -> Hash

{ x:, y:, width:, height: } for the media box / crop box (crop box falls back to the media box).

page.rotation -> Integer

Page rotation in degrees.

Text

page.text -> String

Extract this page’s text (equivalent to parent.extract_text(index)).

page.to_s -> String
page.inspect -> String

Short inspection label (#<PdfOxide::PdfPage index=N>).


Pdf

Create and save PDFs: Markdown/HTML/text/image sources, byte export, and bookmark-split planning.

pdf = PdfOxide::Pdf.from_markdown("# Title\n\nBody")

Factory Methods

PdfOxide::Pdf.from_markdown(markdown) { |pdf| ... } -> Pdf

Build a PDF from Markdown.

PdfOxide::Pdf.from_html(html) { |pdf| ... } -> Pdf

Build a PDF from HTML (CSS honored via the html_css pipeline).

PdfOxide::Pdf.from_text(text) { |pdf| ... } -> Pdf

Build a PDF from plain text.

PdfOxide::Pdf.from_images(images) { |pdf| ... } -> Pdf

Build a PDF from an array of JPEG/PNG byte blobs (format auto-detected from magic bytes).

PdfOxide::Pdf.create_empty { |pdf| ... } -> Pdf

Create a blank single-page PDF.

Static Helpers

PdfOxide::Pdf.version -> String

The library version.

PdfOxide::Pdf.prefetch_models(languages) -> String

Prefetch OCR models for the given BCP-47/ISO language tags. Returns the cache directory path (empty on no-OCR builds).

PdfOxide::Pdf.prefetch_available? -> Boolean

Whether the build supports OCR model provisioning.

PdfOxide::Pdf.plan_split_by_bookmarks_count(source_pdf, level) -> Integer

Count the bookmark-split segments that would result from splitting source_pdf (raw bytes) at level (1 = top-level, 0 = all), without producing output.

Instance Methods

pdf.to_bytes -> String

The PDF as BINARY-encoded bytes.

pdf.save(path) -> String

Write the PDF to path. Returns the absolute path written.

pdf.close -> nil
pdf.closed? -> Boolean

Free the native handle (idempotent) / whether it has been closed.


DocumentEditor

Write-side editor: destructive redaction, metadata scrubbing, form-fill, and incremental save. Every redaction operation fails closed (non-zero return raises).

PdfOxide::DocumentEditor.open('source.pdf') do |ed|
  ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
  ed.apply_redactions!
  ed.save_to('redacted.pdf')
end

Constructor

PdfOxide::DocumentEditor.open(source) { |ed| ... } -> DocumentEditor

Open an editor over a PDF on disk or in-memory bytes. Block form auto-closes.

PdfOxide::DocumentEditor.new(source) -> DocumentEditor

Construct directly without a block.

Redaction

ed.add_redaction(page:, rect:, color: [0.0, 0.0, 0.0]) -> self

Queue a redaction rectangle (rect = [x1, y1, x2, y2] in PDF user-space; color = [r, g, b]). Not applied until apply_redactions!.

ed.redaction_count(page) -> Integer

Total redactions queued for the page.

ed.apply_redactions!(scrub_metadata: false, fill_color: [0.0, 0.0, 0.0]) -> self

Apply all queued redactions destructively, optionally scrubbing /Info, XMP, and JS.

ed.scrub_metadata -> self

Strip metadata without redaction regions.

Forms

ed.set_form_field(name, value) -> self

Set an AcroForm field by dot-separated full name. A Boolean value targets a checkbox/radio; otherwise sets a text value.

Save & Lifecycle

ed.save_to(path) -> String

Save the edited PDF. Returns the absolute path written.

ed.to_bytes -> String

The edited PDF as BINARY-encoded bytes.

ed.close -> nil
ed.closed? -> Boolean

Free the native handle (idempotent) / whether it has been closed.


AutoExtractor

Auto-extraction with typed reasons (text-vs-OCR routing with graceful native fallback). Construct from a PdfDocument.

ax = PdfOxide::AutoExtractor.new(doc)
result = ax.extract_page(0)
warn "degraded: #{result[:reason]}" unless ax.ok?(result[:reason])

Constructor & Attribute

PdfOxide::AutoExtractor.new(document) -> AutoExtractor

Wrap a PdfDocument for auto-extraction.

ax.document -> PdfDocument

The wrapped document.

Classification

ax.classify_page(page_index) -> Hash

Cheap per-page classifier (no OCR/rasterization). Returns { reason:, kind:, confidence:, classification: }.

ax.classify_document -> Hash

Whole-document classifier; returns the decoded JSON envelope.

Extraction

ax.extract_text(page_index) -> Hash

Extract a page’s text via the auto-router; returns { text:, reason:, kind:, confidence:, classification: }.

ax.extract_page(page_index, options: nil) -> Hash

Rich per-page extraction returning the full PageExtraction envelope (text + per-region bbox + reason + confidence) merged into a Hash.

Predicates

ax.ok?(reason) -> Boolean

Whether reason represents a clean extract.

ax.ocr_fallback?(reason) -> Boolean

Whether the OCR-unavailable graceful-fallback path engaged.

PdfOxide::AutoExtractor.prefetch_available? -> Boolean

Whether the build supports OCR provisioning.

Constants

AutoExtractor::REASONS — frozen array of typed reason symbols (:ok, :native_text_high_confidence, :no_text_layer_present, :ocr_requested_but_unavailable, etc.). AutoExtractor::PAGE_KINDS — page-kind symbols (:text_layer, :scanned, :image_text, :mixed, :empty).


MarkdownConverter

Stateless module converting a PdfDocument to Markdown or HTML.

PdfOxide::MarkdownConverter.to_markdown(doc, page_index = nil) -> String

Convert a page (or the whole document when page_index is nil) to Markdown.

PdfOxide::MarkdownConverter.to_html(doc, page_index = nil) -> String

Convert a page (or the whole document) to HTML.


PdfPolicy

Process-global crypto-governance policy with set-once semantics. Call .set before any other PDF Oxide operation.

PdfOxide::PdfPolicy.current -> Symbol

The current process policy mode (:compat, :strict, or :fips_strict).

PdfOxide::PdfPolicy.set(mode) -> Symbol

Set the process-global policy mode. Raises if already set or unsupported by the build.

PdfOxide::PdfPolicy.compat -> Symbol
PdfOxide::PdfPolicy.strict -> Symbol
PdfOxide::PdfPolicy.fips_strict -> Symbol

Preset mode symbols: accept all algorithms / reject legacy algorithms / FIPS 140-3 only.

PdfPolicy::MODES — frozen mapping of mode symbols to cdylib ordinals.


PdfSigner

PAdES B-B / B-T / B-LT / B-LTA digital-signature signer. Signing is a security operation: every non-zero return fails closed.

PdfOxide::PdfSigner.new(certificate_handle) -> PdfSigner

Construct a signer from an opaque PKCS#12/PEM credentials handle.

signer.sign(pdf, level:, tsa_url: nil, reason: nil, location: nil) -> String

Sign raw PDF bytes at the requested PAdES level (:b, :t, :lt, :lta). A tsa_url is required for levels >= :t. Returns BINARY-encoded signed PDF bytes.

PdfOxide::PdfSigner.sign(pdf:, certificate_handle:, level:, tsa_url: nil, reason: nil, location: nil) -> String

Static convenience: sign without constructing a signer instance.

PdfOxide::PdfSigner.pades_level(signature_handle) -> Integer

The PAdES level ordinal of an existing signature handle.

PdfOxide::PdfSigner.document_has_timestamp?(document_handle) -> Boolean

Whether the document carries a document-scoped /DocTimeStamp.

PdfSigner::LEVELS — frozen mapping of level symbols to codes. PdfSigner::PadesSignOptions — packed FFI::Struct mirroring the C PadesSignOptionsC layout.


PdfValidator

Stateless PDF/A and PDF/UA compliance validation.

PdfOxide::PdfValidator.pdf_a?(doc, level: :a1b) -> Boolean

Whether the document is PDF/A compliant for level (:a1b, :a1a, :a2b, :a2a, :a2u, :a3b, :a3a, :a3u).

PdfOxide::PdfValidator.pdf_ua?(doc, level: :ua1) -> Boolean

Whether the document is PDF/UA compliant for level (:ua1 or :ua2).

PdfOxide::PdfValidator.validate_pdf_a(doc, level: :a1b) -> Hash

Simplified PDF/A result: { compliant:, violations: }.

PdfOxide::PdfValidator.validate_pdf_ua(doc, level: :ua1) -> Hash

Simplified PDF/UA result: { compliant:, violations: }.

PdfValidator::PDF_A_LEVELS and PdfValidator::PDF_UA_LEVELS — frozen level-to-ordinal mappings.


Error Handling

All PDF Oxide exceptions derive from PdfOxide::Error. Native error codes map 1-to-1 to the subclasses below.

begin
  doc = PdfOxide::PdfDocument.open('file.pdf')
  text = doc.extract_text(0)
rescue PdfOxide::FileNotFoundError
  warn 'file not found'
rescue PdfOxide::ParseError => e
  warn "malformed PDF: #{e.message}"
rescue PdfOxide::Error => e
  warn "PDF error: #{e.message}"
ensure
  doc&.close
end
Exception Cause
Error Base class for all PDF Oxide errors
UnsupportedPlatformError Host platform not supported by the bundled cdylib
ArgumentError Argument failed validation before the native call
IoError Filesystem / I/O failure
FileNotFoundError Missing file (specialises IoError)
ParseError Malformed header, corrupt xref, extraction failure
StateError Wrong operation order
InvalidStateError Operation on an already-closed handle (specialises StateError)
EncryptedError Encryption / wrong-password failure
PermissionError Encrypted PDF lacking extract/sign permission
UnsupportedFeatureError Feature not compiled into this cdylib build
SignatureError PAdES signing / verifying failure
RedactionError Destructive-redaction failure (fails closed)
ComplianceError PDF/A · PDF/UA validation failure
SearchError Native text-search failure
InternalError Generic native-side failure

Complete Example

require 'pdf_oxide'

# --- Extraction ---
PdfOxide::PdfDocument.open('input.pdf') do |doc|
  puts "Pages: #{doc.page_count}"
  doc.page_count.times do |i|
    puts "Page #{i + 1}: #{doc.extract_text(i).length} characters"
  end

  # Search
  doc.search('configuration', case_sensitive: false).each do |m|
    puts "Page #{m[:page] + 1}: '#{m[:text]}' at (#{m[:bbox][:x]}, #{m[:bbox][:y]})"
  end

  # Render page 1 to PNG
  File.binwrite('page1.png', doc.render(0, dpi: 150))
end

# --- Creation ---
PdfOxide::Pdf.from_markdown("# Report\n\nGenerated by PDF Oxide.") do |pdf|
  pdf.save('report.pdf')
end

# --- Redaction ---
PdfOxide::DocumentEditor.open('source.pdf') do |ed|
  ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
  ed.apply_redactions!(scrub_metadata: true)
  ed.save_to('redacted.pdf')
end

# --- Validation ---
PdfOxide::PdfDocument.open('archive.pdf') do |doc|
  puts "PDF/A-1b compliant: #{PdfOxide::PdfValidator.pdf_a?(doc, level: :a1b)}"
end

Other Language Bindings

PDF Oxide ships native bindings for every major ecosystem: Rust, Python, Node.js, WASM, C#, Golang, Java, PHP, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, and Elixir.

Next Steps