Ruby API Reference
PDF Oxide ships native Ruby bindings (gem pdf_oxide) built on FFI over the cdylib C ABI. The gem bundles a prebuilt native library and mirrors the 9-class shape of the Java binding under the PdfOxide namespace.
gem install pdf_oxide
require 'pdf_oxide'
For the Rust API, see the Rust API Reference. For the Python API, see the Python API Reference. For type details, see Types & Enums.
All handle-owning objects (PdfDocument, Pdf, DocumentEditor) own native memory and must be closed. The idiomatic pattern is the block form, which closes automatically; #close is idempotent.
PdfOxide (module)
Top-level convenience entry points and process-global toggles.
PdfOxide.open(source, password: nil) { |doc| ... } -> PdfDocument
Open a PDF for reading. Delegates to PdfDocument.open. Accepts a file path or raw PDF bytes; block form auto-closes.
PdfOxide.version -> String
Return the library version string (e.g. "0.3.69").
PdfOxide.set_max_ops_per_stream(limit) -> Integer
Set the process-global content-stream operator cap. A negative limit restores the default (1,000,000); any non-negative value becomes the explicit cap. Returns the previous cap.
PdfOxide.set_preserve_unmapped_glyphs(preserve) -> Integer
Toggle the process-global U+FFFD (unmapped-glyph) preservation flag used by text extraction. Truthy/non-zero preserves; falsey/0 filters (the default). Returns the previous value (0 or 1).
PdfDocument
The primary read-only entry point to a PDF: extraction, search, conversion, rendering, and page access.
doc = PdfOxide::PdfDocument.open('invoice.pdf')
Constructor & Class Methods
PdfOxide::PdfDocument.open(source, password: nil) { |doc| ... } -> PdfDocument
Open a PDF from a filesystem path or raw PDF bytes (auto-detected via the %PDF- magic on binary input). Block form auto-closes; non-block form returns the document. Raises FileNotFoundError, ParseError, or EncryptedError.
PdfOxide::PdfDocument.new(source, password: nil) -> PdfDocument
Construct directly without a block. Prefer .open.
PdfOxide::PdfDocument.extract_text(source, page: 0) -> String
One-shot helper: open, extract a single page’s text, and close.
Document Info
doc.page_count -> Integer
Number of pages in the document.
doc.pdf_version -> String
PDF version string (e.g. "1.7"), or "unknown" if unavailable.
doc.encrypted? -> Boolean
Whether the PDF carries an encryption dictionary.
doc.path -> String
Absolute path the document was opened from (or <in-memory> for byte-opened docs).
Authentication
doc.authenticate(password) -> Boolean
Authenticate against this document’s encryption. Returns true on success or for unencrypted docs.
Text Extraction
doc.extract_text(page_index) -> String
Extract plain text from a single 0-based page (empty for pages with no text layer).
doc.extract_structured(page) -> Hash
Extract a structured representation of a page as a Hash with page_index, page_width, page_height, and regions (each with kind, text, bbox, spans, column_index).
doc.extract_text_auto(page_index) -> String
Auto-routed extraction: native text where present, OCR for scanned regions when the ocr feature is available, with graceful native fallback (never raises “OCR unavailable”).
Conversion
doc.to_markdown(page_index = nil) -> String
Convert one page to Markdown, or the whole document when page_index is nil.
doc.to_html(page_index = nil) -> String
Convert one page to HTML, or the whole document when page_index is nil.
Search
doc.search(query, case_sensitive: false, regex: false) -> Array<Hash>
Search the document. Each match is { page:, text:, bbox: { x:, y:, width:, height: } }. Pass regex: true to interpret query as a regex (raises UnsupportedFeatureError if the build lacks regex search).
Forms
doc.form_fields -> Array<Hash>
AcroForm fields as { name:, value:, type:, page: } hashes. Returns [] when the build lacks the form-extract accessor.
Rendering
doc.render(page_index, dpi: 150) -> String
Render a single page to PNG bytes (BINARY) at the supplied DPI.
doc.render_with_layers(page_index, dpi: 150, format: 0,
background: [1.0, 1.0, 1.0, 1.0], transparent: false,
render_annotations: true, jpeg_quality: 90,
excluded_layers: []) -> String
Render a page with the full RenderOptions surface plus Optional-Content-Group (OCG) layer filtering. format: 0 = PNG, 1 = JPEG; excluded_layers lists OCG /Names to suppress. Returns encoded image bytes (BINARY).
Page Access
doc.page(index) -> PdfPage
A lightweight PdfPage view of the page at index.
doc.pages -> Array<PdfPage>
Every page in the document (eager).
Auto-Extraction
doc.auto_extractor -> AutoExtractor
The configured AutoExtractor for this document (memoized).
Lifecycle
doc.close -> nil
Free the native handle. Idempotent.
doc.open? -> Boolean
doc.closed? -> Boolean
Whether the document is still open / has been closed.
PdfPage
A lightweight per-page view borrowed from a PdfDocument. Holds no native handle of its own. Construct via PdfDocument#page or #pages.
page = doc.page(0)
Attributes
page.parent -> PdfDocument
page.index -> Integer
The owning document and the 0-based page index.
Geometry
page.width -> Float
page.height -> Float
Page width and height in PDF user-space units.
page.media_box -> Hash
page.crop_box -> Hash
{ x:, y:, width:, height: } for the media box / crop box (crop box falls back to the media box).
page.rotation -> Integer
Page rotation in degrees.
Text
page.text -> String
Extract this page’s text (equivalent to parent.extract_text(index)).
page.to_s -> String
page.inspect -> String
Short inspection label (#<PdfOxide::PdfPage index=N>).
Create and save PDFs: Markdown/HTML/text/image sources, byte export, and bookmark-split planning.
pdf = PdfOxide::Pdf.from_markdown("# Title\n\nBody")
Factory Methods
PdfOxide::Pdf.from_markdown(markdown) { |pdf| ... } -> Pdf
Build a PDF from Markdown.
PdfOxide::Pdf.from_html(html) { |pdf| ... } -> Pdf
Build a PDF from HTML (CSS honored via the html_css pipeline).
PdfOxide::Pdf.from_text(text) { |pdf| ... } -> Pdf
Build a PDF from plain text.
PdfOxide::Pdf.from_images(images) { |pdf| ... } -> Pdf
Build a PDF from an array of JPEG/PNG byte blobs (format auto-detected from magic bytes).
PdfOxide::Pdf.create_empty { |pdf| ... } -> Pdf
Create a blank single-page PDF.
Static Helpers
PdfOxide::Pdf.version -> String
The library version.
PdfOxide::Pdf.prefetch_models(languages) -> String
Prefetch OCR models for the given BCP-47/ISO language tags. Returns the cache directory path (empty on no-OCR builds).
PdfOxide::Pdf.prefetch_available? -> Boolean
Whether the build supports OCR model provisioning.
PdfOxide::Pdf.plan_split_by_bookmarks_count(source_pdf, level) -> Integer
Count the bookmark-split segments that would result from splitting source_pdf (raw bytes) at level (1 = top-level, 0 = all), without producing output.
Instance Methods
pdf.to_bytes -> String
The PDF as BINARY-encoded bytes.
pdf.save(path) -> String
Write the PDF to path. Returns the absolute path written.
pdf.close -> nil
pdf.closed? -> Boolean
Free the native handle (idempotent) / whether it has been closed.
DocumentEditor
Write-side editor: destructive redaction, metadata scrubbing, form-fill, and incremental save. Every redaction operation fails closed (non-zero return raises).
PdfOxide::DocumentEditor.open('source.pdf') do |ed|
ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
ed.apply_redactions!
ed.save_to('redacted.pdf')
end
Constructor
PdfOxide::DocumentEditor.open(source) { |ed| ... } -> DocumentEditor
Open an editor over a PDF on disk or in-memory bytes. Block form auto-closes.
PdfOxide::DocumentEditor.new(source) -> DocumentEditor
Construct directly without a block.
Redaction
ed.add_redaction(page:, rect:, color: [0.0, 0.0, 0.0]) -> self
Queue a redaction rectangle (rect = [x1, y1, x2, y2] in PDF user-space; color = [r, g, b]). Not applied until apply_redactions!.
ed.redaction_count(page) -> Integer
Total redactions queued for the page.
ed.apply_redactions!(scrub_metadata: false, fill_color: [0.0, 0.0, 0.0]) -> self
Apply all queued redactions destructively, optionally scrubbing /Info, XMP, and JS.
ed.scrub_metadata -> self
Strip metadata without redaction regions.
Forms
ed.set_form_field(name, value) -> self
Set an AcroForm field by dot-separated full name. A Boolean value targets a checkbox/radio; otherwise sets a text value.
Save & Lifecycle
ed.save_to(path) -> String
Save the edited PDF. Returns the absolute path written.
ed.to_bytes -> String
The edited PDF as BINARY-encoded bytes.
ed.close -> nil
ed.closed? -> Boolean
Free the native handle (idempotent) / whether it has been closed.
AutoExtractor
Auto-extraction with typed reasons (text-vs-OCR routing with graceful native fallback). Construct from a PdfDocument.
ax = PdfOxide::AutoExtractor.new(doc)
result = ax.extract_page(0)
warn "degraded: #{result[:reason]}" unless ax.ok?(result[:reason])
Constructor & Attribute
PdfOxide::AutoExtractor.new(document) -> AutoExtractor
Wrap a PdfDocument for auto-extraction.
ax.document -> PdfDocument
The wrapped document.
Classification
ax.classify_page(page_index) -> Hash
Cheap per-page classifier (no OCR/rasterization). Returns { reason:, kind:, confidence:, classification: }.
ax.classify_document -> Hash
Whole-document classifier; returns the decoded JSON envelope.
Extraction
ax.extract_text(page_index) -> Hash
Extract a page’s text via the auto-router; returns { text:, reason:, kind:, confidence:, classification: }.
ax.extract_page(page_index, options: nil) -> Hash
Rich per-page extraction returning the full PageExtraction envelope (text + per-region bbox + reason + confidence) merged into a Hash.
Predicates
ax.ok?(reason) -> Boolean
Whether reason represents a clean extract.
ax.ocr_fallback?(reason) -> Boolean
Whether the OCR-unavailable graceful-fallback path engaged.
PdfOxide::AutoExtractor.prefetch_available? -> Boolean
Whether the build supports OCR provisioning.
Constants
AutoExtractor::REASONS — frozen array of typed reason symbols (:ok, :native_text_high_confidence, :no_text_layer_present, :ocr_requested_but_unavailable, etc.). AutoExtractor::PAGE_KINDS — page-kind symbols (:text_layer, :scanned, :image_text, :mixed, :empty).
MarkdownConverter
Stateless module converting a PdfDocument to Markdown or HTML.
PdfOxide::MarkdownConverter.to_markdown(doc, page_index = nil) -> String
Convert a page (or the whole document when page_index is nil) to Markdown.
PdfOxide::MarkdownConverter.to_html(doc, page_index = nil) -> String
Convert a page (or the whole document) to HTML.
PdfPolicy
Process-global crypto-governance policy with set-once semantics. Call .set before any other PDF Oxide operation.
PdfOxide::PdfPolicy.current -> Symbol
The current process policy mode (:compat, :strict, or :fips_strict).
PdfOxide::PdfPolicy.set(mode) -> Symbol
Set the process-global policy mode. Raises if already set or unsupported by the build.
PdfOxide::PdfPolicy.compat -> Symbol
PdfOxide::PdfPolicy.strict -> Symbol
PdfOxide::PdfPolicy.fips_strict -> Symbol
Preset mode symbols: accept all algorithms / reject legacy algorithms / FIPS 140-3 only.
PdfPolicy::MODES — frozen mapping of mode symbols to cdylib ordinals.
PdfSigner
PAdES B-B / B-T / B-LT / B-LTA digital-signature signer. Signing is a security operation: every non-zero return fails closed.
PdfOxide::PdfSigner.new(certificate_handle) -> PdfSigner
Construct a signer from an opaque PKCS#12/PEM credentials handle.
signer.sign(pdf, level:, tsa_url: nil, reason: nil, location: nil) -> String
Sign raw PDF bytes at the requested PAdES level (:b, :t, :lt, :lta). A tsa_url is required for levels >= :t. Returns BINARY-encoded signed PDF bytes.
PdfOxide::PdfSigner.sign(pdf:, certificate_handle:, level:, tsa_url: nil, reason: nil, location: nil) -> String
Static convenience: sign without constructing a signer instance.
PdfOxide::PdfSigner.pades_level(signature_handle) -> Integer
The PAdES level ordinal of an existing signature handle.
PdfOxide::PdfSigner.document_has_timestamp?(document_handle) -> Boolean
Whether the document carries a document-scoped /DocTimeStamp.
PdfSigner::LEVELS — frozen mapping of level symbols to codes. PdfSigner::PadesSignOptions — packed FFI::Struct mirroring the C PadesSignOptionsC layout.
PdfValidator
Stateless PDF/A and PDF/UA compliance validation.
PdfOxide::PdfValidator.pdf_a?(doc, level: :a1b) -> Boolean
Whether the document is PDF/A compliant for level (:a1b, :a1a, :a2b, :a2a, :a2u, :a3b, :a3a, :a3u).
PdfOxide::PdfValidator.pdf_ua?(doc, level: :ua1) -> Boolean
Whether the document is PDF/UA compliant for level (:ua1 or :ua2).
PdfOxide::PdfValidator.validate_pdf_a(doc, level: :a1b) -> Hash
Simplified PDF/A result: { compliant:, violations: }.
PdfOxide::PdfValidator.validate_pdf_ua(doc, level: :ua1) -> Hash
Simplified PDF/UA result: { compliant:, violations: }.
PdfValidator::PDF_A_LEVELS and PdfValidator::PDF_UA_LEVELS — frozen level-to-ordinal mappings.
Error Handling
All PDF Oxide exceptions derive from PdfOxide::Error. Native error codes map 1-to-1 to the subclasses below.
begin
doc = PdfOxide::PdfDocument.open('file.pdf')
text = doc.extract_text(0)
rescue PdfOxide::FileNotFoundError
warn 'file not found'
rescue PdfOxide::ParseError => e
warn "malformed PDF: #{e.message}"
rescue PdfOxide::Error => e
warn "PDF error: #{e.message}"
ensure
doc&.close
end
| Exception | Cause |
|---|---|
Error |
Base class for all PDF Oxide errors |
UnsupportedPlatformError |
Host platform not supported by the bundled cdylib |
ArgumentError |
Argument failed validation before the native call |
IoError |
Filesystem / I/O failure |
FileNotFoundError |
Missing file (specialises IoError) |
ParseError |
Malformed header, corrupt xref, extraction failure |
StateError |
Wrong operation order |
InvalidStateError |
Operation on an already-closed handle (specialises StateError) |
EncryptedError |
Encryption / wrong-password failure |
PermissionError |
Encrypted PDF lacking extract/sign permission |
UnsupportedFeatureError |
Feature not compiled into this cdylib build |
SignatureError |
PAdES signing / verifying failure |
RedactionError |
Destructive-redaction failure (fails closed) |
ComplianceError |
PDF/A · PDF/UA validation failure |
SearchError |
Native text-search failure |
InternalError |
Generic native-side failure |
Complete Example
require 'pdf_oxide'
# --- Extraction ---
PdfOxide::PdfDocument.open('input.pdf') do |doc|
puts "Pages: #{doc.page_count}"
doc.page_count.times do |i|
puts "Page #{i + 1}: #{doc.extract_text(i).length} characters"
end
# Search
doc.search('configuration', case_sensitive: false).each do |m|
puts "Page #{m[:page] + 1}: '#{m[:text]}' at (#{m[:bbox][:x]}, #{m[:bbox][:y]})"
end
# Render page 1 to PNG
File.binwrite('page1.png', doc.render(0, dpi: 150))
end
# --- Creation ---
PdfOxide::Pdf.from_markdown("# Report\n\nGenerated by PDF Oxide.") do |pdf|
pdf.save('report.pdf')
end
# --- Redaction ---
PdfOxide::DocumentEditor.open('source.pdf') do |ed|
ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
ed.apply_redactions!(scrub_metadata: true)
ed.save_to('redacted.pdf')
end
# --- Validation ---
PdfOxide::PdfDocument.open('archive.pdf') do |doc|
puts "PDF/A-1b compliant: #{PdfOxide::PdfValidator.pdf_a?(doc, level: :a1b)}"
end
Other Language Bindings
PDF Oxide ships native bindings for every major ecosystem: Rust, Python, Node.js, WASM, C#, Golang, Java, PHP, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, and Elixir.
Next Steps
- Types & Enums — all shared types and enums
- Page API Reference — consistent per-page iteration across bindings
- Getting Started with Ruby — tutorial