Getting Started with PDF Oxide (Ruby)
PDF Oxide is the fastest Ruby PDF library — 0.8ms mean text extraction, 100% pass rate on 3,830 PDFs. One gem for extracting, searching, converting, creating, and redacting PDFs, built on the same Rust core that powers the Python, Java, Node, Go, C#, and PHP bindings.
Installation
gem install pdf_oxide
Or add it to your Gemfile:
gem 'pdf_oxide', '~> 0.3'
The prebuilt libpdf_oxide native library ships inside the platform-tagged gem — no compiler or system-wide install needed. Prebuilt gems cover Ruby 3.1–3.4 on x86_64-linux, aarch64-linux, Intel and Apple Silicon macOS, and Windows (x64-mingw-ucrt).
Opening a PDF
Use PdfDocument.open to load a file. The block form auto-closes the document when the block returns; #close is also available and is idempotent.
require 'pdf_oxide'
PdfOxide::PdfDocument.open('research-paper.pdf') do |doc|
puts "Pages: #{doc.page_count}"
puts "PDF version: #{doc.pdf_version}"
puts "Encrypted: #{doc.encrypted?}"
end
For encrypted documents, pass password::
PdfOxide::PdfDocument.open('confidential.pdf', password: 'secret') do |doc|
puts doc.extract_text(0)
end
You can also open from in-memory bytes — handy when streaming from S3, HTTP, or a database. PdfDocument.open auto-detects raw PDF bytes via the %PDF- magic header:
bytes = File.binread('report.pdf')
PdfOxide::PdfDocument.open(bytes) do |doc|
puts doc.extract_text(0)
end
Text Extraction
Single Page
Extract plain text from any page by its zero-based index.
PdfOxide::PdfDocument.open('report.pdf') do |doc|
text = doc.extract_text(0)
puts text
end
All Pages
PdfOxide::PdfDocument.open('book.pdf') do |doc|
doc.page_count.times do |i|
puts "--- Page #{i + 1} ---"
puts doc.extract_text(i)
end
end
One-Shot Helper
When you only need one page’s text, PdfDocument.extract_text opens, extracts, and closes in a single call:
text = PdfOxide::PdfDocument.extract_text('report.pdf', page: 0)
puts text
Auto-Routed Extraction
extract_text_auto uses the v0.3.51 auto-router to pick native text or OCR per page. On a build without the ocr feature it gracefully falls back to the native text layer — it never raises an “OCR unavailable” error.
PdfOxide::PdfDocument.open('scan.pdf') do |doc|
puts doc.extract_text_auto(0)
end
For a typed reason describing extraction quality, use the AutoExtractor:
PdfOxide::PdfDocument.open('scan.pdf') do |doc|
ax = doc.auto_extractor
result = ax.extract_page(0)
puts result[:text]
warn "degraded: #{result[:reason]}" unless ax.ok?(result[:reason])
end
Working with Pages
PdfDocument#page returns a lightweight PdfPage view that borrows from the document. #pages returns one for every page.
PdfOxide::PdfDocument.open('paper.pdf') do |doc|
page = doc.page(0)
puts "Index: #{page.index}"
puts page.text # same as doc.extract_text(0)
doc.pages.each do |p|
puts "Page #{p.index}: #{p.text.length} chars"
end
end
Markdown & HTML Conversion
Convert a single page (pass its index) or the whole document (omit the index) to Markdown or HTML.
PdfOxide::PdfDocument.open('paper.pdf') do |doc|
puts doc.to_markdown(0) # first page to Markdown
puts doc.to_html(0) # first page to HTML
puts doc.to_markdown # entire document to Markdown
end
Structured Extraction
extract_structured returns the parsed page layout as a Hash — page dimensions plus typed regions with text, bounding boxes, and column indices.
PdfOxide::PdfDocument.open('paper.pdf') do |doc|
page = doc.extract_structured(0)
puts "Size: #{page['page_width']} x #{page['page_height']}"
page['regions'].each do |region|
puts "#{region['kind']}: #{region['text']}"
end
end
Search
search scans the whole document and returns an array of match hashes, each with :page, :text, and a :bbox hash of :x, :y, :width, :height.
PdfOxide::PdfDocument.open('manual.pdf') do |doc|
matches = doc.search('configuration', case_sensitive: false)
matches.each do |m|
bbox = m[:bbox]
puts "Page #{m[:page]}: '#{m[:text]}' at (#{bbox[:x].round}, #{bbox[:y].round})"
end
end
Rendering
Render a page to PNG bytes at a given DPI:
PdfOxide::PdfDocument.open('poster.pdf') do |doc|
png = doc.render(0, dpi: 150)
File.binwrite('page-0.png', png)
end
PDF Creation
The Pdf class creates PDFs from Markdown, HTML, or plain text. Instances own a native handle; use the block form (auto-closes) or call #close yourself.
PdfOxide::Pdf.from_markdown("# Hello World\n\nThis is a PDF.") do |pdf|
pdf.save('output.pdf')
end
PdfOxide::Pdf.from_html('<h1>Invoice</h1><p>Amount due: $42.00</p>') do |pdf|
pdf.save('invoice.pdf')
end
PdfOxide::Pdf.from_text("Plain text document.\n\nSecond paragraph.") do |pdf|
pdf.save('notes.pdf')
end
Grab the bytes instead of saving to disk with #to_bytes:
pdf_bytes = PdfOxide::Pdf.from_markdown('# Report').to_bytes
# upload pdf_bytes, attach to an email, etc.
Redaction
DocumentEditor opens an existing PDF for destructive redaction. apply_redactions! permanently removes the covered content and can scrub document metadata in the same pass.
PdfOxide::DocumentEditor.open('source.pdf') do |ed|
ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
ed.apply_redactions!(scrub_metadata: true)
ed.save_to('redacted.pdf')
end
Error Handling
PDF Oxide raises typed subclasses of PdfOxide::Error for PDF-specific failures.
begin
PdfOxide::PdfDocument.open('document.pdf') do |doc|
puts doc.extract_text(0)
end
rescue PdfOxide::FileNotFoundError
warn 'File not found'
rescue PdfOxide::EncryptedError
warn 'Wrong or missing password'
rescue PdfOxide::ParseError => e
warn "Malformed PDF: #{e.message}"
rescue PdfOxide::Error => e
warn "PDF error: #{e.message}"
end
Next Steps
- Python Getting Started – using PDF Oxide from Python
- Rust Getting Started – using PDF Oxide from Rust
- Text Extraction – detailed extraction options and recipes
- PDF Creation – advanced creation, encryption, and metadata
- Editing – modifying existing PDFs, annotations, and form fields