What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Getting Started with PDF Oxide (Ruby)

PDF Oxide is the fastest Ruby PDF library — 0.8ms mean text extraction, 100% pass rate on 3,830 PDFs. One gem for extracting, searching, converting, creating, and redacting PDFs, built on the same Rust core that powers the Python, Java, Node, Go, C#, and PHP bindings.

Installation

gem install pdf_oxide

Or add it to your Gemfile:

gem 'pdf_oxide', '~> 0.3'

The prebuilt libpdf_oxide native library ships inside the platform-tagged gem — no compiler or system-wide install needed. Prebuilt gems cover Ruby 3.1–3.4 on x86_64-linux, aarch64-linux, Intel and Apple Silicon macOS, and Windows (x64-mingw-ucrt).

Opening a PDF

Use PdfDocument.open to load a file. The block form auto-closes the document when the block returns; #close is also available and is idempotent.

require 'pdf_oxide'

PdfOxide::PdfDocument.open('research-paper.pdf') do |doc|
  puts "Pages: #{doc.page_count}"
  puts "PDF version: #{doc.pdf_version}"
  puts "Encrypted: #{doc.encrypted?}"
end

For encrypted documents, pass password::

PdfOxide::PdfDocument.open('confidential.pdf', password: 'secret') do |doc|
  puts doc.extract_text(0)
end

You can also open from in-memory bytes — handy when streaming from S3, HTTP, or a database. PdfDocument.open auto-detects raw PDF bytes via the %PDF- magic header:

bytes = File.binread('report.pdf')
PdfOxide::PdfDocument.open(bytes) do |doc|
  puts doc.extract_text(0)
end

Text Extraction

Single Page

Extract plain text from any page by its zero-based index.

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  text = doc.extract_text(0)
  puts text
end

All Pages

PdfOxide::PdfDocument.open('book.pdf') do |doc|
  doc.page_count.times do |i|
    puts "--- Page #{i + 1} ---"
    puts doc.extract_text(i)
  end
end

One-Shot Helper

When you only need one page’s text, PdfDocument.extract_text opens, extracts, and closes in a single call:

text = PdfOxide::PdfDocument.extract_text('report.pdf', page: 0)
puts text

Auto-Routed Extraction

extract_text_auto uses the v0.3.51 auto-router to pick native text or OCR per page. On a build without the ocr feature it gracefully falls back to the native text layer — it never raises an “OCR unavailable” error.

PdfOxide::PdfDocument.open('scan.pdf') do |doc|
  puts doc.extract_text_auto(0)
end

For a typed reason describing extraction quality, use the AutoExtractor:

PdfOxide::PdfDocument.open('scan.pdf') do |doc|
  ax     = doc.auto_extractor
  result = ax.extract_page(0)
  puts result[:text]
  warn "degraded: #{result[:reason]}" unless ax.ok?(result[:reason])
end

Working with Pages

PdfDocument#page returns a lightweight PdfPage view that borrows from the document. #pages returns one for every page.

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  page = doc.page(0)
  puts "Index: #{page.index}"
  puts page.text   # same as doc.extract_text(0)

  doc.pages.each do |p|
    puts "Page #{p.index}: #{p.text.length} chars"
  end
end

Markdown & HTML Conversion

Convert a single page (pass its index) or the whole document (omit the index) to Markdown or HTML.

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  puts doc.to_markdown(0)   # first page to Markdown
  puts doc.to_html(0)       # first page to HTML
  puts doc.to_markdown      # entire document to Markdown
end

Structured Extraction

extract_structured returns the parsed page layout as a Hash — page dimensions plus typed regions with text, bounding boxes, and column indices.

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  page = doc.extract_structured(0)
  puts "Size: #{page['page_width']} x #{page['page_height']}"
  page['regions'].each do |region|
    puts "#{region['kind']}: #{region['text']}"
  end
end

Search

search scans the whole document and returns an array of match hashes, each with :page, :text, and a :bbox hash of :x, :y, :width, :height.

PdfOxide::PdfDocument.open('manual.pdf') do |doc|
  matches = doc.search('configuration', case_sensitive: false)
  matches.each do |m|
    bbox = m[:bbox]
    puts "Page #{m[:page]}: '#{m[:text]}' at (#{bbox[:x].round}, #{bbox[:y].round})"
  end
end

Rendering

Render a page to PNG bytes at a given DPI:

PdfOxide::PdfDocument.open('poster.pdf') do |doc|
  png = doc.render(0, dpi: 150)
  File.binwrite('page-0.png', png)
end

PDF Creation

The Pdf class creates PDFs from Markdown, HTML, or plain text. Instances own a native handle; use the block form (auto-closes) or call #close yourself.

PdfOxide::Pdf.from_markdown("# Hello World\n\nThis is a PDF.") do |pdf|
  pdf.save('output.pdf')
end

PdfOxide::Pdf.from_html('<h1>Invoice</h1><p>Amount due: $42.00</p>') do |pdf|
  pdf.save('invoice.pdf')
end

PdfOxide::Pdf.from_text("Plain text document.\n\nSecond paragraph.") do |pdf|
  pdf.save('notes.pdf')
end

Grab the bytes instead of saving to disk with #to_bytes:

pdf_bytes = PdfOxide::Pdf.from_markdown('# Report').to_bytes
# upload pdf_bytes, attach to an email, etc.

Redaction

DocumentEditor opens an existing PDF for destructive redaction. apply_redactions! permanently removes the covered content and can scrub document metadata in the same pass.

PdfOxide::DocumentEditor.open('source.pdf') do |ed|
  ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
  ed.apply_redactions!(scrub_metadata: true)
  ed.save_to('redacted.pdf')
end

Error Handling

PDF Oxide raises typed subclasses of PdfOxide::Error for PDF-specific failures.

begin
  PdfOxide::PdfDocument.open('document.pdf') do |doc|
    puts doc.extract_text(0)
  end
rescue PdfOxide::FileNotFoundError
  warn 'File not found'
rescue PdfOxide::EncryptedError
  warn 'Wrong or missing password'
rescue PdfOxide::ParseError => e
  warn "Malformed PDF: #{e.message}"
rescue PdfOxide::Error => e
  warn "PDF error: #{e.message}"
end

Next Steps

Python Getting Started – using PDF Oxide from Python
Rust Getting Started – using PDF Oxide from Rust
Text Extraction – detailed extraction options and recipes
PDF Creation – advanced creation, encryption, and metadata
Editing – modifying existing PDFs, annotations, and form fields