What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Erste Schritte mit PDF Oxide (Ruby)

PDF Oxide ist die schnellste Ruby PDF-Bibliothek — 0,8 ms mittlere Textextraktion, 100 % Erfolgsquote bei 3.830 PDFs. Ein einziges Gem zum Extrahieren, Durchsuchen, Konvertieren, Erstellen und Schwärzen von PDFs, aufgebaut auf demselben Rust-Kern, der auch die Bindings für Python, Java, Node, Go, C# und PHP antreibt.

Installation

gem install pdf_oxide

Oder füge es deinem Gemfile hinzu:

gem 'pdf_oxide', '~> 0.3'

Die vorkompilierte native Bibliothek libpdf_oxide ist im plattformspezifischen Gem enthalten — kein Compiler und keine systemweite Installation nötig. Vorkompilierte Gems decken Ruby 3.1–3.4 auf x86_64-linux, aarch64-linux, Intel- und Apple-Silicon-macOS sowie Windows (x64-mingw-ucrt) ab.

Ein PDF öffnen

Verwende PdfDocument.open, um eine Datei zu laden. Die Block-Form schließt das Dokument automatisch, sobald der Block zurückkehrt; #close ist ebenfalls verfügbar und ist idempotent.

require 'pdf_oxide'

PdfOxide::PdfDocument.open('research-paper.pdf') do |doc|
  puts "Pages: #{doc.page_count}"
  puts "PDF version: #{doc.pdf_version}"
  puts "Encrypted: #{doc.encrypted?}"
end

Für verschlüsselte Dokumente übergib password::

PdfOxide::PdfDocument.open('confidential.pdf', password: 'secret') do |doc|
  puts doc.extract_text(0)
end

Du kannst auch aus Bytes im Speicher öffnen — praktisch beim Streamen aus S3, HTTP oder einer Datenbank. PdfDocument.open erkennt rohe PDF-Bytes automatisch anhand des Magic-Headers %PDF-:

bytes = File.binread('report.pdf')
PdfOxide::PdfDocument.open(bytes) do |doc|
  puts doc.extract_text(0)
end

Textextraktion

Einzelne Seite

Extrahiere reinen Text aus jeder Seite anhand ihres nullbasierten Index.

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  text = doc.extract_text(0)
  puts text
end

Alle Seiten

PdfOxide::PdfDocument.open('book.pdf') do |doc|
  doc.page_count.times do |i|
    puts "--- Page #{i + 1} ---"
    puts doc.extract_text(i)
  end
end

One-Shot-Helfer

Wenn du nur den Text einer einzigen Seite brauchst, öffnet, extrahiert und schließt PdfDocument.extract_text in einem einzigen Aufruf:

text = PdfOxide::PdfDocument.extract_text('report.pdf', page: 0)
puts text

Automatisch geroutete Extraktion

extract_text_auto nutzt den Auto-Router aus v0.3.51, um pro Seite zwischen nativem Text und OCR zu wählen. In einem Build ohne das ocr-Feature fällt es elegant auf die native Textebene zurück — es löst niemals einen Fehler “OCR unavailable” aus.

PdfOxide::PdfDocument.open('scan.pdf') do |doc|
  puts doc.extract_text_auto(0)
end

Für eine typisierte Begründung, die die Extraktionsqualität beschreibt, verwende den AutoExtractor:

PdfOxide::PdfDocument.open('scan.pdf') do |doc|
  ax     = doc.auto_extractor
  result = ax.extract_page(0)
  puts result[:text]
  warn "degraded: #{result[:reason]}" unless ax.ok?(result[:reason])
end

Mit Seiten arbeiten

PdfDocument#page gibt eine leichtgewichtige PdfPage-Ansicht zurück, die vom Dokument geliehen ist. #pages liefert eine solche für jede Seite.

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  page = doc.page(0)
  puts "Index: #{page.index}"
  puts page.text   # same as doc.extract_text(0)

  doc.pages.each do |p|
    puts "Page #{p.index}: #{p.text.length} chars"
  end
end

Markdown- & HTML-Konvertierung

Konvertiere eine einzelne Seite (übergib ihren Index) oder das gesamte Dokument (lasse den Index weg) nach Markdown oder HTML.

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  puts doc.to_markdown(0)   # first page to Markdown
  puts doc.to_html(0)       # first page to HTML
  puts doc.to_markdown      # entire document to Markdown
end

Strukturierte Extraktion

extract_structured gibt das geparste Seitenlayout als Hash zurück — Seitenabmessungen plus typisierte Regionen mit Text, Begrenzungsrahmen und Spaltenindizes.

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  page = doc.extract_structured(0)
  puts "Size: #{page['page_width']} x #{page['page_height']}"
  page['regions'].each do |region|
    puts "#{region['kind']}: #{region['text']}"
  end
end

Suche

search durchsucht das gesamte Dokument und gibt ein Array von Treffer-Hashes zurück, jeder mit :page, :text und einem :bbox-Hash aus :x, :y, :width, :height.

PdfOxide::PdfDocument.open('manual.pdf') do |doc|
  matches = doc.search('configuration', case_sensitive: false)
  matches.each do |m|
    bbox = m[:bbox]
    puts "Page #{m[:page]}: '#{m[:text]}' at (#{bbox[:x].round}, #{bbox[:y].round})"
  end
end

Rendering

Rendere eine Seite mit einer bestimmten DPI-Auflösung zu PNG-Bytes:

PdfOxide::PdfDocument.open('poster.pdf') do |doc|
  png = doc.render(0, dpi: 150)
  File.binwrite('page-0.png', png)
end

PDF-Erstellung

Die Klasse Pdf erstellt PDFs aus Markdown, HTML oder reinem Text. Instanzen besitzen ein natives Handle; verwende die Block-Form (schließt automatisch) oder rufe #close selbst auf.

PdfOxide::Pdf.from_markdown("# Hello World\n\nThis is a PDF.") do |pdf|
  pdf.save('output.pdf')
end

PdfOxide::Pdf.from_html('<h1>Invoice</h1><p>Amount due: $42.00</p>') do |pdf|
  pdf.save('invoice.pdf')
end

PdfOxide::Pdf.from_text("Plain text document.\n\nSecond paragraph.") do |pdf|
  pdf.save('notes.pdf')
end

Hole dir mit #to_bytes die Bytes, anstatt auf die Festplatte zu speichern:

pdf_bytes = PdfOxide::Pdf.from_markdown('# Report').to_bytes
# upload pdf_bytes, attach to an email, etc.

Schwärzung

DocumentEditor öffnet ein bestehendes PDF zur destruktiven Schwärzung. apply_redactions! entfernt den überdeckten Inhalt dauerhaft und kann im selben Durchgang die Dokument-Metadaten bereinigen.

PdfOxide::DocumentEditor.open('source.pdf') do |ed|
  ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
  ed.apply_redactions!(scrub_metadata: true)
  ed.save_to('redacted.pdf')
end

Fehlerbehandlung

PDF Oxide löst typisierte Unterklassen von PdfOxide::Error für PDF-spezifische Fehler aus.

begin
  PdfOxide::PdfDocument.open('document.pdf') do |doc|
    puts doc.extract_text(0)
  end
rescue PdfOxide::FileNotFoundError
  warn 'File not found'
rescue PdfOxide::EncryptedError
  warn 'Wrong or missing password'
rescue PdfOxide::ParseError => e
  warn "Malformed PDF: #{e.message}"
rescue PdfOxide::Error => e
  warn "PDF error: #{e.message}"
end

Nächste Schritte

Erste Schritte mit Python – PDF Oxide aus Python verwenden
Erste Schritte mit Rust – PDF Oxide aus Rust verwenden
Textextraktion – detaillierte Extraktionsoptionen und Rezepte
PDF-Erstellung – fortgeschrittene Erstellung, Verschlüsselung und Metadaten
Bearbeitung – bestehende PDFs, Annotationen und Formularfelder ändern