What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Erste Schritte mit PDF Oxide (Elixir)

PDF Oxide ist der schnellste Weg, um PDFs aus Elixir zu lesen und zu schreiben — 0,8 ms mittlere Textextraktion, 100 % Erfolgsquote bei 3.830 PDFs. Es ist ein NIF über demselben Rust-Kern, der CPU-gebundene Arbeit auf Dirty-CPU-Schedulern (ERL_NIF_DIRTY_JOB_CPU_BOUND) ausführt, sodass der BEAM-Scheduler nie blockiert wird.

Document- und Pdf-Handles sind NIF-Ressourcen, die vom GC freigegeben werden. Fehleranfällige Funktionen geben {:ok, value} oder {:error, code} zurück, und Seitenindizes sind 0-basiert.

Installation

Füge pdf_oxide zu den Abhängigkeiten in deiner mix.exs hinzu:

def deps do
  [
    {:pdf_oxide, "~> 0.3"}
  ]
end

Hole und kompiliere dann die Abhängigkeiten — das NIF wird über elixir_make gegen die native cdylib gebaut:

mix deps.get
mix compile

Schnellstart

Erstelle ein PDF aus Markdown, serialisiere es zu Bytes, öffne es anschließend und extrahiere den Text wieder heraus.

{:ok, pdf}   = PdfOxide.from_markdown("# Hello pdf_oxide\n\nThis is an **Elixir** binding.\n")
{:ok, bytes} = PdfOxide.to_bytes(pdf)
{:ok, doc}   = PdfOxide.open_from_bytes(bytes)

{:ok, pages} = PdfOxide.page_count(doc)
IO.puts("pages: #{pages}")

%{major: maj, minor: min} = PdfOxide.version(doc)
IO.puts("version: #{maj}.#{min}")

{:ok, text} = PdfOxide.extract_text(doc, 0)
IO.puts(text)

Ein PDF öffnen

Öffne ein PDF über einen Dateipfad oder direkt aus Bytes im Speicher (nützlich beim Streaming von S3, HTTP oder einer Datenbank):

# Aus einem Pfad
{:ok, doc} = PdfOxide.open("report.pdf")

# Aus Bytes, die bereits im Speicher liegen
{:ok, doc} = PdfOxide.open_from_bytes(pdf_bytes)

# Verschlüsselte Dokumente
{:ok, doc} = PdfOxide.open_with_password("confidential.pdf", "secret")

# Inspizieren
{:ok, count} = PdfOxide.page_count(doc)
encrypted? = PdfOxide.encrypted?(doc)

Schließe ein Dokument explizit, wenn du fertig bist (close/1 ist idempotent), oder überlasse die Freigabe dem GC:

:ok = PdfOxide.close(doc)

Textextraktion

Extrahiere reinen Text aus einer einzelnen Seite über ihren nullbasierten Index oder hole das gesamte Dokument auf einmal:

{:ok, doc} = PdfOxide.open("book.pdf")

# Eine einzelne Seite
{:ok, text} = PdfOxide.extract_text(doc, 0)

# Reiner Text, eine Seite
{:ok, pt} = PdfOxide.to_plain_text(doc, 0)

# Jede Seite, aneinandergehängt
{:ok, all} = PdfOxide.to_plain_text_all(doc)
IO.puts(all)

Markdown- & HTML-Konvertierung

Konvertiere eine Seite — oder das gesamte Dokument — nach Markdown oder HTML:

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, md}    = PdfOxide.to_markdown(doc, 0)
{:ok, mdall} = PdfOxide.to_markdown_all(doc)

{:ok, html}    = PdfOxide.to_html(doc, 0)
{:ok, htmlall} = PdfOxide.to_html_all(doc)

Wörter & Zeilen

extract_words/2 gibt strukturierte PdfOxide.Word-Structs mit Bounding Box und einem bold-Flag zurück; extract_text_lines/2 gruppiert sie zu Zeilen.

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, words} = PdfOxide.extract_words(doc, 0)

for w <- Enum.take(words, 10) do
  %PdfOxide.Bbox{x: x, y: y, width: width} = w.bbox
  IO.puts("#{w.text} at (#{x}, #{y}) w=#{width} bold=#{w.bold}")
end

{:ok, lines} = PdfOxide.extract_text_lines(doc, 0)

for line <- lines do
  IO.puts("#{line.word_count} words: #{line.text}")
end

Suche

Durchsuche eine einzelne Seite oder das gesamte Dokument. Das vierte Argument ist case_sensitive. Jedes Ergebnis enthält text, page und eine PdfOxide.Bbox.

{:ok, doc} = PdfOxide.open("manual.pdf")

# Eine Seite (Seitenindex 0), Groß-/Kleinschreibung wird ignoriert
{:ok, results} = PdfOxide.search(doc, 0, "configuration", false)

for r <- results do
  %PdfOxide.Bbox{x: x, y: y} = r.bbox
  IO.puts("page #{r.page}: '#{r.text}' at (#{x}, #{y})")
end

# Alle Seiten
{:ok, all} = PdfOxide.search_all(doc, "configuration", false)
IO.puts("#{length(all)} matches")

PDF-Erstellung

Die Builder-Factory-Funktionen geben ein Pdf-Handle zurück, das du mit to_bytes/1 serialisierst oder mit save/2 direkt auf die Festplatte schreibst:

{:ok, pdf} = PdfOxide.from_markdown("# Hello World\n\nThis is a PDF.")
:ok = PdfOxide.save(pdf, "output.pdf")

{:ok, pdf} = PdfOxide.from_html("<h1>Invoice</h1><p>Amount: $42</p>")
{:ok, bytes} = PdfOxide.to_bytes(pdf)

{:ok, pdf} = PdfOxide.from_text("Plain text content.")
:ok = PdfOxide.save(pdf, "notes.pdf")

Seiten als Bilder rendern

Mit dem Rendering-Feature rasterisierst du eine Seite zu einem PdfOxide.RenderedImage und speicherst sie als PNG:

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, img} = PdfOxide.render_page(doc, 0)
IO.puts("#{img.width}x#{img.height}, #{byte_size(img.data)} bytes")
:ok = PdfOxide.save(img, "page0.png")

# Zoomfaktor oder ein Thumbnail mit fester Größe
{:ok, zoomed} = PdfOxide.render_page_zoom(doc, 0, 2.0)
{:ok, thumb}  = PdfOxide.render_page_thumbnail(doc, 0, 128)

Fehlerbehandlung

Fehleranfällige Funktionen geben ein getaggtes Tupel zurück — nutze Pattern Matching für einen sauberen Kontrollfluss:

case PdfOxide.open("/nonexistent/nope.pdf") do
  {:ok, doc} ->
    {:ok, text} = PdfOxide.extract_text(doc, 0)
    IO.puts(text)

  {:error, code} ->
    IO.puts("could not open PDF: #{inspect(code)}")
end

Nächste Schritte

Erste Schritte mit Rust — PDF Oxide aus Rust verwenden
Erste Schritte mit Python — PDF Oxide aus Python verwenden
Textextraktion — detaillierte Extraktionsoptionen und Rezepte
PDF-Erstellung — erweiterte Erstellung mit Metadaten und Verschlüsselung
Bearbeiten — bestehende PDFs, Annotationen und Formularfelder ändern