What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Erste Schritte mit PDF Oxide ®

PDF Oxide bringt idiomatische R-Bindings für schnelle Extraktion von PDF-Text, Markdown und HTML mit — 0,8 ms durchschnittliche Textextraktion, 100 % Trefferquote bei 3.830 PDFs — gestützt auf denselben Rust-Kern wie jedes andere Binding. Das R-Paket umschließt das pdf_oxide-C-ABI über die .Call-Schnittstelle von R; Dokument-Handles sind externe R-Zeiger, die vom Garbage Collector freigegeben werden, und Seitenindizes sind 0-basiert, passend zur zugrunde liegenden Engine.

Installation

Das R-Paket bindet die cdylib mit den Standard-Features. Bauen Sie zunächst die native Bibliothek und installieren Sie dann das Paket, indem Sie es auf den Header und die cdylib verweisen:

# 1. build the native library (shipped binding feature set)
cargo build --release --lib \
  --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. install the R package
PDF_OXIDE_INCLUDE_DIR="$PWD/include" PDF_OXIDE_LIB_DIR="$PWD/target/release" \
  R CMD INSTALL r/

Machen Sie die cdylib zur Laufzeit für den Linker auffindbar:

LD_LIBRARY_PATH="$PWD/target/release" Rscript your_script.R

Ein PDF öffnen

Öffnen Sie eine Datei mit pdf_open() und prüfen Sie anschließend deren Metadaten. pdf_version() gibt eine benannte Liste mit major und minor zurück.

library(pdfoxide)

doc <- pdf_open("research-paper.pdf")

pdf_page_count(doc)               # number of pages
v <- pdf_version(doc)
cat("PDF version:", paste(v$major, v$minor, sep = "."), "\n")
pdf_is_encrypted(doc)             # logical

Textextraktion

Extrahieren Sie den Text in Lesereihenfolge für eine einzelne 0-basierte Seite mit pdf_extract_text().

library(pdfoxide)

doc <- pdf_open("report.pdf")
text <- pdf_extract_text(doc, 0)  # 0-based page index
cat(text)

Durchlaufen Sie mit pdf_page_count() jede Seite:

doc <- pdf_open("book.pdf")
for (page in seq_len(pdf_page_count(doc)) - 1L) {   # 0-based indices
  cat("--- Page", page + 1L, "---\n")
  cat(pdf_extract_text(doc, page), "\n")
}

Markdown und HTML

Wandeln Sie eine einzelne Seite in Markdown oder HTML um oder konvertieren Sie das gesamte Dokument auf einmal.

library(pdfoxide)

doc <- pdf_open("paper.pdf")

md  <- pdf_to_markdown(doc, 0)    # one page as Markdown
html <- pdf_to_html(doc, 0)       # one page as HTML

all_md   <- pdf_to_markdown_all(doc)    # whole document
all_text <- pdf_to_plain_text_all(doc)  # whole document, plain text

cat(all_md)

Wörter, Zeichen und Zeilen

Die Element-Extraktion gibt Listen von Datensätzen mit positionierten Begrenzungsrahmen zurück. Jede bbox ist eine benannte Liste mit x, y, width und height.

library(pdfoxide)

doc <- pdf_open("paper.pdf")

# Positioned words — each has $text, $bbox, $font_name, $font_size, $bold
words <- pdf_extract_words(doc, 0)
for (w in head(words, 10)) {
  cat(sprintf("'%s' at (%.1f, %.1f) font=%s bold=%s\n",
              w$text, w$bbox$x, w$bbox$y, w$font_name, w$bold))
}

# Reading-order lines — each has $text, $bbox, $word_count
lines <- pdf_extract_text_lines(doc, 0)
for (ln in head(lines, 5)) {
  cat(sprintf("[%d words] %s\n", ln$word_count, ln$text))
}

# Positioned characters — $character is the Unicode codepoint (integer)
chars <- pdf_extract_chars(doc, 0)
for (ch in head(chars, 10)) {
  cat(sprintf("'%s' at (%.1f, %.1f) size=%.1f\n",
              intToUtf8(ch$character), ch$bbox$x, ch$bbox$y, ch$font_size))
}

Tabellen

pdf_extract_tables() gibt die erkannten Tabellen zurück. Jeder Tabellendatensatz enthält row_count, col_count, has_header und eine cells-Zeichenmatrix, die 1-basiert als tbl$cells[row, col] indiziert wird.

library(pdfoxide)

doc <- pdf_open("statement.pdf")
tables <- pdf_extract_tables(doc, 0)

for (tbl in tables) {
  cat(sprintf("Table: %d rows x %d cols (header=%s)\n",
              tbl$row_count, tbl$col_count, tbl$has_header))
  for (r in seq_len(tbl$row_count)) {
    cat(paste(tbl$cells[r, ], collapse = " | "), "\n")
  }
}

Suche

Durchsuchen Sie eine einzelne Seite mit pdf_search() oder das gesamte Dokument mit pdf_search_all(). Beide nehmen ein optionales case_sensitive-Flag (Standard FALSE) entgegen und geben Datensätze mit text, page und bbox zurück.

library(pdfoxide)

doc <- pdf_open("manual.pdf")

# Whole document
hits <- pdf_search_all(doc, "configuration")
for (h in hits) {
  cat(sprintf("Page %d: '%s' at (%.0f, %.0f)\n",
              h$page, h$text, h$bbox$x, h$bbox$y))
}

# Single page, case-sensitive
page_hits <- pdf_search(doc, 0, "Configuration", case_sensitive = TRUE)

Aus Bytes öffnen

Öffnen Sie ein im Speicher gehaltenes PDF — praktisch beim Lesen aus S3, HTTP oder einer Datenbank — mit pdf_open_from_bytes(), das einen raw-Vektor entgegennimmt.

library(pdfoxide)

bytes <- readBin("report.pdf", "raw", file.info("report.pdf")$size)
doc <- pdf_open_from_bytes(bytes)
cat(pdf_extract_text(doc, 0))

Passwortgeschützte PDFs

Öffnen Sie ein verschlüsseltes Dokument mit pdf_open_with_password() oder rufen Sie nach dem Öffnen pdf_authenticate() auf (es gibt bei Erfolg TRUE zurück, bei falschem Passwort FALSE).

library(pdfoxide)

doc <- pdf_open_with_password("confidential.pdf", "secret")
cat(pdf_extract_text(doc, 0))

PDFs erstellen

Die Builder-Funktionen erzeugen aus Markdown, HTML oder reinem Text ein pdfoxide_pdf. Speichern Sie es mit pdf_save() unter einem Pfad oder serialisieren Sie es mit pdf_to_bytes() in einen raw-Vektor (der sich mit pdf_open_from_bytes() wieder öffnen lässt).

library(pdfoxide)

pdf <- pdf_from_markdown("# Hello World\n\nThis is a PDF.\n")
pdf_save(pdf, "output.pdf")

pdf_from_html("<h1>Invoice</h1><p>Amount due: $42.00</p>") |>
  pdf_save("invoice.pdf")

pdf_from_text("Plain text document.\n\nSecond paragraph.") |>
  pdf_save("notes.pdf")

# Round-trip through bytes
bytes <- pdf_to_bytes(pdf_from_markdown("# In memory\n\nbody\n"))
doc <- pdf_open_from_bytes(bytes)
cat(pdf_extract_text(doc, 0))

Nächste Schritte

Erste Schritte mit Python – PDF Oxide aus Python nutzen
Erste Schritte mit Rust – der zugrunde liegende Rust-Crate
Textextraktion – detaillierte Extraktionsoptionen und Rezepte
PDF-Erstellung – fortgeschrittene Erstellung mit Buildern, Verschlüsselung und Metadaten