What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Erste Schritte mit PDF Oxide (Swift)

PDF Oxide ist die schnellste PDF-Bibliothek mit integrierter Textextraktion — 0,8 ms im Mittel, 100 % Erfolgsquote bei 3.830 PDFs. Das Swift-Binding, neu in v0.3.69, umschließt den Rust-Kern über eine C-ABI: Handles gehören zu Klassen (in deinit freigegeben), C-Buffer werden in Swift-String/[UInt8] kopiert, und Fehlercodes werden als PdfOxideError geworfen.

Installation

Das Binding bindet die cdylib mit den Standard-Features. Erstellen Sie die native Bibliothek und verweisen Sie SwiftPM anschließend auf die Header und die Bibliothek:

# 1. build the native library (shipped binding feature set)
cargo build --release --lib --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. test + run the example (Package.swift reads PDF_OXIDE_INCLUDE_DIR / _LIB_DIR)
cd swift
export PDF_OXIDE_INCLUDE_DIR="$PWD/../include"
export PDF_OXIDE_LIB_DIR="$PWD/../target/release"
DYLD_LIBRARY_PATH="$PDF_OXIDE_LIB_DIR" swift test
DYLD_LIBRARY_PATH="$PDF_OXIDE_LIB_DIR" swift run basic_extraction

Schnellstart

Erstellen Sie ein PDF aus Markdown, öffnen Sie es aus den erzeugten Bytes und extrahieren Sie seinen Text. Der gesamte Durchlauf läuft ohne jegliche externe Datei:

import PdfOxide

let pdf = try Pdf.fromMarkdown("# Hello pdf_oxide\n\nThis is a **Swift** binding.\n")
let doc = try Document.openFromBytes(try pdf.toBytes())

print("pages:   \(try doc.pageCount())")
print("version: \(try doc.version())")
print(try doc.extractText(0))

Um eine Datei von der Festplatte zu öffnen, verwenden Sie Document.open(_:):

import PdfOxide

let doc = try Document.open("research-paper.pdf")
print("Pages:   \(try doc.pageCount())")
print("Version: \(try doc.version())")        // e.g. 1.7

Textextraktion

extractText(_:) gibt den Text einer einzelnen, nullbasierten Seite zurück. Iterieren Sie über pageCount(), um das gesamte Dokument zu lesen:

import PdfOxide

let doc = try Document.open("book.pdf")
for i in 0..<(try doc.pageCount()) {
    print("--- Page \(i + 1) ---")
    print(try doc.extractText(i))
}

toPlainText(_:) liefert eine vereinfachte, layoutfreie Variante, und die *All()-Methoden extrahieren alle Seiten auf einmal:

let doc = try Document.open("report.pdf")
let plain = try doc.toPlainText(0)            // single page, no layout
let everything = try doc.toPlainTextAll()     // all pages concatenated

Wörter & Zeichen

extractWords(_:) gibt ein [Word] mit Bounding-Box und Font-Metadaten für jedes Wort zurück. extractChars(_:) gibt ein [Char] mit zeichenweiser Positionierung zurück:

import PdfOxide

let doc = try Document.open("paper.pdf")

let words = try doc.extractWords(0)
for word in words.prefix(10) {
    print("'\(word.text)' at (\(word.bbox.x), \(word.bbox.y)) "
        + "font=\(word.fontName) size=\(word.fontSize) bold=\(word.bold)")
}

let chars = try doc.extractChars(0)
for ch in chars.prefix(10) {
    let scalar = Unicode.Scalar(ch.character).map(String.init) ?? "?"
    print("'\(scalar)' size=\(ch.fontSize) font=\(ch.fontName)")
}

Felder von Word: text (String), bbox (Bbox), fontName (String), fontSize (Double), bold (Bool). Felder von Char: character (UInt32-Codepoint), bbox, fontName, fontSize. Eine Bbox stellt x, y, width und height als Double bereit.

Sie können Text auch zeilenweise mit extractTextLines(_:) auslesen, das ein [TextLine] (text, bbox, wordCount) zurückgibt:

let lines = try doc.extractTextLines(0)
for line in lines {
    print("\(line.wordCount) words: \(line.text)")
}

Markdown- & HTML-Konvertierung

Konvertieren Sie eine einzelne Seite oder das gesamte Dokument nach Markdown oder HTML:

import PdfOxide

let doc = try Document.open("paper.pdf")

let md = try doc.toMarkdown(0)        // one page to Markdown
let mdAll = try doc.toMarkdownAll()   // whole document to Markdown
let html = try doc.toHtml(0)          // one page to HTML
let htmlAll = try doc.toHtmlAll()     // whole document to HTML

print(mdAll)

Suche

search(_:_:_:) durchsucht eine einzelne Seite; searchAll(_:_:) durchsucht das gesamte Dokument. Beide erwarten einen Suchbegriff und ein caseSensitive-Flag und geben ein [SearchResult] (text, page, bbox) zurück:

import PdfOxide

let doc = try Document.open("manual.pdf")

// Search a single page (page 0, case-insensitive)
let hits = try doc.search(0, "configuration", false)
for hit in hits {
    print("page \(hit.page): '\(hit.text)' at (\(hit.bbox.x), \(hit.bbox.y))")
}

// Search the whole document
let allHits = try doc.searchAll("configuration", false)
print("\(allHits.count) total matches")

PDF-Erstellung

Der Typ Pdf stellt Factory-Methoden bereit, die ein Dokument aus einem Quellformat erstellen. Speichern Sie es mit save(_:) auf der Festplatte oder holen Sie sich die rohen Bytes mit toBytes():

import PdfOxide

try Pdf.fromMarkdown("# Hello World\n\nThis is a PDF.").save("output.pdf")
try Pdf.fromHtml("<h1>Invoice</h1><p>Amount: $42</p>").save("invoice.pdf")
try Pdf.fromText("Plain text content.").save("notes.pdf")

let bytes = try Pdf.fromMarkdown("# In-memory\n\nbody\n").toBytes()
print("produced \(bytes.count) bytes")

Fehlerbehandlung

Jeder fehleranfällige Aufruf wirft PdfOxideError, das den Namen der fehlgeschlagenen Operation und den zugrunde liegenden C-ABI-Fehlercode mitführt:

import PdfOxide

do {
    let doc = try Document.open("document.pdf")
    print(try doc.extractText(0))
} catch let error as PdfOxideError {
    print("PDF error: \(error)")   // e.g. "PdfOxideError: open failed (error code 1)"
}

Nächste Schritte

Erste Schritte mit Rust – PDF Oxide aus Rust verwenden
Erste Schritte mit Python – PDF Oxide aus Python verwenden
Textextraktion – detaillierte Extraktionsoptionen und Rezepte
PDF-Erstellung – fortgeschrittene Erstellung mit Metadaten und Verschlüsselung
Bearbeitung – bestehende PDFs, Annotationen und Formularfelder bearbeiten