What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Erste Schritte mit PDF Oxide (C++)

PDF Oxide liefert idiomatische, header-only C++17-Bindings für seinen Rust-Kern — 0,8 ms durchschnittliche Textextraktion, 100 % Erfolgsquote bei 3.830 PDFs. Handles sind move-only RAII-Wrapper, native Strings und Buffer werden für dich in std::string / std::vector<std::uint8_t> kopiert, und Fehlercodes der C-ABI werden als pdf_oxide::Error geworfen. Neu in v0.3.69.

Installation

Die Bindings bestehen aus einem einzigen Header (cpp/include/pdf_oxide/pdf_oxide.hpp), der die native cdylib einbindet. Baue die Bibliothek einmalig aus dem Repo-Root und richte CMake darauf aus:

# 1. build the native library (shipped binding feature set)
cargo build --release --lib \
  --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. configure + build with the header-only wrapper
cmake -S cpp -B cpp/build -DCMAKE_BUILD_TYPE=Release \
  -DPDF_OXIDE_LIB_DIR="$PWD/target/release"
cmake --build cpp/build -j

Binde den Header anschließend in deine eigenen Übersetzungseinheiten ein:

#include <pdf_oxide/pdf_oxide.hpp>

Der C-Header deklariert einen globalen Pdf-Typ, verwende daher nicht using namespace pdf_oxide;. Qualifiziere Namen (pdf_oxide::Pdf, pdf_oxide::Document) oder hole sie mit gezielten using-Deklarationen herein.

Schnelleinstieg

Öffne ein PDF und extrahiere Text in Lesereihenfolge von einer Seite. Jeder fehleranfällige Aufruf wirft pdf_oxide::Error, kapsle deine Arbeit also in ein try/catch.

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    try {
        auto doc = pdf_oxide::Document::open("research-paper.pdf");

        std::cout << "pages: " << doc.page_count() << "\n";

        pdf_oxide::Version v = doc.version();
        std::cout << "version: " << static_cast<int>(v.major) << "."
                  << static_cast<int>(v.minor) << "\n";

        std::string text = doc.extract_text(0);   // 0-based page index
        std::cout << text << "\n";
        return 0;
    } catch (const pdf_oxide::Error& e) {
        std::cerr << "error: " << e.what() << "\n";
        return 1;
    }
}

Um ein bereits im Speicher vorliegendes PDF zu öffnen, verwende Document::open_from_bytes:

std::vector<std::uint8_t> bytes = load_pdf_bytes();   // from S3, HTTP, a DB…
auto doc = pdf_oxide::Document::open_from_bytes(bytes);
std::string text = doc.extract_text(0);

Markdown- und HTML-Konvertierung

Konvertiere eine einzelne Seite — oder das gesamte Dokument — nach Markdown oder HTML.

auto doc = pdf_oxide::Document::open("paper.pdf");

std::string page_md = doc.to_markdown(0);   // one page
std::string all_md   = doc.to_markdown_all(); // every page

std::string page_html = doc.to_html(0);
std::string all_html  = doc.to_html_all();

std::cout << all_md << "\n";

Wortweise Extraktion

extract_words(page_index) gibt einen std::vector<pdf_oxide::Word> mit dem Text, der Bounding Box und den Font-Metadaten für jedes Wort auf der Seite zurück.

auto doc   = pdf_oxide::Document::open("paper.pdf");
auto words = doc.extract_words(0);

for (const auto& w : words) {
    std::cout << "'" << w.text << "'"
              << " at (" << w.bbox.x << ", " << w.bbox.y << ")"
              << " size=" << w.font_size
              << " font=" << w.font_name
              << (w.bold ? " [bold]" : "") << "\n";
}

Felder von pdf_oxide::Word:

Feld	Typ	Beschreibung
`text`	`std::string`	Der Worttext
`bbox`	`Bbox`	Bounding Box (`x`, `y`, `width`, `height`)
`font_name`	`std::string`	PostScript-Font-Name
`font_size`	`float`	Schriftgröße in Punkt
`bold`	`bool`	Ob der Textlauf fett ist

Extraktion auf Zeichen- und Zeilenebene folgt demselben Schema: extract_chars(0) liefert Char-Datensätze (Unicode-Codepoint + bbox), und extract_text_lines(0) liefert TextLine-Datensätze (text, bbox, word_count).

Suche

Durchsuche eine einzelne Seite mit search(page_index, term, case_sensitive) oder das gesamte Dokument mit search_all(term, case_sensitive). Beide geben einen std::vector<pdf_oxide::SearchResult> zurück.

auto doc = pdf_oxide::Document::open("manual.pdf");

// One page
auto hits = doc.search(0, "configuration", /*case_sensitive=*/false);

// Every page
auto all_hits = doc.search_all("configuration", /*case_sensitive=*/false);
for (const auto& r : all_hits) {
    std::cout << "page " << r.page << ": '" << r.text << "'"
              << " at (" << r.bbox.x << ", " << r.bbox.y << ")\n";
}

Ein PDF erstellen

Der Builder pdf_oxide::Pdf erstellt Dokumente aus Markdown, HTML oder reinem Text. Serialisiere mit to_bytes() oder schreibe direkt mit save() auf die Festplatte.

// From Markdown
auto pdf = pdf_oxide::Pdf::from_markdown("# Hello World\n\nThis is a PDF.\n");
pdf.save("output.pdf");

// From HTML
auto invoice = pdf_oxide::Pdf::from_html("<h1>Invoice</h1><p>Amount: $42</p>");
invoice.save("invoice.pdf");

// From plain text, or grab the bytes for in-memory use
auto notes = pdf_oxide::Pdf::from_text("Plain text body.");
std::vector<std::uint8_t> bytes = notes.to_bytes();

Überführe ein frisch erstelltes PDF direkt wieder in ein Document:

auto pdf  = pdf_oxide::Pdf::from_markdown("# Title\n\nbody\n");
auto doc  = pdf_oxide::Document::open_from_bytes(pdf.to_bytes());
std::cout << doc.to_markdown_all() << "\n";

Fehlerbehandlung

Jede fehleranfällige Operation wirft pdf_oxide::Error, das die native Fehlermeldung (what()) und den rohen Fehlercode der C-ABI (code()) trägt. Handles sind außerdem explizit schließbar und idempotent: doc.close() gibt das native Handle frühzeitig frei, und eine Nutzung nach dem Schließen wirft eine Exception.

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    try {
        auto doc = pdf_oxide::Document::open("missing.pdf");
        std::cout << doc.extract_text(0) << "\n";
        doc.close();   // optional — happens automatically at scope exit
    } catch (const pdf_oxide::Error& e) {
        std::cerr << "pdf error (" << e.code() << "): " << e.what() << "\n";
        return 1;
    }
}

Nächste Schritte

Erste Schritte mit Rust – PDF Oxide aus Rust verwenden
Erste Schritte mit Python – PDF Oxide aus Python verwenden
Textextraktion – detaillierte Extraktionsoptionen und Rezepte
PDF-Erstellung – fortgeschrittene Erstellung mit Metadaten und Verschlüsselung
Bearbeitung – bestehende PDFs, Annotationen und Formularfelder ändern