What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Erste Schritte mit PDF Oxide (WASM)

PDF Oxide lässt sich zu WebAssembly kompilieren und läuft damit im Browser, in Deno, Bun und in Edge-Runtimes wie Cloudflare Workers oder Vercel Edge. Derselbe Rust-Kern, der die Python-, Rust-, Node.js-, Go- und C#-Bindings antreibt, arbeitet direkt in jeder JavaScript-Umgebung — mit nahezu nativer Geschwindigkeit.

Sie nutzen Node.js? Verwenden Sie serverseitig bevorzugt das native N-API-Addon pdf-oxide — es ist schneller und unterstützt zusätzlich OCR, Rendering und Signaturen. Der WASM-Build auf dieser Seite ist die richtige Wahl für Browser und Edge-Runtimes, in denen native Addons nicht geladen werden können.

Installation

npm install pdf-oxide-wasm

import { WasmPdfDocument, WasmPdf } from "pdf-oxide-wasm";

Schnellstart

Node.js

import { readFileSync } from "fs";
import { WasmPdfDocument } from "pdf-oxide-wasm";

const bytes = new Uint8Array(readFileSync("document.pdf"));
const doc = new WasmPdfDocument(bytes);

console.log(`Pages: ${doc.pageCount()}`);
console.log(doc.extractText(0));

doc.free();

Browser

<script type="module">
import init, { WasmPdfDocument } from "pdf-oxide-wasm";

await init();

const response = await fetch("document.pdf");
const bytes = new Uint8Array(await response.arrayBuffer());
const doc = new WasmPdfDocument(bytes);

console.log(`Pages: ${doc.pageCount()}`);
console.log(doc.extractText(0));
doc.free();
</script>

Browser mit Datei-Upload

<input type="file" id="pdfInput" accept=".pdf" />
<pre id="output"></pre>

<script type="module">
import init, { WasmPdfDocument } from "pdf-oxide-wasm";
await init();

document.getElementById("pdfInput").addEventListener("change", async (e) => {
  const file = e.target.files[0];
  const bytes = new Uint8Array(await file.arrayBuffer());
  const doc = new WasmPdfDocument(bytes);

  let result = `Pages: ${doc.pageCount()}\n\n`;
  for (let i = 0; i < doc.pageCount(); i++) {
    result += `--- Page ${i + 1} ---\n`;
    result += doc.extractText(i) + "\n\n";
  }

  document.getElementById("output").textContent = result;
  doc.free();
});
</script>

Textextraktion

Einzelne Seite

const doc = new WasmPdfDocument(bytes);
const text = doc.extractText(0);

Alle Seiten

const allText = doc.extractAllText();

Strukturierte Extraktion

Zeichen- und Span-Daten inklusive Positionen und Schriftmetadaten:

// Daten auf Zeichenebene
const chars = doc.extractChars(0);
for (const c of chars) {
  console.log(`'${c.char}' at (${c.bbox.x}, ${c.bbox.y}) font=${c.fontName}`);
}

// Daten auf Span-Ebene
const spans = doc.extractSpans(0);
for (const span of spans) {
  console.log(`"${span.text}" size=${span.fontSize}`);
}

Markdown-Konvertierung

const markdown = doc.toMarkdown(0);

// Mit Optionen
const md = doc.toMarkdown(0, true, true); // detect_headings, include_images

// Alle Seiten
const allMarkdown = doc.toMarkdownAll();

HTML-Konvertierung

const html = doc.toHtml(0);

// Alle Seiten
const allHtml = doc.toHtmlAll();

PDF-Erstellung

Neue PDFs aus Markdown, HTML oder reinem Text erzeugen Sie mit WasmPdf:

import { WasmPdf } from "pdf-oxide-wasm";

// Aus Markdown
const pdf = WasmPdf.fromMarkdown("# Hello World\n\nThis is a PDF.");
const pdfBytes = pdf.toBytes(); // Uint8Array

// Aus HTML
const invoice = WasmPdf.fromHtml("<h1>Invoice</h1><p>Amount: $42</p>");

// Aus reinem Text
const notes = WasmPdf.fromText("Plain text content.");

// In Datei schreiben (Node.js)
import { writeFileSync } from "fs";
writeFileSync("output.pdf", pdf.toBytes());

Formularfelder

const fields = doc.getFormFields();
for (const f of fields) {
  console.log(`${f.name} (${f.fieldType}) = ${f.value}`);
}

// Formulardaten exportieren
const fdfBytes = doc.exportFormData();        // FDF-Format
const xfdfBytes = doc.exportFormData("xfdf"); // XFDF-Format

Suche

// Alle Seiten durchsuchen
const results = doc.search("configuration", true); // case_insensitive
for (const r of results) {
  console.log(`Found "${r.text}" on page ${r.page}`);
}

// Einzelne Seite durchsuchen
const pageResults = doc.searchPage(0, "configuration", true);

Laden aus Byte-Arrays

Der WasmPdfDocument-Konstruktor nimmt bereits ein Uint8Array entgegen — eine separate from_bytes-Methode ist nicht nötig:

// Funktioniert direkt — WasmPdfDocument akzeptiert Bytes
const doc = new WasmPdfDocument(uint8Array);

Verschlüsselte PDFs

const doc = new WasmPdfDocument(encryptedBytes);
const success = doc.authenticate("password");
if (success) {
  console.log(doc.extractText(0));
}

Bearbeitung

const doc = new WasmPdfDocument(bytes);

// Metadaten
doc.setTitle("Updated Title");
doc.setAuthor("Jane Doe");

// Seite drehen
doc.rotatePage(0, 90);

// Mit Änderungen speichern
const edited = doc.save();

// Verschlüsselt speichern
const encrypted = doc.saveEncryptedToBytes(
  "user-password",
  "owner-password",
  true,   // allow_print
  true,   // allow_copy
  false,  // allow_modify
  true    // allow_annotate
);

Speicherverwaltung

WASM-Objekte halten Rust-Speicher, der explizit freigegeben werden muss:

const doc = new WasmPdfDocument(bytes);
try {
  const text = doc.extractText(0);
} finally {
  doc.free();
}

Verfügbare Funktionen

Einige Features benötigen native Abhängigkeiten und stehen im WebAssembly-Build nicht zur Verfügung:

Funktion	WASM	Hinweise
Textextraktion	Ja	Voll unterstützt
PDF-Erstellung	Ja	Markdown, HTML, Text
PDF-Bearbeitung	Ja	Voll unterstützt
Verschlüsselung	Ja	AES-256
OCR	Nein	Erfordert natives ONNX Runtime
Digitale Signaturen	Nein	Erfordert native Krypto-Bibliotheken
Seiten-Rendering	Nein	Erfordert natives tiny-skia

Für OCR oder Rendering greifen Sie zu den Python- oder Rust-Bindings.

Nächste Schritte

Python — Schnellstart – PDF Oxide aus Python verwenden
Rust — Schnellstart – PDF Oxide aus Rust verwenden
JavaScript-API-Referenz – vollständige WASM-API-Dokumentation
Textextraktion – detaillierte Extraktionsoptionen
PDF-Erstellung – fortgeschrittene Erstellung