What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

XFA-Formulare — XML-Formulardaten in Python, Rust, Node.js, Go und C# erkennen und lesen

XFA-Formulare erkennen und analysieren:

from pdf_oxide import PdfDocument

doc = PdfDocument("government-form.pdf")
xfa = doc.analyze_xfa()
if xfa:
    print(f"XFA form with {len(xfa.fields)} fields")
    for field in xfa.fields:
        print(f"  {field.name}: {field.field_type}")

XFA (XML Forms Architecture) ist ein veraltetes Formularformat, das von vielen Behörden, Finanzinstituten und Unternehmenssystemen verwendet wird. Die meisten Python-PDF-Bibliotheken können XFA-Formulare überhaupt nicht verarbeiten. PDF Oxide kann sie erkennen, analysieren und Daten daraus extrahieren.

Was ist XFA?

XFA-Formulare verwenden XML-basierte Vorlagen, die in ein PDF eingebettet sind, anstelle von Standard-AcroForm-Feldern. Sie wurden von Adobe entwickelt und kommen häufig vor in:

Behördenformularen — Finanzamt, Einwanderungsbehörden, staatliche Dokumente
Finanzformularen — Kreditanträge, Versicherungsansprüche
Unternehmensformularen — Personaleinarbeitung, Beschaffung, Compliance

XFA wurde in PDF 2.0 (ISO 32000-2:2020) als veraltet eingestuft, doch Millionen bestehender XFA-Dokumente sind nach wie vor im Umlauf.

XFA vs. AcroForm

Merkmal	AcroForm	XFA
Format	PDF-Objekte	XML-Vorlagen
Unterstützt von	Allen PDF-Bibliotheken	Wenigen PDF-Bibliotheken
Dynamische Layouts	Nein	Ja
PDF-2.0-Status	Unterstützt	Veraltet
Typische Quelle	Die meisten Formularersteller	Adobe LiveCycle, Adobe Designer

Warum PyMuPDF und pypdf keine XFA-Formulare verarbeiten können

Wer XFA-Formulare mit gängigen Python-PDF-Bibliotheken gelesen hat, kennt das Problem: leere Ergebnisse, kein Fehler, keine Warnung. Der Grund ist einfach — PyMuPDF, pypdf, pdfplumber und pdfminer unterstützen XFA nicht.

PyMuPDF (fitz) — gibt stillschweigend leere Ergebnisse zurück

Die Methoden doc.get_form_fields() und .widgets() von PyMuPDF lesen ausschließlich AcroForm-Felder. Verwendet ein PDF reine XFA-Formulare (typisch bei Steuer-, Einwanderungs- und Behördendokumenten), gibt PyMuPDF lautlos leere Ergebnisse zurück:

# PyMuPDF — übersieht XFA-Daten stillschweigend
import fitz
doc = fitz.open("government-form.pdf")
fields = doc[0].widgets()  # Gibt [] bei reinen XFA-Formularen zurück
form_data = doc.get_form_fields()  # Gibt {} bei reinen XFA-Formularen zurück

Enthält das XFA-Formular eine AcroForm-Fallback-Ebene, kann PyMuPDF einen Teil der Felder zurückgeben — die eigentlichen XFA-Daten (dynamische Layouts, berechnete Werte, verschachtelte Unterformulare) bleiben jedoch unsichtbar.

pypdf — gibt bei XFA-Formularen ebenfalls leere Ergebnisse zurück

Auch das Lesen von Formularfeldern mit pypdf trifft auf dieselbe Einschränkung. Die Bibliothek kann nur AcroForm-Felder lesen und kennt kein XFA:

# pypdf — kann XFA-Inhalte nicht lesen
from pypdf import PdfReader
reader = PdfReader("government-form.pdf")
fields = reader.get_form_text_fields()  # Gibt {} bei reinen XFA-Formularen zurück

pdfplumber und pdfminer — keinerlei XFA-Unterstützung

pdfplumber und pdfminer unternehmen keinen Versuch, Felder aus XFA-Formularen zu lesen. Es gibt keine API zur XFA-Erkennung oder -Extraktion.

PDF Oxide — liest XFA nativ

PDF Oxide parst die XFA-XML-Vorlagen direkt und extrahiert alle Felder, Werte und die Formularstruktur:

# PDF Oxide — liest XFA nativ
from pdf_oxide import PdfDocument
doc = PdfDocument("government-form.pdf")
xfa = doc.analyze_xfa()
print(f"{len(xfa.fields)} fields found")  # Alle XFA-Felder extrahiert

Das funktioniert bei Behördenformularen, Steuerdokumenten, Versicherungsanträgen und jedem anderen XFA-basierten PDF — auch bei Formularen ohne AcroForm-Fallback-Ebene.

Installation

pip install pdf_oxide

XFA-Formulare erkennen

Prüfen, ob ein PDF XFA-Inhalte enthält:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")
xfa = doc.analyze_xfa()

if xfa:
    print("Dieses PDF verwendet XFA-Formulare")
    print(f"  Felder: {len(xfa.fields)}")
    print(f"  Hat Vorlage: {xfa.has_template}")
    print(f"  Hat Datensätze: {xfa.has_datasets}")
else:
    print("Standard-AcroForm (oder keine Formulare)")

WASM

In WASM können XFA-Formulare erkannt und auf AcroForm-Felder zurückgefallen werden:

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
if (doc.hasXfa()) {
  console.log("Dieses PDF verwendet XFA-Formulare");
  // AcroForm-Fallback-Felder lesen
  const fields = doc.getFormFields();
  console.log(`AcroForm-Fallback-Felder: ${fields.length}`);
}
doc.free();

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

auto doc = pdf_oxide::Document::open("government-form.pdf");
if (doc.has_xfa()) {
    std::cout << "This PDF uses XFA forms\n";
    // Read any AcroForm fallback fields
    auto fields = doc.get_form_fields();
    std::cout << "AcroForm fallback fields: " << fields.size() << "\n";
}

Swift

import PdfOxide

let doc = try Document.open("government-form.pdf")
if try doc.hasXfa() {
    print("This PDF uses XFA forms")
    // Read any AcroForm fallback fields
    let fields = try doc.formFields()
    print("AcroForm fallback fields: \(fields.count)")
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('government-form.pdf');
if (doc.hasXfa()) {
  print('This PDF uses XFA forms');
  // Read any AcroForm fallback fields
  final fields = doc.getFormFields();
  print('AcroForm fallback fields: ${fields.length}');
}
doc.close();

library(pdfoxide)

doc <- pdf_open("government-form.pdf")
if (pdf_has_xfa(doc)) {
  cat("This PDF uses XFA forms\n")
  # Read any AcroForm fallback fields
  fields <- pdf_get_form_fields(doc)
  cat("AcroForm fallback fields:", length(fields), "\n")
}

Julia

using PdfOxide

doc = open_document("government-form.pdf")
if has_xfa(doc)
    println("This PDF uses XFA forms")
    # Read any AcroForm fallback fields
    fields = get_form_fields(doc)
    println("AcroForm fallback fields: ", length(fields))
end

Zig

const pdf_oxide = @import("pdf_oxide");

var doc = try pdf_oxide.Document.open("government-form.pdf");
defer doc.deinit();
if (doc.hasXfa()) {
    std.debug.print("This PDF uses XFA forms\n", .{});
    // Read any AcroForm fallback fields
    var fields = try doc.formFields();
    defer fields.deinit();
    std.debug.print("AcroForm fallback fields: {d}\n", .{try fields.count()});
}

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"government-form.pdf" error:&err];
if ([doc hasXfa]) {
    NSLog(@"This PDF uses XFA forms");
    // Read any AcroForm fallback fields
    NSArray<POXFormField*> *fields = [doc formFieldsWithError:&err];
    NSLog(@"AcroForm fallback fields: %lu", (unsigned long)fields.count);
}

Elixir

{:ok, doc} = PdfOxide.open("government-form.pdf")

if PdfOxide.has_xfa?(doc) do
  IO.puts("This PDF uses XFA forms")
  # Read any AcroForm fallback fields
  {:ok, fields} = PdfOxide.form_fields(doc)
  IO.puts("AcroForm fallback fields: #{length(fields)}")
end

XFA-Felder analysieren

Details zu jedem Feld im XFA-Formular abrufen:

from pdf_oxide import PdfDocument

doc = PdfDocument("tax-form.pdf")
xfa = doc.analyze_xfa()

if xfa:
    for field in xfa.fields:
        print(f"Name: {field.name}")
        print(f"  Type: {field.field_type}")
        print(f"  Value: {field.value}")
        print()

XFA-Daten lesen

Aktuelle Feldwerte aus den XFA-Datensätzen extrahieren:

from pdf_oxide import PdfDocument

doc = PdfDocument("filled-xfa.pdf")
xfa = doc.analyze_xfa()

if xfa and xfa.has_datasets:
    data = {}
    for field in xfa.fields:
        if field.value:
            data[field.name] = field.value
    print(data)

Stapelverarbeitung von XFA-Formularen

Ein Verzeichnis durchsuchen und ermitteln, welche PDFs XFA verwenden:

from pdf_oxide import PdfDocument, PdfError
from pathlib import Path

pdf_dir = Path("government-forms/")
xfa_files = []
acroform_files = []

for pdf_path in pdf_dir.glob("*.pdf"):
    try:
        doc = PdfDocument(str(pdf_path))
        xfa = doc.analyze_xfa()
        if xfa:
            xfa_files.append(pdf_path.name)
        else:
            acroform_files.append(pdf_path.name)
    except PdfError as e:
        print(f"Error: {pdf_path.name}: {e}")

print(f"XFA-Formulare: {len(xfa_files)}")
print(f"Standardformulare: {len(acroform_files)}")

Rust-API

use pdf_oxide::PdfDocument;
use pdf_oxide::xfa::analyze_xfa_document;

let mut doc = PdfDocument::open("xfa-form.pdf")?;
let analysis = analyze_xfa_document(&mut doc)?;

println!("XFA form detected: {} fields", analysis.fields.len());
for field in &analysis.fields {
    println!("  {} ({:?}): {:?}", field.name, field.field_type, field.value);
}

Node.js / TypeScript

Das Node.js-Binding stellt XFA-Erkennung und — wenn der optionale Node-seitige Manager installiert ist — einen übergeordneten XfaManager für feldbasierte Operationen bereit. Für einfache Routing-Logik genügt ein einziger Aufruf:

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("government-form.pdf");
if (doc.hasXFA()) {
  console.log("XFA form — route to specialized handler");
  // AcroForm fallback fields (if any) via doc.getFormFields()
  const fallback = doc.getFormFields();
  console.log(`AcroForm fallback fields: ${fallback.length}`);
} else {
  console.log("Standard AcroForm or no forms");
}
doc.close();

import { PdfDocument } from "pdf-oxide";

const doc = new PdfDocument("government-form.pdf");
if (doc.hasXFA()) {
  const fallback = doc.getFormFields();
  console.log(`XFA detected; ${fallback.length} AcroForm fallback fields`);
}
doc.close();

Go

Das Go-Binding unterstützt XFA-Erkennung. Damit lassen sich XFA-Dokumente in Pipelines markieren und zur vollständigen Feldextraktion an einen Python- oder Rust-Schritt weiterleiten:

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("government-form.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    if doc.HasXfa() {
        fmt.Println("XFA form detected — route to Python/Rust extractor")
    } else {
        fmt.Println("Standard AcroForm or no forms")
    }
}

C#

using PdfOxide;

using var doc = PdfDocument.Open("government-form.pdf");
if (doc.HasXfa)
{
    Console.WriteLine("XFA form detected — route to specialized extractor");
}
else
{
    Console.WriteLine("Standard AcroForm or no forms");
}

Hinweis zur Binding-Abdeckung. Die XFA-Erkennung (hasXFA / HasXfa) ist in allen fünf Bindings verfügbar. Die vollständige XFA-Feldenumeration und Wertextraktion (Namen, Typen, Werte, Datensatz-XML) ist derzeit nur in Python und Rust verfügbar; die Node.js-, Go- und C#-Bindings bieten Erkennung und Lesen des AcroForm-Fallbacks. Wer XFA-Feldwerte aus Go oder C# benötigt, sollte einen Python- oder Rust-Zwischenschritt einbauen.

Warum XFA wichtig ist

Die meisten Python-PDF-Bibliotheken ignorieren XFA-Inhalte stillschweigend — extract_text() und Formularfeld-APIs sehen nur die AcroForm-Fallback-Ebene, sofern sie existiert. Viele reine XFA-Formulare besitzen gar keine AcroForm-Fallback-Ebene und sind damit für andere Werkzeuge vollständig unsichtbar:

PyMuPDF (pymupdf) und XFA — get_form_fields() und .widgets() liefern bei reinen XFA-PDFs leere Ergebnisse. PyMuPDF unterstützt XFA nicht und plant keine Unterstützung.
pypdf und XFA — get_form_text_fields() kann keine XFA-Inhalte lesen. Nur AcroForm-Fallback-Felder sind sichtbar — falls welche vorhanden sind.
pdfplumber — keine XFA-Unterstützung. Formularextraktion ist auf AcroForm-Felder beschränkt.
pdfminer — keine XFA-Unterstützung. Kann XFA-Formulardaten weder erkennen noch extrahieren.

PDF Oxide ist die einzige Python-PDF-Bibliothek, die XFA-XML-Vorlagen direkt liest und Zugriff auf Formularstruktur und -daten bietet, die PyMuPDF, pypdf, pdfplumber und pdfminer nicht sehen können.