What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

XFA 表单 — 在 Python、Rust、Node.js、Go 和 C# 中检测并读取 XML 表单数据

检测和分析 XFA 表单：

from pdf_oxide import PdfDocument

doc = PdfDocument("government-form.pdf")
xfa = doc.analyze_xfa()
if xfa:
    print(f"XFA form with {len(xfa.fields)} fields")
    for field in xfa.fields:
        print(f"  {field.name}: {field.field_type}")

XFA（XML Forms Architecture）是一种旧版表单格式，被许多政府机构、金融机构和企业系统广泛采用。大多数 Python PDF 库根本无法处理 XFA 表单。PDF Oxide 可以检测、分析并从中提取数据。

什么是 XFA？

XFA 表单使用嵌入 PDF 内部的 XML 模板，而非标准 AcroForm 字段。它由 Adobe 创建，常见于以下场景：

政府表单 — IRS、移民局、州政府机构文件
金融表单 — 贷款申请、保险理赔
企业表单 — HR 入职、采购、合规

XFA 在 PDF 2.0（ISO 32000-2:2020）中已被废弃，但数以百万计的 XFA 文档仍在流通使用。

XFA 与 AcroForm 对比

特性	AcroForm	XFA
格式	PDF 对象	XML 模板
支持方	所有 PDF 库	极少数 PDF 库
动态布局	不支持	支持
PDF 2.0 状态	受支持	已废弃
典型来源	大多数表单创建工具	Adobe LiveCycle、Adobe Designer

为什么 PyMuPDF 和 pypdf 无法处理 XFA 表单

如果你曾尝试用流行的 Python PDF 库读取 XFA 表单，很可能看到空结果却没有任何报错或警告。原因在于 PyMuPDF、pypdf、pdfplumber 和 pdfminer 均不支持 XFA。

PyMuPDF（fitz）— 静默返回空结果

PyMuPDF 的 doc.get_form_fields() 和页面的 .widgets() 只读取 AcroForm 字段。当 PDF 使用纯 XFA 表单时（常见于 IRS、移民及州政府机构文件），PyMuPDF 会静默返回空结果，不给出任何警告：

# PyMuPDF — silently misses XFA data
import fitz
doc = fitz.open("government-form.pdf")
fields = doc[0].widgets()  # Returns [] on XFA-only forms
form_data = doc.get_form_fields()  # Returns {} on XFA-only forms

如果 XFA 表单包含 AcroForm 回退层，PyMuPDF 可能返回部分字段——但真实的 XFA 数据（动态布局、计算值、嵌套子表单）仍然不可见。

pypdf — 同样在 XFA 表单上返回空结果

pypdf 的表单字段读取面临相同的限制，只能访问 AcroForm 字段，没有 XFA 支持：

# pypdf — cannot read XFA content
from pypdf import PdfReader
reader = PdfReader("government-form.pdf")
fields = reader.get_form_text_fields()  # Returns {} on XFA-only forms

pdfplumber 和 pdfminer — 完全不支持 XFA

pdfplumber 和 pdfminer 不会尝试从 XFA 表单读取字段，也没有任何 XFA 检测或提取的 API。

PDF Oxide — 原生读取 XFA

PDF Oxide 直接解析 XFA XML 模板，提取所有字段、值和表单结构：

# PDF Oxide — reads XFA natively
from pdf_oxide import PdfDocument
doc = PdfDocument("government-form.pdf")
xfa = doc.analyze_xfa()
print(f"{len(xfa.fields)} fields found")  # All XFA fields extracted

无论是政府表单、IRS 文件、保险申请，还是任何基于 XFA 的 PDF——包括没有 AcroForm 回退层的表单——都能正常处理。

安装

pip install pdf_oxide

检测 XFA 表单

检查 PDF 是否包含 XFA 内容：

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")
xfa = doc.analyze_xfa()

if xfa:
    print("This PDF uses XFA forms")
    print(f"  Fields: {len(xfa.fields)}")
    print(f"  Has template: {xfa.has_template}")
    print(f"  Has datasets: {xfa.has_datasets}")
else:
    print("Standard AcroForm (or no forms)")

WASM

在 WASM 中，你可以检测 XFA 表单，并回退到读取 AcroForm 字段：

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
if (doc.hasXfa()) {
  console.log("This PDF uses XFA forms");
  // Read any AcroForm fallback fields
  const fields = doc.getFormFields();
  console.log(`AcroForm fallback fields: ${fields.length}`);
}
doc.free();

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

auto doc = pdf_oxide::Document::open("government-form.pdf");
if (doc.has_xfa()) {
    std::cout << "This PDF uses XFA forms\n";
    // Read any AcroForm fallback fields
    auto fields = doc.get_form_fields();
    std::cout << "AcroForm fallback fields: " << fields.size() << "\n";
}

Swift

import PdfOxide

let doc = try Document.open("government-form.pdf")
if try doc.hasXfa() {
    print("This PDF uses XFA forms")
    // Read any AcroForm fallback fields
    let fields = try doc.formFields()
    print("AcroForm fallback fields: \(fields.count)")
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('government-form.pdf');
if (doc.hasXfa()) {
  print('This PDF uses XFA forms');
  // Read any AcroForm fallback fields
  final fields = doc.getFormFields();
  print('AcroForm fallback fields: ${fields.length}');
}
doc.close();

library(pdfoxide)

doc <- pdf_open("government-form.pdf")
if (pdf_has_xfa(doc)) {
  cat("This PDF uses XFA forms\n")
  # Read any AcroForm fallback fields
  fields <- pdf_get_form_fields(doc)
  cat("AcroForm fallback fields:", length(fields), "\n")
}

Julia

using PdfOxide

doc = open_document("government-form.pdf")
if has_xfa(doc)
    println("This PDF uses XFA forms")
    # Read any AcroForm fallback fields
    fields = get_form_fields(doc)
    println("AcroForm fallback fields: ", length(fields))
end

Zig

const pdf_oxide = @import("pdf_oxide");

var doc = try pdf_oxide.Document.open("government-form.pdf");
defer doc.deinit();
if (doc.hasXfa()) {
    std.debug.print("This PDF uses XFA forms\n", .{});
    // Read any AcroForm fallback fields
    var fields = try doc.formFields();
    defer fields.deinit();
    std.debug.print("AcroForm fallback fields: {d}\n", .{try fields.count()});
}

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"government-form.pdf" error:&err];
if ([doc hasXfa]) {
    NSLog(@"This PDF uses XFA forms");
    // Read any AcroForm fallback fields
    NSArray<POXFormField*> *fields = [doc formFieldsWithError:&err];
    NSLog(@"AcroForm fallback fields: %lu", (unsigned long)fields.count);
}

Elixir

{:ok, doc} = PdfOxide.open("government-form.pdf")

if PdfOxide.has_xfa?(doc) do
  IO.puts("This PDF uses XFA forms")
  # Read any AcroForm fallback fields
  {:ok, fields} = PdfOxide.form_fields(doc)
  IO.puts("AcroForm fallback fields: #{length(fields)}")
end

分析 XFA 字段

获取 XFA 表单中每个字段的详细信息：

from pdf_oxide import PdfDocument

doc = PdfDocument("tax-form.pdf")
xfa = doc.analyze_xfa()

if xfa:
    for field in xfa.fields:
        print(f"Name: {field.name}")
        print(f"  Type: {field.field_type}")
        print(f"  Value: {field.value}")
        print()

读取 XFA 数据

从 XFA 数据集中提取当前字段值：

from pdf_oxide import PdfDocument

doc = PdfDocument("filled-xfa.pdf")
xfa = doc.analyze_xfa()

if xfa and xfa.has_datasets:
    data = {}
    for field in xfa.fields:
        if field.value:
            data[field.name] = field.value
    print(data)

批量处理 XFA 表单

扫描目录，识别哪些 PDF 使用了 XFA：

from pdf_oxide import PdfDocument, PdfError
from pathlib import Path

pdf_dir = Path("government-forms/")
xfa_files = []
acroform_files = []

for pdf_path in pdf_dir.glob("*.pdf"):
    try:
        doc = PdfDocument(str(pdf_path))
        xfa = doc.analyze_xfa()
        if xfa:
            xfa_files.append(pdf_path.name)
        else:
            acroform_files.append(pdf_path.name)
    except PdfError as e:
        print(f"Error: {pdf_path.name}: {e}")

print(f"XFA forms: {len(xfa_files)}")
print(f"Standard forms: {len(acroform_files)}")

Rust API

use pdf_oxide::PdfDocument;
use pdf_oxide::xfa::analyze_xfa_document;

let mut doc = PdfDocument::open("xfa-form.pdf")?;
let analysis = analyze_xfa_document(&mut doc)?;

println!("XFA form detected: {} fields", analysis.fields.len());
for field in &analysis.fields {
    println!("  {} ({:?}): {:?}", field.name, field.field_type, field.value);
}

Node.js / TypeScript

Node.js 绑定提供 XFA 检测功能，以及安装了可选的 Node 端管理器后可用于字段级操作的高层 XfaManager。对于简单的路由逻辑，检测只需一次调用：

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("government-form.pdf");
if (doc.hasXFA()) {
  console.log("XFA form — route to specialized handler");
  // AcroForm fallback fields (if any) via doc.getFormFields()
  const fallback = doc.getFormFields();
  console.log(`AcroForm fallback fields: ${fallback.length}`);
} else {
  console.log("Standard AcroForm or no forms");
}
doc.close();

import { PdfDocument } from "pdf-oxide";

const doc = new PdfDocument("government-form.pdf");
if (doc.hasXFA()) {
  const fallback = doc.getFormFields();
  console.log(`XFA detected; ${fallback.length} AcroForm fallback fields`);
}
doc.close();

Go

Go 绑定支持 XFA 检测。可以在数据处理管道中为 XFA 文档打标记，然后将这些 PDF 路由到 Python 或 Rust 步骤进行完整的字段提取：

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("government-form.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    if doc.HasXfa() {
        fmt.Println("XFA form detected — route to Python/Rust extractor")
    } else {
        fmt.Println("Standard AcroForm or no forms")
    }
}

C#

using PdfOxide;

using var doc = PdfDocument.Open("government-form.pdf");
if (doc.HasXfa)
{
    Console.WriteLine("XFA form detected — route to specialized extractor");
}
else
{
    Console.WriteLine("Standard AcroForm or no forms");
}

绑定覆盖说明。 XFA 检测（hasXFA / HasXfa）在所有五个绑定中均可用。XFA 字段枚举和值提取（名称、类型、值、数据集 XML）目前仅在 Python 和 Rust 中开放；Node.js、Go 和 C# 绑定支持检测及 AcroForm 回退读取。如果需要在 Go 或 C# 中读取 XFA 字段值，请通过 Python 或 Rust 步骤桥接。

为什么 XFA 很重要

大多数 Python PDF 库会静默忽略 XFA 内容——extract_text() 和表单字段 API 只能看到 AcroForm 回退层（如果存在的话）。许多纯 XFA 表单没有 AcroForm 回退，导致它们对其他工具完全不可见：

PyMuPDF（pymupdf）XFA 表单 — get_form_fields() 和 .widgets() 在纯 XFA PDF 上返回空。PyMuPDF 没有 XFA 支持，也没有添加计划。
pypdf XFA 支持 — pypdf 的 get_form_text_fields() 无法读取 XFA 内容。只有 AcroForm 回退字段可见——如果它们存在的话。
pdfplumber — 不支持 XFA。表单提取仅限于 AcroForm 字段。
pdfminer — 不支持 XFA。无法检测或提取 XFA 表单数据。

PDF Oxide 是唯一能直接读取 XFA XML 模板的 Python PDF 库，让你可以访问 PyMuPDF、pypdf、pdfplumber 和 pdfminer 所看不到的表单结构和数据。