Skip to content

XFA Forms — Detect and Read XML Form Data in Python, Rust, Node.js, Go & C#

Detect and analyze XFA forms:

from pdf_oxide import PdfDocument

doc = PdfDocument("government-form.pdf")
xfa = doc.analyze_xfa()
if xfa:
    print(f"XFA form with {len(xfa.fields)} fields")
    for field in xfa.fields:
        print(f"  {field.name}: {field.field_type}")

XFA (XML Forms Architecture) is a legacy form format used by many government agencies, financial institutions, and enterprise systems. Most Python PDF libraries cannot handle XFA forms at all. PDF Oxide can detect, analyze, and extract data from them.

What Is XFA?

XFA forms use XML-based templates embedded inside a PDF instead of standard AcroForm fields. They were created by Adobe and are common in:

  • Government forms — IRS, immigration, state agency documents
  • Financial forms — loan applications, insurance claims
  • Enterprise forms — HR onboarding, procurement, compliance

XFA was deprecated in PDF 2.0 (ISO 32000-2:2020), but millions of existing XFA documents remain in circulation.

XFA vs AcroForm

Feature AcroForm XFA
Format PDF objects XML templates
Supported by All PDF libraries Few PDF libraries
Dynamic layouts No Yes
PDF 2.0 status Supported Deprecated
Typical source Most form creators Adobe LiveCycle, Adobe Designer

Why PyMuPDF and pypdf Cannot Handle XFA Forms

If you have tried reading XFA forms with popular Python PDF libraries, you have likely seen empty results with no error or warning. This is because PyMuPDF, pypdf, pdfplumber, and pdfminer have no XFA support.

PyMuPDF (fitz) — silently returns empty

PyMuPDF’s doc.get_form_fields() and page .widgets() only read AcroForm fields. When a PDF uses XFA-only forms (common with IRS, immigration, and state agency documents), PyMuPDF returns empty results without any warning:

# PyMuPDF — silently misses XFA data
import fitz
doc = fitz.open("government-form.pdf")
fields = doc[0].widgets()  # Returns [] on XFA-only forms
form_data = doc.get_form_fields()  # Returns {} on XFA-only forms

If the XFA form includes an AcroForm fallback layer, PyMuPDF may return a partial subset of fields — but the actual XFA data (dynamic layouts, calculated values, nested subforms) is invisible.

pypdf — also returns empty on XFA forms

pypdf’s form field reading hits the same limitation. It can only access AcroForm fields and has no XFA support:

# pypdf — cannot read XFA content
from pypdf import PdfReader
reader = PdfReader("government-form.pdf")
fields = reader.get_form_text_fields()  # Returns {} on XFA-only forms

pdfplumber and pdfminer — no XFA support at all

pdfplumber and pdfminer do not attempt to read form fields from XFA forms. They have no API for XFA detection or extraction.

PDF Oxide — reads XFA natively

PDF Oxide parses the XFA XML templates directly, extracting all fields, values, and form structure:

# PDF Oxide — reads XFA natively
from pdf_oxide import PdfDocument
doc = PdfDocument("government-form.pdf")
xfa = doc.analyze_xfa()
print(f"{len(xfa.fields)} fields found")  # All XFA fields extracted

This works on government forms, IRS documents, insurance applications, and any other XFA-based PDF — including forms with no AcroForm fallback layer.

Installation

pip install pdf_oxide

Detecting XFA Forms

Check if a PDF contains XFA content:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")
xfa = doc.analyze_xfa()

if xfa:
    print("This PDF uses XFA forms")
    print(f"  Fields: {len(xfa.fields)}")
    print(f"  Has template: {xfa.has_template}")
    print(f"  Has datasets: {xfa.has_datasets}")
else:
    print("Standard AcroForm (or no forms)")

WASM

In WASM, you can detect XFA forms and fall back to reading AcroForm fields:

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
if (doc.hasXfa()) {
  console.log("This PDF uses XFA forms");
  // Read any AcroForm fallback fields
  const fields = doc.getFormFields();
  console.log(`AcroForm fallback fields: ${fields.length}`);
}
doc.free();

Analyzing XFA Fields

Get details about every field in the XFA form:

from pdf_oxide import PdfDocument

doc = PdfDocument("tax-form.pdf")
xfa = doc.analyze_xfa()

if xfa:
    for field in xfa.fields:
        print(f"Name: {field.name}")
        print(f"  Type: {field.field_type}")
        print(f"  Value: {field.value}")
        print()

Reading XFA Data

Extract current field values from the XFA datasets:

from pdf_oxide import PdfDocument

doc = PdfDocument("filled-xfa.pdf")
xfa = doc.analyze_xfa()

if xfa and xfa.has_datasets:
    data = {}
    for field in xfa.fields:
        if field.value:
            data[field.name] = field.value
    print(data)

Batch Processing XFA Forms

Scan a directory to identify which PDFs use XFA:

from pdf_oxide import PdfDocument, PdfError
from pathlib import Path

pdf_dir = Path("government-forms/")
xfa_files = []
acroform_files = []

for pdf_path in pdf_dir.glob("*.pdf"):
    try:
        doc = PdfDocument(str(pdf_path))
        xfa = doc.analyze_xfa()
        if xfa:
            xfa_files.append(pdf_path.name)
        else:
            acroform_files.append(pdf_path.name)
    except PdfError as e:
        print(f"Error: {pdf_path.name}: {e}")

print(f"XFA forms: {len(xfa_files)}")
print(f"Standard forms: {len(acroform_files)}")

Rust API

use pdf_oxide::PdfDocument;
use pdf_oxide::xfa::analyze_xfa_document;

let mut doc = PdfDocument::open("xfa-form.pdf")?;
let analysis = analyze_xfa_document(&mut doc)?;

println!("XFA form detected: {} fields", analysis.fields.len());
for field in &analysis.fields {
    println!("  {} ({:?}): {:?}", field.name, field.field_type, field.value);
}

Node.js / TypeScript

The Node binding exposes XFA detection and a higher-level XfaManager for field-level operations when the optional Node-side manager is installed. For simple routing logic, detection is a single call:

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("government-form.pdf");
if (doc.hasXFA()) {
  console.log("XFA form — route to specialized handler");
  // AcroForm fallback fields (if any) via doc.getFormFields()
  const fallback = doc.getFormFields();
  console.log(`AcroForm fallback fields: ${fallback.length}`);
} else {
  console.log("Standard AcroForm or no forms");
}
doc.close();
import { PdfDocument } from "pdf-oxide";

const doc = new PdfDocument("government-form.pdf");
if (doc.hasXFA()) {
  const fallback = doc.getFormFields();
  console.log(`XFA detected; ${fallback.length} AcroForm fallback fields`);
}
doc.close();

Go

The Go binding exposes XFA detection. Use it to flag XFA documents in pipelines, then route those PDFs to a Python or Rust step for full field extraction:

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("government-form.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    if doc.HasXfa() {
        fmt.Println("XFA form detected — route to Python/Rust extractor")
    } else {
        fmt.Println("Standard AcroForm or no forms")
    }
}

C#

using PdfOxide;

using var doc = PdfDocument.Open("government-form.pdf");
if (doc.HasXfa)
{
    Console.WriteLine("XFA form detected — route to specialized extractor");
}
else
{
    Console.WriteLine("Standard AcroForm or no forms");
}

Binding coverage note. XFA detection (hasXFA / HasXfa) is available in all five bindings. Full XFA field enumeration and value extraction (names, types, values, datasets XML) is currently exposed in Python and Rust only; the Node, Go, and C# bindings surface detection plus AcroForm-fallback reading. For workflows that need to read XFA field values from Go or C#, bridge through a Python or Rust step.

Why XFA Matters

Most Python PDF libraries silently ignore XFA content — extract_text() and form field APIs only see the AcroForm fallback layer (if one exists). Many XFA-only forms have no AcroForm fallback, making them invisible to other tools:

  • PyMuPDF (pymupdf) XFA formsget_form_fields() and .widgets() return empty on XFA-only PDFs. PyMuPDF has no XFA support and no plans to add it.
  • pypdf XFA support — pypdf’s get_form_text_fields() cannot read XFA content. Only AcroForm fallback fields are visible, if they exist at all.
  • pdfplumber — no XFA support. Form extraction is limited to AcroForm fields.
  • pdfminer — no XFA support. Cannot detect or extract XFA form data.

PDF Oxide is the only Python PDF library that reads XFA XML templates directly, giving you access to form structure and data that PyMuPDF, pypdf, pdfplumber, and pdfminer cannot see.