XFA 폼 — Python / Rust / Node.js / Go / C#
AcroForm 폴백에 의존하지 않고 XML 템플릿을 직접 읽는 유일한 라이브러리로 XFA 폼을 감지하고 분석합니다:
from pdf_oxide import PdfDocument
doc = PdfDocument("government-form.pdf")
xfa = doc.analyze_xfa()
if xfa:
print(f"XFA form with {len(xfa.fields)} fields")
for field in xfa.fields:
print(f" {field.name}: {field.field_type}")
XFA(XML Forms Architecture)는 많은 정부 기관, 금융 기관, 기업 시스템에서 사용하는 레거시 양식 형식입니다 used by many government agencies, financial institutions, and enterprise systems. Most Python PDF libraries cannot handle XFA forms at all. PDF Oxide can detect, analyze, and extract data from them.
XFA란?
XFA forms use XML-based templates embedded inside a PDF instead of standard AcroForm fields. They were created by Adobe and are common in:
- Government forms — IRS, immigration, state agency documents
- Financial forms — loan applications, insurance claims
- Enterprise forms — HR onboarding, procurement, compliance
XFA는 PDF 2.0(ISO 32000-2:2020)에서 더 이상 사용되지 않지만 (ISO 32000-2:2020), but millions of existing XFA documents remain in circulation.
XFA vs AcroForm
| Feature | AcroForm | XFA |
|---|---|---|
| Format | PDF objects | XML templates |
| Supported by | All PDF libraries | Few PDF libraries |
| Dynamic layouts | No | Yes |
| PDF 2.0 status | Supported | Deprecated |
| Typical source | Most form creators | Adobe LiveCycle, Adobe Designer |
PyMuPDF와 pypdf가 XFA 양식을 처리할 수 없는 이유
인기 있는 Python PDF 라이브러리로 XFA 양식을 읽어보려고 했다면, you have likely seen empty results with no error or warning. This is because PyMuPDF, pypdf, pdfplumber, and pdfminer have no XFA support.
PyMuPDF (fitz) — silently returns empty
PyMuPDF’s doc.get_form_fields() and page .widgets() only read AcroForm fields. When a PDF uses XFA-only forms (common with IRS, immigration, and state agency documents), PyMuPDF returns empty results without any warning:
# PyMuPDF — silently misses XFA data
import fitz
doc = fitz.open("government-form.pdf")
fields = doc[0].widgets() # 반환값 [] on XFA-only forms
form_data = doc.get_form_fields() # 반환값 {} on XFA-only forms
If the XFA form includes an AcroForm fallback layer, PyMuPDF may return a partial subset of fields — but the actual XFA data (dynamic layouts, calculated values, nested subforms) is invisible.
pypdf — also returns empty on XFA forms
pypdf’s form field reading hits the same limitation. It can only access AcroForm fields and has no XFA support:
# pypdf — cannot read XFA content
from pypdf import PdfReader
reader = PdfReader("government-form.pdf")
fields = reader.get_form_text_fields() # 반환값 {} on XFA-only forms
pdfplumber and pdfminer — no XFA support at all
pdfplumber and pdfminer do not attempt to read form fields from XFA forms. They have no API for XFA detection or extraction.
PDF Oxide — reads XFA natively
PDF Oxide parses the XFA XML templates directly, extracting all fields, values, and form structure:
# PDF Oxide — reads XFA natively
from pdf_oxide import PdfDocument
doc = PdfDocument("government-form.pdf")
xfa = doc.analyze_xfa()
print(f"{len(xfa.fields)} fields found") # All XFA fields extracted
This works on government forms, IRS documents, insurance applications, and any other XFA-based PDF — including forms with no AcroForm fallback layer.
설치
pip install pdf_oxide
XFA 양식 감지
PDF에 XFA 콘텐츠가 포함되어 있는지 확인합니다:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("form.pdf")
xfa = doc.analyze_xfa()
if xfa:
print("This PDF uses XFA forms")
print(f" Fields: {len(xfa.fields)}")
print(f" Has template: {xfa.has_template}")
print(f" Has datasets: {xfa.has_datasets}")
else:
print("Standard AcroForm (or no forms)")
WASM
In WASM, you can detect XFA forms and fall back to reading AcroForm fields:
import { WasmPdfDocument } from "pdf-oxide-wasm";
const doc = new WasmPdfDocument(bytes);
if (doc.hasXfa()) {
console.log("This PDF uses XFA forms");
// Read any AcroForm fallback fields
const fields = doc.getFormFields();
console.log(`AcroForm fallback fields: ${fields.length}`);
}
doc.free();
XFA 필드 분석
XFA 양식의 모든 필드에 대한 세부 정보를 가져옵니다:
from pdf_oxide import PdfDocument
doc = PdfDocument("tax-form.pdf")
xfa = doc.analyze_xfa()
if xfa:
for field in xfa.fields:
print(f"Name: {field.name}")
print(f" 타입: {field.field_type}")
print(f" Value: {field.value}")
print()
XFA 데이터 읽기
XFA 데이터셋에서 현재 필드 값을 추출합니다:
from pdf_oxide import PdfDocument
doc = PdfDocument("filled-xfa.pdf")
xfa = doc.analyze_xfa()
if xfa and xfa.has_datasets:
data = {}
for field in xfa.fields:
if field.value:
data[field.name] = field.value
print(data)
XFA 양식 배치 처리
디렉토리를 스캔하여 어떤 PDF가 XFA를 사용하는지 식별합니다:
from pdf_oxide import PdfDocument, PdfError
from pathlib import Path
pdf_dir = Path("government-forms/")
xfa_files = []
acroform_files = []
for pdf_path in pdf_dir.glob("*.pdf"):
try:
doc = PdfDocument(str(pdf_path))
xfa = doc.analyze_xfa()
if xfa:
xfa_files.append(pdf_path.name)
else:
acroform_files.append(pdf_path.name)
except PdfError as e:
print(f"Error: {pdf_path.name}: {e}")
print(f"XFA forms: {len(xfa_files)}")
print(f"Standard forms: {len(acroform_files)}")
Rust API
use pdf_oxide::PdfDocument;
use pdf_oxide::xfa::analyze_xfa_document;
let mut doc = PdfDocument::open("xfa-form.pdf")?;
let analysis = analyze_xfa_document(&mut doc)?;
println!("XFA form detected: {} fields", analysis.fields.len());
for field in &analysis.fields {
println!(" {} ({:?}): {:?}", field.name, field.field_type, field.value);
}
XFA가 중요한 이유
대부분의 Python PDF 라이브러리는 XFA 콘텐츠를 조용히 무시합니다 — extract_text() and form field APIs only see the AcroForm fallback layer (if one exists). Many XFA-only forms have no AcroForm fallback, making them invisible to other tools:
- PyMuPDF (pymupdf) XFA forms —
get_form_fields()and.widgets()return empty on XFA-only PDFs. PyMuPDF has no XFA support and no plans to add it. - pypdf XFA support — pypdf’s
get_form_text_fields()cannot read XFA content. Only AcroForm fallback fields are visible, if they exist at all. - pdfplumber — no XFA support. Form extraction is limited to AcroForm fields.
- pdfminer — no XFA support. Cannot detect or extract XFA form data.
PDF Oxide는 XFA XML 템플릿을 직접 읽는 유일한 Python PDF 라이브러리로, giving you access to form structure and data that PyMuPDF, pypdf, pdfplumber, and pdfminer cannot see.
관련 페이지
- Form Data Extraction — AcroForm 추출 API
- Fill PDF Forms — 양식 채우기 가이드
- Form Field Editing — 고급 양식 작업