What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

XFA 폼 — Python, Rust, Node.js, Go, C#에서 XML 폼 데이터 감지 및 읽기

XFA 폼 감지와 분석:

from pdf_oxide import PdfDocument

doc = PdfDocument("government-form.pdf")
xfa = doc.analyze_xfa()
if xfa:
    print(f"XFA form with {len(xfa.fields)} fields")
    for field in xfa.fields:
        print(f"  {field.name}: {field.field_type}")

XFA(XML Forms Architecture)는 많은 정부 기관, 금융 기관, 기업 시스템에서 사용하는 레거시 폼 형식입니다. 대부분의 Python PDF 라이브러리는 XFA 폼을 전혀 처리하지 못합니다. PDF Oxide는 XFA 폼을 감지하고 분석하며 데이터를 추출할 수 있습니다.

XFA란?

XFA 폼은 표준 AcroForm 필드 대신 PDF 내부에 내장된 XML 기반 템플릿을 사용합니다. Adobe가 만들었으며 다음과 같은 곳에서 흔히 쓰입니다:

정부 폼 — IRS, 이민청, 주 정부 기관 문서
금융 폼 — 대출 신청서, 보험 청구서
기업 폼 — HR 온보딩, 구매, 컴플라이언스

XFA는 PDF 2.0(ISO 32000-2:2020)에서 더 이상 사용되지 않지만, 수백만 개의 기존 XFA 문서가 여전히 유통되고 있습니다.

XFA 대 AcroForm

기능	AcroForm	XFA
형식	PDF 객체	XML 템플릿
지원 라이브러리	모든 PDF 라이브러리	극소수의 PDF 라이브러리
동적 레이아웃	없음	있음
PDF 2.0 상태	지원	더 이상 사용 안 함
대표 생성 도구	대부분의 폼 제작 도구	Adobe LiveCycle, Adobe Designer

PyMuPDF와 pypdf가 XFA 폼을 처리하지 못하는 이유

일반적인 Python PDF 라이브러리로 XFA 폼을 읽으려 하면 오류나 경고 없이 빈 결과만 돌아오는 경우가 많습니다. PyMuPDF, pypdf, pdfplumber, pdfminer가 XFA를 지원하지 않기 때문입니다.

PyMuPDF(fitz) — 조용히 빈 값을 반환

PyMuPDF의 doc.get_form_fields()와 페이지의 .widgets()는 AcroForm 필드만 읽습니다. PDF가 XFA 전용 폼을 사용하는 경우(IRS, 이민청, 주 정부 기관 문서에 흔함), PyMuPDF는 아무런 경고 없이 빈 결과를 반환합니다:

# PyMuPDF — silently misses XFA data
import fitz
doc = fitz.open("government-form.pdf")
fields = doc[0].widgets()  # Returns [] on XFA-only forms
form_data = doc.get_form_fields()  # Returns {} on XFA-only forms

XFA 폼에 AcroForm 폴백 레이어가 포함된 경우 PyMuPDF가 일부 필드를 반환할 수 있지만, 실제 XFA 데이터(동적 레이아웃, 계산된 값, 중첩된 서브폼)는 보이지 않습니다.

pypdf — XFA 폼에서도 빈 값을 반환

pypdf의 폼 필드 읽기도 같은 한계에 부딪힙니다. AcroForm 필드에만 접근할 수 있으며 XFA 지원이 없습니다:

# pypdf — cannot read XFA content
from pypdf import PdfReader
reader = PdfReader("government-form.pdf")
fields = reader.get_form_text_fields()  # Returns {} on XFA-only forms

pdfplumber와 pdfminer — XFA 지원 전혀 없음

pdfplumber와 pdfminer는 XFA 폼에서 폼 필드를 읽으려 시도하지 않습니다. XFA 감지나 추출을 위한 API가 존재하지 않습니다.

PDF Oxide — XFA를 네이티브로 읽기

PDF Oxide는 XFA XML 템플릿을 직접 파싱하여 모든 필드, 값, 폼 구조를 추출합니다:

# PDF Oxide — reads XFA natively
from pdf_oxide import PdfDocument
doc = PdfDocument("government-form.pdf")
xfa = doc.analyze_xfa()
print(f"{len(xfa.fields)} fields found")  # All XFA fields extracted

정부 폼, IRS 문서, 보험 신청서, 그 밖에 AcroForm 폴백 레이어가 없는 폼을 포함한 모든 XFA 기반 PDF에서 동작합니다.

설치

pip install pdf_oxide

XFA 폼 감지

PDF에 XFA 콘텐츠가 포함되어 있는지 확인하기:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")
xfa = doc.analyze_xfa()

if xfa:
    print("This PDF uses XFA forms")
    print(f"  Fields: {len(xfa.fields)}")
    print(f"  Has template: {xfa.has_template}")
    print(f"  Has datasets: {xfa.has_datasets}")
else:
    print("Standard AcroForm (or no forms)")

WASM

WASM에서는 XFA 폼을 감지하고 AcroForm 필드 읽기로 폴백할 수 있습니다:

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
if (doc.hasXfa()) {
  console.log("This PDF uses XFA forms");
  // Read any AcroForm fallback fields
  const fields = doc.getFormFields();
  console.log(`AcroForm fallback fields: ${fields.length}`);
}
doc.free();

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

auto doc = pdf_oxide::Document::open("government-form.pdf");
if (doc.has_xfa()) {
    std::cout << "This PDF uses XFA forms\n";
    // Read any AcroForm fallback fields
    auto fields = doc.get_form_fields();
    std::cout << "AcroForm fallback fields: " << fields.size() << "\n";
}

Swift

import PdfOxide

let doc = try Document.open("government-form.pdf")
if try doc.hasXfa() {
    print("This PDF uses XFA forms")
    // Read any AcroForm fallback fields
    let fields = try doc.formFields()
    print("AcroForm fallback fields: \(fields.count)")
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('government-form.pdf');
if (doc.hasXfa()) {
  print('This PDF uses XFA forms');
  // Read any AcroForm fallback fields
  final fields = doc.getFormFields();
  print('AcroForm fallback fields: ${fields.length}');
}
doc.close();

library(pdfoxide)

doc <- pdf_open("government-form.pdf")
if (pdf_has_xfa(doc)) {
  cat("This PDF uses XFA forms\n")
  # Read any AcroForm fallback fields
  fields <- pdf_get_form_fields(doc)
  cat("AcroForm fallback fields:", length(fields), "\n")
}

Julia

using PdfOxide

doc = open_document("government-form.pdf")
if has_xfa(doc)
    println("This PDF uses XFA forms")
    # Read any AcroForm fallback fields
    fields = get_form_fields(doc)
    println("AcroForm fallback fields: ", length(fields))
end

Zig

const pdf_oxide = @import("pdf_oxide");

var doc = try pdf_oxide.Document.open("government-form.pdf");
defer doc.deinit();
if (doc.hasXfa()) {
    std.debug.print("This PDF uses XFA forms\n", .{});
    // Read any AcroForm fallback fields
    var fields = try doc.formFields();
    defer fields.deinit();
    std.debug.print("AcroForm fallback fields: {d}\n", .{try fields.count()});
}

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"government-form.pdf" error:&err];
if ([doc hasXfa]) {
    NSLog(@"This PDF uses XFA forms");
    // Read any AcroForm fallback fields
    NSArray<POXFormField*> *fields = [doc formFieldsWithError:&err];
    NSLog(@"AcroForm fallback fields: %lu", (unsigned long)fields.count);
}

Elixir

{:ok, doc} = PdfOxide.open("government-form.pdf")

if PdfOxide.has_xfa?(doc) do
  IO.puts("This PDF uses XFA forms")
  # Read any AcroForm fallback fields
  {:ok, fields} = PdfOxide.form_fields(doc)
  IO.puts("AcroForm fallback fields: #{length(fields)}")
end

XFA 필드 분석

XFA 폼의 모든 필드 상세 정보 가져오기:

from pdf_oxide import PdfDocument

doc = PdfDocument("tax-form.pdf")
xfa = doc.analyze_xfa()

if xfa:
    for field in xfa.fields:
        print(f"Name: {field.name}")
        print(f"  Type: {field.field_type}")
        print(f"  Value: {field.value}")
        print()

XFA 데이터 읽기

XFA 데이터셋에서 현재 필드 값 추출하기:

from pdf_oxide import PdfDocument

doc = PdfDocument("filled-xfa.pdf")
xfa = doc.analyze_xfa()

if xfa and xfa.has_datasets:
    data = {}
    for field in xfa.fields:
        if field.value:
            data[field.name] = field.value
    print(data)

XFA 폼 일괄 처리

디렉터리를 스캔하여 어떤 PDF가 XFA를 사용하는지 파악하기:

from pdf_oxide import PdfDocument, PdfError
from pathlib import Path

pdf_dir = Path("government-forms/")
xfa_files = []
acroform_files = []

for pdf_path in pdf_dir.glob("*.pdf"):
    try:
        doc = PdfDocument(str(pdf_path))
        xfa = doc.analyze_xfa()
        if xfa:
            xfa_files.append(pdf_path.name)
        else:
            acroform_files.append(pdf_path.name)
    except PdfError as e:
        print(f"Error: {pdf_path.name}: {e}")

print(f"XFA forms: {len(xfa_files)}")
print(f"Standard forms: {len(acroform_files)}")

Rust API

use pdf_oxide::PdfDocument;
use pdf_oxide::xfa::analyze_xfa_document;

let mut doc = PdfDocument::open("xfa-form.pdf")?;
let analysis = analyze_xfa_document(&mut doc)?;

println!("XFA form detected: {} fields", analysis.fields.len());
for field in &analysis.fields {
    println!("  {} ({:?}): {:?}", field.name, field.field_type, field.value);
}

Node.js / TypeScript

Node.js 바인딩은 XFA 감지 기능과, 선택적 Node 측 매니저가 설치된 경우 필드 수준 작업을 위한 상위 레벨 XfaManager를 제공합니다. 단순한 라우팅 로직에는 감지 한 번으로 충분합니다:

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("government-form.pdf");
if (doc.hasXFA()) {
  console.log("XFA form — route to specialized handler");
  // AcroForm fallback fields (if any) via doc.getFormFields()
  const fallback = doc.getFormFields();
  console.log(`AcroForm fallback fields: ${fallback.length}`);
} else {
  console.log("Standard AcroForm or no forms");
}
doc.close();

import { PdfDocument } from "pdf-oxide";

const doc = new PdfDocument("government-form.pdf");
if (doc.hasXFA()) {
  const fallback = doc.getFormFields();
  console.log(`XFA detected; ${fallback.length} AcroForm fallback fields`);
}
doc.close();

Go

Go 바인딩은 XFA 감지를 지원합니다. 파이프라인에서 XFA 문서에 플래그를 달고, 해당 PDF를 전체 필드 추출을 위해 Python 또는 Rust 단계로 라우팅하는 데 활용하세요:

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("government-form.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    if doc.HasXfa() {
        fmt.Println("XFA form detected — route to Python/Rust extractor")
    } else {
        fmt.Println("Standard AcroForm or no forms")
    }
}

C#

using PdfOxide;

using var doc = PdfDocument.Open("government-form.pdf");
if (doc.HasXfa)
{
    Console.WriteLine("XFA form detected — route to specialized extractor");
}
else
{
    Console.WriteLine("Standard AcroForm or no forms");
}

바인딩 지원 범위 안내. XFA 감지(hasXFA / HasXfa)는 5개 바인딩 모두에서 사용 가능합니다. XFA 필드 열거 및 값 추출(이름, 타입, 값, 데이터셋 XML)은 현재 Python과 Rust에서만 제공됩니다. Node.js, Go, C# 바인딩은 감지와 AcroForm 폴백 읽기를 지원합니다. Go나 C#에서 XFA 필드 값을 읽어야 하는 워크플로는 Python 또는 Rust 단계를 거쳐 처리하세요.

XFA가 중요한 이유

대부분의 Python PDF 라이브러리는 XFA 콘텐츠를 조용히 무시합니다. extract_text()와 폼 필드 API는 AcroForm 폴백 레이어(존재하는 경우)만 볼 수 있습니다. XFA 전용 폼의 상당수는 AcroForm 폴백이 없어 다른 도구에서는 완전히 보이지 않습니다:

PyMuPDF(pymupdf) XFA 폼 — get_form_fields()와 .widgets()가 XFA 전용 PDF에서 빈 값을 반환합니다. PyMuPDF는 XFA 지원이 없으며 추가 계획도 없습니다.
pypdf XFA 지원 — pypdf의 get_form_text_fields()는 XFA 콘텐츠를 읽지 못합니다. AcroForm 폴백 필드만 보입니다(존재한다면).
pdfplumber — XFA 지원 없음. 폼 추출이 AcroForm 필드로 한정됩니다.
pdfminer — XFA 지원 없음. XFA 폼 데이터를 감지하거나 추출하지 못합니다.

PDF Oxide는 XFA XML 템플릿을 직접 읽는 유일한 Python PDF 라이브러리로, PyMuPDF, pypdf, pdfplumber, pdfminer가 볼 수 없는 폼 구조와 데이터에 접근할 수 있게 해줍니다.