What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Видобування даних форм

PDF Oxide видобуває інтерактивні поля форм (AcroForm) з PDF-документів, зокрема текстові поля, прапорці, перемикачі, поля вибору та підписи. Видобуті дані форм можна експортувати у формат FDF або XFDF для обміну. Форми XFA (XML Forms Architecture) також можна аналізувати та конвертувати.

Швидкий приклад

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")
fields = doc.get_form_fields()
for field in fields:
    print(f"{field.name} ({field.field_type}): {field.value}")

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("form.pdf");
const fields = doc.getFormFields();
for (const field of fields) {
  console.log(`${field.name} (${field.fieldType}): ${field.value}`);
}
doc.close();

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("form.pdf")
defer doc.Close()
fields, _ := doc.FormFields()
for _, field := range fields {
    fmt.Printf("%s (%s): %s\n", field.Name, field.FieldType, field.Value)
}

using PdfOxide.Core;

using var doc = PdfDocument.Open("form.pdf");
var fields = doc.GetFormFields();
foreach (var field in fields)
{
    Console.WriteLine($"{field.Name} ({field.FieldType}): {field.Value}");
}

WASM

const doc = new WasmPdfDocument(bytes);
const fields = doc.getFormFields();
for (const field of fields) {
    console.log(`${field.name} (${field.fieldType}): ${field.value}`);
}

Rust

use pdf_oxide::extractors::FormExtractor;
use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("form.pdf")?;
let fields = FormExtractor::extract_fields(&mut doc)?;
for field in &fields {
    println!("{} ({:?}): {:?}", field.full_name, field.field_type, field.value);
}

Міграція з PyMuPDF get_form_fields()

Якщо ви переходите з PyMuPDF, API схожий, але PDF Oxide повертає багатші дані та обробляє форми XFA:

PyMuPDF:

import fitz

doc = fitz.open("form.pdf")
# Returns dict of {field_name: field_value} — loses type info
fields = doc.get_form_fields()

# Or iterate widgets for more detail
for page in doc:
    for widget in page.widgets():
        print(widget.field_name, widget.field_value)

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")
# Returns structured objects with name, value, type, options, rect
fields = doc.get_form_fields()
for field in fields:
    print(f"{field.name} ({field.field_type}): {field.value}")

# Also handles XFA forms that PyMuPDF cannot read
xfa = doc.has_xfa()

Ключові відмінності:

PDF Oxide повертає структуровані об’єкти полів (а не просто словник)
Включає тип поля, обмежувальну рамку та параметри для полів вибору
Підтримує форми XFA — get_form_fields() у PyMuPDF повертає порожній результат для PDF лише з XFA
Експорт у формат FDF/XFDF для обміну даними форм

Повний посібник з міграції, що охоплює PyMuPDF, pypdf, pdfplumber і pdfminer, див. у Міграція на PDF Oxide.

Читання полів форм

Отримання всіх полів

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("tax-form.pdf")
fields = doc.get_form_fields()

for field in fields:
    print(f"Name: {field.name}")
    print(f"  Type: {field.field_type}")
    print(f"  Value: {field.value}")
    print(f"  Required: {field.is_required}")
    print(f"  Read-only: {field.is_readonly}")
    if field.max_length:
        print(f"  Max length: {field.max_length}")

Node.js

const doc = new PdfDocument("tax-form.pdf");
const fields = doc.getFormFields();

for (const field of fields) {
  console.log(`Name: ${field.name}`);
  console.log(`  Type: ${field.fieldType}`);
  console.log(`  Value: ${field.value}`);
}
doc.close();

doc, _ := pdfoxide.Open("tax-form.pdf")
defer doc.Close()
fields, _ := doc.FormFields()

for _, field := range fields {
    fmt.Printf("Name: %s\n", field.Name)
    fmt.Printf("  Type: %s\n", field.FieldType)
    fmt.Printf("  Value: %s\n", field.Value)
}

using var doc = PdfDocument.Open("tax-form.pdf");
var fields = doc.GetFormFields();

foreach (var field in fields)
{
    Console.WriteLine($"Name: {field.Name}");
    Console.WriteLine($"  Type: {field.FieldType}");
    Console.WriteLine($"  Value: {field.Value}");
}

WASM

const doc = new WasmPdfDocument(bytes);
const fields = doc.getFormFields();

for (const field of fields) {
    console.log(`Name: ${field.name}`);
    console.log(`  Type: ${field.fieldType}`);
    console.log(`  Value: ${field.value}`);
    console.log(`  Flags: ${field.flags}`);
}

Rust

use pdf_oxide::extractors::{FormExtractor, FieldType};
use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("tax-form.pdf")?;
let fields = FormExtractor::extract_fields(&mut doc)?;

for field in &fields {
    let type_str = match &field.field_type {
        FieldType::Button => "Button",
        FieldType::Text => "Text",
        FieldType::Choice => "Choice",
        FieldType::Signature => "Signature",
        FieldType::Unknown(s) => s.as_str(),
    };

    println!("[{}] {} = {:?}", type_str, field.full_name, field.value);

    if let Some(tooltip) = &field.tooltip {
        println!("  Tooltip: {}", tooltip);
    }
    if let Some(bounds) = &field.bounds {
        println!("  Bounds: [{:.1}, {:.1}, {:.1}, {:.1}]",
            bounds[0], bounds[1], bounds[2], bounds[3]);
    }
}

Отримання значення конкретного поля

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")

name = doc.get_form_field_value("employee_name")
ssn = doc.get_form_field_value("ssn")
agreed = doc.get_form_field_value("agree_to_terms")

print(f"Name: {name}")       # "John Doe"
print(f"SSN: {ssn}")         # "123-45-6789"
print(f"Agreed: {agreed}")   # True

WASM

const doc = new WasmPdfDocument(bytes);

const name = doc.getFormFieldValue("employee_name");
const ssn = doc.getFormFieldValue("ssn");
const agreed = doc.getFormFieldValue("agree_to_terms");

console.log(`Name: ${name}`);     // "John Doe"
console.log(`SSN: ${ssn}`);       // "123-45-6789"
console.log(`Agreed: ${agreed}`); // true

Rust

use pdf_oxide::editor::{DocumentEditor, EditableDocument};

let mut editor = DocumentEditor::open("form.pdf")?;

if let Some(value) = editor.get_form_field_value("employee_name")? {
    println!("Name: {:?}", value);
}

Заповнення форм

Встановлення значень полів

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")

# Set text fields
doc.set_form_field_value("full_name", "Jane Doe")
doc.set_form_field_value("email", "jane@example.com")

# Set checkboxes
doc.set_form_field_value("agree_to_terms", True)

# Save the filled form
doc.save("filled_form.pdf")

WASM

const doc = new WasmPdfDocument(bytes);

// Set text fields
doc.setFormFieldValue("full_name", "Jane Doe");
doc.setFormFieldValue("email", "jane@example.com");

// Set checkboxes
doc.setFormFieldValue("agree_to_terms", true);

// Save the filled form
const filledBytes = doc.save();

Rust

use pdf_oxide::editor::{DocumentEditor, EditableDocument, FormFieldValue};

let mut editor = DocumentEditor::open("form.pdf")?;

// Set text fields
editor.set_form_field_value("full_name", FormFieldValue::Text("Jane Doe".into()))?;
editor.set_form_field_value("email", FormFieldValue::Text("jane@example.com".into()))?;

// Set checkboxes
editor.set_form_field_value("agree_to_terms", FormFieldValue::Boolean(true))?;

// Set choice fields
editor.set_form_field_value("state", FormFieldValue::Choice("California".into()))?;

editor.save("filled_form.pdf")?;

Експорт даних форм

Експортуйте дані полів форм у форматі FDF або XFDF для обміну з іншими застосунками.

Експорт FDF

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")
doc.export_form_data("form_data.fdf")

WASM

const doc = new WasmPdfDocument(bytes);
const fdfBytes = doc.exportFormData("fdf");
// fdfBytes is a Uint8Array

Rust

use pdf_oxide::extractors::FormExtractor;
use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("form.pdf")?;
let fields = FormExtractor::extract_fields(&mut doc)?;
let fdf_bytes = FormExtractor::export_fdf(&mut doc, fields)?;
std::fs::write("form_data.fdf", &fdf_bytes)?;

Експорт XFDF

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")
doc.export_form_data("form_data.xfdf", format="xfdf")

WASM

const doc = new WasmPdfDocument(bytes);
const xfdfBytes = doc.exportFormData("xfdf");

Rust

use pdf_oxide::extractors::FormExtractor;
use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("form.pdf")?;
let fields = FormExtractor::extract_fields(&mut doc)?;
let xfdf = FormExtractor::export_xfdf(&mut doc, fields)?;
std::fs::write("form_data.xfdf", &xfdf)?;

Поля форм у Markdown/HTML

Значення полів форм за замовчуванням включаються до конвертації у Markdown та HTML. Використовуйте include_form_fields, щоб керувати цим.

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")

# Include form field values (default)
md = doc.to_markdown(0, include_form_fields=True)

# Exclude form fields
md = doc.to_markdown(0, include_form_fields=False)

WASM

const doc = new WasmPdfDocument(bytes);

// Include form fields (default: true)
const md = doc.toMarkdown(0, true, true, true);

// Exclude form fields (4th parameter)
const md2 = doc.toMarkdown(0, true, true, false);

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let doc = PdfDocument::open("form.pdf")?;
let options = ConversionOptions {
    include_form_fields: true,
    ..Default::default()
};
let md = doc.to_markdown(0, &options)?;

Згладжування форм

Згладьте поля форм у вміст сторінки, щоб вони стали нередагованими. Корисно для створення фіналізованих PDF.

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")

# Flatten all form fields
doc.flatten_forms()
doc.save("flattened.pdf")

# Or flatten a single page
doc2 = PdfDocument("form.pdf")
doc2.flatten_forms_on_page(0)
doc2.save("flattened_page0.pdf")

WASM

const doc = new WasmPdfDocument(bytes);

// Flatten all form fields
doc.flattenForms();
const flattened = doc.save();

// Or flatten a single page
const doc2 = new WasmPdfDocument(bytes);
doc2.flattenFormsOnPage(0);
const flattened2 = doc2.save();

Rust

use pdf_oxide::Pdf;

let mut pdf = Pdf::open("form.pdf")?;

// Mark a specific page for flattening
pdf.flatten_page_annotations(0);
pdf.save("flattened.pdf")?;

// Or flatten all pages
let mut pdf2 = Pdf::open("form.pdf")?;
pdf2.flatten_all_annotations();
pdf2.save("flattened_all.pdf")?;

XFA-форми

Аналізуйте вміст форм XFA (XML Forms Architecture). Форми XFA використовують шаблони на основі XML замість полів AcroForm і поширені в державних та корпоративних формах.

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("xfa-form.pdf")
if doc.has_xfa():
    print("This document contains an XFA form")
    fields = doc.get_form_fields()  # Extracts AcroForm fallback fields
    for field in fields:
        print(f"  {field.name}: {field.value}")

Node.js

const doc = new PdfDocument("xfa-form.pdf");
if (doc.hasXFA()) {
  console.log("This document contains an XFA form");
  const fields = doc.getFormFields();
  for (const field of fields) {
    console.log(`  ${field.name}: ${field.value}`);
  }
}
doc.close();

doc, _ := pdfoxide.Open("xfa-form.pdf")
defer doc.Close()
if doc.HasXfa() {
    fmt.Println("This document contains an XFA form")
    fields, _ := doc.FormFields()
    for _, field := range fields {
        fmt.Printf("  %s: %s\n", field.Name, field.Value)
    }
}

using var doc = PdfDocument.Open("xfa-form.pdf");
if (doc.HasXfa)
{
    Console.WriteLine("This document contains an XFA form");
    var fields = doc.GetFormFields();
    foreach (var field in fields)
    {
        Console.WriteLine($"  {field.Name}: {field.Value}");
    }
}

WASM

const doc = new WasmPdfDocument(bytes);
if (doc.hasXfa()) {
    console.log("This document contains an XFA form");
    const fields = doc.getFormFields(); // AcroForm fallback fields
    for (const field of fields) {
        console.log(`  ${field.name}: ${field.value}`);
    }
}

Rust

use pdf_oxide::xfa::analyze_xfa_document;
use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("xfa-form.pdf")?;
let analysis = analyze_xfa_document(&mut doc)?;
println!("XFA form detected: {} fields", analysis.fields.len());
for field in &analysis.fields {
    println!("  {} ({:?})", field.name, field.field_type);
}

Довідник API

Python API

Метод	Опис
`doc.get_form_fields()`	Отримати всі поля форми як об’єкти `FormField`
`doc.get_form_field_value(name)`	Отримати значення конкретного поля за іменем
`doc.set_form_field_value(name, value)`	Задати значення поля форми
`doc.export_form_data(path, format="fdf")`	Експортувати дані форми у файл FDF або XFDF
`doc.has_xfa()`	Перевірити, чи документ містить форму XFA
`doc.flatten_forms()`	Згладити всі поля форми у вміст сторінки
`doc.flatten_forms_on_page(page)`	Згладити поля форми на конкретній сторінці

Властивості Python FormField

Властивість	Тип	Опис
`name`	`str`	Ім’я поля
`field_type`	`str`	Тип поля (text, checkbox, radio, choice, signature)
`value`	`str \| bool \| None`	Поточне значення поля
`is_required`	`bool`	Чи є поле обов’язковим
`is_readonly`	`bool`	Чи доступне поле лише для читання
`max_length`	`int \| None`	Максимальна довжина для текстових полів

JavaScript API

Метод	Опис
`doc.getFormFields()`	Отримати всі поля форми
`doc.getFormFieldValue(name)`	Отримати значення конкретного поля за іменем
`doc.setFormFieldValue(name, value)`	Задати значення поля форми
`doc.exportFormData(format?)`	Експортувати як FDF (за замовчуванням) або XFDF, повертає `Uint8Array`
`doc.hasXfa()`	Перевірити, чи документ містить форму XFA
`doc.flattenForms()`	Згладити всі поля форми у вміст сторінки
`doc.flattenFormsOnPage(pageIndex)`	Згладити поля форми на конкретній сторінці

Властивості JavaScript FormField

Властивість	Тип	Опис
`name`	`string`	Ім’я поля
`fieldType`	`string`	Тип поля
`value`	`string \| boolean \| null`	Поточне значення
`flags`	`number`	Прапорці поля

Rust API

Функція	Опис
`FormExtractor::extract_fields(doc)`	Видобути всі поля форми зі словника AcroForm
`FormExtractor::export_fdf(doc, fields)`	Експортувати як байти FDF
`FormExtractor::export_xfdf(doc, fields)`	Експортувати як рядок XFDF
`analyze_xfa_document(doc)`	Проаналізувати структуру форми XFA
`editor.get_form_fields()`	Отримати поля через DocumentEditor
`editor.get_form_field_value(name)`	Отримати значення поля за іменем
`editor.set_form_field_value(name, value)`	Задати значення поля

Поля FormField (Rust)

Поле	Тип	Опис
`name`	`String`	Ім’я поля з ключа `/T`
`full_name`	`String`	Повне кваліфіковане ім’я (через крапку)
`field_type`	`FieldType`	Button, Text, Choice, Signature, Unknown
`value`	`FieldValue`	Поточне значення поля
`tooltip`	`Option<String>`	Підказка з ключа `/TU`
`bounds`	`Option<[f64; 4]>`	Обмежувальна рамка `[x1, y1, x2, y2]`
`flags`	`Option<u32>`	Прапорці поля (ReadOnly, Required, NoExport)
`default_value`	`Option<FieldValue>`	Значення за замовчуванням з ключа `/DV`
`max_length`	`Option<u32>`	Максимальна довжина для текстових полів

Варіанти FieldType

Варіант	Опис
`Button`	Прапорець, перемикач або кнопка (`/Btn`)
`Text`	Однорядкове або багаторядкове текстове поле (`/Tx`)
`Choice`	Список або комбінований список (`/Ch`)
`Signature`	Поле цифрового підпису (`/Sig`)
`Unknown(String)`	Нерозпізнаний тип поля

Варіанти FieldValue

Варіант	Опис
`Text(String)`	Текстове рядкове значення
`Boolean(bool)`	Логічне значення (прапорці)
`Name(String)`	Значення-ім’я (перемикачі, поля вибору)
`Array(Vec<String>)`	Кілька значень (списки з множинним вибором)
`None`	Значення відсутнє

Розширено: перевірка обов’язкових полів

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")
fields = doc.get_form_fields()

missing = [f for f in fields if f.is_required and not f.value]
if missing:
    print("Missing required fields:")
    for f in missing:
        print(f"  - {f.name}")

Rust

use pdf_oxide::extractors::{FormExtractor, FieldValue};
use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("form.pdf")?;
let fields = FormExtractor::extract_fields(&mut doc)?;

let required_empty: Vec<_> = fields.iter()
    .filter(|f| {
        f.flags.map_or(false, |flags| flags & 0x02 != 0)
            && matches!(f.value, FieldValue::None | FieldValue::Text(ref s) if s.is_empty())
    })
    .collect();

if !required_empty.is_empty() {
    println!("Missing required fields:");
    for f in &required_empty {
        println!("  - {}", f.full_name);
    }
}

Пов’язані сторінки

Заповнення PDF-форм – покроковий посібник із заповнення форм
Видобування анотацій – доступ до анотацій разом із полями форм
Видобування тексту – видобування текстового вмісту зі сторінок
Метадані та XMP – читання властивостей рівня документа