What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF/UA 无障碍合规

PDF/UA（ISO 14289）规定了通用无障碍 PDF 文档的要求。PDF Oxide 会校验结构树、标题顺序、替代文本、表格表头、语言声明等内容。

绑定支持情况。 PDF/UA 校验在 Python（doc.validate_pdf_ua()）、Rust（validate_pdf_ua 和 PdfUaValidator 构建器）以及 Go（doc.ValidatePdfUa()）中均已提供。WASM 在可用时通过 validatePdfUa 提供仅返回通过/失败的基础检查。公开的 C# 封装尚未提供——请使用 Rust CLI（pdf-oxide validate --pdfua doc.pdf），或通过受支持的某个绑定调用。

支持的级别

级别	标准	说明
PDF/UA-1	ISO 14289-1:2014	基础无障碍要求
PDF/UA-2	ISO 14289-2:2024	增强型要求，与 WCAG 2.1 对齐

快速校验

from pdf_oxide import PdfDocument

doc = PdfDocument("document.pdf")
result = doc.validate_pdf_ua()
print(f"Valid: {result.valid}")
for error in result.errors:
    print(f"  {error}")

use pdf_oxide::PdfDocument;
use pdf_oxide::compliance::{validate_pdf_ua, PdfUaLevel};

let mut doc = PdfDocument::open("accessible.pdf")?;
let result = validate_pdf_ua(&mut doc, PdfUaLevel::UA1)?;

if result.has_errors() {
    println!("Not PDF/UA-1 compliant:");
    for error in &result.errors {
        println!("  [{}] {} (clause {})",
            error.code, error.message,
            error.clause.as_deref().unwrap_or("n/a"));
    }
} else {
    println!("Document is PDF/UA-1 compliant");
}

doc, err := pdfoxide.Open("accessible.pdf")
if err != nil { log.Fatal(err) }
defer doc.Close()

valid, errs, err := doc.ValidatePdfUa()
if err != nil { log.Fatal(err) }

if valid {
    fmt.Println("Document is PDF/UA-1 compliant")
} else {
    fmt.Println("Not PDF/UA-1 compliant:")
    for _, e := range errs {
        fmt.Printf("  %s\n", e)
    }
}

校验器 API

PdfUaValidator 构建器允许配置具体的检查项：

use pdf_oxide::PdfDocument;
use pdf_oxide::compliance::{PdfUaValidator, PdfUaLevel};

let mut doc = PdfDocument::open("report.pdf")?;

let result = PdfUaValidator::new()
    .check_heading_sequence(true)
    .check_color_contrast(true)
    .allow_custom_types(vec!["Caption".into(), "Aside".into()])
    .validate(&mut doc, PdfUaLevel::UA1)?;

println!("Errors: {}", result.errors.len());
println!("Warnings: {}", result.warnings.len());
println!("Structure elements checked: {}",
    result.stats.structure_elements_checked);

配置选项

方法	默认值	说明
`check_heading_sequence(bool)`	`true`	校验 H1-H6 是否跳级
`check_color_contrast(bool)`	`true`	标记潜在的对比度问题
`allow_custom_types(Vec<String>)`	`[]`	允许非标准结构类型而不发出警告

结构树检查

在运行校验之前，你可以检查文档的结构树和标记信息：

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("tagged.pdf")?;

// Check if the document claims to be tagged
let mark_info = doc.mark_info()?;
println!("Marked: {}", mark_info.marked);
println!("Suspects: {}", mark_info.suspects);

// Access the structure tree
if let Some(tree) = doc.structure_tree()? {
    println!("Root tag: {}", tree.root_type);
    println!("Children: {}", tree.children.len());
}

mark_info() 方法返回：

字段	类型	说明
`marked`	`bool`	文档是否声明自身为带标记文档
`suspects`	`bool`	标签分配是否可能不正确
`user_properties`	`bool`	是否存在用户属性

当 suspects 为 true 时，PDF Oxide 会自动回退到基于几何位置的排序进行文本提取，而不依赖可能不可靠的结构树。

校验内容

校验器涵盖以下 PDF/UA 要求：

文档级

检查项	条款	说明
语言	7.2	目录中存在 `/Lang` 条目
标题	7.1	设置了文档标题并显示在标题栏中
带标记	7.1	MarkInfo 字典声明 `Marked = true`
XMP 元数据	7.1	XMP 流中声明了 `pdfuaid:part`

结构

检查项	条款	说明
结构树	7.1	以 StructTreeRoot 为根的完整结构树
角色映射	7.5	非标准类型映射到标准结构元素
标题层级	7.4.2	标题（H1-H6）不跳级
工件标记	7.3	装饰性内容被标记为工件
阅读顺序	7.2	结构树定义了合理的阅读顺序

内容

检查项	条款	说明
图像替代文本	7.3	Figure 元素上有 `/Alt` 或 `/ActualText`
表格表头	7.5	表格结构中存在 `TH` 元素
表单标签	7.6.2	表单字段具有关联的标签或提示文本
链接文本	7.18	链接注释具有描述性内容
列表结构	7.4.3	列表使用 `L`、`LI`、`Lbl`、`LBody` 结构

字体与文本

检查项	条款	说明
Unicode 映射	7.21.3	所有文本都有 Unicode 表示
字体嵌入	7.21.4	字体已嵌入或为标准 Base14 字体
ActualText	7.21.5	连字和特殊字形具有 `/ActualText`

UaValidationResult

pub struct UaValidationResult {
    pub level: PdfUaLevel,
    pub errors: Vec<UaComplianceError>,
    pub warnings: Vec<ComplianceWarning>,
    pub stats: UaValidationStats,
}

UaComplianceError

每个错误都可选地包含与 WCAG 的对齐信息：

pub struct UaComplianceError {
    pub code: UaErrorCode,
    pub message: String,
    pub location: Option<String>,
    pub wcag_ref: Option<String>,
    pub clause: Option<String>,
}

wcag_ref 字段将 PDF/UA 违规映射到相应的 WCAG 成功标准（例如，非文本内容为 "1.1.1"，信息与关系为 "1.3.1"）。

UaErrorCode 类别

UaErrorCode 枚举包含如下错误类别：

MissingLanguage – 文档目录中没有 /Lang 条目
MissingStructureTree – 文档未带标记
MissingAltText – Figure 元素缺少替代文本
HeadingSkipped – 标题级别跳跃（例如 H1 跳到 H3）
MissingTableHeaders – 表格缺少 TH 元素
FormFieldNoLabel – 表单字段没有关联标签
InvalidRoleMapping – 非标准类型未映射到标准元素
ArtifactNotMarked – 装饰性内容未标记为工件
MissingUnicode – 文本没有 Unicode 映射

实战示例：无障碍报告

根据校验结果生成一份人类可读的无障碍报告：

use pdf_oxide::PdfDocument;
use pdf_oxide::compliance::{validate_pdf_ua, PdfUaLevel};

let mut doc = PdfDocument::open("document.pdf")?;
let result = validate_pdf_ua(&mut doc, PdfUaLevel::UA1)?;

println!("=== PDF/UA Accessibility Report ===");
println!("Level: PDF/UA-{}", result.level.xmp_part());
println!("Status: {}", if result.has_errors() { "FAIL" } else { "PASS" });
println!();

if result.has_errors() {
    println!("Errors ({}):", result.errors.len());
    for (i, error) in result.errors.iter().enumerate() {
        print!("  {}. [{}] {}", i + 1, error.code, error.message);
        if let Some(ref wcag) = error.wcag_ref {
            print!("  (WCAG {})", wcag);
        }
        println!();
    }
}

if result.has_warnings() {
    println!("\nWarnings ({}):", result.warnings.len());
    for warning in &result.warnings {
        println!("  - [{}] {}", warning.code, warning.message);
    }
}

println!("\nStats:");
println!("  Structure elements checked: {}",
    result.stats.structure_elements_checked);

后续步骤

PDF/A 校验 – 归档合规
PDF/X 印刷生产 – 印刷生产合规
API 参考 – 完整的 Rust API