What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Annotation Extraction

PDF Oxide provides access to all annotation types defined in the PDF specification (ISO 32000-1:2008, Section 12.5), including text notes, hyperlinks, highlights, stamps, ink annotations, and more. The document outline (bookmarks) is also accessible for building navigation structures.

Use get_annotations() on PdfDocument for raw annotation data, or the PdfPage DOM API for a unified AnnotationWrapper interface that supports both reading and writing.

Quick Example

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("annotated.pdf")
page = doc.page(0)
for annot in page.annotations():
    print(f"{annot.subtype}: {annot.contents}")

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("annotated.pdf");
const annotations = doc.getPageAnnotations(0);
for (const annot of annotations) {
  console.log(`${annot.subtype}: ${annot.contents}`);
}
doc.close();

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("annotated.pdf")
defer doc.Close()
annotations, _ := doc.Annotations(0)
for _, annot := range annotations {
    fmt.Printf("%s: %s\n", annot.Subtype, annot.Content)
}

WASM

const doc = new WasmPdfDocument(bytes);
const annotations = doc.getAnnotations(0);
for (const annot of annotations) {
    console.log(`${annot.subtype}: ${annot.contents}`);
}

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("annotated.pdf")?;
let annotations = doc.get_annotations(0)?;
for annot in &annotations {
    println!("{:?}: {:?}", annot.subtype_enum, annot.contents);
}

API Reference

`get_annotations(page_index) -> Vec<Annotation>`

Extract raw annotations from a specific page. Returns all annotation types present on the page.

Parameter	Type	Description
`page_index`	`usize`	Zero-based page index

Returns: A vector of Annotation objects.

Annotation Fields

Field	Type	Description
`annotation_type`	`String`	Always `"Annot"`
`subtype`	`Option<String>`	Raw subtype string (e.g., `"Text"`, `"Highlight"`)
`subtype_enum`	`AnnotationSubtype`	Parsed subtype enum
`contents`	`Option<String>`	Text contents of the annotation
`rect`	`Option<[f64; 4]>`	Bounding rectangle [x1, y1, x2, y2]
`author`	`Option<String>`	Author/creator (`/T` entry)
`creation_date`	`Option<String>`	Creation date
`modification_date`	`Option<String>`	Last modification date
`subject`	`Option<String>`	Subject of the annotation
`destination`	`Option<LinkDestination>`	Link destination (for Link annotations)
`action`	`Option<LinkAction>`	Link action (for Link annotations)
`color`	`Option<Vec<f64>>`	Annotation color components
`flags`	`Option<AnnotationFlags>`	Annotation flags (invisible, hidden, print, etc.)

AnnotationSubtype Variants

Variant	Description
`Text`	Sticky note annotation
`Link`	Hyperlink annotation
`FreeText`	Text box annotation
`Line`	Line shape annotation
`Square`	Rectangle shape annotation
`Circle`	Ellipse shape annotation
`Polygon`	Polygon shape annotation
`PolyLine`	Polyline shape annotation
`Highlight`	Text highlight markup
`Underline`	Text underline markup
`Squiggly`	Squiggly underline markup
`StrikeOut`	Strikethrough markup
`Stamp`	Rubber stamp annotation
`Ink`	Freehand drawing annotation
`Popup`	Pop-up note associated with another annotation
`FileAttachment`	Embedded file annotation
`Sound`	Sound annotation
`Movie`	Movie annotation
`Screen`	Screen annotation
`Widget`	Form field widget
`PrinterMark`	Printer’s mark annotation
`TrapNet`	Trap network annotation
`Watermark`	Watermark annotation
`ThreeDimensional`	3D annotation
`Redact`	Redaction annotation
`Caret`	Caret annotation (insertion point)
`RichMedia`	Rich media annotation
`Unknown`	Unrecognized annotation type

`get_outline() -> Option<Vec<OutlineItem>>`

Get the document outline (bookmarks) if present. Returns a hierarchical tree of outline items that can be used for document navigation.

Returns:

Some(Vec<OutlineItem>) – Bookmarks found and parsed
None – No bookmarks in the document

OutlineItem Fields

Field	Type	Description
`title`	`String`	Bookmark title text
`dest`	`Option<Destination>`	Navigation destination
`children`	`Vec<OutlineItem>`	Nested child bookmarks

Destination Variants

Variant	Description
`PageIndex(usize)`	Direct page reference (0-based index)
`Named(String)`	Named destination identifier

Rust

let mut doc = PdfDocument::open("book.pdf")?;

if let Some(outline) = doc.get_outline()? {
    for item in &outline {
        println!("  {}", item.title);
        for child in &item.children {
            println!("    {}", child.title);
        }
    }
} else {
    println!("No bookmarks found.");
}

PdfPage Annotation API (DOM)

The PdfPage object from the DocumentEditor provides a higher-level AnnotationWrapper interface that supports both reading existing annotations and adding new ones.

`page.annotations() -> &[AnnotationWrapper]`

Get all annotations on the page as wrapped objects.

`page.find_annotations_by_type(subtype) -> Vec<&AnnotationWrapper>`

Find annotations of a specific type.

`page.add_annotation(annotation)`

Add a new annotation to the page.

`page.remove_annotation(index) -> Option<AnnotationWrapper>`

Remove an annotation by index.

`page.find_annotations_in_region(rect) -> Vec<&AnnotationWrapper>`

Find annotations whose bounding boxes intersect a given region.

AnnotationWrapper Methods

Method	Returns	Description
`id()`	`AnnotationId`	Unique session ID
`subtype()`	`AnnotationSubtype`	Annotation type
`rect()`	`Rect`	Bounding rectangle
`contents()`	`Option<&str>`	Text contents
`color()`	`Option<(f32, f32, f32)>`	RGB color (0.0–1.0)
`is_modified()`	`bool`	Whether annotation has been changed

Python

doc = PdfDocument("annotated.pdf")
page = doc.page(0)

# List all annotations
for annot in page.annotations():
    print(f"[{annot.subtype}] {annot.contents} at {annot.rect}")

# Find highlights
highlights = [a for a in page.annotations() if a.subtype == "Highlight"]
print(f"Found {len(highlights)} highlights")

Node.js

const doc = new PdfDocument("annotated.pdf");
const annotations = doc.getPageAnnotations(0);

// List all annotations
for (const annot of annotations) {
  console.log(`[${annot.subtype}] ${annot.contents}`);
}

// Find highlights
const highlights = annotations.filter(a => a.subtype === "Highlight");
console.log(`Found ${highlights.length} highlights`);
doc.close();

doc, _ := pdfoxide.Open("annotated.pdf")
defer doc.Close()
annotations, _ := doc.Annotations(0)

// List all annotations
for _, annot := range annotations {
    fmt.Printf("[%s] %s\n", annot.Subtype, annot.Content)
}

// Find highlights
highlights := 0
for _, a := range annotations {
    if a.Subtype == "Highlight" {
        highlights++
    }
}
fmt.Printf("Found %d highlights\n", highlights)

WASM

const doc = new WasmPdfDocument(bytes);
const annotations = doc.getAnnotations(0);

// List all annotations
for (const annot of annotations) {
    console.log(`[${annot.subtype}] ${annot.contents}`);
}

// Find highlights
const highlights = annotations.filter(a => a.subtype === "Highlight");
console.log(`Found ${highlights.length} highlights`);

Rust

use pdf_oxide::editor::{DocumentEditor, EditableDocument};
use pdf_oxide::annotation_types::AnnotationSubtype;

let mut editor = DocumentEditor::open("annotated.pdf")?;
let page = editor.get_page(0)?;

// Find all highlight annotations
let highlights = page.find_annotations_by_type(AnnotationSubtype::Highlight);
for h in &highlights {
    println!("Highlight at {:?}: {:?}", h.rect(), h.contents());
}

Advanced Examples

Build a table of contents from bookmarks

use pdf_oxide::PdfDocument;
use pdf_oxide::outline::Destination;

let mut doc = PdfDocument::open("book.pdf")?;

fn print_toc(items: &[pdf_oxide::outline::OutlineItem], depth: usize) {
    for item in items {
        let indent = "  ".repeat(depth);
        let page = match &item.dest {
            Some(Destination::PageIndex(p)) => format!("page {}", p + 1),
            Some(Destination::Named(n)) => format!("dest '{}'", n),
            None => "no dest".to_string(),
        };
        println!("{}{} ({})", indent, item.title, page);
        print_toc(&item.children, depth + 1);
    }
}

if let Some(outline) = doc.get_outline()? {
    println!("Table of Contents:");
    print_toc(&outline, 0);
}

Extract all comments (Text annotations)

use pdf_oxide::PdfDocument;
use pdf_oxide::annotation_types::AnnotationSubtype;

let mut doc = PdfDocument::open("reviewed.pdf")?;
let page_count = doc.page_count()?;

for page_idx in 0..page_count {
    let annotations = doc.get_annotations(page_idx)?;
    let comments: Vec<_> = annotations.iter()
        .filter(|a| a.subtype_enum == AnnotationSubtype::Text)
        .collect();

    if !comments.is_empty() {
        println!("Page {}:", page_idx + 1);
        for c in &comments {
            let author = c.author.as_deref().unwrap_or("Unknown");
            let text = c.contents.as_deref().unwrap_or("");
            println!("  [{}] {}", author, text);
        }
    }
}

Extract all hyperlinks

use pdf_oxide::PdfDocument;
use pdf_oxide::annotation_types::AnnotationSubtype;

let mut doc = PdfDocument::open("report.pdf")?;
let annotations = doc.get_annotations(0)?;

let links: Vec<_> = annotations.iter()
    .filter(|a| a.subtype_enum == AnnotationSubtype::Link)
    .collect();

for link in &links {
    if let Some(ref action) = link.action {
        println!("Link: {:?}", action);
    }
    if let Some(ref dest) = link.destination {
        println!("Internal link: {:?}", dest);
    }
}

Form Data Extraction – Extract form fields (Widget annotations)
Text Extraction – Extract text content from pages
Metadata & XMP – Read document properties and bookmarks

Annotation Extraction

Quick Example

API Reference

get_annotations(page_index) -> Vec<Annotation>

Annotation Fields

AnnotationSubtype Variants

get_outline() -> Option<Vec<OutlineItem>>

OutlineItem Fields

Destination Variants

PdfPage Annotation API (DOM)

page.annotations() -> &[AnnotationWrapper]

page.find_annotations_by_type(subtype) -> Vec<&AnnotationWrapper>

page.add_annotation(annotation)

page.remove_annotation(index) -> Option<AnnotationWrapper>

page.find_annotations_in_region(rect) -> Vec<&AnnotationWrapper>

AnnotationWrapper Methods

Advanced Examples

Build a table of contents from bookmarks

Extract all comments (Text annotations)

Extract all hyperlinks

Related Pages

`get_annotations(page_index) -> Vec<Annotation>`

`get_outline() -> Option<Vec<OutlineItem>>`

`page.annotations() -> &[AnnotationWrapper]`

`page.find_annotations_by_type(subtype) -> Vec<&AnnotationWrapper>`

`page.add_annotation(annotation)`

`page.remove_annotation(index) -> Option<AnnotationWrapper>`

`page.find_annotations_in_region(rect) -> Vec<&AnnotationWrapper>`