What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Getting Started with PDF Oxide (Objective-C)

PDF Oxide ships idiomatic Objective-C bindings over its Rust core — 0.8ms mean text extraction, 100% pass rate on 3,830 PDFs. NSObject wrappers (POXDocument, POXPdf) own the native handles and free them under ARC, returned strings come back as NSString, and any C-ABI error code surfaces as an NSError in the POXErrorDomain. New in v0.3.69.

Installation

The Objective-C binding links the default-feature cdylib and builds with clang under ARC. Build the native library, then make build against it:

# 1. build the native library (shipped binding feature set)
cargo build --release --lib --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. build the Objective-C binding (clang, ARC)
cd objc
make build PDF_OXIDE_LIB_DIR="$PWD/../target/release"
DYLD_LIBRARY_PATH="$PWD/../target/release" ./basic_extraction

Import the single public header in your sources:

#import "POXPdfOxide.h"

Quick Start

Open a PDF, inspect its metadata, and extract text from the first page. Every fallible call takes a trailing NSError**.

#import "POXPdfOxide.h"

NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"research-paper.pdf" error:&err];
if (!doc) {
    NSLog(@"open failed: %@", err.localizedDescription);
    return;
}

NSInteger pages = [doc pageCountError:&err];
POXVersion ver  = [doc version];
NSLog(@"pages: %ld  version: %d.%d", (long)pages, ver.major, ver.minor);

NSString *text = [doc extractText:0 error:&err];
NSLog(@"%@", text);

You can also open from in-memory bytes — useful when the PDF arrives over the network or from a database — and open password-protected files:

POXDocument *doc = [POXDocument openFromBytes:pdfData error:&err];

// Encrypted document, password supplied up front:
POXDocument *enc = [POXDocument openWithPassword:@"confidential.pdf"
                                       password:@"secret"
                                          error:&err];

// Or authenticate after opening:
BOOL ok = [doc authenticate:@"secret" error:&err];

Text Extraction

Plain text is the fast path. Extract a single page by its zero-based index, or pull the whole document at once.

// One page
NSString *text = [doc extractText:0 error:&err];

// Whole document, joined
NSString *all = [doc toPlainTextAllWithError:&err];

// Page by page
NSInteger count = [doc pageCountError:&err];
for (NSInteger i = 0; i < count; i++) {
    NSLog(@"--- page %ld ---\n%@", (long)i, [doc extractText:i error:&err]);
}

Words and Lines

extractWords: and extractTextLines: return arrays of element objects with bounding boxes and font metadata, all in PDF user-space points.

NSArray<POXWord *> *words = [doc extractWords:0 error:&err];
for (POXWord *w in words) {
    POXBbox box = w.bbox;
    NSLog(@"'%@' at (%.1f, %.1f) %.1fx%.1f  font=%@ size=%.1f  bold=%d",
          w.text, box.x, box.y, box.width, box.height,
          w.fontName, w.fontSize, w.bold);
}

NSArray<POXTextLine *> *lines = [doc extractTextLines:0 error:&err];
for (POXTextLine *line in lines) {
    NSLog(@"%@  (%ld words)", line.text, (long)line.wordCount);
}

POXChar (from extractChars:) exposes the same shape at character granularity — character, bbox, fontName, and fontSize.

Markdown and HTML

Convert a page — or the entire document — to Markdown or HTML.

// Single page
NSString *md   = [doc toMarkdown:0 error:&err];
NSString *html = [doc toHtml:0 error:&err];

// Whole document
NSString *mdAll   = [doc toMarkdownAllWithError:&err];
NSString *htmlAll = [doc toHtmlAllWithError:&err];

Search

Search a single page with search:term:caseSensitive:error:, or the whole document with searchAll:caseSensitive:error:. Both return arrays of POXSearchResult carrying the matched text, page index, and bounding box.

NSArray<POXSearchResult *> *hits =
    [doc searchAll:@"configuration" caseSensitive:NO error:&err];

for (POXSearchResult *r in hits) {
    POXBbox b = r.bbox;
    NSLog(@"page %ld: '%@' at (%.0f, %.0f)", (long)r.page, r.text, b.x, b.y);
}

// Single-page variant:
NSArray<POXSearchResult *> *pageHits =
    [doc search:0 term:@"configuration" caseSensitive:NO error:&err];

Creating PDFs

The POXPdf builder produces PDFs from Markdown, HTML, or plain text. Save to a path or get the bytes back as NSData.

POXPdf *pdf = [POXPdf fromMarkdown:@"# Hello World\n\nThis is a PDF.\n"
                            error:&err];
[pdf saveToPath:@"output.pdf" error:&err];

// Or keep the bytes in memory
NSData *bytes = [pdf toBytesWithError:&err];

// HTML and plain text constructors exist too
POXPdf *invoice = [POXPdf fromHtml:@"<h1>Invoice</h1><p>Amount: $42</p>"
                            error:&err];
POXPdf *notes   = [POXPdf fromText:@"Plain text content." error:&err];

Round-trip from a builder straight into a document for extraction:

POXPdf *pdf      = [POXPdf fromMarkdown:@"# Report\n\nBody text.\n" error:&err];
POXDocument *doc = [POXDocument openFromBytes:[pdf toBytesWithError:&err]
                                        error:&err];
NSLog(@"%@", [doc extractText:0 error:&err]);

Error Handling

Every fallible method writes into the trailing NSError** and returns nil / a sentinel on failure. Errors land in the POXErrorDomain.

NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"document.pdf" error:&err];
if (!doc) {
    if ([err.domain isEqualToString:POXErrorDomain]) {
        NSLog(@"PDF error: %@", err.localizedDescription);
    }
    return;
}

NSString *text = [doc extractText:0 error:&err];
if (!text) {
    NSLog(@"extract failed: %@", err.localizedDescription);
}

Handles free themselves under ARC, but you can release the native handle eagerly with -close (idempotent):

[doc close];

Next Steps

Rust Getting Started — using PDF Oxide from Rust
Python Getting Started — using PDF Oxide from Python
Text Extraction — detailed extraction options and recipes
PDF Creation — advanced creation with the builder API
Editing — modifying existing PDFs, annotations, and form fields