What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Getting Started with PDF Oxide (PHP)

PDF Oxide is the fastest PHP PDF library for text extraction — 0.8ms mean, 100% pass rate on 3,830 PDFs. One library for extracting, converting, and creating PDFs, built on the same Rust core used by the Python, Node, Go, C#, Ruby, and Java bindings.

Installation

composer require oxide/pdf-oxide

Composer’s post-install hook downloads the matching prebuilt native library into vendor/oxide/pdf-oxide/lib/ for your platform (linux-x86_64, linux-aarch64, darwin-x86_64, darwin-arm64, windows-x64).

Requirements: PHP 8.2+ (8.2, 8.3, 8.4, 8.5) with ext-ffi enabled. Confirm with php -m | grep -i ffi. Some managed hosts disable ext-ffi; if so, use a Docker image such as php:8.3-cli.

Opening a PDF

Use PdfDocument::open() to load a file and inspect its metadata.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('research-paper.pdf');
echo $doc->pageCount(), " pages\n";

$version = $doc->version();   // ['major' => int, 'minor' => int]
printf("PDF version: %d.%d\n", $version['major'], $version['minor']);

$doc->close();   // or rely on __destruct()

Text Extraction

Single Page

Extract plain text from any page by its zero-based index.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('report.pdf');
echo $doc->extractText(0);
$doc->close();

All Pages

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('book.pdf');
for ($i = 0; $i < $doc->pageCount(); $i++) {
    echo "--- Page " . ($i + 1) . " ---\n";
    echo $doc->extractText($i), "\n";
}
$doc->close();

One-Shot Extraction

extractTextOnce() is a static helper that opens, extracts page 0, and closes in a single call.

use PdfOxide\PdfDocument;

echo PdfDocument::extractTextOnce('report.pdf');

Auto-Routed Extraction

extractTextAuto() returns native text when present and gracefully falls back to whatever text is recoverable — it never throws on the fallback path.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('mixed.pdf');
echo $doc->extractTextAuto(0);
$doc->close();

Page API

pages() returns an array of PdfPage views, and pagesIter() yields them lazily with their index. Each page delegates extraction back to the document.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');

foreach ($doc->pagesIter() as $index => $page) {
    echo "Page {$index}:\n";
    echo $page->text(), "\n";
}

// Or grab a single page directly:
$page = $doc->page(0);
echo $page->toMarkdown();

$doc->close();

PdfPage methods: index(), parent(), text(), textAuto(), toMarkdown(), toHtml().

Markdown & HTML Conversion

Convert a single page or the whole document to Markdown, or render a page to HTML.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');

echo $doc->toMarkdown(0);     // one page (defaults to page 0)
echo $doc->toMarkdownAll();   // entire document
echo $doc->toHtml(0);         // one page as HTML

$doc->close();

The static MarkdownConverter exposes the same conversions without holding a page index in the document call site.

use PdfOxide\PdfDocument;
use PdfOxide\MarkdownConverter;

$doc = PdfDocument::open('paper.pdf');

echo MarkdownConverter::toMarkdown($doc, 0);
echo MarkdownConverter::toMarkdownAll($doc);
echo MarkdownConverter::toHtml($doc, 0);
echo MarkdownConverter::toPlainText($doc, 0);

$doc->close();

Structured Extraction

extractStructured() returns a layout-aware view of a page as an associative array — regions with their kind, text, bounding box, and column index.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');
$structured = $doc->extractStructured(0);

printf("Page %d: %.0f x %.0f\n",
    $structured['page_index'],
    $structured['page_width'],
    $structured['page_height']);

foreach ($structured['regions'] as $region) {
    echo "[{$region['kind']}] {$region['text']}\n";
}

$doc->close();

Opening from Bytes

Open a PDF from an in-memory string — useful when fetching from S3, HTTP, or a database.

use PdfOxide\PdfDocument;

$bytes = file_get_contents('report.pdf');
$doc = PdfDocument::openBytes($bytes);
echo $doc->extractText(0);
$doc->close();

Inspecting Document Features

Probe a document before processing it.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('form.pdf');

var_dump($doc->hasStructureTree());   // tagged PDF?
var_dump($doc->hasFormFields());      // AcroForm fields?
var_dump($doc->hasSignatures());      // digital signatures?

$doc->close();

PDF Creation

The Pdf class provides factory methods to build PDFs from Markdown, HTML, or plain text.

use PdfOxide\Pdf;

$pdf = Pdf::fromMarkdown("# Invoice\n\n**Total:** \$42.00\n");
$pdf->saveTo('invoice.pdf');           // write to a path
$pdf->close();

$pdf = Pdf::fromHtml('<h1>Report</h1><p>Quarterly figures.</p>');
$bytes = $pdf->save();                 // or get the raw bytes
file_put_contents('report.pdf', $bytes);
$pdf->close();

$pdf = Pdf::fromText("Plain text document.\n\nSecond paragraph.");
$pdf->saveTo('notes.pdf');
$pdf->close();

Next Steps

Python Getting Started – using PDF Oxide from Python
Rust Getting Started – using PDF Oxide from Rust
Text Extraction – detailed extraction options and recipes
PDF Creation – advanced creation with metadata and styling
Editing – modifying existing PDFs, annotations, and form fields