Skip to content

Getting Started with PDF Oxide (PHP)

PDF Oxide is the fastest PHP PDF library for text extraction — 0.8ms mean, 100% pass rate on 3,830 PDFs. One library for extracting, converting, and creating PDFs, built on the same Rust core used by the Python, Node, Go, C#, Ruby, and Java bindings.

Installation

composer require oxide/pdf-oxide

Composer’s post-install hook downloads the matching prebuilt native library into vendor/oxide/pdf-oxide/lib/ for your platform (linux-x86_64, linux-aarch64, darwin-x86_64, darwin-arm64, windows-x64).

Requirements: PHP 8.2+ (8.2, 8.3, 8.4, 8.5) with ext-ffi enabled. Confirm with php -m | grep -i ffi. Some managed hosts disable ext-ffi; if so, use a Docker image such as php:8.3-cli.

Opening a PDF

Use PdfDocument::open() to load a file and inspect its metadata.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('research-paper.pdf');
echo $doc->pageCount(), " pages\n";

$version = $doc->version();   // ['major' => int, 'minor' => int]
printf("PDF version: %d.%d\n", $version['major'], $version['minor']);

$doc->close();   // or rely on __destruct()

Text Extraction

Single Page

Extract plain text from any page by its zero-based index.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('report.pdf');
echo $doc->extractText(0);
$doc->close();

All Pages

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('book.pdf');
for ($i = 0; $i < $doc->pageCount(); $i++) {
    echo "--- Page " . ($i + 1) . " ---\n";
    echo $doc->extractText($i), "\n";
}
$doc->close();

One-Shot Extraction

extractTextOnce() is a static helper that opens, extracts page 0, and closes in a single call.

use PdfOxide\PdfDocument;

echo PdfDocument::extractTextOnce('report.pdf');

Auto-Routed Extraction

extractTextAuto() returns native text when present and gracefully falls back to whatever text is recoverable — it never throws on the fallback path.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('mixed.pdf');
echo $doc->extractTextAuto(0);
$doc->close();

Page API

pages() returns an array of PdfPage views, and pagesIter() yields them lazily with their index. Each page delegates extraction back to the document.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');

foreach ($doc->pagesIter() as $index => $page) {
    echo "Page {$index}:\n";
    echo $page->text(), "\n";
}

// Or grab a single page directly:
$page = $doc->page(0);
echo $page->toMarkdown();

$doc->close();

PdfPage methods: index(), parent(), text(), textAuto(), toMarkdown(), toHtml().

Markdown & HTML Conversion

Convert a single page or the whole document to Markdown, or render a page to HTML.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');

echo $doc->toMarkdown(0);     // one page (defaults to page 0)
echo $doc->toMarkdownAll();   // entire document
echo $doc->toHtml(0);         // one page as HTML

$doc->close();

The static MarkdownConverter exposes the same conversions without holding a page index in the document call site.

use PdfOxide\PdfDocument;
use PdfOxide\MarkdownConverter;

$doc = PdfDocument::open('paper.pdf');

echo MarkdownConverter::toMarkdown($doc, 0);
echo MarkdownConverter::toMarkdownAll($doc);
echo MarkdownConverter::toHtml($doc, 0);
echo MarkdownConverter::toPlainText($doc, 0);

$doc->close();

Structured Extraction

extractStructured() returns a layout-aware view of a page as an associative array — regions with their kind, text, bounding box, and column index.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');
$structured = $doc->extractStructured(0);

printf("Page %d: %.0f x %.0f\n",
    $structured['page_index'],
    $structured['page_width'],
    $structured['page_height']);

foreach ($structured['regions'] as $region) {
    echo "[{$region['kind']}] {$region['text']}\n";
}

$doc->close();

Opening from Bytes

Open a PDF from an in-memory string — useful when fetching from S3, HTTP, or a database.

use PdfOxide\PdfDocument;

$bytes = file_get_contents('report.pdf');
$doc = PdfDocument::openBytes($bytes);
echo $doc->extractText(0);
$doc->close();

Inspecting Document Features

Probe a document before processing it.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('form.pdf');

var_dump($doc->hasStructureTree());   // tagged PDF?
var_dump($doc->hasFormFields());      // AcroForm fields?
var_dump($doc->hasSignatures());      // digital signatures?

$doc->close();

PDF Creation

The Pdf class provides factory methods to build PDFs from Markdown, HTML, or plain text.

use PdfOxide\Pdf;

$pdf = Pdf::fromMarkdown("# Invoice\n\n**Total:** \$42.00\n");
$pdf->saveTo('invoice.pdf');           // write to a path
$pdf->close();

$pdf = Pdf::fromHtml('<h1>Report</h1><p>Quarterly figures.</p>');
$bytes = $pdf->save();                 // or get the raw bytes
file_put_contents('report.pdf', $bytes);
$pdf->close();

$pdf = Pdf::fromText("Plain text document.\n\nSecond paragraph.");
$pdf->saveTo('notes.pdf');
$pdf->close();

Next Steps