What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 시작하기 (PHP)

PDF Oxide는 텍스트 추출에서 가장 빠른 PHP PDF 라이브러리입니다 — 평균 0.8ms, 3,830개 PDF에서 100% 통과율. PDF를 추출하고, 변환하고, 생성하는 작업을 하나의 라이브러리로 처리하며, Python, Node, Go, C#, Ruby, Java 바인딩에 쓰이는 것과 동일한 Rust 코어 위에서 동작합니다.

설치

composer require oxide/pdf-oxide

Composer의 post-install 훅이 사용 중인 플랫폼(linux-x86_64, linux-aarch64, darwin-x86_64, darwin-arm64, windows-x64)에 맞는 사전 빌드된 네이티브 라이브러리를 vendor/oxide/pdf-oxide/lib/에 내려받습니다.

요구 사항: ext-ffi가 활성화된 PHP 8.2 이상(8.2, 8.3, 8.4, 8.5). php -m | grep -i ffi로 확인하세요. 일부 관리형 호스팅은 ext-ffi를 비활성화하는데, 그런 경우 php:8.3-cli 같은 Docker 이미지를 사용하세요.

PDF 열기

PdfDocument::open()으로 파일을 불러오고 메타데이터를 확인합니다.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('research-paper.pdf');
echo $doc->pageCount(), " pages\n";

$version = $doc->version();   // ['major' => int, 'minor' => int]
printf("PDF version: %d.%d\n", $version['major'], $version['minor']);

$doc->close();   // or rely on __destruct()

텍스트 추출

단일 페이지

0부터 시작하는 인덱스로 원하는 페이지의 일반 텍스트를 추출합니다.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('report.pdf');
echo $doc->extractText(0);
$doc->close();

전체 페이지

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('book.pdf');
for ($i = 0; $i < $doc->pageCount(); $i++) {
    echo "--- Page " . ($i + 1) . " ---\n";
    echo $doc->extractText($i), "\n";
}
$doc->close();

원샷 추출

extractTextOnce()는 파일을 열고, 0번 페이지를 추출한 뒤, 닫는 작업을 한 번의 호출로 처리하는 정적 헬퍼입니다.

use PdfOxide\PdfDocument;

echo PdfDocument::extractTextOnce('report.pdf');

자동 경로 선택 추출

extractTextAuto()는 네이티브 텍스트가 있으면 이를 반환하고, 없으면 복구 가능한 텍스트로 매끄럽게 폴백합니다 — 폴백 경로에서는 예외를 던지지 않습니다.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('mixed.pdf');
echo $doc->extractTextAuto(0);
$doc->close();

Page API

pages()는 PdfPage 뷰의 배열을 반환하고, pagesIter()는 인덱스와 함께 페이지를 지연 평가로 하나씩 내어줍니다. 각 페이지는 추출 작업을 문서로 위임합니다.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');

foreach ($doc->pagesIter() as $index => $page) {
    echo "Page {$index}:\n";
    echo $page->text(), "\n";
}

// Or grab a single page directly:
$page = $doc->page(0);
echo $page->toMarkdown();

$doc->close();

PdfPage 메서드: index(), parent(), text(), textAuto(), toMarkdown(), toHtml().

Markdown 및 HTML 변환

단일 페이지나 문서 전체를 Markdown으로 변환하거나, 페이지를 HTML로 렌더링합니다.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');

echo $doc->toMarkdown(0);     // one page (defaults to page 0)
echo $doc->toMarkdownAll();   // entire document
echo $doc->toHtml(0);         // one page as HTML

$doc->close();

정적 MarkdownConverter는 문서 호출 지점에 페이지 인덱스를 들고 있지 않으면서도 동일한 변환을 제공합니다.

use PdfOxide\PdfDocument;
use PdfOxide\MarkdownConverter;

$doc = PdfDocument::open('paper.pdf');

echo MarkdownConverter::toMarkdown($doc, 0);
echo MarkdownConverter::toMarkdownAll($doc);
echo MarkdownConverter::toHtml($doc, 0);
echo MarkdownConverter::toPlainText($doc, 0);

$doc->close();

구조화 추출

extractStructured()는 페이지를 레이아웃을 인식하는 형태로, 즉 종류·텍스트·경계 상자·열 인덱스를 가진 영역들의 연관 배열로 반환합니다.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');
$structured = $doc->extractStructured(0);

printf("Page %d: %.0f x %.0f\n",
    $structured['page_index'],
    $structured['page_width'],
    $structured['page_height']);

foreach ($structured['regions'] as $region) {
    echo "[{$region['kind']}] {$region['text']}\n";
}

$doc->close();

바이트에서 열기

메모리에 올라간 문자열로부터 PDF를 엽니다 — S3, HTTP, 데이터베이스에서 가져올 때 유용합니다.

use PdfOxide\PdfDocument;

$bytes = file_get_contents('report.pdf');
$doc = PdfDocument::openBytes($bytes);
echo $doc->extractText(0);
$doc->close();

문서 특성 확인

문서를 처리하기 전에 그 특성을 살펴봅니다.

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('form.pdf');

var_dump($doc->hasStructureTree());   // tagged PDF?
var_dump($doc->hasFormFields());      // AcroForm fields?
var_dump($doc->hasSignatures());      // digital signatures?

$doc->close();

PDF 생성

Pdf 클래스는 Markdown, HTML, 일반 텍스트로부터 PDF를 만드는 팩토리 메서드를 제공합니다.

use PdfOxide\Pdf;

$pdf = Pdf::fromMarkdown("# Invoice\n\n**Total:** \$42.00\n");
$pdf->saveTo('invoice.pdf');           // write to a path
$pdf->close();

$pdf = Pdf::fromHtml('<h1>Report</h1><p>Quarterly figures.</p>');
$bytes = $pdf->save();                 // or get the raw bytes
file_put_contents('report.pdf', $bytes);
$pdf->close();

$pdf = Pdf::fromText("Plain text document.\n\nSecond paragraph.");
$pdf->saveTo('notes.pdf');
$pdf->close();

다음 단계

Python 시작하기 – Python에서 PDF Oxide 사용하기
Rust 시작하기 – Rust에서 PDF Oxide 사용하기
텍스트 추출 – 자세한 추출 옵션과 활용법
PDF 생성 – 메타데이터와 스타일을 곁들인 고급 생성
편집 – 기존 PDF 수정, 주석, 폼 필드