What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 시작하기 (Zig)

PDF Oxide는 텍스트 추출 기능을 기본 내장한 가장 빠른 PDF 라이브러리로, 3,830개 PDF 기준 평균 0.8ms, 통과율 100%를 기록합니다. Zig 바인딩은 @cImport를 통해 pdf_oxide C ABI 위에 얹은 관용적인 Zig 코드입니다 — shim 없이 일급 C 상호운용을 제공합니다. 핸들은 deinit를 갖는 struct이며, 반환되는 C 문자열과 버퍼는 호출자가 제공한 allocator로 복사됩니다.

Zig 0.15.1 에 고정되어 있습니다 — 1.0 이전 빌드라 릴리스마다 빌드 방식과 C-import API가 달라지므로, CI도 동일한 버전을 사용합니다.

설치

이 바인딩은 (Python wheel이 아니라) 기본 기능셋의 cdylib와 링크됩니다. 먼저 네이티브 라이브러리를 빌드한 다음, Zig가 헤더와 cdylib를 가리키도록 지정하세요:

# 1. 네이티브 라이브러리 빌드 (바인딩에 포함된 기능셋)
cargo build --release --lib \
  --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. 테스트 + 예제 실행
cd zig
LD_LIBRARY_PATH="$PWD/../target/release" \
  zig build test \
    -DPDF_OXIDE_INCLUDE_DIR="$PWD/../include" \
    -DPDF_OXIDE_LIB_DIR="$PWD/../target/release"

LD_LIBRARY_PATH="$PWD/../target/release" \
  zig build example \
    -DPDF_OXIDE_INCLUDE_DIR="$PWD/../include" \
    -DPDF_OXIDE_LIB_DIR="$PWD/../target/release"

직접 작성하는 코드에서는 모듈을 import하면 바로 사용할 수 있습니다:

const pdf_oxide = @import("pdf_oxide");

PDF 열기

Document.open(메모리상의 데이터는 Document.openFromBytes)으로 파일을 열고 메타데이터를 살펴보세요. 모든 핸들은 C 리소스를 소유하므로 defer doc.deinit()와 함께 사용해야 합니다.

const std = @import("std");
const pdf_oxide = @import("pdf_oxide");

pub fn main() !void {
    const a = std.heap.page_allocator;

    var doc = try pdf_oxide.Document.open("research-paper.pdf");
    defer doc.deinit();

    std.debug.print("pages:   {d}\n", .{try doc.pageCount()});
    const v = doc.version();
    std.debug.print("version: {d}.{d}\n", .{ v.major, v.minor });
    std.debug.print("encrypted: {}\n", .{doc.isEncrypted()});
}

텍스트 추출

extractText는 단일 페이지(0부터 시작)의 텍스트를 반환합니다. 결과는 전달한 allocator가 소유하므로, 다 쓴 뒤에는 해제해야 합니다.

const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("report.pdf");
defer doc.deinit();

const text = try doc.extractText(a, 0);
defer a.free(text);
std.debug.print("{s}\n", .{text});

전체 문서를 다루는 변형 함수는 모든 페이지를 한 번에 추출합니다:

const all_text = try doc.toPlainTextAll(a);
defer a.free(all_text);
std.debug.print("{s}\n", .{all_text});

마크다운 & HTML 변환

단일 페이지 또는 문서 전체를 Markdown이나 HTML로 변환할 수 있습니다. 각 함수는 allocator가 소유하는 slice를 반환합니다.

const md = try doc.toMarkdown(a, 0);
defer a.free(md);
std.debug.print("{s}\n", .{md});

const md_all = try doc.toMarkdownAll(a);
defer a.free(md_all);

const html = try doc.toHtml(a, 0);
defer a.free(html);

단어 단위 추출

extractWords는 텍스트, 경계 상자(bounding box), 폰트, bold 여부를 담은 Word struct의 slice를 반환합니다. 짝이 되는 freeWords 헬퍼로 slice 전체를 해제하세요 — 단어별 문자열과 그 기반 slice까지 함께 해제됩니다.

const words = try doc.extractWords(a, 0);
defer pdf_oxide.Document.freeWords(a, words);

for (words) |w| {
    std.debug.print("'{s}' at ({d:.1}, {d:.1}) font={s} size={d:.1} bold={}\n", .{
        w.text, w.bbox.x, w.bbox.y, w.fontName, w.fontSize, w.bold,
    });
}

Word 필드:

필드	타입	설명
`text`	`[]u8`	단어 텍스트 (allocator 소유)
`bbox`	`Bbox`	포인트 단위의 `{ x, y, width, height }`
`fontName`	`[]u8`	PostScript 폰트 이름
`fontSize`	`f32`	포인트 단위 폰트 크기
`bold`	`bool`	해당 런이 bold인지 여부

동일한 패턴으로 문자와 줄도 얻을 수 있습니다:

const chars = try doc.extractChars(a, 0);
defer pdf_oxide.Document.freeChars(a, chars);

const lines = try doc.extractTextLines(a, 0);
defer pdf_oxide.Document.freeTextLines(a, lines);

검색

search는 한 페이지 안에서 찾고, searchAll은 모든 페이지를 훑습니다. 두 함수 모두 NUL로 끝나는 검색어와 case_sensitive 플래그를 받아 SearchResult의 slice를 반환합니다.

const hits = try doc.searchAll(a, "configuration", false);
defer pdf_oxide.Document.freeSearchResults(a, hits);

for (hits) |hit| {
    std.debug.print("page {d}: '{s}' at ({d:.0}, {d:.0})\n", .{
        hit.page, hit.text, hit.bbox.x, hit.bbox.y,
    });
}

검색을 단일 페이지로 한정하려면, 페이지 인덱스와 함께 search를 사용하세요:

const page_hits = try doc.search(a, 0, "Alpha", false);
defer pdf_oxide.Document.freeSearchResults(a, page_hits);

PDF 생성

Pdf 타입은 Markdown, HTML, 일반 텍스트로부터 문서를 만듭니다. toBytes는 메모리로 직렬화하고, save는 디스크에 기록합니다.

const a = std.heap.page_allocator;

var pdf = try pdf_oxide.Pdf.fromMarkdown("# Hello\n\nThis is a **Zig** PDF.\n");
defer pdf.deinit();

// 메모리로 직렬화하거나...
const bytes = try pdf.toBytes(a);
defer a.free(bytes);

// ...바로 디스크에 기록할 수 있습니다.
try pdf.save("output.pdf");

방금 만든 PDF를 그대로 추출기에 다시 통과시켜 왕복(round-trip)시킬 수도 있습니다:

var pdf = try pdf_oxide.Pdf.fromHtml("<h1>Invoice</h1><p>Amount: $42</p>");
defer pdf.deinit();

const bytes = try pdf.toBytes(a);
defer a.free(bytes);

var doc = try pdf_oxide.Document.openFromBytes(bytes);
defer doc.deinit();

const text = try doc.extractText(a, 0);
defer a.free(text);
std.debug.print("{s}\n", .{text});

오류 처리

실패할 수 있는 호출은 Error!T를 반환하며, 여기서 Error는 error{ PdfOxide, OutOfMemory }입니다. Zig의 error 값은 별도의 데이터를 담을 수 없으므로, 내부 C-ABI 코드는 lastErrorCode()로 노출됩니다 — error.PdfOxide를 잡은 직후에 읽으세요.

const text = doc.extractText(a, 99) catch |err| switch (err) {
    error.PdfOxide => {
        std.debug.print("pdf_oxide error code: {d}\n", .{pdf_oxide.lastErrorCode()});
        return;
    },
    error.OutOfMemory => return err,
};
defer a.free(text);

다음 단계

Rust 시작하기 — PDF Oxide가 그 위에 구축된 네이티브 코어
Python 시작하기 — Python에서 PDF Oxide 사용하기
텍스트 추출 — 상세한 추출 옵션과 레시피
PDF 생성 — 메타데이터와 암호화를 포함한 고급 생성