What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 시작하기 (Swift)

PDF Oxide는 텍스트 추출 기능을 내장한 가장 빠른 PDF 라이브러리입니다 — 평균 0.8ms, 3,830개 PDF에서 100% 통과율. v0.3.69에서 새로 추가된 Swift 바인딩은 C ABI를 통해 Rust 코어를 감쌉니다. 핸들은 클래스가 소유하며(deinit에서 해제됨), C 버퍼는 Swift String/[UInt8]로 복사되고, 오류 코드는 PdfOxideError로 던져집니다.

설치

이 바인딩은 기본 기능(default-feature) cdylib와 링크됩니다. 먼저 네이티브 라이브러리를 빌드한 다음, SwiftPM이 헤더와 라이브러리를 가리키도록 설정하세요:

# 1. build the native library (shipped binding feature set)
cargo build --release --lib --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. test + run the example (Package.swift reads PDF_OXIDE_INCLUDE_DIR / _LIB_DIR)
cd swift
export PDF_OXIDE_INCLUDE_DIR="$PWD/../include"
export PDF_OXIDE_LIB_DIR="$PWD/../target/release"
DYLD_LIBRARY_PATH="$PDF_OXIDE_LIB_DIR" swift test
DYLD_LIBRARY_PATH="$PDF_OXIDE_LIB_DIR" swift run basic_extraction

빠른 시작

Markdown으로 PDF를 만들고, 생성된 바이트에서 열어, 텍스트를 추출합니다. 외부 픽스처 없이 전체 왕복 과정이 실행됩니다:

import PdfOxide

let pdf = try Pdf.fromMarkdown("# Hello pdf_oxide\n\nThis is a **Swift** binding.\n")
let doc = try Document.openFromBytes(try pdf.toBytes())

print("pages:   \(try doc.pageCount())")
print("version: \(try doc.version())")
print(try doc.extractText(0))

디스크에 있는 파일을 열려면 Document.open(_:)을 사용하세요:

import PdfOxide

let doc = try Document.open("research-paper.pdf")
print("Pages:   \(try doc.pageCount())")
print("Version: \(try doc.version())")        // e.g. 1.7

텍스트 추출

extractText(_:)은 0부터 시작하는 단일 페이지의 텍스트를 반환합니다. 문서 전체를 읽으려면 pageCount()를 순회하세요:

import PdfOxide

let doc = try Document.open("book.pdf")
for i in 0..<(try doc.pageCount()) {
    print("--- Page \(i + 1) ---")
    print(try doc.extractText(i))
}

toPlainText(_:)은 레이아웃을 제거한 평면화된 형태를 제공하고, *All() 메서드는 모든 페이지를 한 번에 추출합니다:

let doc = try Document.open("report.pdf")
let plain = try doc.toPlainText(0)            // single page, no layout
let everything = try doc.toPlainTextAll()     // all pages concatenated

단어와 문자

extractWords(_:)는 각 단어의 경계 상자(bounding box)와 폰트 메타데이터를 담은 [Word]를 반환합니다. extractChars(_:)는 문자별 위치 정보를 담은 [Char]를 반환합니다:

import PdfOxide

let doc = try Document.open("paper.pdf")

let words = try doc.extractWords(0)
for word in words.prefix(10) {
    print("'\(word.text)' at (\(word.bbox.x), \(word.bbox.y)) "
        + "font=\(word.fontName) size=\(word.fontSize) bold=\(word.bold)")
}

let chars = try doc.extractChars(0)
for ch in chars.prefix(10) {
    let scalar = Unicode.Scalar(ch.character).map(String.init) ?? "?"
    print("'\(scalar)' size=\(ch.fontSize) font=\(ch.fontName)")
}

Word 필드: text (String), bbox (Bbox), fontName (String), fontSize (Double), bold (Bool). Char 필드: character (UInt32 코드포인트), bbox, fontName, fontSize. Bbox는 x, y, width, height를 Double로 노출합니다.

extractTextLines(_:)로 줄 단위 텍스트도 가져올 수 있으며, 이는 [TextLine](text, bbox, wordCount)을 반환합니다:

let lines = try doc.extractTextLines(0)
for line in lines {
    print("\(line.wordCount) words: \(line.text)")
}

Markdown 및 HTML 변환

단일 페이지나 문서 전체를 Markdown 또는 HTML로 변환합니다:

import PdfOxide

let doc = try Document.open("paper.pdf")

let md = try doc.toMarkdown(0)        // one page to Markdown
let mdAll = try doc.toMarkdownAll()   // whole document to Markdown
let html = try doc.toHtml(0)          // one page to HTML
let htmlAll = try doc.toHtmlAll()     // whole document to HTML

print(mdAll)

검색

search(_:_:_:)은 단일 페이지를 검색하고, searchAll(_:_:)은 문서 전체를 검색합니다. 두 메서드 모두 검색어와 caseSensitive 플래그를 받아 [SearchResult](text, page, bbox)를 반환합니다:

import PdfOxide

let doc = try Document.open("manual.pdf")

// Search a single page (page 0, case-insensitive)
let hits = try doc.search(0, "configuration", false)
for hit in hits {
    print("page \(hit.page): '\(hit.text)' at (\(hit.bbox.x), \(hit.bbox.y))")
}

// Search the whole document
let allHits = try doc.searchAll("configuration", false)
print("\(allHits.count) total matches")

PDF 생성

Pdf 타입은 소스 형식으로부터 문서를 만드는 팩토리 메서드를 제공합니다. save(_:)로 디스크에 저장하거나 toBytes()로 원시 바이트를 가져오세요:

import PdfOxide

try Pdf.fromMarkdown("# Hello World\n\nThis is a PDF.").save("output.pdf")
try Pdf.fromHtml("<h1>Invoice</h1><p>Amount: $42</p>").save("invoice.pdf")
try Pdf.fromText("Plain text content.").save("notes.pdf")

let bytes = try Pdf.fromMarkdown("# In-memory\n\nbody\n").toBytes()
print("produced \(bytes.count) bytes")

오류 처리

실패할 수 있는 모든 호출은 PdfOxideError를 던지며, 여기에는 실패한 작업 이름과 그 바탕이 된 C-ABI 오류 코드가 담겨 있습니다:

import PdfOxide

do {
    let doc = try Document.open("document.pdf")
    print(try doc.extractText(0))
} catch let error as PdfOxideError {
    print("PDF error: \(error)")   // e.g. "PdfOxideError: open failed (error code 1)"
}

다음 단계

Rust 시작하기 – Rust에서 PDF Oxide 사용하기
Python 시작하기 – Python에서 PDF Oxide 사용하기
텍스트 추출 – 자세한 추출 옵션과 레시피
PDF 생성 – 메타데이터와 암호화를 포함한 고급 생성
편집 – 기존 PDF, 주석, 양식 필드 수정하기