What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide (Rust) 시작하기

PDF Oxide는 텍스트 추출을 기본 제공하는 가장 빠른 Rust PDF 크레이트입니다. 평균 0.8ms, 3,830개 PDF에서 100% 성공률을 기록했습니다. 추출·생성·편집을 한 라이브러리로 처리할 수 있습니다.

설치

Cargo.toml에 pdf_oxide를 추가합니다.

[dependencies]
pdf_oxide = "0.3"

피처 플래그

필요한 기능만 골라서 활성화할 수 있습니다.

# 기본 -- 텍스트 추출, 생성, 편집
pdf_oxide = "0.3"

# 페이지를 이미지로 렌더링
pdf_oxide = { version = "0.3", features = ["rendering"] }

# 바코드 생성
pdf_oxide = { version = "0.3", features = ["barcodes"] }

# 디지털 서명
pdf_oxide = { version = "0.3", features = ["signatures"] }

# Office 문서 변환 (DOCX, XLSX, PPTX)
pdf_oxide = { version = "0.3", features = ["office"] }

# 전체
pdf_oxide = { version = "0.3", features = ["full"] }

PDF 열기

PdfDocument::open()으로 파일을 불러와 메타데이터를 확인합니다.

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("research-paper.pdf")?;
println!("Pages: {}", doc.page_count());
println!("PDF version: {}", doc.version());

텍스트 추출

일반 텍스트

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("report.pdf")?;
let text = doc.extract_text(0)?;
println!("{text}");

텍스트 스팬

extract_spans()는 동일한 스타일이 이어지는 구간별 폰트 정보를 포함한 Vec<TextSpan>을 반환합니다.

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("paper.pdf")?;
let spans = doc.extract_spans(0)?;

for span in &spans {
    println!("'{}' at ({:.1}, {:.1}) font={} size={:.1}",
        span.text, span.x, span.y, span.font_name, span.font_size);
}

TextSpan 필드:

필드	타입	설명
`text`	`String`	텍스트 내용
`x`	`f64`	포인트 단위 가로 좌표
`y`	`f64`	포인트 단위 세로 좌표
`font_name`	`String`	PostScript 폰트 이름
`font_size`	`f64`	포인트 단위 폰트 크기
`bbox`	`Rect`	바운딩 박스

문자 단위 추출

extract_chars()는 문자별 정확한 위치를 담은 Vec<TextChar>를 반환합니다.

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("paper.pdf")?;
let chars = doc.extract_chars(0)?;

for ch in chars.iter().take(10) {
    println!("'{}' at ({:.1}, {:.1}) size={:.1} font={}",
        ch.char, ch.x, ch.y, ch.font_size, ch.font_name);
}

TextChar 필드:

필드	타입	설명
`char`	`char`	유니코드 문자
`x`	`f64`	포인트 단위 가로 좌표
`y`	`f64`	포인트 단위 세로 좌표
`font_size`	`f64`	포인트 단위 폰트 크기
`font_name`	`String`	PostScript 폰트 이름
`bbox`	`Rect`	바운딩 박스

Markdown 변환

옵션을 지정해 페이지를 Markdown으로 변환합니다.

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let doc = PdfDocument::open("paper.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown(0, &options)?;
println!("{md}");

HTML 변환

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("paper.pdf")?;
let html = doc.to_html(0)?;
println!("{html}");

이미지 추출

extract_images()는 페이지 안의 모든 이미지 메타데이터와 원본 데이터를 반환합니다. 콘텐츠 스트림과 중첩된 Form XObject 속 이미지까지 포함합니다.

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("brochure.pdf")?;
let images = doc.extract_images(0)?;

for (i, img) in images.iter().enumerate() {
    println!("Image {i}: {}x{} {} {}bpc ({} bytes)",
        img.width, img.height, img.color_space,
        img.bits_per_component, img.data.len());
}

extract_images_to_files()를 사용하면 디스크에 바로 저장할 수 있습니다.

let doc = PdfDocument::open("brochure.pdf")?;
let paths = doc.extract_images_to_files(0, "output_dir")?;
for path in &paths {
    println!("Saved: {}", path.display());
}

PDF 생성

팩토리 메서드

Pdf 타입이 고수준 팩토리 메서드를 제공합니다.

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::from_markdown("# Hello World\n\nThis is a PDF.")?;
pdf.save("output.pdf")?;

let mut pdf = Pdf::from_html("<h1>Invoice</h1><p>Amount: $42</p>")?;
pdf.save("invoice.pdf")?;

let mut pdf = Pdf::from_text("Plain text content.")?;
pdf.save("notes.pdf")?;

let mut pdf = Pdf::from_image("scan.jpg")?;
pdf.save("scan.pdf")?;

PdfBuilder 유연한 API

메타데이터, 페이지 크기, 여백까지 모두 제어할 때 사용합니다.

use pdf_oxide::api::PdfBuilder;
use pdf_oxide::writer::PageSize;

let mut pdf = PdfBuilder::new()
    .title("Annual Report")
    .author("Acme Corp")
    .page_size(PageSize::A4)
    .margins(72.0, 72.0, 72.0, 72.0)
    .font_size(11.0)
    .from_markdown("# Annual Report\n\n...")?;

pdf.save("annual-report.pdf")?;

DocumentBuilder 저수준 API

텍스트, 도형, 이미지를 픽셀 단위로 배치합니다.

use pdf_oxide::writer::DocumentBuilder;

let mut builder = DocumentBuilder::new();
builder.add_page(612.0, 792.0)
    .text("Hello, world!", 72.0, 720.0, 12.0)
    .rect(100.0, 600.0, 200.0, 50.0)
    .image_at("logo.png", 400.0, 700.0, 100.0, 50.0)?;

builder.save("custom.pdf")?;

검색

문서 전체에서 텍스트를 검색하거나 옵션으로 세밀하게 제어합니다.

use pdf_oxide::api::Pdf;

let pdf = Pdf::open("manual.pdf")?;

// 모든 페이지를 대상으로 한 단순 검색
let results = pdf.search("configuration")?;
for r in &results {
    println!("Page {}: '{}' at ({:.0}, {:.0})", r.page, r.text, r.x, r.y);
}

use pdf_oxide::api::{Pdf, SearchOptions};

let pdf = Pdf::open("manual.pdf")?;

let opts = SearchOptions {
    case_sensitive: false,
    whole_word: true,
    max_results: Some(50),
    ..Default::default()
};
let results = pdf.search_with_options("configuration", &opts)?;

편집

DocumentEditor

기존 PDF를 열어 페이지 회전이나 폼 필드 조작 같은 구조적 편집을 수행합니다.

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open_editor("form-template.pdf")?;

// 페이지 회전
pdf.rotate_page(0, 90)?;

// 폼 필드 추가
pdf.add_text_field("name", [100.0, 700.0, 300.0, 720.0])?;
pdf.add_checkbox("agree", [100.0, 650.0, 120.0, 670.0], false)?;

pdf.save("modified.pdf")?;

DOM 스타일 페이지 편집

페이지 요소를 탐색해 텍스트를 그 자리에서 수정합니다.

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("document.pdf")?;
let mut page = pdf.page(0)?;

// 텍스트 요소 찾기
for t in page.find_text_containing("Draft") {
    println!("Found '{}' at {:?}", t.text(), t.bbox());
}

// 텍스트 치환
let matches = page.find_text_containing("Draft");
for t in &matches {
    page.set_text(t.id(), "Final")?;
}

pdf.save_page(page)?;
pdf.save("updated.pdf")?;

에러 처리

실패 가능성이 있는 모든 연산은 Result<T, PdfError>를 반환합니다. PdfError 열거형이 주요 실패 모드를 모두 다룹니다.

use pdf_oxide::PdfDocument;
use pdf_oxide::PdfError;

fn extract(path: &str) -> Result<String, PdfError> {
    let doc = PdfDocument::open(path)?;
    doc.extract_text(0)
}

match extract("file.pdf") {
    Ok(text) => println!("{text}"),
    Err(PdfError::Io(e)) => eprintln!("I/O error: {e}"),
    Err(PdfError::Parse(msg)) => eprintln!("Parse error: {msg}"),
    Err(PdfError::Password) => eprintln!("Password required"),
    Err(PdfError::PageOutOfRange { index, count }) => {
        eprintln!("Page {index} does not exist ({count} pages total)");
    }
    Err(e) => eprintln!("Error: {e}"),
}

PdfError 변형:

변형	설명
`Io`	파일 시스템 또는 I/O 실패
`Parse`	잘못된 PDF 구조
`Password`	암호화된 문서에 비밀번호를 제공하지 않음
`PageOutOfRange`	요청한 페이지 번호가 전체 페이지 수를 초과

다음 단계

Python 시작하기 – Python에서 PDF Oxide 사용하기
텍스트 추출 – 추출 옵션과 레시피 상세
PDF 생성 – PdfBuilder, 암호화, 메타데이터를 활용한 고급 생성
편집 – 기존 PDF 수정, 주석, 폼 필드
API 레퍼런스 – API 전체 문서