What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide와 lopdf 비교

lopdf는 PDF 객체를 직접 조작하기 위한 저수준 Rust 크레이트입니다. PDF Oxide는 텍스트 추출, 생성, 편집을 기본으로 제공하는 고수준 라이브러리입니다. 두 라이브러리는 근본적으로 다른 사용 사례를 대상으로 합니다.

핵심 차이점

추상화 수준. lopdf는 원시 PDF 객체 — 딕셔너리, 스트림, 상호 참조 테이블 — 를 제공합니다. 텍스트 추출도, 폰트 디코딩도, 이미지 내보내기도 없습니다. PDF Oxide는 목적에 맞게 설계된 메서드를 제공합니다: extract_text(), extract_images(), to_markdown().

신뢰성. lopdf는 3,830개 PDF 테스트 코퍼스의 20%를 파싱하지 못합니다. 파싱에 성공한 PDF 중에서도 57%는 빈 출력을 내놓는데, lopdf에 텍스트 추출 기능이 없기 때문입니다 — 객체는 얻지만 텍스트는 얻지 못합니다. PDF Oxide는 100%를 통과합니다.

파싱 가능한 PDF에서의 속도. 원시 객체 파싱에서는 lopdf가 더 빠릅니다: 평균 0.3ms 대 PDF Oxide의 0.8ms. 하지만 lopdf는 텍스트 추출 작업을 전혀 하지 않습니다 — 폰트 디코딩, CMap 해석, 간격 분석, 읽기 순서를 직접 구현해야 합니다.

빠른 비교

	PDF Oxide	lopdf
API 수준	고수준	저수준
텍스트 추출	내장(프로덕션급)	없음
통과율(3,830개 PDF)	100%	80.2%
평균 파싱 시간	0.8ms	0.3ms
이미지 추출	내장	수동(원시 스트림)
양식 필드	읽기 + 쓰기	수동(원시 딕셔너리)
PDF 생성	지원(Markdown/HTML)	지원(원시 객체)
Markdown/HTML 출력	지원	미지원
암호화	읽기 + 쓰기	미지원
렌더링	지원	미지원
PDF/A 검증	지원	미지원
라이선스	MIT	MIT

lopdf로는 할 수 없는 것

lopdf는 PDF 객체에 대한 접근을 제공하지만, 텍스트 추출은 그 객체들을 PDF 명세에 따라 해석하는 작업을 필요로 합니다. 다음은 직접 구현해야 할 것들입니다:

콘텐츠 스트림 파싱 — PostScript와 유사한 연산자(Tj, TJ, Tm, Tf 등) 파싱
폰트 해석 — /Font 리소스 조회, 간접 참조 해석
CMap/ToUnicode 디코딩 — 글리프 ID를 유니코드 문자로 변환
폰트 메트릭 기반 간격 — 폰트 디스크립터로부터 문자 너비 계산
텍스트 행렬 변환 — Tm, Td, T* 연산자를 적용해 텍스트 배치
읽기 순서 — 다단 레이아웃의 올바른 순서 판별
합자 재구성 — fi, fl, ffi 합자 처리
CJK 인코딩 — 중국어, 일본어, 한국어 텍스트 인코딩 디코딩

이는 수천 줄의 코드와 ISO 32000에 대한 깊은 지식을 요구합니다. PDF Oxide는 이 모든 것을 내부에서 처리합니다.

코드 나란히 비교

텍스트 추출

PDF Oxide:

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;
let text = doc.extract_text(0)?;
println!("{}", text);

lopdf:

use lopdf::Document;

let doc = Document::load("report.pdf")?;

// lopdf does not provide text extraction.
// You get access to PDF objects only:
let page_id = doc.page_iter().next().unwrap();
let page = doc.get_dictionary(page_id)?;
let contents = page.get("Contents")?;
let stream = doc.get_object(contents.as_reference()?)?;

// To get actual text, you must:
// 1. Parse content stream operators
// 2. Resolve font references from /Resources
// 3. Decode CMap/ToUnicode mappings
// 4. Apply text matrix transformations
// 5. Handle encoding differences
// ... (hundreds to thousands of lines of code)

PDF 생성

PDF Oxide:

use pdf_oxide::api::Pdf;

let pdf = Pdf::from_markdown("# Report\n\n| Q1 | Q2 |\n|---|---|\n| $1M | $2M |")?;
pdf.save("report.pdf")?;

lopdf:

use lopdf::{Document, Object, Stream, dictionary};

let mut doc = Document::with_version("1.5");

// Create font dictionary
let font_id = doc.add_object(dictionary! {
    "Type" => "Font",
    "Subtype" => "Type1",
    "BaseFont" => "Helvetica",
});

// Create resources
let resources_id = doc.add_object(dictionary! {
    "Font" => dictionary! { "F1" => font_id },
});

// Create content stream (raw PostScript operators)
let content = Stream::new(
    dictionary! {},
    b"BT /F1 12 Tf 72 720 Td (Hello World) Tj ET".to_vec(),
);
let content_id = doc.add_object(content);

// Create page
let page_id = doc.add_object(dictionary! {
    "Type" => "Page",
    "MediaBox" => vec![0.into(), 0.into(), 612.into(), 792.into()],
    "Contents" => content_id,
    "Resources" => resources_id,
});

// Wire up page tree
let pages_id = doc.add_object(dictionary! {
    "Type" => "Pages",
    "Kids" => vec![page_id.into()],
    "Count" => 1,
});
doc.add_object(dictionary! {
    "Type" => "Catalog",
    "Pages" => pages_id,
});

doc.save("report.pdf")?;

암호화된 PDF

PDF Oxide:

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open_with_password("encrypted.pdf", "secret")?;
let text = doc.extract_text(0)?;
println!("{}", text);

lopdf:

// lopdf does not support encrypted PDFs.
// Loading an encrypted PDF will fail or produce undecrypted streams.

신뢰성 비교

지표	PDF Oxide	lopdf
파싱에 성공한 PDF	3,823 / 3,823(100%)	3,071 / 3,823(80.2%)
텍스트 출력이 있는 PDF	3,823 / 3,823	약 1,320 / 3,823(추정치)
암호화 PDF 지원	지원	미지원
손상된 PDF 복구	지원	미지원

lopdf의 80.2% 통과율은 대략 5개 중 1개의 PDF에서 실패한다는 뜻입니다. 실패는 암호화된 문서, 비표준 xref 테이블을 가진 PDF, 상호 참조 스트림을 사용하는 문서에서 발생합니다. PDF Oxide는 관대한 파싱과 폴백 전략으로 이 모든 경우를 처리합니다.

어느 것을 언제 쓸 것인가

다음의 경우 PDF Oxide를 선택하세요:

텍스트 추출, 이미지 추출 또는 콘텐츠 수준의 작업이 필요한 경우
읽기 + 쓰기 + 생성을 단일 크레이트로 해결하고 싶은 경우
모든 PDF(암호화, 손상, 복잡한 것)를 안정적으로 처리해야 하는 경우
Markdown/HTML 출력, 렌더링 또는 OCR이 필요한 경우
규격 준수 검증(PDF/A, PDF/X, PDF/UA)이 필요한 경우

다음의 경우 lopdf를 선택하세요:

맞춤 처리를 위해 PDF 객체에 직접 접근해야 하는 경우
객체 수준에서 동작하는 특수한 PDF 도구를 만드는 경우
객체 트리를 직접 조작해 문서를 병합해야 하는 경우
다루는 PDF가 단순하고 형식이 올바른 경우(암호화 없음, 표준 xref 테이블)

둘을 함께 사용하기:

고수준 작업에는 PDF Oxide를, 원시 객체 접근이 필요한 엣지 케이스에는 lopdf를 사용하세요:

[dependencies]
pdf_oxide = "0.3"
lopdf = "0.32"

PDF Oxide와 lopdf 비교

핵심 차이점

빠른 비교

lopdf로는 할 수 없는 것

코드 나란히 비교

텍스트 추출

PDF 생성

암호화된 PDF

신뢰성 비교

어느 것을 언제 쓸 것인가

다음의 경우 PDF Oxide를 선택하세요:

다음의 경우 lopdf를 선택하세요:

둘을 함께 사용하기:

관련 페이지