What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

추출 프로파일 — 문서 종류별 공백 감지 튜닝

PDF마다 공백을 숨겨두는 방식이 다릅니다. arXiv 논문은 빽빽한 양쪽 정렬의 다단으로 조판되고, IRS 양식은 셀 정렬에 엄격하게 의존하며, GDPR 정책 문서는 최소한의 커닝만으로 양쪽 정렬된 빽빽한 문단을 흘려 넣습니다. 한쪽에 맞춘 tj_offset_threshold 하나만으로는 다른 쪽에서 엉뚱한 공백이 끼어듭니다.

ExtractionProfile은 실제 문서 클래스에 깔끔하게 대응하는 사전 조정된 아홉 가지 파라미터 세트를 제공합니다. 프로파일을 extract_text()나 extract_words()에 넘기면, PDF Oxide가 해당 문서 스타일에 맞는 단어 margin 비율, TJ offset 임계값, 적응형 임계값 on/off를 적용합니다.

바인딩 지원 범위. 추출 프로파일은 현재 Python (pdf_oxide.ExtractionProfile)과 Rust (pdf_oxide::config::ExtractionProfile)에서 노출되어 있습니다. Node, WASM, Go, C# 바인딩은 내부적으로 CONSERVATIVE 기본값을 사용합니다. 이 런타임에서 다른 프로파일을 적용하려면 Rust CLI(pdf-oxide extract --profile academic doc.pdf)를 호출하거나, Python / Rust 단계를 거쳐 연결하세요.

빠른 예제

Python

from pdf_oxide import PdfDocument, ExtractionProfile

doc = PdfDocument("paper.pdf")

# 학술 논문: 좁은 자간, 인용 감지 활성화
text = doc.extract_text(0, profile=ExtractionProfile.academic())
print(text)

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::config::ExtractionProfile;

let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text_with_profile(0, ExtractionProfile::ACADEMIC)?;
println!("{}", text);

사용 가능한 프로파일

프로파일	권장 용도	TJ 임계값	단어 margin 비율	적응형
`conservative()`	기본값 — 일반 텍스트, 불필요한 공백 최소	−120	0.10	off
`aggressive()`	공백이 억제된 PDF, 붙어 있는 단어 분리	−80	0.20	off
`balanced()`	혼합 콘텐츠	−100	0.15	off
`academic()`	arXiv 논문, 학회 프로시딩, 기술 보고서	−105	0.12	on + 인용 / 이메일 감지
`policy()`	법률, GDPR, 정부 규정	−110	0.18	on
`form()`	IRS 양식, 신청서, 설문지	−120	0.08	off
`government()`	표가 포함된 정부 보고서	−105	0.14	off
`scanned_ocr()`	좌표가 노이즈 섞인 OCR 출력	자동	자동	on
`adaptive()`	폰트 통계로부터 추출기가 자동 튜닝	자동	자동	on

프로파일별 효과

학술 논문 / 학회 프로시딩 — `academic()`

빽빽한 조판, 2단 레이아웃, 문장 내 인용이 섞여 있습니다. 기본 설정으로는 합자(fi, ff) 안에 공백이 더 들어가거나, 커닝이 강한 단어 사이에서는 공백이 모자라기 쉽습니다.

doc = PdfDocument("neurips-paper.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.academic())

학술용 프로파일은 적응형 임계값과 함께 인용 / 이메일 감지를 켜기 때문에, [1,2,3] 같은 인라인 참조와 author@lab.edu 같은 이메일 주소가 온전히 유지됩니다.

IRS 양식, 신청서 — `form()`

양식 PDF는 단어 경계보다 열 정렬을 중요하게 여깁니다. form() 프로파일은 아주 좁은 단어 margin 비율(0.08)을 사용해, 엄격하게 정렬된 필드 라벨이 그 값과 합쳐지지 않도록 합니다.

doc = PdfDocument("w2.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.form())

GDPR / 정책 / 규정 — `policy()`

양쪽 정렬된 문단은 가변 폭의 공백이 삽입되어 기본 임계값을 무너뜨립니다. policy()는 더 넉넉한 단어 margin(0.18)과 적응형 임계값을 함께 써서 빽빽한 법률 문장을 정확히 읽어 냅니다.

doc = PdfDocument("gdpr.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.policy())

스캔 OCR 출력 — `scanned_ocr()`

페이지가 OCR(Tesseract, PaddleOCR, Azure)로 만들어졌다면, 문자 좌표에 노이즈가 섞이고 커닝 힌트가 사라집니다. scanned_ocr()는 페이지마다 폰트 통계를 다시 읽는 적응형 임계값으로 이를 보완합니다.

doc = PdfDocument("scanned.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.scanned_ocr())

라이브러리에 맡기기 — `adaptive()`

문서 클래스를 사전에 알 수 없다면 adaptive()가 첫 패스에서 폰트 통계를 샘플링하고 추출 전에 임계값을 결정합니다. 고정 프로파일보다 조금 느리지만 혼합 코퍼스에 관대합니다.

for pdf_path in Path("mixed_corpus/").glob("*.pdf"):
    doc = PdfDocument(str(pdf_path))
    text = doc.extract_text(0, profile=ExtractionProfile.adaptive())

프로파일 필드

각 프로파일은 튜닝 파라미터를 공개하므로, 값을 읽거나 복제할 수 있습니다.

Python

from pdf_oxide import ExtractionProfile

p = ExtractionProfile.academic()
print(p.name)                # "Academic"
print(p.word_margin_ratio)   # 0.12
print(p.tj_offset_threshold) # -105.0

# 모든 프리셋 확인
for profile in ExtractionProfile.all_profiles():
    print(profile.name, profile.word_margin_ratio)

Rust

use pdf_oxide::config::ExtractionProfile;

let p = ExtractionProfile::ACADEMIC;
println!("{} margin={} tj={}",
    p.name, p.word_margin_ratio, p.tj_offset_threshold);

프로덕션 파이프라인에서의 프로파일 선택

혼합 코퍼스(학술 논문, IRS 양식, 웹 크롤링된 HTML 내보내기 등)를 받아들인다면 기본값으로 adaptive()를 고르는 것이 안전합니다. 페이지당 몇 퍼센트의 오버헤드가 생기지만, 단어가 붙어 버리거나 단 사이에 공백이 빠지는 최악의 실패를 줄일 수 있습니다.

동질 코퍼스(Title IX 접수 파이프라인, 계약 검토 도구, arXiv 크롤러 등)라면 해당 프로파일을 명시적으로 지정하세요. 최고의 추출 품질을 얻으면서 adaptive()의 페이지당 샘플링 비용도 피할 수 있습니다.

추출 프로파일 — 문서 종류별 공백 감지 튜닝

빠른 예제

사용 가능한 프로파일

프로파일별 효과

학술 논문 / 학회 프로시딩 — academic()

IRS 양식, 신청서 — form()

GDPR / 정책 / 규정 — policy()

스캔 OCR 출력 — scanned_ocr()

라이브러리에 맡기기 — adaptive()