What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 시작하기 (Julia)

PDF Oxide는 Julia에서 가장 빠른 PDF 라이브러리입니다 — 평균 0.8ms 텍스트 추출, 3,830개 PDF에서 100% 통과율. PdfOxide.jl 패키지는 Rust 코어를 C ABI 위에서 직접 감싸므로, 관용적인 Julia API와 네이티브 속도를 동시에 누릴 수 있습니다. 페이지 인덱스는 0부터 시작합니다.

설치

Julia REPL의 패키지 관리자에서 패키지를 추가합니다:

using Pkg
Pkg.add("PdfOxide")

네이티브 라이브러리(libpdf_oxide)는 런타임에 로드됩니다. 시스템 로더 경로에 라이브러리가 없다면, PdfOxide.jl이 순서대로 확인하는 환경 변수 중 하나로 위치를 지정하세요: PDF_OXIDE_LIB_PATH(파일의 전체 경로), PDF_OXIDE_LIB_DIR(디렉터리), 그다음 로컬 target/release 빌드 디렉터리.

export PDF_OXIDE_LIB_DIR=/path/to/pdf_oxide/target/release

빠른 시작

PDF를 열고 첫 페이지에서 텍스트를 추출합니다. extract_text는 0부터 시작하는 페이지 인덱스를 받습니다.

using PdfOxide

doc = open_document("report.pdf")

println("pages:   ", page_count(doc))
v = version(doc)
println("version: ", v.major, ".", v.minor)

# 첫 페이지의 일반 텍스트 (0부터 시작하는 인덱스)
println(extract_text(doc, 0))

메모리에서 문서를 만들어 바이트로부터 열 수도 있습니다 — 디스크를 전혀 거치지 않는 테스트나 파이프라인에 유용합니다:

using PdfOxide

pdf = from_markdown("# Hello pdf_oxide\n\nThis is the **Julia** binding.\n")
doc = open_from_bytes(to_bytes(pdf))

println("pages: ", page_count(doc))
println(extract_text(doc, 0))

문서 살펴보기

추출에 들어가기 전에, 몇 번의 가벼운 호출만으로 지금 다루는 문서가 어떤 것인지 파악할 수 있습니다:

using PdfOxide

doc = open_document("report.pdf")

@show page_count(doc)        # 페이지 수
@show version(doc).major     # PDF 사양 버전
@show is_encrypted(doc)      # 파일이 비밀번호로 보호되어 있으면 true

Markdown 및 HTML 변환

단일 페이지를 변환하거나 문서 전체를 한 번에 변환할 수 있습니다. Markdown은 제목, 목록, 강조를 그대로 유지하며, _all 변형은 모든 페이지를 이어 붙입니다.

using PdfOxide

doc = open_document("paper.pdf")

# 한 페이지 (0부터 시작)
md = to_markdown(doc, 0)
println(md)

# 문서 전체
println(to_markdown_all(doc))

# 단일 페이지의 HTML
html = to_html(doc, 0)
println(html)

# 마크업이 전혀 없는 일반 텍스트
println(to_plain_text(doc, 0))

단어 단위 추출

extract_words는 Word 값의 벡터를 반환하며, 각 값은 텍스트, 경계 상자, 글꼴 크기, 굵게 표시 여부 플래그를 담고 있습니다. 경계 상자는 width, height, 위치 필드를 가진 Bbox입니다.

using PdfOxide

doc = open_document("paper.pdf")
words = extract_words(doc, 0)

for w in first(words, 10)
    println(rpad(w.text, 20),
            " size=", w.font_size,
            " bold=", w.bold,
            " width=", round(w.bbox.width; digits = 1))
end

행 단위 레이아웃의 경우, extract_text_lines는 TextLine 값을 반환하며, 각 값은 텍스트, word_count, bbox를 가집니다:

using PdfOxide

doc = open_document("paper.pdf")
lines = extract_text_lines(doc, 0)

for line in lines
    println(line.word_count, " words: ", line.text)
end

검색

단일 페이지를 검색하거나 문서 전체를 검색할 수 있습니다. 세 번째 인수는 대소문자 구분 플래그입니다(대소문자를 구분하지 않으려면 false). 각 결과는 text, 발견된 page, bbox를 보고합니다.

using PdfOxide

doc = open_document("manual.pdf")

# 한 페이지 검색 (대소문자 구분 안 함)
hits = search(doc, 0, "configuration", false)
for h in hits
    println("page ", h.page, ": ", h.text)
end

# 모든 페이지 검색
all_hits = search_all(doc, "configuration", false)
println(length(all_hits), " total matches")
for h in all_hits
    println("page ", h.page, " at (",
            round(h.bbox.x; digits = 0), ", ",
            round(h.bbox.y; digits = 0), ")")
end

PDF 생성

from_* 팩토리 함수는 Markdown, HTML, 또는 일반 텍스트로부터 Pdf를 만듭니다. to_bytes를 호출하면 원시 바이트를 얻고, save를 호출하면 파일에 직접 씁니다.

using PdfOxide

# Markdown으로부터
pdf = from_markdown("# Invoice\n\nAmount due: **\$42**\n")
save(pdf, "invoice.pdf")

# HTML로부터
html_pdf = from_html("<h1>Report</h1><p>Quarterly results.</p>")
save(html_pdf, "report.pdf")

# 일반 텍스트로부터 — 파일로 쓰는 대신 바이트를 가져오기
text_pdf = from_text("Plain text body.")
bytes = to_bytes(text_pdf)
println("generated ", length(bytes), " bytes")

오류 처리

작업이 실패하면 PdfOxideError가 발생합니다. 신뢰할 수 없는 입력을 다루는 호출은 try/catch로 감싸세요:

using PdfOxide

try
    doc = open_document("missing.pdf")
    println(extract_text(doc, 0))
catch e
    e isa PdfOxideError || rethrow()
    println("PDF error: ", e)
end

다음 단계

Rust 시작하기 — PDF Oxide가 기반으로 삼는 네이티브 코어
Python 시작하기 — Python에서 PDF Oxide 사용하기
텍스트 추출 — 자세한 추출 옵션과 사용 예시
PDF 생성 — 메타데이터와 스타일링을 활용한 고급 생성