What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 시작하기 (Elixir)

PDF Oxide는 Elixir에서 PDF를 읽고 쓰는 가장 빠른 방법입니다 — 평균 0.8ms 텍스트 추출, PDF 3,830개에서 100% 통과율. 동일한 Rust 코어를 감싼 NIF이며, CPU 집약적인 작업을 dirty CPU 스케줄러(ERL_NIF_DIRTY_JOB_CPU_BOUND) 위에서 실행하므로 BEAM 스케줄러를 절대 막지 않습니다.

Document와 Pdf 핸들은 GC가 해제하는 NIF 리소스입니다. 실패할 수 있는 함수는 {:ok, value} 또는 {:error, code}를 반환하며, 페이지 인덱스는 0부터 시작합니다.

설치

mix.exs의 의존성에 pdf_oxide를 추가하세요:

def deps do
  [
    {:pdf_oxide, "~> 0.3"}
  ]
end

그런 다음 받아서 컴파일하면 — NIF는 elixir_make를 통해 네이티브 cdylib로 빌드됩니다:

mix deps.get
mix compile

빠른 시작

Markdown으로 PDF를 만들고, 바이트로 직렬화한 뒤, 다시 열어서 텍스트를 추출해 봅니다.

{:ok, pdf}   = PdfOxide.from_markdown("# Hello pdf_oxide\n\nThis is an **Elixir** binding.\n")
{:ok, bytes} = PdfOxide.to_bytes(pdf)
{:ok, doc}   = PdfOxide.open_from_bytes(bytes)

{:ok, pages} = PdfOxide.page_count(doc)
IO.puts("pages: #{pages}")

%{major: maj, minor: min} = PdfOxide.version(doc)
IO.puts("version: #{maj}.#{min}")

{:ok, text} = PdfOxide.extract_text(doc, 0)
IO.puts(text)

PDF 열기

파일 경로에서 열거나, 메모리에 있는 바이트에서 바로 열 수 있습니다(S3, HTTP, 데이터베이스에서 스트리밍할 때 유용합니다):

# 경로에서 열기
{:ok, doc} = PdfOxide.open("report.pdf")

# 이미 메모리에 있는 바이트에서 열기
{:ok, doc} = PdfOxide.open_from_bytes(pdf_bytes)

# 암호화된 문서
{:ok, doc} = PdfOxide.open_with_password("confidential.pdf", "secret")

# 검사
{:ok, count} = PdfOxide.page_count(doc)
encrypted? = PdfOxide.encrypted?(doc)

작업이 끝나면 문서를 명시적으로 닫거나(close/1은 멱등합니다), GC가 회수하도록 둘 수 있습니다:

:ok = PdfOxide.close(doc)

텍스트 추출

0부터 시작하는 인덱스로 단일 페이지의 일반 텍스트를 추출하거나, 문서 전체를 한 번에 가져올 수 있습니다:

{:ok, doc} = PdfOxide.open("book.pdf")

# 단일 페이지
{:ok, text} = PdfOxide.extract_text(doc, 0)

# 일반 텍스트, 한 페이지
{:ok, pt} = PdfOxide.to_plain_text(doc, 0)

# 모든 페이지를 이어 붙임
{:ok, all} = PdfOxide.to_plain_text_all(doc)
IO.puts(all)

Markdown & HTML 변환

페이지를 — 또는 문서 전체를 — Markdown이나 HTML로 변환합니다:

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, md}    = PdfOxide.to_markdown(doc, 0)
{:ok, mdall} = PdfOxide.to_markdown_all(doc)

{:ok, html}    = PdfOxide.to_html(doc, 0)
{:ok, htmlall} = PdfOxide.to_html_all(doc)

단어 & 줄

extract_words/2는 바운딩 박스와 bold 플래그를 가진 구조화된 PdfOxide.Word 구조체를 반환하고, extract_text_lines/2는 이를 줄 단위로 묶습니다.

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, words} = PdfOxide.extract_words(doc, 0)

for w <- Enum.take(words, 10) do
  %PdfOxide.Bbox{x: x, y: y, width: width} = w.bbox
  IO.puts("#{w.text} at (#{x}, #{y}) w=#{width} bold=#{w.bold}")
end

{:ok, lines} = PdfOxide.extract_text_lines(doc, 0)

for line <- lines do
  IO.puts("#{line.word_count} words: #{line.text}")
end

검색

단일 페이지를 검색하거나 문서 전체를 검색할 수 있습니다. 네 번째 인자는 case_sensitive입니다. 각 결과에는 text, page, 그리고 PdfOxide.Bbox가 담겨 있습니다.

{:ok, doc} = PdfOxide.open("manual.pdf")

# 한 페이지(페이지 인덱스 0), 대소문자 구분 없음
{:ok, results} = PdfOxide.search(doc, 0, "configuration", false)

for r <- results do
  %PdfOxide.Bbox{x: x, y: y} = r.bbox
  IO.puts("page #{r.page}: '#{r.text}' at (#{x}, #{y})")
end

# 모든 페이지
{:ok, all} = PdfOxide.search_all(doc, "configuration", false)
IO.puts("#{length(all)} matches")

PDF 생성

빌더 팩토리 함수는 Pdf 핸들을 반환하며, 이를 to_bytes/1로 직렬화하거나 save/2로 디스크에 바로 쓸 수 있습니다:

{:ok, pdf} = PdfOxide.from_markdown("# Hello World\n\nThis is a PDF.")
:ok = PdfOxide.save(pdf, "output.pdf")

{:ok, pdf} = PdfOxide.from_html("<h1>Invoice</h1><p>Amount: $42</p>")
{:ok, bytes} = PdfOxide.to_bytes(pdf)

{:ok, pdf} = PdfOxide.from_text("Plain text content.")
:ok = PdfOxide.save(pdf, "notes.pdf")

페이지를 이미지로 렌더링

렌더링 기능을 사용하면 페이지를 PdfOxide.RenderedImage로 래스터화하여 PNG로 저장할 수 있습니다:

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, img} = PdfOxide.render_page(doc, 0)
IO.puts("#{img.width}x#{img.height}, #{byte_size(img.data)} bytes")
:ok = PdfOxide.save(img, "page0.png")

# 확대 배율, 또는 고정 크기 썸네일
{:ok, zoomed} = PdfOxide.render_page_zoom(doc, 0, 2.0)
{:ok, thumb}  = PdfOxide.render_page_thumbnail(doc, 0, 128)

오류 처리

실패할 수 있는 함수는 태그가 붙은 튜플을 반환합니다 — 패턴 매칭으로 깔끔하게 흐름을 제어하세요:

case PdfOxide.open("/nonexistent/nope.pdf") do
  {:ok, doc} ->
    {:ok, text} = PdfOxide.extract_text(doc, 0)
    IO.puts(text)

  {:error, code} ->
    IO.puts("could not open PDF: #{inspect(code)}")
end

다음 단계

Rust 시작하기 — Rust에서 PDF Oxide 사용하기
Python 시작하기 — Python에서 PDF Oxide 사용하기
텍스트 추출 — 자세한 추출 옵션과 레시피
PDF 생성 — 메타데이터와 암호화를 활용한 고급 생성
편집 — 기존 PDF 수정, 주석, 양식 필드