What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 시작하기 (Ruby)

PDF Oxide는 가장 빠른 Ruby PDF 라이브러리입니다 — 평균 0.8ms 텍스트 추출, PDF 3,830개에 대해 100% 통과율. PDF를 추출, 검색, 변환, 생성, 편집(redaction)하는 모든 작업을 하나의 gem으로 처리하며, Python, Java, Node, Go, C#, PHP 바인딩을 떠받치는 동일한 Rust 코어 위에 구축되었습니다.

설치

gem install pdf_oxide

또는 Gemfile에 추가하세요.

gem 'pdf_oxide', '~> 0.3'

사전 빌드된 libpdf_oxide 네이티브 라이브러리가 플랫폼 태그가 붙은 gem 안에 함께 제공되므로 컴파일러나 시스템 전역 설치가 필요 없습니다. 사전 빌드 gem은 x86_64-linux, aarch64-linux, Intel 및 Apple Silicon macOS, Windows(x64-mingw-ucrt)에서 Ruby 3.1–3.4를 지원합니다.

PDF 열기

PdfDocument.open으로 파일을 불러옵니다. 블록 형태를 사용하면 블록이 끝날 때 문서가 자동으로 닫힙니다. #close도 사용할 수 있으며 멱등(idempotent)합니다.

require 'pdf_oxide'

PdfOxide::PdfDocument.open('research-paper.pdf') do |doc|
  puts "Pages: #{doc.page_count}"
  puts "PDF version: #{doc.pdf_version}"
  puts "Encrypted: #{doc.encrypted?}"
end

암호화된 문서는 password:를 전달하세요.

PdfOxide::PdfDocument.open('confidential.pdf', password: 'secret') do |doc|
  puts doc.extract_text(0)
end

메모리에 올라온 바이트에서 바로 열 수도 있습니다 — S3, HTTP, 데이터베이스에서 스트리밍할 때 유용합니다. PdfDocument.open은 %PDF- 매직 헤더로 원시 PDF 바이트를 자동 감지합니다.

bytes = File.binread('report.pdf')
PdfOxide::PdfDocument.open(bytes) do |doc|
  puts doc.extract_text(0)
end

텍스트 추출

단일 페이지

0부터 시작하는 인덱스로 임의의 페이지에서 일반 텍스트를 추출합니다.

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  text = doc.extract_text(0)
  puts text
end

전체 페이지

PdfOxide::PdfDocument.open('book.pdf') do |doc|
  doc.page_count.times do |i|
    puts "--- Page #{i + 1} ---"
    puts doc.extract_text(i)
  end
end

원샷 헬퍼

한 페이지의 텍스트만 필요하다면, PdfDocument.extract_text가 열기, 추출, 닫기를 한 번의 호출로 처리합니다.

text = PdfOxide::PdfDocument.extract_text('report.pdf', page: 0)
puts text

자동 라우팅 추출

extract_text_auto는 v0.3.51의 자동 라우터를 사용해 페이지별로 네이티브 텍스트와 OCR 중 무엇을 쓸지 선택합니다. ocr 기능 없이 빌드한 경우에도 네이티브 텍스트 레이어로 자연스럽게 폴백하므로 “OCR unavailable” 오류가 발생하지 않습니다.

PdfOxide::PdfDocument.open('scan.pdf') do |doc|
  puts doc.extract_text_auto(0)
end

추출 품질을 설명하는 타입화된 이유(reason)가 필요하다면 AutoExtractor를 사용하세요.

PdfOxide::PdfDocument.open('scan.pdf') do |doc|
  ax     = doc.auto_extractor
  result = ax.extract_page(0)
  puts result[:text]
  warn "degraded: #{result[:reason]}" unless ax.ok?(result[:reason])
end

페이지 다루기

PdfDocument#page는 문서를 빌려 참조하는 가벼운 PdfPage 뷰를 반환합니다. #pages는 모든 페이지에 대해 하나씩 반환합니다.

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  page = doc.page(0)
  puts "Index: #{page.index}"
  puts page.text   # same as doc.extract_text(0)

  doc.pages.each do |p|
    puts "Page #{p.index}: #{p.text.length} chars"
  end
end

Markdown 및 HTML 변환

단일 페이지(인덱스 전달) 또는 문서 전체(인덱스 생략)를 Markdown이나 HTML로 변환합니다.

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  puts doc.to_markdown(0)   # first page to Markdown
  puts doc.to_html(0)       # first page to HTML
  puts doc.to_markdown      # entire document to Markdown
end

구조화된 추출

extract_structured는 파싱된 페이지 레이아웃을 Hash로 반환합니다 — 페이지 크기와 함께 텍스트, 바운딩 박스, 열 인덱스를 가진 타입화된 영역(region)들을 담고 있습니다.

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  page = doc.extract_structured(0)
  puts "Size: #{page['page_width']} x #{page['page_height']}"
  page['regions'].each do |region|
    puts "#{region['kind']}: #{region['text']}"
  end
end

검색

search는 문서 전체를 스캔하여 매치 해시의 배열을 반환합니다. 각 해시는 :page, :text, 그리고 :x, :y, :width, :height로 이루어진 :bbox 해시를 가집니다.

PdfOxide::PdfDocument.open('manual.pdf') do |doc|
  matches = doc.search('configuration', case_sensitive: false)
  matches.each do |m|
    bbox = m[:bbox]
    puts "Page #{m[:page]}: '#{m[:text]}' at (#{bbox[:x].round}, #{bbox[:y].round})"
  end
end

렌더링

지정한 DPI로 페이지를 PNG 바이트로 렌더링합니다.

PdfOxide::PdfDocument.open('poster.pdf') do |doc|
  png = doc.render(0, dpi: 150)
  File.binwrite('page-0.png', png)
end

PDF 생성

Pdf 클래스는 Markdown, HTML, 일반 텍스트로부터 PDF를 생성합니다. 인스턴스는 네이티브 핸들을 소유하므로 블록 형태(자동으로 닫힘)를 사용하거나 직접 #close를 호출하세요.

PdfOxide::Pdf.from_markdown("# Hello World\n\nThis is a PDF.") do |pdf|
  pdf.save('output.pdf')
end

PdfOxide::Pdf.from_html('<h1>Invoice</h1><p>Amount due: $42.00</p>') do |pdf|
  pdf.save('invoice.pdf')
end

PdfOxide::Pdf.from_text("Plain text document.\n\nSecond paragraph.") do |pdf|
  pdf.save('notes.pdf')
end

디스크에 저장하는 대신 #to_bytes로 바이트를 바로 얻을 수 있습니다.

pdf_bytes = PdfOxide::Pdf.from_markdown('# Report').to_bytes
# upload pdf_bytes, attach to an email, etc.

편집(Redaction)

DocumentEditor는 기존 PDF를 열어 파괴적인 편집(redaction)을 수행합니다. apply_redactions!는 가려진 콘텐츠를 영구적으로 제거하며, 같은 작업에서 문서 메타데이터까지 정리(scrub)할 수 있습니다.

PdfOxide::DocumentEditor.open('source.pdf') do |ed|
  ed.add_redaction(page: 0, rect: [100, 200, 300, 250])
  ed.apply_redactions!(scrub_metadata: true)
  ed.save_to('redacted.pdf')
end

오류 처리

PDF Oxide는 PDF 고유의 실패에 대해 PdfOxide::Error의 타입화된 하위 클래스를 발생시킵니다.

begin
  PdfOxide::PdfDocument.open('document.pdf') do |doc|
    puts doc.extract_text(0)
  end
rescue PdfOxide::FileNotFoundError
  warn 'File not found'
rescue PdfOxide::EncryptedError
  warn 'Wrong or missing password'
rescue PdfOxide::ParseError => e
  warn "Malformed PDF: #{e.message}"
rescue PdfOxide::Error => e
  warn "PDF error: #{e.message}"
end

다음 단계

Python 시작하기 – Python에서 PDF Oxide 사용하기
Rust 시작하기 – Rust에서 PDF Oxide 사용하기
텍스트 추출 – 상세한 추출 옵션과 레시피
PDF 생성 – 고급 생성, 암호화, 메타데이터
편집 – 기존 PDF 수정, 주석, 폼 필드