What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

pdfplumber에서 PDF Oxide로 마이그레이션

pdfplumber에서 PDF Oxide로 전환하기 위한 완전한 가이드입니다. 현재 사용 중인 모든 API와 대체 방법을 다룹니다.

왜 pdfplumber에서 전환해야 합니까?

마이그레이션해야 하는 4가지 이유가 있습니다:

29배 빠름 — PDF Oxide는 페이지당 평균 0.8ms, pdfplumber는 23.2ms입니다. 100페이지 문서가 2.3초가 아닌 80ms에 처리됩니다.
암호화 PDF 지원 — pdfplumber는 암호화된 PDF를 전혀 열 수 없습니다. PDF Oxide는 AES-256을 포함한 모든 암호화 방식을 투명하게 처리합니다.
이미지 추출 — pdfplumber에는 이미지 추출 기능이 없습니다. PDF Oxide는 한 번의 호출로 임베디드 이미지를 추출합니다.
Markdown 출력 — pdfplumber는 수동 포맷팅이 필요한 Python 리스트로 테이블을 반환합니다. PDF Oxide는 테이블 구조를 유지한 Markdown을 출력하여 LLM에 바로 사용할 수 있습니다.

1단계: 설치

pip install pdf_oxide
pip uninstall pdfplumber  # 선택 사항

2단계: 임포트 변경

# 변경 전
import pdfplumber

# 변경 후
from pdf_oxide import PdfDocument

3단계: API 매핑 테이블

작업	pdfplumber	PDF Oxide
PDF 열기	`pdfplumber.open("file.pdf")`	`PdfDocument("file.pdf")`
페이지 수	`len(pdf.pages)`	`doc.page_count()`
텍스트 추출	`pdf.pages[0].extract_text()`	`doc.extract_text(0)`
문자 위치	`pdf.pages[0].chars`	`doc.extract_chars(0)`
테이블 추출	`pdf.pages[0].extract_tables()`	`doc.to_markdown(0)`
양식 필드	지원하지 않음 (읽기 전용)	`doc.get_form_fields()`
암호화 PDF	지원하지 않음	`PdfDocument("file.pdf", password="pw")`
이미지 추출	지원하지 않음	`doc.extract_image_bytes(0)`
Markdown 변환	지원하지 않음	`doc.to_markdown(0)`
렌더링	지원하지 않음	`doc.render_page(0)`
OCR	지원하지 않음	`doc.extract_text_ocr(0)`
PDF 생성	지원하지 않음	`Pdf.from_markdown("# Title")`

4단계: 일반적인 패턴 변경

텍스트 추출

pdfplumber는 컨텍스트 매니저가 필요합니다. PDF Oxide는 필요하지 않습니다:

# pdfplumber — 컨텍스트 매니저 필요
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

# PDF Oxide — 컨텍스트 매니저 불필요
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(text)

테이블 추출

pdfplumber는 테이블을 중첩된 Python 리스트로 반환합니다. PDF Oxide는 Markdown으로 출력합니다:

# pdfplumber — 리스트의 리스트 반환
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()
    for table in tables:
        for row in table:
            print(row)

# PDF Oxide — 구조화된 Markdown 출력
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
md = doc.to_markdown(0)
print(md)  # 테이블은 Markdown 테이블로 렌더링

문자 수준 추출

# pdfplumber
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    chars = pdf.pages[0].chars
    for c in chars:
        print(f"{c['text']} at ({c['x0']}, {c['top']})")

# PDF Oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for c in chars:
    print(f"{c.char} at ({c.x}, {c.y})")

암호화 PDF (새로운 기능)

pdfplumber는 암호화 PDF를 열 수 없습니다. PDF Oxide는 투명하게 처리합니다:

from pdf_oxide import PdfDocument

# AES-256을 포함한 모든 암호화 방식 지원
doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)
print(text)

이미지 추출 (새로운 기능)

pdfplumber에는 이미지 추출 기능이 없습니다. PDF Oxide를 사용하면 간단합니다:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

스캔 문서 OCR (새로운 기능)

pdfplumber는 스캔된 PDF를 처리할 수 없습니다. PDF Oxide는 OCR을 내장하고 있습니다:

from pdf_oxide import PdfDocument

doc = PdfDocument("scanned.pdf")
text = doc.extract_text_ocr(0)
print(text)

주요 차이점

컨텍스트 매니저 불필요 — pdfplumber는 with pdfplumber.open(...) as pdf:를 사용합니다. PDF Oxide는 컨텍스트 매니저가 필요하지 않습니다.
암호화 PDF — pdfplumber는 전혀 열 수 없습니다. PDF Oxide는 암호화를 투명하게 처리합니다.
테이블 — pdfplumber는 Python 리스트를 반환합니다. PDF Oxide는 Markdown 또는 HTML로 테이블을 출력합니다. 시각적 디버깅이 필요한 복잡한 테이블의 경우 PDF Oxide와 함께 pdfplumber를 병행 사용할 수 있습니다.

5단계: 마이그레이션 테스트

기존 테스트 파일을 두 라이브러리로 실행하고 출력을 비교합니다:

from pdf_oxide import PdfDocument

doc = PdfDocument("your-test-file.pdf")

# 텍스트 추출 확인
text = doc.extract_text(0)
print(text[:500])

# 페이지 수 확인
print(f"Pages: {doc.page_count()}")

# 양식 필드 확인 (해당하는 경우)
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name}: {f.value}")