What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

pypdf에서 PDF Oxide로 마이그레이션

pypdf에서 PDF Oxide로 전환하기 위한 완전한 가이드입니다. 현재 사용 중인 모든 API와 대체 방법을 다룹니다.

왜 pypdf에서 전환해야 합니까?

마이그레이션해야 하는 4가지 이유가 있습니다:

15배 빠름 — PDF Oxide는 페이지당 평균 0.8ms, pypdf는 12.1ms입니다. 500페이지 문서 처리가 6초에서 0.4초로 단축됩니다.
100% 신뢰성 — PDF Oxide는 PDF 테스트 스위트의 100%를 통과합니다. pypdf는 1.6%의 파일에서 실패합니다(98.4% 통과율). 약 60개 문서 중 1개에서 출력 오류가 발생합니다.
Markdown과 HTML 내장 — pypdf는 일반 텍스트만 추출할 수 있습니다. PDF Oxide는 테이블과 구조를 유지한 채 Markdown과 HTML로 변환할 수 있어 LLM/RAG 파이프라인에 필수적입니다.
OCR과 렌더링 내장 — pypdf에는 OCR이나 페이지 렌더링 기능이 없습니다. PDF Oxide는 스캔 문서용 PaddleOCR을 내장하고 있으며, 외부 의존성 없이 페이지를 이미지로 렌더링할 수 있습니다.

1단계: 설치

pip install pdf_oxide
pip uninstall pypdf  # 선택 사항

2단계: 임포트 변경

# 변경 전
from pypdf import PdfReader, PdfWriter, PdfMerger

# 변경 후
from pdf_oxide import PdfDocument, Pdf

3단계: API 매핑 테이블

작업	pypdf	PDF Oxide
PDF 열기	`PdfReader("file.pdf")`	`PdfDocument("file.pdf")`
페이지 수	`len(reader.pages)`	`doc.page_count()`
텍스트 추출	`reader.pages[0].extract_text()`	`doc.extract_text(0)`
이미지 추출	`reader.pages[0].images`	`doc.extract_image_bytes(0)`
양식 필드	`reader.get_fields()`	`doc.get_form_fields()`
메타데이터	`reader.metadata`	`doc.metadata()`
암호화 PDF	`reader.decrypt("pw")`	`PdfDocument("file.pdf", password="pw")`
PDF 병합	`PdfMerger()` + `.append()`	`doc.merge_from("doc2.pdf")`
페이지 분할	`PdfWriter()` + `.add_page()`	`doc.extract_pages([0,1,2,3,4], "out.pdf")`
Markdown 변환	지원하지 않음	`doc.to_markdown(0)`
렌더링	지원하지 않음	`doc.render_page(0)`
OCR	지원하지 않음	`doc.extract_text_ocr(0)`

4단계: 일반적인 패턴 변경

텍스트 추출

# pypdf
from pypdf import PdfReader
reader = PdfReader("report.pdf")
for page in reader.pages:
    print(page.extract_text())

# PDF Oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    print(doc.extract_text(i))

PDF 병합

pypdf는 병합 객체를 생성하고 파일을 하나씩 추가해야 합니다. PDF Oxide는 한 번의 호출로 완료합니다:

# pypdf
from pypdf import PdfMerger
merger = PdfMerger()
merger.append("doc1.pdf")
merger.append("doc2.pdf")
merger.write("merged.pdf")

# PDF Oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("doc1.pdf")
doc.merge_from("doc2.pdf")
doc.save("merged.pdf")

페이지 분할/추출

# pypdf — 수동 페이지별 복사
from pypdf import PdfReader, PdfWriter
reader = PdfReader("report.pdf")
writer = PdfWriter()
for page in reader.pages[0:5]:
    writer.add_page(page)
writer.write("first_5_pages.pdf")

# PDF Oxide — 한 번의 호출
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
doc.extract_pages([0, 1, 2, 3, 4], "first_5_pages.pdf")

암호화 PDF

# pypdf — open + decrypt 2단계
from pypdf import PdfReader
reader = PdfReader("encrypted.pdf")
reader.decrypt("password")
text = reader.pages[0].extract_text()

# PDF Oxide — 생성자에서 비밀번호 지정
from pdf_oxide import PdfDocument
doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)

Markdown 변환 (새로운 기능)

pypdf는 Markdown을 지원하지 않습니다. PDF Oxide를 사용하면 LLM 파이프라인에 쉽게 투입할 수 있습니다:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    md = doc.to_markdown(i)
    print(md)

이미지 추출

# pypdf
from pypdf import PdfReader
reader = PdfReader("report.pdf")
for image in reader.pages[0].images:
    with open(image.name, "wb") as f:
        f.write(image.data)

# PDF Oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

5단계: 마이그레이션 테스트

기존 테스트 파일을 두 라이브러리로 실행하고 출력을 비교합니다:

from pdf_oxide import PdfDocument

doc = PdfDocument("your-test-file.pdf")

# 텍스트 추출 확인
text = doc.extract_text(0)
print(text[:500])

# 페이지 수 확인
print(f"Pages: {doc.page_count()}")

# 양식 필드 확인 (해당하는 경우)
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name}: {f.value}")