What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PyMuPDF (fitz)에서 PDF Oxide로 마이그레이션

PyMuPDF에서 PDF Oxide로 전환하는 전체 가이드입니다. 현재 사용 중인 모든 API와 이를 어떻게 대체할 수 있는지 다룹니다.

왜 PyMuPDF에서 옮겨야 할까

이전할 만한 네 가지 뚜렷한 이유가 있습니다.

5.8× 빠름 — PDF Oxide는 페이지당 평균 0.8ms, PyMuPDF는 4.6ms입니다. 규모가 커질수록 차이는 누적되어 1,000페이지 배치가 5초가 아니라 1초 이내에 끝납니다.
MIT 라이선스 — PyMuPDF는 AGPL이라서 상호작용하는 모든 코드를 공개하거나 상업 라이선스를 구입해야 합니다. PDF Oxide는 MIT라서 제약 없이 어디서든 사용할 수 있습니다.
100% 안정성 — PDF Oxide는 PDF 테스트 스위트를 100% 통과합니다. PyMuPDF는 파일의 0.7%에서 실패(성공률 99.3%)하며, 약 140건당 1건꼴로 잘못된 출력이 발생합니다.
풍부한 기본 기능 — Markdown 변환, HTML 출력, OCR, XFA 폼 지원, PDF 렌더링이 기본 포함됩니다. PyMuPDF에서 비슷한 기능을 쓰려면 별도 패키지(pymupdf4llm)나 외부 도구(Tesseract)가 필요합니다.

1단계: 설치

pip install pdf_oxide
pip uninstall pymupdf  # 선택 — 준비가 되면 제거합니다

2단계: 임포트 교체

# Before
import fitz

# After
from pdf_oxide import PdfDocument

Markdown 변환을 위해 pymupdf4llm을 사용하고 있었다면 해당 의존성은 전부 제거해도 됩니다. PDF Oxide가 기본 지원합니다.

3단계: API 매핑

작업	PyMuPDF	PDF Oxide
PDF 열기	`fitz.open("file.pdf")`	`PdfDocument("file.pdf")`
페이지 수	`doc.page_count`	`doc.page_count()`
텍스트 추출	`doc[0].get_text()`	`doc.extract_text(0)`
문자 위치	`doc[0].get_text("dict")`	`doc.extract_chars(0)`
이미지 추출	`doc[0].get_images()` + `doc.extract_image(xref)`	`doc.extract_images(0)`
텍스트 검색	`doc[0].search_for("query")`	`doc.search_page(0, "query")`
폼 필드	`doc[0].widgets()` 또는 `doc.get_form_fields()`	`doc.get_form_fields()`
암호화된 PDF	`doc.authenticate("pw")`	`PdfDocument("f.pdf", password="pw")`
Markdown 변환	`pymupdf4llm.to_markdown("file.pdf")`(별도 패키지)	`doc.to_markdown(0)`(내장)
HTML 변환	미지원	`doc.to_html(0)`
PDF 생성	`insert_text()`를 수동으로 사용	`Pdf.from_markdown("# Title")`
이미지로 렌더링	`doc[0].get_pixmap()`	`doc.render_page(0)`
XFA 폼	미지원	`doc.has_xfa()`
OCR	Tesseract 필요	PaddleOCR 내장

4단계: 자주 쓰이는 패턴 변경

텍스트 추출 루프

# PyMuPDF
import fitz
doc = fitz.open("report.pdf")
for page in doc:
    text = page.get_text()
    print(text)

# PDF Oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(text)

이미지 추출

PyMuPDF는 여러 단계의 xref 조회가 필요합니다. PDF Oxide는 단 한 번의 호출로 처리합니다.

# PyMuPDF — 여러 단계의 xref 조회
import fitz
doc = fitz.open("report.pdf")
page = doc[0]
for img in page.get_images():
    xref = img[0]
    base = doc.extract_image(xref)
    with open(f"img.{base['ext']}", "wb") as f:
        f.write(base["image"])

# PDF Oxide — 한 단계
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for i, img in enumerate(doc.extract_image_bytes(0)):
    with open(f"img_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

암호화된 PDF

PyMuPDF는 "열고 나서 인증"하는 두 단계 패턴을 사용합니다. PDF Oxide는 생성자에서 password=를 전달하는 방식과 열람 후 doc.authenticate()를 호출하는 방식을 모두 지원합니다.

# PyMuPDF
import fitz
doc = fitz.open("encrypted.pdf")
doc.authenticate("password")
text = doc[0].get_text()

# PDF Oxide — password= 한 단계
from pdf_oxide import PdfDocument
doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)

Markdown 변환

PyMuPDF는 별도의 pymupdf4llm 패키지를 요구합니다. PDF Oxide는 Markdown을 내장하고 있습니다.

# PyMuPDF — 별도 패키지 필요
import pymupdf4llm
md = pymupdf4llm.to_markdown("report.pdf")

# PDF Oxide — 내장 기능
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
md = doc.to_markdown(0)

페이지 렌더링

# PyMuPDF
import fitz
doc = fitz.open("report.pdf")
pix = doc[0].get_pixmap()
pix.save("page.png")

# PDF Oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
png_bytes = doc.render_page(0, dpi=150)
with open("page.png", "wb") as f:
    f.write(png_bytes)

5단계: 마이그레이션 테스트

기존 테스트 파일을 두 라이브러리에 각각 실행해 출력을 비교합니다.

from pdf_oxide import PdfDocument

doc = PdfDocument("your-test-file.pdf")

# 텍스트 추출 확인
text = doc.extract_text(0)
print(text[:500])

# 페이지 수 확인
print(f"Pages: {doc.page_count()}")

# 폼 필드 확인(해당되는 경우)
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name}: {f.value}")