What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs pypdf

PDF Oxide는 pypdf보다 15배 빠르며, 통과율이 더 높고, 렌더링과 Markdown/HTML 변환을 기본으로 제공합니다. 기본적인 PDF 조작 이상의 기능이 필요하다면, pypdf가 여러 패키지를 조합해야 해내는 작업을 PDF Oxide는 라이브러리 하나로 처리합니다.

pypdf 대신 PDF Oxide를 고려해야 하는 이유

속도. pypdf는 순수 Python 구현입니다. PDF Oxide는 PyO3로 컴파일된 Rust 코어를 사용해 Python 프로세스 안에서 직접 실행됩니다. 평균 텍스트 추출 시간은 0.8ms 대 12.1ms로, 15배 차이입니다.

신뢰성. PDF Oxide는 테스트용 PDF 3,830개 전체의 100%를 통과합니다. pypdf의 통과율은 98.4%로, 유효한 PDF에서 61건이 실패합니다.

기능. pypdf는 PDF 조작 라이브러리(병합, 분할, 회전, 암호화)입니다. 텍스트 추출, 렌더링, Markdown 출력, 폼 생성에는 추가 패키지가 필요합니다. PDF Oxide는 이 모든 것을 한 번의 설치로 제공합니다.

빠른 비교

	PDF Oxide	pypdf
평균 추출 시간	0.8ms	12.1ms
통과율(PDF 3,830개)	100%	98.4%
라이선스	MIT	BSD-3
언어	Rust + PyO3	순수 Python
텍스트 추출	지원	지원
문자 위치	지원	부분 지원
이미지 추출	지원	지원
Markdown 출력	지원	미지원
HTML 출력	지원	미지원
PDF 생성	지원(Markdown/HTML/이미지)	제한적(병합만)
폼 필드	읽기 + 쓰기	읽기 + 쓰기
암호화	읽기 + 쓰기	읽기 + 쓰기
렌더링	지원	미지원
OCR	내장	미지원
검색	정규식 + 공간 검색	미지원
설치 크기	약 5 MB	약 1 MB

코드 나란히 비교

텍스트 추출

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
text = reader.pages[0].extract_text()
print(text)

모든 페이지 추출

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(f"--- Page {i + 1} ---")
    print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("book.pdf")
for page in reader.pages:
    text = page.extract_text()
    print(text)

이미지 추출

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
page = reader.pages[0]
for i, image in enumerate(page.images):
    with open(f"image_{i}.{image.name.split('.')[-1]}", "wb") as f:
        f.write(image.data)

암호화된 PDF

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("encrypted.pdf", password="secret")
text = doc.extract_text(0)

pypdf:

from pypdf import PdfReader

reader = PdfReader("encrypted.pdf")
reader.decrypt("secret")
text = reader.pages[0].extract_text()

Markdown 변환

PDF Oxide(내장):

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

pypdf:

# pypdf has no Markdown conversion.
# You would need a separate tool chain.

벤치마크 세부 정보

지표	PDF Oxide	pypdf
평균 추출 시간	0.8ms	12.1ms
p99 추출 시간	9ms	97ms
통과율(유효한 PDF)	100%(3,823/3,823)	98.4%(3,762/3,823)

pypdf의 순수 Python 구현에서는 모든 연산이 인터프리터에서 실행됩니다. PDF Oxide의 Rust 코어는 파싱, 폰트 디코딩, 텍스트 조립을 네이티브로 처리하며, 최종 결과만 Python 경계를 넘습니다.

코퍼스 세부 사항은 전체 벤치마크 방법론을 참고하세요.

기능 격차

pypdf는 PDF 조작—병합, 분할, 회전, 암호화—에 뛰어납니다. 하지만 다음 기능이 없습니다.

기능	PDF Oxide	pypdf
Markdown 변환	`doc.to_markdown(0)`	제공 안 함
HTML 변환	`doc.to_html(0)`	제공 안 함
콘텐츠로부터 PDF 생성	`Pdf.from_markdown()`, `Pdf.from_html()`	제공 안 함
이미지로 렌더링	지원	제공 안 함
스캔 PDF용 OCR	PaddleOCR 내장	제공 안 함
텍스트 검색	`doc.search("query")`	제공 안 함
문자 단위 바운딩 박스	`doc.extract_chars(0)`	부분 지원
PDF/A 검증	지원	제공 안 함

작업 흐름이 순전히 병합/분할/회전뿐이라면 pypdf의 가벼운 순수 Python 방식은 합리적인 선택입니다. 텍스트 추출 품질, 생성, 변환이 관련된다면 PDF Oxide가 더 완성도 높은 선택지입니다.

언제 pypdf를 계속 사용할까

컴파일된 확장이 전혀 없는 순수 Python 의존성이 필요한 경우
사용 사례가 텍스트 추출 없이 엄격히 병합/분할/회전/암호화에 국한되는 경우
레거시 통합을 위해 pypdf 고유의 PDF 조작 메서드가 필요한 경우

PDF Oxide vs pypdf

pypdf 대신 PDF Oxide를 고려해야 하는 이유

빠른 비교

코드 나란히 비교

텍스트 추출

모든 페이지 추출

이미지 추출

암호화된 PDF

Markdown 변환

벤치마크 세부 정보

기능 격차

언제 pypdf를 계속 사용할까

관련 페이지