What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Clojure API 레퍼런스

PDF Oxide는 단일 JNI 네이티브 브리지(pdf_oxide_jni 크레이트)를 담당하는 fyi.oxide:pdf-oxide Java 바인딩 위에 얹은 얇은 래퍼 형태로 관용적인 Clojure 바인딩을 제공합니다. 이 래퍼는 네이티브 코드를 전혀 추가하지 않습니다: Java 클래스를 interop를 통해 직접 호출하고, Clojure 친화적인 값을 반환합니다 (java.util.List는 벡터가 되고, java.util.Optional은 값 또는 nil이 됩니다). 핸들 타입(Pdf, PdfDocument, DocumentEditor, AutoExtractor)은 AutoCloseable이므로, 결정적인 정리를 위해 with-open을 사용하세요.

;; deps.edn
{:deps {fyi.oxide/pdf-oxide-clojure {:mvn/version "0.3.69"}}}

;; Leiningen
[fyi.oxide/pdf-oxide-clojure "0.3.69"]

JNI 네이티브 라이브러리(libpdf_oxide_jni)는 함께 번들되지 않습니다. java.library.path에 System.loadLibrary("pdf_oxide_jni")로 로드 가능하게 만들거나, Java의 NativeLoader가 -Dfyi.oxide.pdf.lib.path=<path>로 해당 경로를 가리키도록 하세요.

모든 함수는 pdf-oxide.core 네임스페이스에 있습니다:

(require '[pdf-oxide.core :as pdf])

다른 언어에 대해서는 Java API 레퍼런스, Python API 레퍼런스, Rust API 레퍼런스, 타입 및 열거형을 참고하세요.

Pdf — 생성

소스 콘텐츠로부터 새 인메모리 Pdf를 만드는 함수와, 이를 바이트 배열로 직렬화하는 함수입니다. 반환되는 Pdf는 AutoCloseable입니다.

생성

(from-markdown ^Pdf [^String markdown])

Markdown 문자열로부터 Pdf를 생성합니다.

(from-html ^Pdf [^String html])

HTML 문자열로부터 Pdf를 생성합니다.

저장

(save ^bytes [^Pdf pdf])

빌드된 Pdf를 바이트 배열(원본 PDF 바이트)로 직렬화합니다.

(with-open [p (pdf/from-markdown "# Hello\n\nbody\n")]
  (pdf/save p))                 ; => byte[]

PdfDocument — 열기, 추출 및 렌더링

기존 PDF에 대한 주요 읽기 핸들입니다. 바이트 배열 또는 파일시스템 경로에서 열고, 텍스트를 추출하고, Markdown/HTML로 변환하고, 페이지를 렌더링하고, 검색하고, 메타데이터와 폼 필드를 읽을 수 있습니다. AutoCloseable입니다.

열기

(open ^PdfDocument [source])
(open ^PdfDocument [source ^String password])

바이트 배열 또는 파일시스템 경로 문자열로부터 문서를 엽니다. 2-인자 형태는 암호화된 PDF를 위한 비밀번호를 받습니다.

(authenticate [^PdfDocument doc ^String password])

연 이후 암호화된 문서를 인증합니다; boolean을 반환합니다.

문서 조회

(page-count [^PdfDocument doc])

문서의 페이지 수를 반환합니다.

(producer [^PdfDocument doc])

/Producer 메타데이터 문자열을 반환하며, 없으면 nil을 반환합니다.

(creator [^PdfDocument doc])

/Creator 메타데이터 문자열을 반환하며, 없으면 nil을 반환합니다.

텍스트 추출

(extract-text [^PdfDocument doc page])

0부터 시작하는 단일 페이지에서 일반 텍스트를 추출합니다.

(extract-structured [^PdfDocument doc page])

단일 페이지의 구조화된 텍스트(위치 정보가 포함된 span/block)를 추출합니다.

변환

(to-markdown [^PdfDocument doc])
(to-markdown [^PdfDocument doc page])

문서 전체 또는 단일 페이지를 Markdown으로 변환합니다.

(to-html [^PdfDocument doc])
(to-html [^PdfDocument doc page])

문서 전체 또는 단일 페이지를 HTML로 변환합니다.

렌더링

(render ^bytes [^PdfDocument doc page])
(render ^bytes [^PdfDocument doc page dpi])

페이지를 PNG 이미지 바이트로 렌더링하며, 선택적으로 DPI를 지정할 수 있습니다.

검색

(search [^PdfDocument doc ^String query])

문서에서 텍스트를 검색합니다; SearchMatch 결과의 벡터를 반환합니다.

폼

(form-fields [^PdfDocument doc])

문서의 AcroForm 폼 필드 벡터를 반환합니다.

페이지 접근

(page ^PdfPage [^PdfDocument doc idx])

0부터 시작하는 페이지에 대한 PdfPage 핸들을 가져옵니다.

(pages [^PdfDocument doc])

문서의 모든 PdfPage 핸들 벡터를 반환합니다.

PdfPage — 페이지 요소 추출

(pdf/page doc idx) 또는 (pdf/pages doc)로 얻는 페이지 핸들입니다. 각 추출 함수는 Java List 결과를 Clojure 벡터로 변환합니다.

요소

(words [^PdfPage page])

페이지의 단어 요소 벡터를 반환합니다.

(lines [^PdfPage page])

페이지의 줄 요소 벡터를 반환합니다.

(chars [^PdfPage page])

페이지의 글자 단위 글리프 벡터를 반환합니다. (이 pdf/chars는 의도적으로 clojure.core/chars를 가립니다.)

(tables [^PdfPage page])

페이지에서 감지된 표 벡터를 반환합니다.

(images [^PdfPage page])

페이지의 이미지 요소 벡터를 반환합니다.

(annotations [^PdfPage page])

페이지의 주석 벡터를 반환합니다.

페이지 텍스트

(page-text [^PdfPage page])
(page-text [^PdfPage page region])

페이지의 일반 텍스트를 반환하며, 선택적으로 BBox 영역으로 제한할 수 있습니다.

(with-open [d (pdf/open (pdf/save p))]
  (let [pg (pdf/page d 0)]
    (map #(.text %) (pdf/words pg))                          ; word strings
    (pdf/page-text pg (BBox. 0.0 0.0 1000.0 1000.0))))       ; region text

DocumentEditor — 편집 및 교정(Redaction)

PdfDocument와는 독립적으로 열리는 가변 편집 핸들입니다. 메타데이터 제거와 파괴적인 교정(redaction)을 지원하며, 결과를 바이트로 직렬화합니다. AutoCloseable입니다.

(editor ^DocumentEditor [source])

바이트 배열 또는 파일시스템 경로 문자열로부터 DocumentEditor를 엽니다.

(scrub-metadata [^DocumentEditor ed])

문서 메타데이터(정보 딕셔너리 / XMP)를 제자리에서 제거합니다.

(add-redaction [^DocumentEditor ed page region])

0부터 시작하는 페이지의 사각형 BBox 영역을 교정 대상으로 표시합니다.

(apply-redactions [^DocumentEditor ed])

대기 중인 모든 교정을 파괴적으로 적용하여, 하부 콘텐츠를 제거합니다.

(editor-save ^bytes [^DocumentEditor ed])

편집된 문서를 바이트 배열로 직렬화합니다.

(with-open [ed (pdf/editor pdf-bytes)]
  (pdf/scrub-metadata ed)
  (pdf/add-redaction ed 0 (BBox. 10.0 10.0 50.0 20.0))
  (pdf/apply-redactions ed)
  (pdf/editor-save ed))

AutoExtractor — 자동 추출

PdfDocument를 위한 추출 전략을 자동으로 선택해주는 편의 추출기입니다.

(auto-extractor ^AutoExtractor [^PdfDocument doc])

주어진 문서에 대한 AutoExtractor를 생성합니다.

(auto-text [^AutoExtractor ax])

자동 선택된 전략을 사용해 문서 전체에서 텍스트를 추출합니다.

(with-open [d (pdf/open pdf-bytes)]
  (pdf/auto-text (pdf/auto-extractor d)))

수명 주기

핸들 타입은 AutoCloseable입니다; 결정적인 정리를 위해 with-open을 우선 사용하세요. 다음 함수들은 with-open을 사용하지 않는 경우를 위한 탈출구입니다.

(close [resource])

임의의 핸들(Pdf, PdfDocument, PdfPage, DocumentEditor, AutoExtractor)을 닫습니다.

(open? [resource])

핸들이 아직 열려 있는지 여부를 반환합니다.

(let [d (pdf/open pdf-bytes)]
  (pdf/open? d)        ; => true
  (pdf/close d)
  (pdf/open? d))       ; => false

전체 예제

(require '[pdf-oxide.core :as pdf])
(import '[fyi.oxide.pdf.geometry BBox])

;; --- Creation + Extraction ---
(with-open [p (pdf/from-markdown "# Report\n\nGenerated by PDF Oxide.\n")
            d (pdf/open (pdf/save p))]
  (println "Pages:" (pdf/page-count d))
  (println (pdf/extract-text d 0))
  (println (pdf/to-markdown d))
  (println (pdf/to-html d 0))

  ;; Page elements (List -> vector)
  (let [pg (pdf/page d 0)]
    (println "Words:" (count (pdf/words pg)))
    (doseq [w (pdf/words pg)] (print (.text w) "")))

  ;; Search
  (doseq [m (pdf/search d "Report")]
    (println "Match:" (.text m)))

  ;; Metadata (Optional -> nil)
  (println "Producer:" (or (pdf/producer d) "(none)"))

  ;; Render
  (spit "page0.png" (pdf/render d 0 150)))

;; --- Editing + Redaction ---
(with-open [ed (pdf/editor pdf-bytes)]
  (pdf/scrub-metadata ed)
  (pdf/add-redaction ed 0 (BBox. 10.0 10.0 50.0 20.0))
  (pdf/apply-redactions ed)
  (spit "redacted.pdf" (pdf/editor-save ed)))

Other Language Bindings

PDF Oxide는 모든 주요 생태계를 위한 네이티브 바인딩을 제공합니다: Rust, Python, Node.js, WASM, C#, Golang, Java, PHP, Ruby, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Objective-C, Elixir

다음 단계

타입 & 열거형 — 모든 공유 타입과 열거형
Page API 레퍼런스 — 바인딩 간 일관된 페이지 단위 순회
Clojure 시작하기 — 튜토리얼