What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 시작하기 (Java)

PDF Oxide는 텍스트 추출에서 가장 빠른 Java PDF 라이브러리입니다 — 평균 0.8ms, 실제 PDF 3,830개에서 100% 통과율을 기록합니다. 동일한 Rust 코어가 Python, Go, JS, C#에도 제공되며, Java 바인딩은 JDK 11 LTS를 최소 요구 사항으로 하는 얇은 JNI 레이어로, 같은 JAR에서 Kotlin과 무료로 상호 운용됩니다.

설치

JAR에는 Linux(x86_64/aarch64), macOS(x86_64/aarch64), Windows(x86_64)용 네이티브 라이브러리가 내장되어 있습니다. 컴파일러나 추가 설정이 필요 없으며, 첫 호출 시 적합한 라이브러리가 자동으로 추출됩니다.

Maven

<dependency>
  <groupId>fyi.oxide</groupId>
  <artifactId>pdf-oxide</artifactId>
  <version>0.3.69</version>
</dependency>

Gradle

// Kotlin DSL
implementation("fyi.oxide:pdf-oxide:0.3.69")

// Groovy
implementation 'fyi.oxide:pdf-oxide:0.3.69'

빠른 시작

PDF를 열고 텍스트를 추출합니다. PdfDocument는 AutoCloseable이므로, try-with-resources를 사용하면 네이티브 핸들이 확정적으로 해제됩니다.

import fyi.oxide.pdf.PdfDocument;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    System.out.println("Pages: " + doc.pageCount());
    System.out.println(doc.extractText(0)); // zero-based page index
}

경로 문자열, Path, 원시 byte[], 또는 InputStream에서 PDF를 열 수 있습니다.

import fyi.oxide.pdf.PdfDocument;

byte[] pdfBytes = downloadFromS3();
try (PdfDocument doc = PdfDocument.open(pdfBytes)) {
    String text = doc.extractText(0);
}

텍스트 추출

0부터 시작하는 인덱스로 모든 페이지를 순회합니다.

import fyi.oxide.pdf.PdfDocument;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("book.pdf"))) {
    for (int i = 0; i < doc.pageCount(); i++) {
        System.out.println("--- Page " + (i + 1) + " ---");
        System.out.println(doc.extractText(i));
    }
}

단어 단위 추출

PdfPage는 구조화된 기하 정보를 제공합니다. words()는 TextWord의 리스트를 반환하며, 각 항목은 텍스트, 경계 상자(bounding box), OCR 신뢰도를 담고 있습니다.

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.PdfPage;
import fyi.oxide.pdf.text.TextWord;
import fyi.oxide.pdf.geometry.BBox;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("paper.pdf"))) {
    PdfPage page = doc.page(0);
    for (TextWord word : page.words()) {
        BBox b = word.bbox();
        System.out.printf("'%s' at (%.1f, %.1f) conf=%.2f%n",
            word.text(), b.x0(), b.y0(), word.confidence());
    }
}

PdfPage는 lines(), chars(), tables(), images(), annotations()와 더불어 width(), height(), 그리고 하위 영역에서 추출하는 text(BBox region)도 제공합니다.

Markdown 변환

MarkdownConverter 헬퍼(또는 편의 메서드 doc.toMarkdown(...))를 사용해 단일 페이지 또는 문서 전체를 Markdown으로 변환합니다.

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.MarkdownConverter;
import java.nio.file.Files;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    String md = MarkdownConverter.toMarkdown(doc); // whole document
    Files.writeString(Path.of("report.md"), md);

    String pageMd = doc.toMarkdown(0); // single page
    String pageHtml = doc.toHtml(0);   // or HTML
}

검색

search()는 문서 전체를 스캔하여 SearchMatch의 리스트를 반환하며, 각 항목은 페이지 인덱스, 경계 상자, 일치한 텍스트를 담고 있습니다.

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.search.SearchMatch;
import fyi.oxide.pdf.geometry.BBox;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("manual.pdf"))) {
    for (SearchMatch m : doc.search("configuration")) {
        BBox b = m.bbox();
        System.out.printf("Page %d: '%s' at (%.0f, %.0f)%n",
            m.pageIndex(), m.text(), b.x0(), b.y0());
    }
}

PDF 생성

Pdf 타입은 Markdown, HTML, 이미지로부터 PDF를 생성합니다. AutoCloseable이지만 Cleaner 백스톱이 없으므로, 항상 명시적으로 닫거나 try-with-resources를 사용하세요.

import fyi.oxide.pdf.Pdf;
import java.nio.file.Path;

try (Pdf pdf = Pdf.fromMarkdown("# Hello\n\nThis is a PDF.")) {
    pdf.saveTo(Path.of("out.pdf"));
}

try (Pdf pdf = Pdf.fromHtml("<h1>Invoice</h1><p>Amount: $42</p>")) {
    byte[] bytes = pdf.save(); // serialize to memory instead of disk
}

비밀번호로 보호된 PDF

open()에 비밀번호를 전달하거나, PdfEncryptedException을 잡은 뒤 authenticate()를 호출하세요.

import fyi.oxide.pdf.PdfDocument;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("confidential.pdf"), "secret")) {
    System.out.println(doc.extractText(0));
}

오류 처리

PdfException은 RuntimeException을 상속하는 unchecked 예외로, 타입이 지정된 하위 클래스와 switch 분기에 사용할 수 있는 kind() enum을 제공합니다.

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.exception.PdfEncryptedException;
import fyi.oxide.pdf.exception.PdfException;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("document.pdf"))) {
    String text = doc.extractText(0);
} catch (PdfEncryptedException e) {
    System.err.println("Password required");
} catch (PdfException e) {
    switch (e.kind()) {
        case PARSE -> System.err.println("Malformed PDF");
        case IO    -> System.err.println("I/O error");
        default    -> System.err.println("PDF error: " + e.getMessage());
    }
}

Kotlin

같은 JAR을 Kotlin에서 그대로 사용할 수 있으며, record 접근자는 프로퍼티로 노출됩니다.

import fyi.oxide.pdf.PdfDocument
import java.nio.file.Path

PdfDocument.open(Path.of("report.pdf")).use { doc ->
    println("Pages: ${doc.pageCount()}")
    println(doc.extractText(0))
}

다음 단계

Python 시작하기 – Python에서 PDF Oxide 사용하기
Rust 시작하기 – Rust에서 PDF Oxide 사용하기
텍스트 추출 – 자세한 추출 옵션과 활용법
PDF 생성 – 고급 생성, 암호화, 메타데이터