What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 시작하기 (Kotlin)

PDF Oxide는 텍스트 추출 기능을 기본 내장한 JVM에서 가장 빠른 PDF 라이브러리입니다 — 평균 0.8ms, 3,830개 PDF에서 100% 통과율. Kotlin 바인딩은 Java 바인딩 위에 얹은 관용적이고 Android를 지원하는 파사드로, 닫을 수 있는 핸들에 use { }를 더하고 Java의 Optional<T> 반환값을 nullable한 T?로 바꿔줍니다. PDF를 추출하고, 만들고, 편집하는 일을 하나의 라이브러리로 처리합니다. MIT 라이선스이며 Rust 코어 위에 구축되었습니다.

설치

build.gradle.kts에 Kotlin 바인딩을 추가하세요. JNI 네이티브 브리지를 담당하는 Java 바인딩이 추이적으로 함께 가져와집니다.

dependencies {
    implementation("fyi.oxide:pdf-oxide-kotlin:0.3.69")
}

요구 사항: JDK 17 이상. Android에서는 네이티브 libpdf_oxide_jni.so를 jniLibs/<abi>/에 포함하세요. 데스크톱 JVM에서는 로더가 자동으로 찾아줍니다(필요할 경우 -Dfyi.oxide.pdf.lib.path=<path>로 재정의).

빠른 시작

Markdown으로 PDF를 만들고, 열고, 텍스트를 다시 읽어 봅니다. Pdf와 PdfDocument 핸들은 AutoCloseable이므로 use { }로 감싸세요.

import fyi.oxide.pdf.Pdf
import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.producerOrNull

Pdf.fromMarkdown("# Hello pdf_oxide\n\nThis is a **Kotlin** binding.\n").use { pdf ->
    PdfDocument.open(pdf.save()).use { doc ->
        println("pages:    ${doc.pageCount()}")
        println("producer: ${doc.producerOrNull() ?: "(none)"}")
        println(doc.extractText(0))
    }
}

Pdf.fromMarkdown(String)은 닫을 수 있는 Pdf 빌더를 반환하고, pdf.save()는 이를 ByteArray로 직렬화합니다. PdfDocument.open(ByteArray)는 그 바이트를 열어 읽습니다.

PDF 열기

기존 문서를 바이트로부터 열어 메타데이터를 살펴봅니다. producerOrNull()과 creatorOrNull()은 Java의 Optional 게터를 Kotlin nullable 형태로 본 것입니다.

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.producerOrNull
import fyi.oxide.pdf.creatorOrNull

PdfDocument.open(pdfBytes).use { doc ->
    println("open:     ${doc.isOpen}")
    println("pages:    ${doc.pageCount()}")
    println("producer: ${doc.producerOrNull() ?: "(none)"}")
    println("creator:  ${doc.creatorOrNull() ?: "(none)"}")
}

텍스트 추출

0부터 시작하는 인덱스로 임의의 페이지에서 일반 텍스트를 추출하거나, 모든 페이지를 순회할 수 있습니다.

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    // 단일 페이지
    println(doc.extractText(0))

    // 모든 페이지
    for (i in 0 until doc.pageCount()) {
        println("--- Page ${i + 1} ---")
        println(doc.extractText(i))
    }
}

페이지 요소

doc.page(i)는 단어, 줄, 문자, 표, 이미지, 주석 같은 구조화된 기하 정보를 노출하는 PdfPage를 반환합니다. 각 단어는 자신의 텍스트와 경계 상자(bounding box)를 함께 담고 있습니다.

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    val page = doc.page(0)
    println("size: ${page.width()} x ${page.height()}")

    page.words().take(8).forEach { word ->
        println("${word.text()} @ ${word.bbox()}")
    }

    println("lines:       ${page.lines().size}")
    println("chars:       ${page.chars().size}")
    println("tables:      ${page.tables().size}")
    println("images:      ${page.images().size}")
    println("annotations: ${page.annotations().size}")
}

단어의 bbox()는 width(), height() 같은 헬퍼를 갖춘 BBox입니다.

Markdown 및 HTML 변환

문서 전체를 Markdown으로 변환하거나, 페이지를 HTML로 렌더링할 수 있습니다.

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    val markdown = doc.toMarkdown()  // 모든 페이지
    println(markdown)

    val html = doc.toHtml()
    println(html)
}

검색

문서 전반에서 텍스트를 검색합니다. 각 일치 항목은 text()로 텍스트를 노출합니다.

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    val matches = doc.search("configuration")
    matches.forEach { m ->
        println("match: ${m.text()}")
    }
}

자동 추출

AutoExtractor는 전체 추출 파이프라인을 한 번의 호출로 실행하고, 텍스트와 함께 선택적인 Markdown/HTML 렌더링을 담은 AutoResult를 반환합니다. markdownOrNull() / htmlOrNull() 확장 함수는 Java의 Optional 반환값을 nullable한 값으로 바꿔줍니다.

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.AutoExtractor
import fyi.oxide.pdf.markdownOrNull
import fyi.oxide.pdf.htmlOrNull

PdfDocument.open(pdfBytes).use { doc ->
    val result = AutoExtractor.of(doc).extractDocument()
    println(result.text())
    result.markdownOrNull()?.let { println(it) }
    result.htmlOrNull()?.let { println(it) }
}

편집

DocumentEditor는 PDF를 구조적 편집을 위해 엽니다 — 예를 들어 공유하기 전에 메타데이터를 지우는 작업 — 그런 다음 결과를 다시 바이트로 직렬화합니다.

import fyi.oxide.pdf.DocumentEditor

DocumentEditor.open(pdfBytes).use { editor ->
    editor.scrubMetadata()
    val cleaned: ByteArray = editor.save()
    println("cleaned: ${cleaned.size} bytes")
}

다음 단계

Java 시작하기 – Kotlin 파사드가 감싸는 JVM 바인딩
Python 시작하기 – Python에서 PDF Oxide 사용하기
텍스트 추출 – 자세한 추출 옵션과 레시피
PDF 생성 – 빌더, 암호화, 메타데이터를 활용한 고급 생성
편집 – 기존 PDF, 주석, 양식 필드 수정하기