What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide をはじめる (Kotlin)

PDF Oxide は、テキスト抽出を標準装備した JVM 向けの最速 PDF ライブラリです。平均 0.8ms、3,830 件の PDF で 100% のパス率を達成しています。Kotlin バインディングは Java バインディングを慣用的にラップした、Android 対応のファサードです。クローズ可能なハンドルに use { } を追加し、Java の Optional<T> 戻り値を null 許容の T? に変換します。抽出・作成・編集をこれ一つでこなせます。MIT ライセンスで、Rust 製のコアの上に構築されています。

インストール

build.gradle.kts に Kotlin バインディングを追加します。JNI のネイティブブリッジを担う Java バインディングも推移的に取り込まれます。

dependencies {
    implementation("fyi.oxide:pdf-oxide-kotlin:0.3.69")
}

要件: JDK 17 以上。Android では、ネイティブの libpdf_oxide_jni.so を jniLibs/<abi>/ に同梱してください。デスクトップ JVM ではローダーが自動的に検出します（必要に応じて -Dfyi.oxide.pdf.lib.path=<path> で上書きできます）。

クイックスタート

Markdown から PDF を生成して開き、テキストを読み戻してみましょう。Pdf と PdfDocument のハンドルは AutoCloseable なので、use { } で囲みます。

import fyi.oxide.pdf.Pdf
import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.producerOrNull

Pdf.fromMarkdown("# Hello pdf_oxide\n\nThis is a **Kotlin** binding.\n").use { pdf ->
    PdfDocument.open(pdf.save()).use { doc ->
        println("pages:    ${doc.pageCount()}")
        println("producer: ${doc.producerOrNull() ?: "(none)"}")
        println(doc.extractText(0))
    }
}

Pdf.fromMarkdown(String) はクローズ可能な Pdf ビルダーを返し、pdf.save() はそれを ByteArray にシリアライズします。PdfDocument.open(ByteArray) はそのバイト列を読み取り用に開きます。

PDF を開く

既存のドキュメントをバイト列から開き、メタデータを調べます。producerOrNull() と creatorOrNull() は、Java の Optional ゲッターを Kotlin の null 許容ビューとして見せたものです。

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.producerOrNull
import fyi.oxide.pdf.creatorOrNull

PdfDocument.open(pdfBytes).use { doc ->
    println("open:     ${doc.isOpen}")
    println("pages:    ${doc.pageCount()}")
    println("producer: ${doc.producerOrNull() ?: "(none)"}")
    println("creator:  ${doc.creatorOrNull() ?: "(none)"}")
}

テキスト抽出

任意のページから、0 始まりのインデックスを指定してプレーンテキストを抽出できます。全ページをループ処理することも可能です。

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    // 単一ページ
    println(doc.extractText(0))

    // 全ページ
    for (i in 0 until doc.pageCount()) {
        println("--- Page ${i + 1} ---")
        println(doc.extractText(i))
    }
}

ページ要素

doc.page(i) は PdfPage を返し、単語・行・文字・表・画像・注釈といった構造化された幾何情報を公開します。各単語はそのテキストと境界ボックスを保持します。

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    val page = doc.page(0)
    println("size: ${page.width()} x ${page.height()}")

    page.words().take(8).forEach { word ->
        println("${word.text()} @ ${word.bbox()}")
    }

    println("lines:       ${page.lines().size}")
    println("chars:       ${page.chars().size}")
    println("tables:      ${page.tables().size}")
    println("images:      ${page.images().size}")
    println("annotations: ${page.annotations().size}")
}

単語の bbox() は BBox で、width() や height() などのヘルパーを備えています。

Markdown と HTML への変換

ドキュメント全体を Markdown に変換したり、ページを HTML にレンダリングしたりできます。

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    val markdown = doc.toMarkdown()  // 全ページ
    println(markdown)

    val html = doc.toHtml()
    println(html)
}

検索

ドキュメント全体からテキストを検索します。各マッチは text() でそのテキストを公開します。

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    val matches = doc.search("configuration")
    matches.forEach { m ->
        println("match: ${m.text()}")
    }
}

自動抽出

AutoExtractor は抽出パイプライン全体を 1 回の呼び出しで実行し、テキストに加えて Markdown / HTML レンダリング（任意）を含む AutoResult を返します。markdownOrNull() / htmlOrNull() 拡張関数は、Java の Optional 戻り値を null 許容の値に変換します。

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.AutoExtractor
import fyi.oxide.pdf.markdownOrNull
import fyi.oxide.pdf.htmlOrNull

PdfDocument.open(pdfBytes).use { doc ->
    val result = AutoExtractor.of(doc).extractDocument()
    println(result.text())
    result.markdownOrNull()?.let { println(it) }
    result.htmlOrNull()?.let { println(it) }
}

編集

DocumentEditor は構造的な編集のために PDF を開き（たとえば共有前のメタデータの除去など）、結果をバイト列にシリアライズし直します。

import fyi.oxide.pdf.DocumentEditor

DocumentEditor.open(pdfBytes).use { editor ->
    editor.scrubMetadata()
    val cleaned: ByteArray = editor.save()
    println("cleaned: ${cleaned.size} bytes")
}

次のステップ

Java をはじめる – Kotlin ファサードがラップする JVM バインディング
Python をはじめる – Python から PDF Oxide を使う
テキスト抽出 – 抽出オプションとレシピの詳細
PDF の作成 – ビルダー、暗号化、メタデータを使った高度な作成
編集 – 既存の PDF、注釈、フォームフィールドの変更