What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 上手指南（Kotlin）

PDF Oxide 是 JVM 上最快的 PDF 库，内置文本提取——平均 0.8ms，在 3,830 个 PDF 上 100% 通过。Kotlin 绑定是 Java 绑定之上一层地道、面向 Android 的封装：它为可关闭的句柄加上了 use { }，并把 Java 的 Optional<T> 返回值转换为可空的 T?。一个库即可完成 PDF 的提取、创建与编辑。MIT 许可，基于 Rust 内核构建。

安装

在 build.gradle.kts 中加入 Kotlin 绑定，它会传递性地引入持有 JNI 原生桥接的 Java 绑定：

dependencies {
    implementation("fyi.oxide:pdf-oxide-kotlin:0.3.69")
}

环境要求： JDK 17+。在 Android 上，需将原生库 libpdf_oxide_jni.so 放入 jniLibs/<abi>/ 一并打包；在桌面 JVM 上，加载器会自动找到它（必要时可用 -Dfyi.oxide.pdf.lib.path=<path> 覆盖路径）。

快速上手

从 Markdown 构建一个 PDF，打开它，再把文本读回来。Pdf 和 PdfDocument 句柄都实现了 AutoCloseable，所以用 use { } 包裹它们：

import fyi.oxide.pdf.Pdf
import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.producerOrNull

Pdf.fromMarkdown("# Hello pdf_oxide\n\nThis is a **Kotlin** binding.\n").use { pdf ->
    PdfDocument.open(pdf.save()).use { doc ->
        println("pages:    ${doc.pageCount()}")
        println("producer: ${doc.producerOrNull() ?: "(none)"}")
        println(doc.extractText(0))
    }
}

Pdf.fromMarkdown(String) 返回一个可关闭的 Pdf 构建器；pdf.save() 会将其序列化为 ByteArray。PdfDocument.open(ByteArray) 则打开该字节数组以供读取。

打开 PDF

从字节数据打开一个已有文档并查看它的元数据。producerOrNull() 和 creatorOrNull() 是对 Java Optional getter 的 Kotlin 可空封装：

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.producerOrNull
import fyi.oxide.pdf.creatorOrNull

PdfDocument.open(pdfBytes).use { doc ->
    println("open:     ${doc.isOpen}")
    println("pages:    ${doc.pageCount()}")
    println("producer: ${doc.producerOrNull() ?: "(none)"}")
    println("creator:  ${doc.creatorOrNull() ?: "(none)"}")
}

文本提取

按从零开始的页码索引提取任意一页的纯文本，或者遍历每一页：

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    // a single page
    println(doc.extractText(0))

    // every page
    for (i in 0 until doc.pageCount()) {
        println("--- Page ${i + 1} ---")
        println(doc.extractText(i))
    }
}

页面元素

doc.page(i) 返回一个 PdfPage，暴露结构化的几何信息——单词、行、字符、表格、图片和注释。每个单词都带有它的文本和一个边界框：

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    val page = doc.page(0)
    println("size: ${page.width()} x ${page.height()}")

    page.words().take(8).forEach { word ->
        println("${word.text()} @ ${word.bbox()}")
    }

    println("lines:       ${page.lines().size}")
    println("chars:       ${page.chars().size}")
    println("tables:      ${page.tables().size}")
    println("images:      ${page.images().size}")
    println("annotations: ${page.annotations().size}")
}

单词的 bbox() 是一个 BBox，提供 width() 和 height() 等辅助方法。

Markdown 与 HTML 转换

将整个文档转换为 Markdown，或将某一页渲染为 HTML：

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    val markdown = doc.toMarkdown()  // all pages
    println(markdown)

    val html = doc.toHtml()
    println(html)
}

搜索

在整个文档中搜索文本。每个匹配项通过 text() 暴露其文本：

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    val matches = doc.search("configuration")
    matches.forEach { m ->
        println("match: ${m.text()}")
    }
}

自动提取

AutoExtractor 一次调用即可运行完整的提取流水线，返回一个 AutoResult，其中包含文本以及可选的 Markdown/HTML 渲染结果。markdownOrNull() / htmlOrNull() 扩展函数会把 Java 的 Optional 返回值转换为可空值：

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.AutoExtractor
import fyi.oxide.pdf.markdownOrNull
import fyi.oxide.pdf.htmlOrNull

PdfDocument.open(pdfBytes).use { doc ->
    val result = AutoExtractor.of(doc).extractDocument()
    println(result.text())
    result.markdownOrNull()?.let { println(it) }
    result.htmlOrNull()?.let { println(it) }
}

编辑

DocumentEditor 打开一个 PDF 以进行结构性编辑——例如在分享前清除元数据——然后将结果序列化回字节：

import fyi.oxide.pdf.DocumentEditor

DocumentEditor.open(pdfBytes).use { editor ->
    editor.scrubMetadata()
    val cleaned: ByteArray = editor.save()
    println("cleaned: ${cleaned.size} bytes")
}

后续步骤

Java 上手指南 —— Kotlin 封装所包裹的 JVM 绑定
Python 上手指南 —— 在 Python 中使用 PDF Oxide
文本提取 —— 详细的提取选项与实用范例
PDF 创建 —— 借助构建器、加密和元数据进行进阶创建
编辑 —— 修改已有 PDF、注释和表单字段