What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Clojure API 参考

PDF Oxide 提供地道的 Clojure 绑定，它是对 fyi.oxide:pdf-oxide Java 绑定的一层轻量封装，后者负责管理唯一的 JNI 原生桥接（pdf_oxide_jni crate）。这层封装不引入任何原生代码：它通过互操作直接调用 Java 类，并返回对 Clojure 友好的值（java.util.List 会变成向量，java.util.Optional 会变成值或 nil）。句柄类型（Pdf、PdfDocument、DocumentEditor、AutoExtractor）都实现了 AutoCloseable，因此请使用 with-open 实现确定性的资源清理。

;; deps.edn
{:deps {fyi.oxide/pdf-oxide-clojure {:mvn/version "0.3.69"}}}

;; Leiningen
[fyi.oxide/pdf-oxide-clojure "0.3.69"]

JNI 原生库（libpdf_oxide_jni）不会被一同打包——你需要将其放在 java.library.path 中以便通过 System.loadLibrary("pdf_oxide_jni") 加载，或者通过 -Dfyi.oxide.pdf.lib.path=<path> 让 Java 端的 NativeLoader 指向该库的路径。

所有函数都位于 pdf-oxide.core 命名空间中：

(require '[pdf-oxide.core :as pdf])

其他语言请参阅 Java API 参考、 Python API 参考、 Rust API 参考以及类型与枚举。

Pdf —— 创建

用于从源内容构建新的内存中 Pdf 的函数，以及序列化为字节数组的功能。返回的 Pdf 实现了 AutoCloseable。

创建

(from-markdown ^Pdf [^String markdown])

从 Markdown 字符串创建一个 Pdf。

(from-html ^Pdf [^String html])

从 HTML 字符串创建一个 Pdf。

保存

(save ^bytes [^Pdf pdf])

将构建好的 Pdf 序列化为字节数组（原始 PDF 字节）。

(with-open [p (pdf/from-markdown "# Hello\n\nbody\n")]
  (pdf/save p))                 ; => byte[]

PdfDocument —— 打开、提取与渲染

用于读取既有 PDF 的主要句柄。可以从字节数组或文件系统路径打开，然后提取文本、转换为 Markdown/HTML、渲染页面、搜索内容以及读取元数据和表单字段。实现了 AutoCloseable。

打开

(open ^PdfDocument [source])
(open ^PdfDocument [source ^String password])

从字节数组或文件系统路径字符串打开文档。双参数形式可为加密 PDF 提供密码。

(authenticate [^PdfDocument doc ^String password])

在打开文档之后对加密文档进行身份验证；返回一个布尔值。

文档查询

(page-count [^PdfDocument doc])

返回文档的页数。

(producer [^PdfDocument doc])

返回 /Producer 元数据字符串，不存在时返回 nil。

(creator [^PdfDocument doc])

返回 /Creator 元数据字符串，不存在时返回 nil。

文本提取

(extract-text [^PdfDocument doc page])

从单个从零开始计数的页面提取纯文本。

(extract-structured [^PdfDocument doc page])

为单个页面提取结构化文本（带定位信息的文本块/文本片段）。

转换

(to-markdown [^PdfDocument doc])
(to-markdown [^PdfDocument doc page])

将整篇文档或单个页面转换为 Markdown。

(to-html [^PdfDocument doc])
(to-html [^PdfDocument doc page])

将整篇文档或单个页面转换为 HTML。

渲染

(render ^bytes [^PdfDocument doc page])
(render ^bytes [^PdfDocument doc page dpi])

将页面渲染为 PNG 图像字节，可选指定 DPI。

搜索

(search [^PdfDocument doc ^String query])

在文档中搜索文本；返回一个 SearchMatch 结果组成的向量。

表单

(form-fields [^PdfDocument doc])

返回文档 AcroForm 表单字段组成的向量。

页面访问

(page ^PdfPage [^PdfDocument doc idx])

获取从零开始计数的页面对应的 PdfPage 句柄。

(pages [^PdfDocument doc])

返回文档中所有 PdfPage 句柄组成的向量。

PdfPage —— 页面元素提取

通过 (pdf/page doc idx) 或 (pdf/pages doc) 获取的页面句柄。每个提取函数都会将 Java 的 List 结果转换为 Clojure 向量。

元素

(words [^PdfPage page])

返回页面上单词元素组成的向量。

(lines [^PdfPage page])

返回页面上行元素组成的向量。

(chars [^PdfPage page])

返回页面上逐字符字形组成的向量。（这个 pdf/chars 有意遮蔽了 clojure.core/chars。）

(tables [^PdfPage page])

返回页面上检测到的表格组成的向量。

(images [^PdfPage page])

返回页面上图像元素组成的向量。

(annotations [^PdfPage page])

返回页面上注释组成的向量。

页面文本

(page-text [^PdfPage page])
(page-text [^PdfPage page region])

返回页面的纯文本，可选限定在某个 BBox 区域内。

(with-open [d (pdf/open (pdf/save p))]
  (let [pg (pdf/page d 0)]
    (map #(.text %) (pdf/words pg))                          ; word strings
    (pdf/page-text pg (BBox. 0.0 0.0 1000.0 1000.0))))       ; region text

DocumentEditor —— 编辑与涂黑

独立于 PdfDocument 打开的可变编辑句柄。支持清除元数据和不可逆的涂黑操作，之后将结果序列化为字节。实现了 AutoCloseable。

(editor ^DocumentEditor [source])

从字节数组或文件系统路径字符串打开一个 DocumentEditor。

(scrub-metadata [^DocumentEditor ed])

就地移除文档元数据（信息字典 / XMP）。

(add-redaction [^DocumentEditor ed page region])

在从零开始计数的页面上标记一个矩形 BBox 区域用于涂黑。

(apply-redactions [^DocumentEditor ed])

不可逆地应用所有待处理的涂黑操作，移除底层内容。

(editor-save ^bytes [^DocumentEditor ed])

将编辑后的文档序列化为字节数组。

(with-open [ed (pdf/editor pdf-bytes)]
  (pdf/scrub-metadata ed)
  (pdf/add-redaction ed 0 (BBox. 10.0 10.0 50.0 20.0))
  (pdf/apply-redactions ed)
  (pdf/editor-save ed))

AutoExtractor —— 自动提取

一个便捷的提取器，会为 PdfDocument 自动选择提取策略。

(auto-extractor ^AutoExtractor [^PdfDocument doc])

为给定文档创建一个 AutoExtractor。

(auto-text [^AutoExtractor ax])

使用自动选择的策略，从整篇文档提取文本。

(with-open [d (pdf/open pdf-bytes)]
  (pdf/auto-text (pdf/auto-extractor d)))

生命周期

句柄类型都实现了 AutoCloseable；优先使用 with-open 实现确定性的资源清理。以下函数是非 with-open 用法下的应急手段。

(close [resource])

关闭任意句柄（Pdf、PdfDocument、PdfPage、DocumentEditor、AutoExtractor）。

(open? [resource])

返回该句柄是否仍处于打开状态。

(let [d (pdf/open pdf-bytes)]
  (pdf/open? d)        ; => true
  (pdf/close d)
  (pdf/open? d))       ; => false

完整示例

(require '[pdf-oxide.core :as pdf])
(import '[fyi.oxide.pdf.geometry BBox])

;; --- Creation + Extraction ---
(with-open [p (pdf/from-markdown "# Report\n\nGenerated by PDF Oxide.\n")
            d (pdf/open (pdf/save p))]
  (println "Pages:" (pdf/page-count d))
  (println (pdf/extract-text d 0))
  (println (pdf/to-markdown d))
  (println (pdf/to-html d 0))

  ;; Page elements (List -> vector)
  (let [pg (pdf/page d 0)]
    (println "Words:" (count (pdf/words pg)))
    (doseq [w (pdf/words pg)] (print (.text w) "")))

  ;; Search
  (doseq [m (pdf/search d "Report")]
    (println "Match:" (.text m)))

  ;; Metadata (Optional -> nil)
  (println "Producer:" (or (pdf/producer d) "(none)"))

  ;; Render
  (spit "page0.png" (pdf/render d 0 150)))

;; --- Editing + Redaction ---
(with-open [ed (pdf/editor pdf-bytes)]
  (pdf/scrub-metadata ed)
  (pdf/add-redaction ed 0 (BBox. 10.0 10.0 50.0 20.0))
  (pdf/apply-redactions ed)
  (spit "redacted.pdf" (pdf/editor-save ed)))

Other Language Bindings

PDF Oxide 为所有主流生态系统提供原生绑定：Rust, Python, Node.js, WASM, C#, Golang, Java, PHP, Ruby, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Objective-C, Elixir。

后续步骤

类型与枚举 — 所有共享类型与枚举
Page API 参考 — 各绑定间一致的逐页迭代方式
Clojure 快速上手 — 教程