What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 快速上手（Elixir）

PDF Oxide 是在 Elixir 中读写 PDF 最快的方式 — 文本提取平均 0.8ms，在 3,830 个 PDF 上 100% 通过率。它是构建在同一个 Rust 核心之上的 NIF，把 CPU 密集型工作放在脏 CPU 调度器（ERL_NIF_DIRTY_JOB_CPU_BOUND）上运行，因此永远不会阻塞 BEAM 调度器。

Document 和 Pdf 句柄是由 GC 释放的 NIF 资源。可能失败的函数返回 {:ok, value} 或 {:error, code}，页索引从 0 开始。

安装

在 mix.exs 的依赖中加入 pdf_oxide：

def deps do
  [
    {:pdf_oxide, "~> 0.3"}
  ]
end

然后拉取并编译 — NIF 会通过 elixir_make 针对原生 cdylib 构建：

mix deps.get
mix compile

快速开始

从 Markdown 构建一个 PDF，序列化为字节，再打开它并把文本提取回来。

{:ok, pdf}   = PdfOxide.from_markdown("# Hello pdf_oxide\n\nThis is an **Elixir** binding.\n")
{:ok, bytes} = PdfOxide.to_bytes(pdf)
{:ok, doc}   = PdfOxide.open_from_bytes(bytes)

{:ok, pages} = PdfOxide.page_count(doc)
IO.puts("pages: #{pages}")

%{major: maj, minor: min} = PdfOxide.version(doc)
IO.puts("version: #{maj}.#{min}")

{:ok, text} = PdfOxide.extract_text(doc, 0)
IO.puts(text)

打开 PDF

可以从文件路径打开，也可以直接从内存中的字节打开（在从 S3、HTTP 或数据库流式读取时很有用）：

# 从路径打开
{:ok, doc} = PdfOxide.open("report.pdf")

# 从内存中已有的字节打开
{:ok, doc} = PdfOxide.open_from_bytes(pdf_bytes)

# 加密文档
{:ok, doc} = PdfOxide.open_with_password("confidential.pdf", "secret")

# 检查
{:ok, count} = PdfOxide.page_count(doc)
encrypted? = PdfOxide.encrypted?(doc)

用完后可以显式关闭文档（close/1 是幂等的），也可以交给 GC 自行回收：

:ok = PdfOxide.close(doc)

文本提取

按从零开始的页索引提取单页纯文本，或者一次性拉取整篇文档：

{:ok, doc} = PdfOxide.open("book.pdf")

# 单页
{:ok, text} = PdfOxide.extract_text(doc, 0)

# 纯文本，单页
{:ok, pt} = PdfOxide.to_plain_text(doc, 0)

# 所有页面拼接在一起
{:ok, all} = PdfOxide.to_plain_text_all(doc)
IO.puts(all)

Markdown 与 HTML 转换

把某一页 — 或整篇文档 — 转换为 Markdown 或 HTML：

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, md}    = PdfOxide.to_markdown(doc, 0)
{:ok, mdall} = PdfOxide.to_markdown_all(doc)

{:ok, html}    = PdfOxide.to_html(doc, 0)
{:ok, htmlall} = PdfOxide.to_html_all(doc)

单词与行

extract_words/2 返回结构化的 PdfOxide.Word 结构体，带有边界框和 bold 标志；extract_text_lines/2 会把它们归并成行。

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, words} = PdfOxide.extract_words(doc, 0)

for w <- Enum.take(words, 10) do
  %PdfOxide.Bbox{x: x, y: y, width: width} = w.bbox
  IO.puts("#{w.text} at (#{x}, #{y}) w=#{width} bold=#{w.bold}")
end

{:ok, lines} = PdfOxide.extract_text_lines(doc, 0)

for line <- lines do
  IO.puts("#{line.word_count} words: #{line.text}")
end

搜索

可以在单页中搜索，也可以跨整篇文档搜索。第四个参数是 case_sensitive。每条结果都带有 text、page 和一个 PdfOxide.Bbox。

{:ok, doc} = PdfOxide.open("manual.pdf")

# 单页（页索引 0），不区分大小写
{:ok, results} = PdfOxide.search(doc, 0, "configuration", false)

for r <- results do
  %PdfOxide.Bbox{x: x, y: y} = r.bbox
  IO.puts("page #{r.page}: '#{r.text}' at (#{x}, #{y})")
end

# 所有页面
{:ok, all} = PdfOxide.search_all(doc, "configuration", false)
IO.puts("#{length(all)} matches")

创建 PDF

构建器工厂函数返回一个 Pdf 句柄，你可以用 to_bytes/1 将其序列化，或用 save/2 直接写入磁盘：

{:ok, pdf} = PdfOxide.from_markdown("# Hello World\n\nThis is a PDF.")
:ok = PdfOxide.save(pdf, "output.pdf")

{:ok, pdf} = PdfOxide.from_html("<h1>Invoice</h1><p>Amount: $42</p>")
{:ok, bytes} = PdfOxide.to_bytes(pdf)

{:ok, pdf} = PdfOxide.from_text("Plain text content.")
:ok = PdfOxide.save(pdf, "notes.pdf")

将页面渲染为图像

启用渲染特性后，可以把某一页栅格化为 PdfOxide.RenderedImage 并保存为 PNG：

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, img} = PdfOxide.render_page(doc, 0)
IO.puts("#{img.width}x#{img.height}, #{byte_size(img.data)} bytes")
:ok = PdfOxide.save(img, "page0.png")

# 缩放系数，或固定尺寸的缩略图
{:ok, zoomed} = PdfOxide.render_page_zoom(doc, 0, 2.0)
{:ok, thumb}  = PdfOxide.render_page_thumbnail(doc, 0, 128)

错误处理

可能失败的函数返回一个带标签的元组 — 用模式匹配可以写出清晰的控制流：

case PdfOxide.open("/nonexistent/nope.pdf") do
  {:ok, doc} ->
    {:ok, text} = PdfOxide.extract_text(doc, 0)
    IO.puts(text)

  {:error, code} ->
    IO.puts("could not open PDF: #{inspect(code)}")
end

下一步

Rust 快速上手 — 在 Rust 中使用 PDF Oxide
Python 快速上手 — 在 Python 中使用 PDF Oxide
文本提取 — 详细的提取选项与示例
创建 PDF — 进阶创建，包含元数据与加密
编辑 — 修改现有 PDF、注释和表单字段