What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 上手指南（R）

PDF Oxide 提供了符合 R 习惯的原生绑定，可快速提取 PDF 的文本、Markdown 和 HTML——文本提取平均耗时 0.8ms，在 3,830 个 PDF 上达到 100% 通过率——其底层与其他所有绑定共用同一套 Rust 内核。R 包通过 R 的 .Call 接口封装了 pdf_oxide 的 C ABI；文档句柄是由垃圾回收器负责释放的 R 外部指针，页索引从 0 开始，与底层引擎保持一致。

安装

R 包链接的是采用默认特性的 cdylib。先构建原生库，再安装该包并将其指向头文件和 cdylib：

# 1. build the native library (shipped binding feature set)
cargo build --release --lib \
  --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. install the R package
PDF_OXIDE_INCLUDE_DIR="$PWD/include" PDF_OXIDE_LIB_DIR="$PWD/target/release" \
  R CMD INSTALL r/

运行时，需要让链接器能够找到 cdylib：

LD_LIBRARY_PATH="$PWD/target/release" Rscript your_script.R

打开 PDF

用 pdf_open() 打开文件，然后查看它的元数据。pdf_version() 返回一个含有 major 和 minor 的具名列表。

library(pdfoxide)

doc <- pdf_open("research-paper.pdf")

pdf_page_count(doc)               # number of pages
v <- pdf_version(doc)
cat("PDF version:", paste(v$major, v$minor, sep = "."), "\n")
pdf_is_encrypted(doc)             # logical

文本提取

用 pdf_extract_text() 按阅读顺序提取某一页（页索引从 0 开始）的文本。

library(pdfoxide)

doc <- pdf_open("report.pdf")
text <- pdf_extract_text(doc, 0)  # 0-based page index
cat(text)

借助 pdf_page_count() 遍历每一页：

doc <- pdf_open("book.pdf")
for (page in seq_len(pdf_page_count(doc)) - 1L) {   # 0-based indices
  cat("--- Page", page + 1L, "---\n")
  cat(pdf_extract_text(doc, page), "\n")
}

Markdown 和 HTML

可以将单独一页转换为 Markdown 或 HTML，也可以一次性转换整个文档。

library(pdfoxide)

doc <- pdf_open("paper.pdf")

md  <- pdf_to_markdown(doc, 0)    # one page as Markdown
html <- pdf_to_html(doc, 0)       # one page as HTML

all_md   <- pdf_to_markdown_all(doc)    # whole document
all_text <- pdf_to_plain_text_all(doc)  # whole document, plain text

cat(all_md)

单词、字符和行

元素提取会返回一组带有定位边界框的记录列表。每个 bbox 都是含有 x、y、width 和 height 的具名列表。

library(pdfoxide)

doc <- pdf_open("paper.pdf")

# Positioned words — each has $text, $bbox, $font_name, $font_size, $bold
words <- pdf_extract_words(doc, 0)
for (w in head(words, 10)) {
  cat(sprintf("'%s' at (%.1f, %.1f) font=%s bold=%s\n",
              w$text, w$bbox$x, w$bbox$y, w$font_name, w$bold))
}

# Reading-order lines — each has $text, $bbox, $word_count
lines <- pdf_extract_text_lines(doc, 0)
for (ln in head(lines, 5)) {
  cat(sprintf("[%d words] %s\n", ln$word_count, ln$text))
}

# Positioned characters — $character is the Unicode codepoint (integer)
chars <- pdf_extract_chars(doc, 0)
for (ch in head(chars, 10)) {
  cat(sprintf("'%s' at (%.1f, %.1f) size=%.1f\n",
              intToUtf8(ch$character), ch$bbox$x, ch$bbox$y, ch$font_size))
}

表格

pdf_extract_tables() 返回检测到的表格。每条表格记录都带有 row_count、col_count、has_header，以及一个采用 1 起始索引的字符矩阵 cells，访问方式为 tbl$cells[row, col]。

library(pdfoxide)

doc <- pdf_open("statement.pdf")
tables <- pdf_extract_tables(doc, 0)

for (tbl in tables) {
  cat(sprintf("Table: %d rows x %d cols (header=%s)\n",
              tbl$row_count, tbl$col_count, tbl$has_header))
  for (r in seq_len(tbl$row_count)) {
    cat(paste(tbl$cells[r, ], collapse = " | "), "\n")
  }
}

搜索

用 pdf_search() 在单独一页中搜索，或用 pdf_search_all() 在整个文档中搜索。两者都接受一个可选的 case_sensitive 参数（默认 FALSE），并返回带有 text、page 和 bbox 的记录。

library(pdfoxide)

doc <- pdf_open("manual.pdf")

# Whole document
hits <- pdf_search_all(doc, "configuration")
for (h in hits) {
  cat(sprintf("Page %d: '%s' at (%.0f, %.0f)\n",
              h$page, h$text, h$bbox$x, h$bbox$y))
}

# Single page, case-sensitive
page_hits <- pdf_search(doc, 0, "Configuration", case_sensitive = TRUE)

从字节打开

用 pdf_open_from_bytes() 打开一个驻留在内存中的 PDF——从 S3、HTTP 或数据库读取数据时尤为方便——它接受一个 raw 向量。

library(pdfoxide)

bytes <- readBin("report.pdf", "raw", file.info("report.pdf")$size)
doc <- pdf_open_from_bytes(bytes)
cat(pdf_extract_text(doc, 0))

密码保护的 PDF

用 pdf_open_with_password() 打开加密文档，或在打开后调用 pdf_authenticate()（成功时返回 TRUE，密码错误时返回 FALSE）。

library(pdfoxide)

doc <- pdf_open_with_password("confidential.pdf", "secret")
cat(pdf_extract_text(doc, 0))

创建 PDF

构建函数可以从 Markdown、HTML 或纯文本创建一个 pdfoxide_pdf。用 pdf_save() 将其保存到指定路径，或用 pdf_to_bytes() 序列化为一个 raw 向量（之后可用 pdf_open_from_bytes() 重新打开）。

library(pdfoxide)

pdf <- pdf_from_markdown("# Hello World\n\nThis is a PDF.\n")
pdf_save(pdf, "output.pdf")

pdf_from_html("<h1>Invoice</h1><p>Amount due: $42.00</p>") |>
  pdf_save("invoice.pdf")

pdf_from_text("Plain text document.\n\nSecond paragraph.") |>
  pdf_save("notes.pdf")

# Round-trip through bytes
bytes <- pdf_to_bytes(pdf_from_markdown("# In memory\n\nbody\n"))
doc <- pdf_open_from_bytes(bytes)
cat(pdf_extract_text(doc, 0))

后续步骤

Python 上手指南 —— 在 Python 中使用 PDF Oxide
Rust 上手指南 —— 底层的 Rust crate
文本提取 —— 详细的提取选项和实用范例
PDF 创建 —— 使用构建器、加密和元数据进行进阶创建