What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

文本搜索

PDF Oxide 提供跨 PDF 文档的全文搜索，支持正则表达式、大小写不敏感匹配、全词匹配模式，以及每个匹配项的边界框。搜索结果包含页码、匹配文本和每个匹配的精确坐标，让搜索并高亮的工作流程变得轻而易举。

多页自定义查询请使用 TextSearcher::search()，常见场景则可使用 Pdf 的便捷方法（search()、search_page()、highlight_matches()）。

快速示例

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
results = doc.search("conclusion", case_insensitive=True)
for r in results:
    print(f"Page {r['page']}: '{r['text']}' at ({r['x']:.1f}, {r['y']:.1f})")

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const results = doc.searchAll("conclusion", { caseSensitive: false });
for (const r of results) {
  console.log(`Page ${r.page}: '${r.text}' at (${r.x.toFixed(1)}, ${r.y.toFixed(1)})`);
}
doc.close();

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
results, _ := doc.SearchAll("conclusion", false)
for _, r := range results {
    fmt.Printf("Page %d: '%s' at (%.1f, %.1f)\n", r.Page, r.Text, r.X, r.Y)
}

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
var results = doc.SearchAll("conclusion");
foreach (var r in results)
{
    Console.WriteLine($"Page {r.Page}: '{r.Text}' at ({r.X:F1}, {r.Y:F1})");
}

WASM

const doc = new WasmPdfDocument(bytes);
const results = doc.search("conclusion");
for (const r of results) {
    console.log(`Page ${r.page}: '${r.text}' at (${r.x.toFixed(1)}, ${r.y.toFixed(1)})`);
}

Rust

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("conclusion")?;
for r in &results {
    println!("Page {}: '{}' at ({:.1}, {:.1})", r.page, r.text, r.bbox.x, r.bbox.y);
}

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.search.SearchMatch;
import java.nio.file.Path;
import java.util.List;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    List<SearchMatch> results = doc.search("conclusion", true, false, 0);
    for (SearchMatch m : results) {
        System.out.printf("Page %d: '%s' at (%.1f, %.1f)%n",
            m.pageIndex(), m.text(), m.bbox().x0(), m.bbox().y0());
    }
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import java.nio.file.Path

PdfDocument.open(Path.of("report.pdf")).use { doc ->
    val results = doc.search("conclusion", true, false, 0)
    for (m in results) {
        println("Page ${m.pageIndex()}: '${m.text()}' at (${m.bbox().x0()}, ${m.bbox().y0()})")
    }
}

Scala

import fyi.oxide.pdf.{PdfDocument, searchSeq}
import scala.util.Using

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  val results = doc.searchSeq("conclusion")
  for (m <- results)
    println(f"Page ${m.pageIndex}: '${m.text}' at (${m.bbox.x0}%.1f, ${m.bbox.y0}%.1f)")
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "report.pdf")]
  (doseq [m (pdf/search doc "conclusion")]
    (printf "Page %d: '%s' at (%.1f, %.1f)%n"
            (.pageIndex m) (.text m) (.x0 (.bbox m)) (.y0 (.bbox m)))))

Ruby

require 'pdf_oxide'

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  doc.search('conclusion', case_sensitive: false).each do |r|
    bbox = r[:bbox]
    printf("Page %d: '%s' at (%.1f, %.1f)\n", r[:page], r[:text], bbox[:x], bbox[:y])
  end
end

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <cstdio>

auto doc = pdf_oxide::Document::open("report.pdf");
auto results = doc.search_all("conclusion", /*case_sensitive=*/false);
for (const auto& r : results) {
    std::printf("Page %d: '%s' at (%.1f, %.1f)\n",
                r.page, r.text.c_str(), r.bbox.x, r.bbox.y);
}

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let results = try doc.searchAll("conclusion", false)
for r in results {
    print("Page \(r.page): '\(r.text)' at (\(r.bbox.x), \(r.bbox.y))")
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
final results = doc.searchAll('conclusion', false);
for (final r in results) {
  print("Page ${r.page}: '${r.text}' at (${r.bbox.x}, ${r.bbox.y})");
}
doc.close();

library(pdfoxide)

doc <- pdf_open("report.pdf")
results <- pdf_search_all(doc, "conclusion", case_sensitive = FALSE)
for (r in results) {
  cat(sprintf("Page %d: '%s' at (%.1f, %.1f)\n",
              r$page, r$text, r$bbox$x, r$bbox$y))
}

Julia

using PdfOxide

doc = open_document("report.pdf")
results = search_all(doc, "conclusion", false)
for r in results
    println("Page $(r.page): '$(r.text)' at ($(r.bbox.x), $(r.bbox.y))")
end

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("report.pdf");
const results = try doc.searchAll(a, "conclusion", false);
defer doc.freeSearchResults(a, results);
for (results) |r| {
    std.debug.print("Page {d}: '{s}' at ({d:.1}, {d:.1})\n", .{ r.page, r.text, r.bbox.x, r.bbox.y });
}

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSArray<POXSearchResult*> *results = [doc searchAll:@"conclusion" caseSensitive:NO error:&err];
for (POXSearchResult *r in results) {
    NSLog(@"Page %ld: '%@' at (%.1f, %.1f)", (long)r.page, r.text, r.bbox.x, r.bbox.y);
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, results} = PdfOxide.search_all(doc, "conclusion", false)

for r <- results do
  IO.puts("Page #{r.page}: '#{r.text}' at (#{r.bbox.x}, #{r.bbox.y})")
end

API 参考

`TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>`

在 PDF 文档的多个页面中搜索文本。除非启用 literal 模式，否则 pattern 将被编译为正则表达式。

参数	类型	说明
`doc`	`&mut PdfDocument`	要搜索的 PDF 文档
`pattern`	`&str`	正则表达式模式（设置 `literal` 时为纯文本）
`options`	`&SearchOptions`	搜索配置

返回值： 按页码和位置排序的 SearchResult 对象向量。

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};

let mut doc = PdfDocument::open("report.pdf")?;

let options = SearchOptions::new()
    .with_case_insensitive(true)
    .with_max_results(50);

let results = TextSearcher::search(&mut doc, "error|warning", &options)?;
for r in &results {
    println!("Page {}: '{}'", r.page, r.text);
}

`TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>`

使用预编译的正则表达式在指定页面上搜索文本。

参数	类型	说明
`doc`	`&mut PdfDocument`	PDF 文档
`page`	`usize`	从零开始的页面索引
`regex`	`&Regex`	预编译的正则表达式模式
`options`	`&SearchOptions`	搜索配置

返回值： 指定页面的 SearchResult 对象向量。

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};
use regex::Regex;

let mut doc = PdfDocument::open("report.pdf")?;
let regex = Regex::new(r"\d{4}-\d{2}-\d{2}")?; // Date pattern
let options = SearchOptions::default();

let results = TextSearcher::search_page(&mut doc, 0, &regex, &options)?;
for r in &results {
    println!("Date found: '{}' at ({:.1}, {:.1})", r.text, r.bbox.x, r.bbox.y);
}

SearchOptions

文本搜索行为的配置，采用构建器模式以便于使用。

字段	类型	默认值	说明
`case_insensitive`	`bool`	`false`	匹配时忽略大小写
`literal`	`bool`	`false`	将模式作为纯文本处理（转义正则元字符）
`whole_word`	`bool`	`false`	仅匹配完整单词（在模式两端添加 `\b...\b`）
`max_results`	`usize`	`0`	最大返回结果数（0 = 不限）
`page_range`	`Option<(usize, usize)>`	`None`	搜索的页面范围（含起始页和终止页）

构建器方法

let options = SearchOptions::new()
    .with_case_insensitive(true)
    .with_literal(true)
    .with_whole_word(true)
    .with_max_results(100)
    .with_page_range(0, 9);

便捷构造函数

// Quick case-insensitive search
let options = SearchOptions::case_insensitive();

SearchResult

包含位置信息的单个搜索匹配结果。

字段	类型	说明
`page`	`usize`	页码（从 0 开始）
`text`	`String`	匹配的文本
`bbox`	`Rect`	匹配项的综合边界框
`start_index`	`usize`	在页面提取文本中的起始索引
`end_index`	`usize`	在页面提取文本中的终止索引
`span_boxes`	`Vec<Rect>`	匹配中每个片段的独立边界框（多行匹配时尤为实用）

Python： Python API 中，搜索结果以字典形式返回：

{
    "page": 0,
    "text": "conclusion",
    "x": 72.0,
    "y": 650.5,
    "width": 85.3,
    "height": 12.0,
}

Pdf 便捷方法

高层 Pdf API 提供了常用搜索操作的快捷方法。

`search(pattern) -> Vec<SearchResult>`

使用默认选项搜索整个文档。

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("important")?;

`search_with_options(pattern, options) -> Vec<SearchResult>`

使用自定义选项搜索。

let options = SearchOptions::case_insensitive()
    .with_whole_word(true)
    .with_page_range(0, 5);
let results = pdf.search_with_options("abstract", options)?;

`search_page(page, pattern) -> Vec<SearchResult>`

使用默认选项搜索单个页面。

let results = pdf.search_page(0, r"\d+\.\d+")?; // Find decimal numbers

`highlight_matches(results, color) -> Result<()>`

为搜索结果创建高亮注释。每个结果在对应页面上获得黄色（或自定义颜色）的高亮注释。

参数	类型	说明
`results`	`&[SearchResult]`	要高亮的搜索结果
`color`	`[f32; 3]`	RGB 颜色（每个分量范围 0.0–1.0）

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("important")?;
pdf.highlight_matches(&results, [1.0, 1.0, 0.0])?; // Yellow
pdf.save("highlighted.pdf")?;

Python 搜索 API

Python PdfDocument 类直接公开了搜索功能。

`doc.search(pattern, ...) -> list[dict]`

doc.search(
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0,
) -> list[dict]

`doc.search_page(page, pattern, ...) -> list[dict]`

doc.search_page(
    page: int,
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0,
) -> list[dict]

JavaScript 搜索 API

WasmPdfDocument 类公开了相同的搜索功能。

`doc.search(pattern, ...) -> Array`

doc.search(pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array

`doc.searchPage(pageIndex, pattern, ...) -> Array`

doc.searchPage(pageIndex, pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array

示例：

const doc = new WasmPdfDocument(bytes);

// Search all pages, case-insensitive
const results = doc.search("error|warning", true);
for (const r of results) {
  console.log(`Page ${r.page}: '${r.text}'`);
}

// Search a single page with whole-word matching
const pageResults = doc.searchPage(0, "abstract", true, false, true);
doc.free();

如何将搜索结果序列化为 JSON？

部分语言绑定提供了一个单次序列化器，可在一次 FFI 跨界中将页面搜索结果列表转换为 JSON 数组——Rust 序列化整个列表，绑定层解码，无需逐个字段传递。Go 和 C# 的 SearchPage 方法内部也采用这种方式解码结果。

C ABI 签名是权威定义：

char *pdf_oxide_search_results_to_json(
    const FfiSearchResults *results,
    int32_t *error_code);

它接收 pdf_document_search_page(...) 返回的不透明结果句柄，返回 malloc 分配的 UTF-8 JSON 字符串（使用 pdf_free_string 释放）。每个元素包含匹配的 page、text 及边界框（x、y、width、height）。

Swift — 封装器将搜索和序列化合并为一次调用 searchResultsToJson(_:_:caseSensitive:)：

import PdfOxide

let doc = try PdfDocument(path: "report.pdf")

// Search page 0 for "conclusion" and get the matches as a JSON string
let json = try doc.searchResultsToJson(0, "conclusion", caseSensitive: false)
print(json)
// [{"page":0,"text":"conclusion","x":72.0,"y":650.5,"width":85.3,"height":12.0}, ...]

Go / C#。 这些绑定在内部调用 pdf_oxide_search_results_to_json 并直接返回已解码的原生记录，因此无需直接调用序列化器。使用 doc.SearchPage(...)（Go：doc.SearchPage(page, text, caseSensitive)；C#：doc.SearchPage(pageIndex, text, caseSensitive)）即可获得强类型结果。若需要 JSON，请用各语言标准 JSON 库（encoding/json / System.Text.Json）对返回的记录进行序列化。

Python / Rust。 Python 的 doc.search(...) / doc.search_page(...) 已返回原生 list[dict] 记录，直接用 json.dumps 序列化即可；Rust 返回可通过 serde_json 序列化的类型化 Vec<SearchResult>。两者均无需 C-ABI 序列化器。

进阶示例

使用自定义颜色搜索并高亮

use pdf_oxide::api::Pdf;
use pdf_oxide::search::SearchOptions;

let mut pdf = Pdf::open("contract.pdf")?;

// Find all dollar amounts
let options = SearchOptions::new()
    .with_literal(false); // regex mode
let results = pdf.search_with_options(r"\$[\d,]+\.?\d*", options)?;

println!("Found {} dollar amounts", results.len());
for r in &results {
    println!("  Page {}: {}", r.page + 1, r.text);
}

// Highlight them in green
pdf.highlight_matches(&results, [0.6, 1.0, 0.6])?;
pdf.save("highlighted_amounts.pdf")?;

限制页面范围搜索

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")

# Search only the first 10 pages
results = doc.search(
    "introduction",
    case_insensitive=True,
    whole_word=True,
    max_results=5,
)

for r in results:
    print(f"Found on page {r['page'] + 1}")

跨多个 PDF 构建搜索索引

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};
use std::collections::HashMap;

let files = vec!["paper_a.pdf", "paper_b.pdf", "paper_c.pdf"];
let query = "machine learning";
let options = SearchOptions::case_insensitive();

let mut index: HashMap<String, Vec<(usize, String)>> = HashMap::new();

for file in &files {
    let mut doc = PdfDocument::open(file)?;
    let results = TextSearcher::search(&mut doc, query, &options)?;

    for r in results {
        index.entry(file.to_string())
            .or_default()
            .push((r.page, r.text));
    }
}

for (file, matches) in &index {
    println!("{}: {} matches", file, matches.len());
    for (page, text) in matches {
        println!("  Page {}: '{}'", page + 1, text);
    }
}

提取匹配项周围的上下文

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};

let mut doc = PdfDocument::open("report.pdf")?;
let options = SearchOptions::new().with_case_insensitive(true);
let results = TextSearcher::search(&mut doc, "error", &options)?;

for r in &results {
    // Extract full page text for context
    let page_text = doc.extract_text(r.page)?;

    // Show 50 chars before and after the match
    let start = r.start_index.saturating_sub(50);
    let end = (r.end_index + 50).min(page_text.len());
    let context = &page_text[start..end];

    println!("Page {} match: ...{}...", r.page + 1, context.trim());
}

常见问题

如何以 JSON 格式获取搜索结果？ 在 Swift 中，调用 doc.searchResultsToJson(page, term, caseSensitive:) 即可一次性执行页面搜索并返回匹配项的 JSON 数组。在 Python 和 Rust 中，搜索返回原生记录（list[dict] / Vec<SearchResult>），分别用 json.dumps / serde_json 序列化。Go 和 C# 返回强类型记录，用 encoding/json / System.Text.Json 序列化。

每个 JSON 匹配项包含哪些内容？ 匹配的 page（从 0 开始）、匹配的 text，以及综合边界框：x、y、width、height（PDF 点单位，左下角为原点）。

搜索默认使用正则还是纯文本？ 模式默认编译为正则表达式。启用 literal 模式（with_literal(true) / literal=True）后，正则元字符将被转义，按原文匹配。

搜索是否支持大小写不敏感和全词匹配？ 支持——在 SearchOptions（Rust）中设置 case_insensitive 和 whole_word，或在 Python 中作为关键字参数，在其他绑定中作为选项传入。

文本搜索

快速示例

API 参考

TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>

TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>

SearchOptions

构建器方法

便捷构造函数

SearchResult

Pdf 便捷方法

search(pattern) -> Vec<SearchResult>

search_with_options(pattern, options) -> Vec<SearchResult>

search_page(page, pattern) -> Vec<SearchResult>

highlight_matches(results, color) -> Result<()>

Python 搜索 API

doc.search(pattern, ...) -> list[dict]

doc.search_page(page, pattern, ...) -> list[dict]

JavaScript 搜索 API

doc.search(pattern, ...) -> Array

doc.searchPage(pageIndex, pattern, ...) -> Array

如何将搜索结果序列化为 JSON？

进阶示例

使用自定义颜色搜索并高亮

限制页面范围搜索

跨多个 PDF 构建搜索索引

提取匹配项周围的上下文

常见问题

相关页面

`TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>`

`TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>`

`search(pattern) -> Vec<SearchResult>`

`search_with_options(pattern, options) -> Vec<SearchResult>`

`search_page(page, pattern) -> Vec<SearchResult>`

`highlight_matches(results, color) -> Result<()>`

`doc.search(pattern, ...) -> list[dict]`

`doc.search_page(page, pattern, ...) -> list[dict]`

`doc.search(pattern, ...) -> Array`

`doc.searchPage(pageIndex, pattern, ...) -> Array`