What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

テキスト検索

PDF Oxide は、正規表現サポート・大文字小文字を区別しないマッチング・単語単位モード・マッチごとのバウンディングボックスを備えたPDF文書の全文検索を提供します。検索結果にはページ番号・一致テキスト・各マッチの正確な座標が含まれており、検索＆ハイライトのワークフローを簡単に構築できます。

複数ページにわたるカスタムオプションのクエリには TextSearcher::search() を、一般的なユースケースには Pdf の便利メソッド（search()・search_page()・highlight_matches()）を使用してください。

クイックサンプル

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
results = doc.search("conclusion", case_insensitive=True)
for r in results:
    print(f"Page {r['page']}: '{r['text']}' at ({r['x']:.1f}, {r['y']:.1f})")

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const results = doc.searchAll("conclusion", { caseSensitive: false });
for (const r of results) {
  console.log(`Page ${r.page}: '${r.text}' at (${r.x.toFixed(1)}, ${r.y.toFixed(1)})`);
}
doc.close();

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
results, _ := doc.SearchAll("conclusion", false)
for _, r := range results {
    fmt.Printf("Page %d: '%s' at (%.1f, %.1f)\n", r.Page, r.Text, r.X, r.Y)
}

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
var results = doc.SearchAll("conclusion");
foreach (var r in results)
{
    Console.WriteLine($"Page {r.Page}: '{r.Text}' at ({r.X:F1}, {r.Y:F1})");
}

WASM

const doc = new WasmPdfDocument(bytes);
const results = doc.search("conclusion");
for (const r of results) {
    console.log(`Page ${r.page}: '${r.text}' at (${r.x.toFixed(1)}, ${r.y.toFixed(1)})`);
}

Rust

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("conclusion")?;
for r in &results {
    println!("Page {}: '{}' at ({:.1}, {:.1})", r.page, r.text, r.bbox.x, r.bbox.y);
}

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.search.SearchMatch;
import java.nio.file.Path;
import java.util.List;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    List<SearchMatch> results = doc.search("conclusion", true, false, 0);
    for (SearchMatch m : results) {
        System.out.printf("Page %d: '%s' at (%.1f, %.1f)%n",
            m.pageIndex(), m.text(), m.bbox().x0(), m.bbox().y0());
    }
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import java.nio.file.Path

PdfDocument.open(Path.of("report.pdf")).use { doc ->
    val results = doc.search("conclusion", true, false, 0)
    for (m in results) {
        println("Page ${m.pageIndex()}: '${m.text()}' at (${m.bbox().x0()}, ${m.bbox().y0()})")
    }
}

Scala

import fyi.oxide.pdf.{PdfDocument, searchSeq}
import scala.util.Using

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  val results = doc.searchSeq("conclusion")
  for (m <- results)
    println(f"Page ${m.pageIndex}: '${m.text}' at (${m.bbox.x0}%.1f, ${m.bbox.y0}%.1f)")
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "report.pdf")]
  (doseq [m (pdf/search doc "conclusion")]
    (printf "Page %d: '%s' at (%.1f, %.1f)%n"
            (.pageIndex m) (.text m) (.x0 (.bbox m)) (.y0 (.bbox m)))))

Ruby

require 'pdf_oxide'

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  doc.search('conclusion', case_sensitive: false).each do |r|
    bbox = r[:bbox]
    printf("Page %d: '%s' at (%.1f, %.1f)\n", r[:page], r[:text], bbox[:x], bbox[:y])
  end
end

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <cstdio>

auto doc = pdf_oxide::Document::open("report.pdf");
auto results = doc.search_all("conclusion", /*case_sensitive=*/false);
for (const auto& r : results) {
    std::printf("Page %d: '%s' at (%.1f, %.1f)\n",
                r.page, r.text.c_str(), r.bbox.x, r.bbox.y);
}

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let results = try doc.searchAll("conclusion", false)
for r in results {
    print("Page \(r.page): '\(r.text)' at (\(r.bbox.x), \(r.bbox.y))")
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
final results = doc.searchAll('conclusion', false);
for (final r in results) {
  print("Page ${r.page}: '${r.text}' at (${r.bbox.x}, ${r.bbox.y})");
}
doc.close();

library(pdfoxide)

doc <- pdf_open("report.pdf")
results <- pdf_search_all(doc, "conclusion", case_sensitive = FALSE)
for (r in results) {
  cat(sprintf("Page %d: '%s' at (%.1f, %.1f)\n",
              r$page, r$text, r$bbox$x, r$bbox$y))
}

Julia

using PdfOxide

doc = open_document("report.pdf")
results = search_all(doc, "conclusion", false)
for r in results
    println("Page $(r.page): '$(r.text)' at ($(r.bbox.x), $(r.bbox.y))")
end

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("report.pdf");
const results = try doc.searchAll(a, "conclusion", false);
defer doc.freeSearchResults(a, results);
for (results) |r| {
    std.debug.print("Page {d}: '{s}' at ({d:.1}, {d:.1})\n", .{ r.page, r.text, r.bbox.x, r.bbox.y });
}

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSArray<POXSearchResult*> *results = [doc searchAll:@"conclusion" caseSensitive:NO error:&err];
for (POXSearchResult *r in results) {
    NSLog(@"Page %ld: '%@' at (%.1f, %.1f)", (long)r.page, r.text, r.bbox.x, r.bbox.y);
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, results} = PdfOxide.search_all(doc, "conclusion", false)

for r <- results do
  IO.puts("Page #{r.page}: '#{r.text}' at (#{r.bbox.x}, #{r.bbox.y})")
end

APIリファレンス

`TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>`

PDF文書の複数ページにわたってテキストを検索します。literal モードが有効でない限り、パターンは正規表現としてコンパイルされます。

パラメータ	型	説明
`doc`	`&mut PdfDocument`	検索対象のPDF文書
`pattern`	`&str`	正規表現パターン（`literal` が設定されている場合はリテラルテキスト）
`options`	`&SearchOptions`	検索設定

戻り値: ページと位置の順に並んだ SearchResult オブジェクトのベクタ。

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};

let mut doc = PdfDocument::open("report.pdf")?;

let options = SearchOptions::new()
    .with_case_insensitive(true)
    .with_max_results(50);

let results = TextSearcher::search(&mut doc, "error|warning", &options)?;
for r in &results {
    println!("Page {}: '{}'", r.page, r.text);
}

`TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>`

コンパイル済みの正規表現を使用して特定ページのテキストを検索します。

パラメータ	型	説明
`doc`	`&mut PdfDocument`	PDF文書
`page`	`usize`	ゼロ始まりのページインデックス
`regex`	`&Regex`	コンパイル済みの正規表現パターン
`options`	`&SearchOptions`	検索設定

戻り値: 指定ページの SearchResult オブジェクトのベクタ。

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};
use regex::Regex;

let mut doc = PdfDocument::open("report.pdf")?;
let regex = Regex::new(r"\d{4}-\d{2}-\d{2}")?; // Date pattern
let options = SearchOptions::default();

let results = TextSearcher::search_page(&mut doc, 0, &regex, &options)?;
for r in &results {
    println!("Date found: '{}' at ({:.1}, {:.1})", r.text, r.bbox.x, r.bbox.y);
}

SearchOptions

テキスト検索の動作設定。ビルダーパターンで直感的に構築できます。

フィールド	型	デフォルト	説明
`case_insensitive`	`bool`	`false`	マッチング時に大文字小文字を無視する
`literal`	`bool`	`false`	パターンをリテラルテキストとして扱う（正規表現文字をエスケープ）
`whole_word`	`bool`	`false`	単語全体のみにマッチする（パターンを `\b...\b` で囲む）
`max_results`	`usize`	`0`	返す最大結果数（0 = 無制限）
`page_range`	`Option<(usize, usize)>`	`None`	検索するページ範囲（開始・終了ともに含む）

ビルダーメソッド

let options = SearchOptions::new()
    .with_case_insensitive(true)
    .with_literal(true)
    .with_whole_word(true)
    .with_max_results(100)
    .with_page_range(0, 9);

便利コンストラクタ

// Quick case-insensitive search
let options = SearchOptions::case_insensitive();

SearchResult

位置情報を含む単一の検索マッチ。

フィールド	型	説明
`page`	`usize`	ページ番号（0始まり）
`text`	`String`	一致したテキスト
`bbox`	`Rect`	マッチ全体のバウンディングボックス
`start_index`	`usize`	ページの抽出テキスト内の開始インデックス
`end_index`	`usize`	ページの抽出テキスト内の終了インデックス
`span_boxes`	`Vec<Rect>`	マッチ内の各スパンの個別バウンディングボックス（複数行マッチに便利）

Python: Python API では検索結果はディクショナリとして返されます：

{
    "page": 0,
    "text": "conclusion",
    "x": 72.0,
    "y": 650.5,
    "width": 85.3,
    "height": 12.0,
}

Pdf 便利メソッド

高レベル Pdf API は一般的な検索操作のショートカットメソッドを提供します。

`search(pattern) -> Vec<SearchResult>`

デフォルトオプションで文書全体を検索します。

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("important")?;

`search_with_options(pattern, options) -> Vec<SearchResult>`

カスタムオプションで検索します。

let options = SearchOptions::case_insensitive()
    .with_whole_word(true)
    .with_page_range(0, 5);
let results = pdf.search_with_options("abstract", options)?;

`search_page(page, pattern) -> Vec<SearchResult>`

デフォルトオプションで単一ページを検索します。

let results = pdf.search_page(0, r"\d+\.\d+")?; // Find decimal numbers

`highlight_matches(results, color) -> Result<()>`

検索結果に対してハイライトアノテーションを作成します。各結果は対応するページに黄色（またはカスタムカラー）のハイライトアノテーションが付きます。

パラメータ	型	説明
`results`	`&[SearchResult]`	ハイライトする検索結果
`color`	`[f32; 3]`	RGB カラー（各成分 0.0〜1.0）

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("important")?;
pdf.highlight_matches(&results, [1.0, 1.0, 0.0])?; // Yellow
pdf.save("highlighted.pdf")?;

Python 検索 API

Python の PdfDocument クラスは検索機能を直接公開しています。

`doc.search(pattern, ...) -> list[dict]`

doc.search(
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0,
) -> list[dict]

`doc.search_page(page, pattern, ...) -> list[dict]`

doc.search_page(
    page: int,
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0,
) -> list[dict]

JavaScript 検索 API

WasmPdfDocument クラスも同じ検索機能を公開しています。

`doc.search(pattern, ...) -> Array`

doc.search(pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array

`doc.searchPage(pageIndex, pattern, ...) -> Array`

doc.searchPage(pageIndex, pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array

使用例:

const doc = new WasmPdfDocument(bytes);

// Search all pages, case-insensitive
const results = doc.search("error|warning", true);
for (const r of results) {
  console.log(`Page ${r.page}: '${r.text}'`);
}

// Search a single page with whole-word matching
const pageResults = doc.searchPage(0, "abstract", true, false, true);
doc.free();

検索結果をJSONにシリアライズするには？

いくつかのバインディングは、ページの検索結果リストをFFIの境界をまたいで一括でJSONに変換するシリアライザを提供しています。Rustがリスト全体をシリアライズし、バインディングがデコードするため、マッチごとにフィールドをやり取りする必要がありません。Go や C# の SearchPage メソッドも内部的にこの仕組みを使っています。

C ABIのシグネチャが正式な定義です：

char *pdf_oxide_search_results_to_json(
    const FfiSearchResults *results,
    int32_t *error_code);

pdf_document_search_page(...) が返す不透明な結果ハンドルを受け取り、malloc された UTF-8 JSON 文字列を返します（pdf_free_string で解放）。各要素にはマッチの page・text・バウンディングボックス（x・y・width・height）が含まれます。

Swift — ラッパーは検索とシリアライズを searchResultsToJson(_:_:caseSensitive:) の一回の呼び出しにまとめています：

import PdfOxide

let doc = try PdfDocument(path: "report.pdf")

// Search page 0 for "conclusion" and get the matches as a JSON string
let json = try doc.searchResultsToJson(0, "conclusion", caseSensitive: false)
print(json)
// [{"page":0,"text":"conclusion","x":72.0,"y":650.5,"width":85.3,"height":12.0}, ...]

Go / C#. これらのバインディングは内部で pdf_oxide_search_results_to_json を呼び出し、デコード済みのネイティブレコードを返すため、シリアライザを直接呼び出す必要はありません。doc.SearchPage(...) を使用してください（Go: doc.SearchPage(page, text, caseSensitive)；C#: doc.SearchPage(pageIndex, text, caseSensitive)）。強い型付きの結果が返ります。JSONが必要な場合は標準JSONライブラリ（encoding/json / System.Text.Json）でシリアライズしてください。

Python / Rust. Python の doc.search(...) / doc.search_page(...) はネイティブの list[dict] レコードを返すため json.dumps で直接シリアライズできます。Rust は serde_json でシリアライズ可能な型付き Vec<SearchResult> を返します。どちらも C-ABI シリアライザは不要です。

応用サンプル

カスタムカラーで検索結果をハイライト

use pdf_oxide::api::Pdf;
use pdf_oxide::search::SearchOptions;

let mut pdf = Pdf::open("contract.pdf")?;

// Find all dollar amounts
let options = SearchOptions::new()
    .with_literal(false); // regex mode
let results = pdf.search_with_options(r"\$[\d,]+\.?\d*", options)?;

println!("Found {} dollar amounts", results.len());
for r in &results {
    println!("  Page {}: {}", r.page + 1, r.text);
}

// Highlight them in green
pdf.highlight_matches(&results, [0.6, 1.0, 0.6])?;
pdf.save("highlighted_amounts.pdf")?;

ページ範囲を絞って検索

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")

# Search only the first 10 pages
results = doc.search(
    "introduction",
    case_insensitive=True,
    whole_word=True,
    max_results=5,
)

for r in results:
    print(f"Found on page {r['page'] + 1}")

複数PDFをまたぐ検索インデックスの構築

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};
use std::collections::HashMap;

let files = vec!["paper_a.pdf", "paper_b.pdf", "paper_c.pdf"];
let query = "machine learning";
let options = SearchOptions::case_insensitive();

let mut index: HashMap<String, Vec<(usize, String)>> = HashMap::new();

for file in &files {
    let mut doc = PdfDocument::open(file)?;
    let results = TextSearcher::search(&mut doc, query, &options)?;

    for r in results {
        index.entry(file.to_string())
            .or_default()
            .push((r.page, r.text));
    }
}

for (file, matches) in &index {
    println!("{}: {} matches", file, matches.len());
    for (page, text) in matches {
        println!("  Page {}: '{}'", page + 1, text);
    }
}

マッチ周辺のコンテキストを抽出

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};

let mut doc = PdfDocument::open("report.pdf")?;
let options = SearchOptions::new().with_case_insensitive(true);
let results = TextSearcher::search(&mut doc, "error", &options)?;

for r in &results {
    // Extract full page text for context
    let page_text = doc.extract_text(r.page)?;

    // Show 50 chars before and after the match
    let start = r.start_index.saturating_sub(50);
    let end = (r.end_index + 50).min(page_text.len());
    let context = &page_text[start..end];

    println!("Page {} match: ...{}...", r.page + 1, context.trim());
}

よくある質問

検索結果をJSONで取得するには？ Swift では doc.searchResultsToJson(page, term, caseSensitive:) を呼び出すと、ページ検索を実行してマッチのJSON配列を一度に返します。PythonとRustでは検索がネイティブレコード（list[dict] / Vec<SearchResult>）を返すため、json.dumps / serde_json でシリアライズします。GoとC#は型付きレコードを返し、encoding/json / System.Text.Json でシリアライズします。

JSONの各マッチには何が含まれますか？ マッチの page（0始まり）・一致した text・バウンディングボックス（x・y・width・height、PDFポイント単位、左下原点）が含まれます。

検索はデフォルトで正規表現とリテラルのどちらですか？ パターンはデフォルトで正規表現としてコンパイルされます。literal モード（with_literal(true) / literal=True）を有効にすると正規表現メタ文字がエスケープされ、テキストそのものにマッチします。

大文字小文字を区別しない検索や単語単位マッチングはサポートされていますか？ はい。Rustでは SearchOptions の case_insensitive と whole_word を設定し、Pythonではキーワード引数、他のバインディングではオプションとして指定します。

テキスト検索

クイックサンプル

APIリファレンス

TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>

TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>

SearchOptions

ビルダーメソッド

便利コンストラクタ

SearchResult

Pdf 便利メソッド

search(pattern) -> Vec<SearchResult>

search_with_options(pattern, options) -> Vec<SearchResult>

search_page(page, pattern) -> Vec<SearchResult>

highlight_matches(results, color) -> Result<()>

Python 検索 API

doc.search(pattern, ...) -> list[dict]

doc.search_page(page, pattern, ...) -> list[dict]

JavaScript 検索 API

doc.search(pattern, ...) -> Array

doc.searchPage(pageIndex, pattern, ...) -> Array

検索結果をJSONにシリアライズするには？

応用サンプル

カスタムカラーで検索結果をハイライト

ページ範囲を絞って検索

複数PDFをまたぐ検索インデックスの構築

マッチ周辺のコンテキストを抽出

よくある質問

関連ページ

`TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>`

`TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>`

`search(pattern) -> Vec<SearchResult>`

`search_with_options(pattern, options) -> Vec<SearchResult>`

`search_page(page, pattern) -> Vec<SearchResult>`

`highlight_matches(results, color) -> Result<()>`

`doc.search(pattern, ...) -> list[dict]`

`doc.search_page(page, pattern, ...) -> list[dict]`

`doc.search(pattern, ...) -> Array`

`doc.searchPage(pageIndex, pattern, ...) -> Array`