What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Text Search

PDF Oxide provides full-text search across PDF documents with regex support, case-insensitive matching, whole-word mode, and per-match bounding boxes. Search results include page number, matched text, and precise coordinates for each match, making it straightforward to build search-and-highlight workflows.

Use TextSearcher::search() for multi-page queries with custom options, or the Pdf convenience methods (search(), search_page(), highlight_matches()) for common use cases.

Quick Example

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
results = doc.search("conclusion", case_insensitive=True)
for r in results:
    print(f"Page {r['page']}: '{r['text']}' at ({r['x']:.1f}, {r['y']:.1f})")

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const results = doc.searchAll("conclusion", { caseSensitive: false });
for (const r of results) {
  console.log(`Page ${r.page}: '${r.text}' at (${r.x.toFixed(1)}, ${r.y.toFixed(1)})`);
}
doc.close();

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
results, _ := doc.SearchAll("conclusion", false)
for _, r := range results {
    fmt.Printf("Page %d: '%s' at (%.1f, %.1f)\n", r.Page, r.Text, r.X, r.Y)
}

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
var results = doc.SearchAll("conclusion");
foreach (var r in results)
{
    Console.WriteLine($"Page {r.Page}: '{r.Text}' at ({r.X:F1}, {r.Y:F1})");
}

WASM

const doc = new WasmPdfDocument(bytes);
const results = doc.search("conclusion");
for (const r of results) {
    console.log(`Page ${r.page}: '${r.text}' at (${r.x.toFixed(1)}, ${r.y.toFixed(1)})`);
}

Rust

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("conclusion")?;
for r in &results {
    println!("Page {}: '{}' at ({:.1}, {:.1})", r.page, r.text, r.bbox.x, r.bbox.y);
}

API Reference

`TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>`

Search for text across multiple pages of a PDF document. The pattern is compiled as a regex unless literal mode is enabled.

Parameter	Type	Description
`doc`	`&mut PdfDocument`	The PDF document to search
`pattern`	`&str`	Regex pattern (or literal text if `literal` is set)
`options`	`&SearchOptions`	Search configuration

Returns: A vector of SearchResult objects, ordered by page and position.

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};

let mut doc = PdfDocument::open("report.pdf")?;

let options = SearchOptions::new()
    .with_case_insensitive(true)
    .with_max_results(50);

let results = TextSearcher::search(&mut doc, "error|warning", &options)?;
for r in &results {
    println!("Page {}: '{}'", r.page, r.text);
}

`TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>`

Search for text on a specific page using a pre-compiled regex.

Parameter	Type	Description
`doc`	`&mut PdfDocument`	The PDF document
`page`	`usize`	Zero-based page index
`regex`	`&Regex`	Pre-compiled regex pattern
`options`	`&SearchOptions`	Search configuration

Returns: A vector of SearchResult objects for the specified page.

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};
use regex::Regex;

let mut doc = PdfDocument::open("report.pdf")?;
let regex = Regex::new(r"\d{4}-\d{2}-\d{2}")?; // Date pattern
let options = SearchOptions::default();

let results = TextSearcher::search_page(&mut doc, 0, &regex, &options)?;
for r in &results {
    println!("Date found: '{}' at ({:.1}, {:.1})", r.text, r.bbox.x, r.bbox.y);
}

SearchOptions

Configuration for text search behavior. Uses a builder pattern for ergonomic construction.

Field	Type	Default	Description
`case_insensitive`	`bool`	`false`	Ignore case when matching
`literal`	`bool`	`false`	Treat pattern as literal text (escape regex chars)
`whole_word`	`bool`	`false`	Match whole words only (wraps pattern in `\b...\b`)
`max_results`	`usize`	`0`	Maximum results to return (0 = unlimited)
`page_range`	`Option<(usize, usize)>`	`None`	Page range to search (inclusive start, inclusive end)

Builder Methods

let options = SearchOptions::new()
    .with_case_insensitive(true)
    .with_literal(true)
    .with_whole_word(true)
    .with_max_results(100)
    .with_page_range(0, 9);

Convenience Constructor

// Quick case-insensitive search
let options = SearchOptions::case_insensitive();

SearchResult

A single search match with position information.

Field	Type	Description
`page`	`usize`	Page number (0-indexed)
`text`	`String`	The matched text
`bbox`	`Rect`	Combined bounding box of the match
`start_index`	`usize`	Start index in the page’s extracted text
`end_index`	`usize`	End index in the page’s extracted text
`span_boxes`	`Vec<Rect>`	Individual bounding boxes for each span in the match (useful for multi-line matches)

Python: In the Python API, search results are returned as dictionaries:

{
    "page": 0,
    "text": "conclusion",
    "x": 72.0,
    "y": 650.5,
    "width": 85.3,
    "height": 12.0,
}

Pdf Convenience Methods

The high-level Pdf API provides shortcut methods for common search operations.

`search(pattern) -> Vec<SearchResult>`

Search the entire document with default options.

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("important")?;

`search_with_options(pattern, options) -> Vec<SearchResult>`

Search with custom options.

let options = SearchOptions::case_insensitive()
    .with_whole_word(true)
    .with_page_range(0, 5);
let results = pdf.search_with_options("abstract", options)?;

`search_page(page, pattern) -> Vec<SearchResult>`

Search a single page with default options.

let results = pdf.search_page(0, r"\d+\.\d+")?; // Find decimal numbers

`highlight_matches(results, color) -> Result<()>`

Create highlight annotations for search results. Each result gets a yellow (or custom color) highlight annotation on its page.

Parameter	Type	Description
`results`	`&[SearchResult]`	Search results to highlight
`color`	`[f32; 3]`	RGB color (0.0–1.0 per component)

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("important")?;
pdf.highlight_matches(&results, [1.0, 1.0, 0.0])?; // Yellow
pdf.save("highlighted.pdf")?;

Python Search API

The Python PdfDocument class exposes search directly.

`doc.search(pattern, ...) -> list[dict]`

doc.search(
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0,
) -> list[dict]

`doc.search_page(page, pattern, ...) -> list[dict]`

doc.search_page(
    page: int,
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0,
) -> list[dict]

JavaScript Search API

The WasmPdfDocument class exposes the same search functionality.

`doc.search(pattern, ...) -> Array`

doc.search(pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array

`doc.searchPage(pageIndex, pattern, ...) -> Array`

doc.searchPage(pageIndex, pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array

Example:

const doc = new WasmPdfDocument(bytes);

// Search all pages, case-insensitive
const results = doc.search("error|warning", true);
for (const r of results) {
  console.log(`Page ${r.page}: '${r.text}'`);
}

// Search a single page with whole-word matching
const pageResults = doc.searchPage(0, "abstract", true, false, true);
doc.free();

Advanced Examples

Search and highlight with custom color

use pdf_oxide::api::Pdf;
use pdf_oxide::search::SearchOptions;

let mut pdf = Pdf::open("contract.pdf")?;

// Find all dollar amounts
let options = SearchOptions::new()
    .with_literal(false); // regex mode
let results = pdf.search_with_options(r"\$[\d,]+\.?\d*", options)?;

println!("Found {} dollar amounts", results.len());
for r in &results {
    println!("  Page {}: {}", r.page + 1, r.text);
}

// Highlight them in green
pdf.highlight_matches(&results, [0.6, 1.0, 0.6])?;
pdf.save("highlighted_amounts.pdf")?;

Search with page range restriction

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")

# Search only the first 10 pages
results = doc.search(
    "introduction",
    case_insensitive=True,
    whole_word=True,
    max_results=5,
)

for r in results:
    print(f"Found on page {r['page'] + 1}")

Build a search index across multiple PDFs

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};
use std::collections::HashMap;

let files = vec!["paper_a.pdf", "paper_b.pdf", "paper_c.pdf"];
let query = "machine learning";
let options = SearchOptions::case_insensitive();

let mut index: HashMap<String, Vec<(usize, String)>> = HashMap::new();

for file in &files {
    let mut doc = PdfDocument::open(file)?;
    let results = TextSearcher::search(&mut doc, query, &options)?;

    for r in results {
        index.entry(file.to_string())
            .or_default()
            .push((r.page, r.text));
    }
}

for (file, matches) in &index {
    println!("{}: {} matches", file, matches.len());
    for (page, text) in matches {
        println!("  Page {}: '{}'", page + 1, text);
    }
}

Extract context around matches

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};

let mut doc = PdfDocument::open("report.pdf")?;
let options = SearchOptions::new().with_case_insensitive(true);
let results = TextSearcher::search(&mut doc, "error", &options)?;

for r in &results {
    // Extract full page text for context
    let page_text = doc.extract_text(r.page)?;

    // Show 50 chars before and after the match
    let start = r.start_index.saturating_sub(50);
    let end = (r.end_index + 50).min(page_text.len());
    let context = &page_text[start..end];

    println!("Page {} match: ...{}...", r.page + 1, context.trim());
}

Text Extraction – The text extraction that search operates on
Annotation Extraction – Annotations created by highlight_matches
Markdown Conversion – Convert search results context to Markdown

Text Search

Quick Example

API Reference

TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>

TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>

SearchOptions

Builder Methods

Convenience Constructor

SearchResult

Pdf Convenience Methods

search(pattern) -> Vec<SearchResult>

search_with_options(pattern, options) -> Vec<SearchResult>

search_page(page, pattern) -> Vec<SearchResult>

highlight_matches(results, color) -> Result<()>

Python Search API

doc.search(pattern, ...) -> list[dict]

doc.search_page(page, pattern, ...) -> list[dict]

JavaScript Search API

doc.search(pattern, ...) -> Array

doc.searchPage(pageIndex, pattern, ...) -> Array

Advanced Examples

Search and highlight with custom color

Search with page range restriction

Build a search index across multiple PDFs

Extract context around matches

Related Pages

`TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>`

`TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>`

`search(pattern) -> Vec<SearchResult>`

`search_with_options(pattern, options) -> Vec<SearchResult>`

`search_page(page, pattern) -> Vec<SearchResult>`

`highlight_matches(results, color) -> Result<()>`

`doc.search(pattern, ...) -> list[dict]`

`doc.search_page(page, pattern, ...) -> list[dict]`

`doc.search(pattern, ...) -> Array`

`doc.searchPage(pageIndex, pattern, ...) -> Array`