Skip to content

Text Search

PDF Oxide provides full-text search across PDF documents with regex support, case-insensitive matching, whole-word mode, and per-match bounding boxes. Search results include page number, matched text, and precise coordinates for each match, making it straightforward to build search-and-highlight workflows.

Use TextSearcher::search() for multi-page queries with custom options, or the Pdf convenience methods (search(), search_page(), highlight_matches()) for common use cases.

Quick Example

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
results = doc.search("conclusion", case_insensitive=True)
for r in results:
    print(f"Page {r['page']}: '{r['text']}' at ({r['x']:.1f}, {r['y']:.1f})")

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const results = doc.searchAll("conclusion", { caseSensitive: false });
for (const r of results) {
  console.log(`Page ${r.page}: '${r.text}' at (${r.x.toFixed(1)}, ${r.y.toFixed(1)})`);
}
doc.close();

Go

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
results, _ := doc.SearchAll("conclusion", false)
for _, r := range results {
    fmt.Printf("Page %d: '%s' at (%.1f, %.1f)\n", r.Page, r.Text, r.X, r.Y)
}

C#

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
var results = doc.SearchAll("conclusion");
foreach (var r in results)
{
    Console.WriteLine($"Page {r.Page}: '{r.Text}' at ({r.X:F1}, {r.Y:F1})");
}

WASM

const doc = new WasmPdfDocument(bytes);
const results = doc.search("conclusion");
for (const r of results) {
    console.log(`Page ${r.page}: '${r.text}' at (${r.x.toFixed(1)}, ${r.y.toFixed(1)})`);
}

Rust

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("conclusion")?;
for r in &results {
    println!("Page {}: '{}' at ({:.1}, {:.1})", r.page, r.text, r.bbox.x, r.bbox.y);
}

API Reference

TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>

Search for text across multiple pages of a PDF document. The pattern is compiled as a regex unless literal mode is enabled.

Parameter Type Description
doc &mut PdfDocument The PDF document to search
pattern &str Regex pattern (or literal text if literal is set)
options &SearchOptions Search configuration

Returns: A vector of SearchResult objects, ordered by page and position.

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};

let mut doc = PdfDocument::open("report.pdf")?;

let options = SearchOptions::new()
    .with_case_insensitive(true)
    .with_max_results(50);

let results = TextSearcher::search(&mut doc, "error|warning", &options)?;
for r in &results {
    println!("Page {}: '{}'", r.page, r.text);
}

TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>

Search for text on a specific page using a pre-compiled regex.

Parameter Type Description
doc &mut PdfDocument The PDF document
page usize Zero-based page index
regex &Regex Pre-compiled regex pattern
options &SearchOptions Search configuration

Returns: A vector of SearchResult objects for the specified page.

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};
use regex::Regex;

let mut doc = PdfDocument::open("report.pdf")?;
let regex = Regex::new(r"\d{4}-\d{2}-\d{2}")?; // Date pattern
let options = SearchOptions::default();

let results = TextSearcher::search_page(&mut doc, 0, &regex, &options)?;
for r in &results {
    println!("Date found: '{}' at ({:.1}, {:.1})", r.text, r.bbox.x, r.bbox.y);
}

SearchOptions

Configuration for text search behavior. Uses a builder pattern for ergonomic construction.

Field Type Default Description
case_insensitive bool false Ignore case when matching
literal bool false Treat pattern as literal text (escape regex chars)
whole_word bool false Match whole words only (wraps pattern in \b...\b)
max_results usize 0 Maximum results to return (0 = unlimited)
page_range Option<(usize, usize)> None Page range to search (inclusive start, inclusive end)

Builder Methods

let options = SearchOptions::new()
    .with_case_insensitive(true)
    .with_literal(true)
    .with_whole_word(true)
    .with_max_results(100)
    .with_page_range(0, 9);

Convenience Constructor

// Quick case-insensitive search
let options = SearchOptions::case_insensitive();

SearchResult

A single search match with position information.

Field Type Description
page usize Page number (0-indexed)
text String The matched text
bbox Rect Combined bounding box of the match
start_index usize Start index in the page’s extracted text
end_index usize End index in the page’s extracted text
span_boxes Vec<Rect> Individual bounding boxes for each span in the match (useful for multi-line matches)

Python: In the Python API, search results are returned as dictionaries:

{
    "page": 0,
    "text": "conclusion",
    "x": 72.0,
    "y": 650.5,
    "width": 85.3,
    "height": 12.0,
}

Pdf Convenience Methods

The high-level Pdf API provides shortcut methods for common search operations.

search(pattern) -> Vec<SearchResult>

Search the entire document with default options.

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("important")?;

search_with_options(pattern, options) -> Vec<SearchResult>

Search with custom options.

let options = SearchOptions::case_insensitive()
    .with_whole_word(true)
    .with_page_range(0, 5);
let results = pdf.search_with_options("abstract", options)?;

search_page(page, pattern) -> Vec<SearchResult>

Search a single page with default options.

let results = pdf.search_page(0, r"\d+\.\d+")?; // Find decimal numbers

highlight_matches(results, color) -> Result<()>

Create highlight annotations for search results. Each result gets a yellow (or custom color) highlight annotation on its page.

Parameter Type Description
results &[SearchResult] Search results to highlight
color [f32; 3] RGB color (0.0–1.0 per component)
let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("important")?;
pdf.highlight_matches(&results, [1.0, 1.0, 0.0])?; // Yellow
pdf.save("highlighted.pdf")?;

Python Search API

The Python PdfDocument class exposes search directly.

doc.search(pattern, ...) -> list[dict]

doc.search(
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0,
) -> list[dict]

doc.search_page(page, pattern, ...) -> list[dict]

doc.search_page(
    page: int,
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0,
) -> list[dict]

JavaScript Search API

The WasmPdfDocument class exposes the same search functionality.

doc.search(pattern, ...) -> Array

doc.search(pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array

doc.searchPage(pageIndex, pattern, ...) -> Array

doc.searchPage(pageIndex, pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array

Example:

const doc = new WasmPdfDocument(bytes);

// Search all pages, case-insensitive
const results = doc.search("error|warning", true);
for (const r of results) {
  console.log(`Page ${r.page}: '${r.text}'`);
}

// Search a single page with whole-word matching
const pageResults = doc.searchPage(0, "abstract", true, false, true);
doc.free();

Advanced Examples

Search and highlight with custom color

use pdf_oxide::api::Pdf;
use pdf_oxide::search::SearchOptions;

let mut pdf = Pdf::open("contract.pdf")?;

// Find all dollar amounts
let options = SearchOptions::new()
    .with_literal(false); // regex mode
let results = pdf.search_with_options(r"\$[\d,]+\.?\d*", options)?;

println!("Found {} dollar amounts", results.len());
for r in &results {
    println!("  Page {}: {}", r.page + 1, r.text);
}

// Highlight them in green
pdf.highlight_matches(&results, [0.6, 1.0, 0.6])?;
pdf.save("highlighted_amounts.pdf")?;

Search with page range restriction

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")

# Search only the first 10 pages
results = doc.search(
    "introduction",
    case_insensitive=True,
    whole_word=True,
    max_results=5,
)

for r in results:
    print(f"Found on page {r['page'] + 1}")

Build a search index across multiple PDFs

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};
use std::collections::HashMap;

let files = vec!["paper_a.pdf", "paper_b.pdf", "paper_c.pdf"];
let query = "machine learning";
let options = SearchOptions::case_insensitive();

let mut index: HashMap<String, Vec<(usize, String)>> = HashMap::new();

for file in &files {
    let mut doc = PdfDocument::open(file)?;
    let results = TextSearcher::search(&mut doc, query, &options)?;

    for r in results {
        index.entry(file.to_string())
            .or_default()
            .push((r.page, r.text));
    }
}

for (file, matches) in &index {
    println!("{}: {} matches", file, matches.len());
    for (page, text) in matches {
        println!("  Page {}: '{}'", page + 1, text);
    }
}

Extract context around matches

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};

let mut doc = PdfDocument::open("report.pdf")?;
let options = SearchOptions::new().with_case_insensitive(true);
let results = TextSearcher::search(&mut doc, "error", &options)?;

for r in &results {
    // Extract full page text for context
    let page_text = doc.extract_text(r.page)?;

    // Show 50 chars before and after the match
    let start = r.start_index.saturating_sub(50);
    let end = (r.end_index + 50).min(page_text.len());
    let context = &page_text[start..end];

    println!("Page {} match: ...{}...", r.page + 1, context.trim());
}