What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Text Search

PDF Oxide provides full-text search across PDF documents with regex support, case-insensitive matching, whole-word mode, and per-match bounding boxes. Search results include page number, matched text, and precise coordinates for each match, making it straightforward to build search-and-highlight workflows.

Use TextSearcher::search() for multi-page queries with custom options, or the Pdf convenience methods (search(), search_page(), highlight_matches()) for common use cases.

Quick Example

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
results = doc.search("conclusion", case_insensitive=True)
for r in results:
    print(f"Page {r['page']}: '{r['text']}' at ({r['x']:.1f}, {r['y']:.1f})")

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const results = doc.searchAll("conclusion", { caseSensitive: false });
for (const r of results) {
  console.log(`Page ${r.page}: '${r.text}' at (${r.x.toFixed(1)}, ${r.y.toFixed(1)})`);
}
doc.close();

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
results, _ := doc.SearchAll("conclusion", false)
for _, r := range results {
    fmt.Printf("Page %d: '%s' at (%.1f, %.1f)\n", r.Page, r.Text, r.X, r.Y)
}

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
var results = doc.SearchAll("conclusion");
foreach (var r in results)
{
    Console.WriteLine($"Page {r.Page}: '{r.Text}' at ({r.X:F1}, {r.Y:F1})");
}

WASM

const doc = new WasmPdfDocument(bytes);
const results = doc.search("conclusion");
for (const r of results) {
    console.log(`Page ${r.page}: '${r.text}' at (${r.x.toFixed(1)}, ${r.y.toFixed(1)})`);
}

Rust

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("conclusion")?;
for r in &results {
    println!("Page {}: '{}' at ({:.1}, {:.1})", r.page, r.text, r.bbox.x, r.bbox.y);
}

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.search.SearchMatch;
import java.nio.file.Path;
import java.util.List;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    List<SearchMatch> results = doc.search("conclusion", true, false, 0);
    for (SearchMatch m : results) {
        System.out.printf("Page %d: '%s' at (%.1f, %.1f)%n",
            m.pageIndex(), m.text(), m.bbox().x0(), m.bbox().y0());
    }
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import java.nio.file.Path

PdfDocument.open(Path.of("report.pdf")).use { doc ->
    val results = doc.search("conclusion", true, false, 0)
    for (m in results) {
        println("Page ${m.pageIndex()}: '${m.text()}' at (${m.bbox().x0()}, ${m.bbox().y0()})")
    }
}

Scala

import fyi.oxide.pdf.{PdfDocument, searchSeq}
import scala.util.Using

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  val results = doc.searchSeq("conclusion")
  for (m <- results)
    println(f"Page ${m.pageIndex}: '${m.text}' at (${m.bbox.x0}%.1f, ${m.bbox.y0}%.1f)")
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "report.pdf")]
  (doseq [m (pdf/search doc "conclusion")]
    (printf "Page %d: '%s' at (%.1f, %.1f)%n"
            (.pageIndex m) (.text m) (.x0 (.bbox m)) (.y0 (.bbox m)))))

Ruby

require 'pdf_oxide'

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  doc.search('conclusion', case_sensitive: false).each do |r|
    bbox = r[:bbox]
    printf("Page %d: '%s' at (%.1f, %.1f)\n", r[:page], r[:text], bbox[:x], bbox[:y])
  end
end

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <cstdio>

auto doc = pdf_oxide::Document::open("report.pdf");
auto results = doc.search_all("conclusion", /*case_sensitive=*/false);
for (const auto& r : results) {
    std::printf("Page %d: '%s' at (%.1f, %.1f)\n",
                r.page, r.text.c_str(), r.bbox.x, r.bbox.y);
}

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let results = try doc.searchAll("conclusion", false)
for r in results {
    print("Page \(r.page): '\(r.text)' at (\(r.bbox.x), \(r.bbox.y))")
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
final results = doc.searchAll('conclusion', false);
for (final r in results) {
  print("Page ${r.page}: '${r.text}' at (${r.bbox.x}, ${r.bbox.y})");
}
doc.close();

library(pdfoxide)

doc <- pdf_open("report.pdf")
results <- pdf_search_all(doc, "conclusion", case_sensitive = FALSE)
for (r in results) {
  cat(sprintf("Page %d: '%s' at (%.1f, %.1f)\n",
              r$page, r$text, r$bbox$x, r$bbox$y))
}

Julia

using PdfOxide

doc = open_document("report.pdf")
results = search_all(doc, "conclusion", false)
for r in results
    println("Page $(r.page): '$(r.text)' at ($(r.bbox.x), $(r.bbox.y))")
end

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("report.pdf");
const results = try doc.searchAll(a, "conclusion", false);
defer doc.freeSearchResults(a, results);
for (results) |r| {
    std.debug.print("Page {d}: '{s}' at ({d:.1}, {d:.1})\n", .{ r.page, r.text, r.bbox.x, r.bbox.y });
}

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSArray<POXSearchResult*> *results = [doc searchAll:@"conclusion" caseSensitive:NO error:&err];
for (POXSearchResult *r in results) {
    NSLog(@"Page %ld: '%@' at (%.1f, %.1f)", (long)r.page, r.text, r.bbox.x, r.bbox.y);
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, results} = PdfOxide.search_all(doc, "conclusion", false)

for r <- results do
  IO.puts("Page #{r.page}: '#{r.text}' at (#{r.bbox.x}, #{r.bbox.y})")
end

API Reference

`TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>`

Search for text across multiple pages of a PDF document. The pattern is compiled as a regex unless literal mode is enabled.

Parameter	Type	Description
`doc`	`&mut PdfDocument`	The PDF document to search
`pattern`	`&str`	Regex pattern (or literal text if `literal` is set)
`options`	`&SearchOptions`	Search configuration

Returns: A vector of SearchResult objects, ordered by page and position.

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};

let mut doc = PdfDocument::open("report.pdf")?;

let options = SearchOptions::new()
    .with_case_insensitive(true)
    .with_max_results(50);

let results = TextSearcher::search(&mut doc, "error|warning", &options)?;
for r in &results {
    println!("Page {}: '{}'", r.page, r.text);
}

`TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>`

Search for text on a specific page using a pre-compiled regex.

Parameter	Type	Description
`doc`	`&mut PdfDocument`	The PDF document
`page`	`usize`	Zero-based page index
`regex`	`&Regex`	Pre-compiled regex pattern
`options`	`&SearchOptions`	Search configuration

Returns: A vector of SearchResult objects for the specified page.

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};
use regex::Regex;

let mut doc = PdfDocument::open("report.pdf")?;
let regex = Regex::new(r"\d{4}-\d{2}-\d{2}")?; // Date pattern
let options = SearchOptions::default();

let results = TextSearcher::search_page(&mut doc, 0, &regex, &options)?;
for r in &results {
    println!("Date found: '{}' at ({:.1}, {:.1})", r.text, r.bbox.x, r.bbox.y);
}

SearchOptions

Configuration for text search behavior. Uses a builder pattern for ergonomic construction.

Field	Type	Default	Description
`case_insensitive`	`bool`	`false`	Ignore case when matching
`literal`	`bool`	`false`	Treat pattern as literal text (escape regex chars)
`whole_word`	`bool`	`false`	Match whole words only (wraps pattern in `\b...\b`)
`max_results`	`usize`	`0`	Maximum results to return (0 = unlimited)
`page_range`	`Option<(usize, usize)>`	`None`	Page range to search (inclusive start, inclusive end)

Builder Methods

let options = SearchOptions::new()
    .with_case_insensitive(true)
    .with_literal(true)
    .with_whole_word(true)
    .with_max_results(100)
    .with_page_range(0, 9);

Convenience Constructor

// Quick case-insensitive search
let options = SearchOptions::case_insensitive();

SearchResult

A single search match with position information.

Field	Type	Description
`page`	`usize`	Page number (0-indexed)
`text`	`String`	The matched text
`bbox`	`Rect`	Combined bounding box of the match
`start_index`	`usize`	Start index in the page’s extracted text
`end_index`	`usize`	End index in the page’s extracted text
`span_boxes`	`Vec<Rect>`	Individual bounding boxes for each span in the match (useful for multi-line matches)

Python: In the Python API, search results are returned as dictionaries:

{
    "page": 0,
    "text": "conclusion",
    "x": 72.0,
    "y": 650.5,
    "width": 85.3,
    "height": 12.0,
}

Pdf Convenience Methods

The high-level Pdf API provides shortcut methods for common search operations.

`search(pattern) -> Vec<SearchResult>`

Search the entire document with default options.

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("important")?;

`search_with_options(pattern, options) -> Vec<SearchResult>`

Search with custom options.

let options = SearchOptions::case_insensitive()
    .with_whole_word(true)
    .with_page_range(0, 5);
let results = pdf.search_with_options("abstract", options)?;

`search_page(page, pattern) -> Vec<SearchResult>`

Search a single page with default options.

let results = pdf.search_page(0, r"\d+\.\d+")?; // Find decimal numbers

`highlight_matches(results, color) -> Result<()>`

Create highlight annotations for search results. Each result gets a yellow (or custom color) highlight annotation on its page.

Parameter	Type	Description
`results`	`&[SearchResult]`	Search results to highlight
`color`	`[f32; 3]`	RGB color (0.0–1.0 per component)

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("important")?;
pdf.highlight_matches(&results, [1.0, 1.0, 0.0])?; // Yellow
pdf.save("highlighted.pdf")?;

Python Search API

The Python PdfDocument class exposes search directly.

`doc.search(pattern, ...) -> list[dict]`

doc.search(
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0,
) -> list[dict]

`doc.search_page(page, pattern, ...) -> list[dict]`

doc.search_page(
    page: int,
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0,
) -> list[dict]

JavaScript Search API

The WasmPdfDocument class exposes the same search functionality.

`doc.search(pattern, ...) -> Array`

doc.search(pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array

`doc.searchPage(pageIndex, pattern, ...) -> Array`

doc.searchPage(pageIndex, pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array

Example:

const doc = new WasmPdfDocument(bytes);

// Search all pages, case-insensitive
const results = doc.search("error|warning", true);
for (const r of results) {
  console.log(`Page ${r.page}: '${r.text}'`);
}

// Search a single page with whole-word matching
const pageResults = doc.searchPage(0, "abstract", true, false, true);
doc.free();

How do I serialize search results to JSON?

Several bindings expose a one-shot serializer that turns a page’s search-result list into a JSON array in a single FFI crossing — Rust serializes the whole list and the binding decodes it, instead of pulling each field across the boundary one match at a time. This is the same path the Go and C# SearchPage methods use internally to decode their results.

The C ABI signature is authoritative:

char *pdf_oxide_search_results_to_json(
    const FfiSearchResults *results,
    int32_t *error_code);

It takes the opaque results handle returned by pdf_document_search_page(...) and returns a malloc’d UTF-8 JSON string (free it with pdf_free_string). Each element carries the match’s page, text, and bounding box (x, y, width, height).

Swift — the wrapper bundles the search and the serialization into one call, searchResultsToJson(_:_:caseSensitive:):

import PdfOxide

let doc = try PdfDocument(path: "report.pdf")

// Search page 0 for "conclusion" and get the matches as a JSON string
let json = try doc.searchResultsToJson(0, "conclusion", caseSensitive: false)
print(json)
// [{"page":0,"text":"conclusion","x":72.0,"y":650.5,"width":85.3,"height":12.0}, ...]

Go / C#. These bindings call pdf_oxide_search_results_to_json under the hood and hand you already-decoded native records, so you don’t invoke the serializer directly. Use doc.SearchPage(...) (Go: doc.SearchPage(page, text, caseSensitive); C#: doc.SearchPage(pageIndex, text, caseSensitive)) and you get strongly-typed results back. To obtain JSON in those languages, serialize the returned records with your standard JSON library (encoding/json / System.Text.Json).

Python / Rust. The Python doc.search(...) / doc.search_page(...) methods already return native list[dict] records (JSON-serialize them directly with json.dumps), and Rust returns typed Vec<SearchResult> you can serialize with serde_json. Neither needs the C-ABI serializer.

Advanced Examples

Search and highlight with custom color

use pdf_oxide::api::Pdf;
use pdf_oxide::search::SearchOptions;

let mut pdf = Pdf::open("contract.pdf")?;

// Find all dollar amounts
let options = SearchOptions::new()
    .with_literal(false); // regex mode
let results = pdf.search_with_options(r"\$[\d,]+\.?\d*", options)?;

println!("Found {} dollar amounts", results.len());
for r in &results {
    println!("  Page {}: {}", r.page + 1, r.text);
}

// Highlight them in green
pdf.highlight_matches(&results, [0.6, 1.0, 0.6])?;
pdf.save("highlighted_amounts.pdf")?;

Search with page range restriction

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")

# Search only the first 10 pages
results = doc.search(
    "introduction",
    case_insensitive=True,
    whole_word=True,
    max_results=5,
)

for r in results:
    print(f"Found on page {r['page'] + 1}")

Build a search index across multiple PDFs

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};
use std::collections::HashMap;

let files = vec!["paper_a.pdf", "paper_b.pdf", "paper_c.pdf"];
let query = "machine learning";
let options = SearchOptions::case_insensitive();

let mut index: HashMap<String, Vec<(usize, String)>> = HashMap::new();

for file in &files {
    let mut doc = PdfDocument::open(file)?;
    let results = TextSearcher::search(&mut doc, query, &options)?;

    for r in results {
        index.entry(file.to_string())
            .or_default()
            .push((r.page, r.text));
    }
}

for (file, matches) in &index {
    println!("{}: {} matches", file, matches.len());
    for (page, text) in matches {
        println!("  Page {}: '{}'", page + 1, text);
    }
}

Extract context around matches

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};

let mut doc = PdfDocument::open("report.pdf")?;
let options = SearchOptions::new().with_case_insensitive(true);
let results = TextSearcher::search(&mut doc, "error", &options)?;

for r in &results {
    // Extract full page text for context
    let page_text = doc.extract_text(r.page)?;

    // Show 50 chars before and after the match
    let start = r.start_index.saturating_sub(50);
    let end = (r.end_index + 50).min(page_text.len());
    let context = &page_text[start..end];

    println!("Page {} match: ...{}...", r.page + 1, context.trim());
}

FAQ

How do I get search results as JSON? In Swift, call doc.searchResultsToJson(page, term, caseSensitive:), which runs the page search and returns a JSON array of matches in one call. In Python and Rust, search returns native records (list[dict] / Vec<SearchResult>) you serialize with json.dumps / serde_json. Go and C# return typed records that you serialize with encoding/json / System.Text.Json.

What does each JSON match contain? The match page (0-indexed), the matched text, and the combined bounding box: x, y, width, height (PDF points, bottom-left origin).

Is search regex or literal by default? Patterns are compiled as a regex unless you enable literal mode (with_literal(true) / literal=True), which escapes regex metacharacters and matches the text verbatim.

Does search support case-insensitive and whole-word matching? Yes — set case_insensitive and whole_word in SearchOptions (Rust) or as keyword arguments (Python) / options (other bindings).

Text Extraction – The text extraction that search operates on
Scoped Extraction – Region-scoped extraction and structured-region JSON
Annotation Extraction – Annotations created by highlight_matches
Markdown Conversion – Convert search results context to Markdown

Text Search

Quick Example

API Reference

TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>

TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>

SearchOptions

Builder Methods

Convenience Constructor

SearchResult

Pdf Convenience Methods

search(pattern) -> Vec<SearchResult>

search_with_options(pattern, options) -> Vec<SearchResult>

search_page(page, pattern) -> Vec<SearchResult>

highlight_matches(results, color) -> Result<()>

Python Search API

doc.search(pattern, ...) -> list[dict]

doc.search_page(page, pattern, ...) -> list[dict]

JavaScript Search API

doc.search(pattern, ...) -> Array

doc.searchPage(pageIndex, pattern, ...) -> Array

How do I serialize search results to JSON?

Advanced Examples

Search and highlight with custom color

Search with page range restriction

Build a search index across multiple PDFs

Extract context around matches

FAQ

Related Pages

`TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>`

`TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>`

`search(pattern) -> Vec<SearchResult>`

`search_with_options(pattern, options) -> Vec<SearchResult>`

`search_page(page, pattern) -> Vec<SearchResult>`

`highlight_matches(results, color) -> Result<()>`

`doc.search(pattern, ...) -> list[dict]`

`doc.search_page(page, pattern, ...) -> list[dict]`

`doc.search(pattern, ...) -> Array`

`doc.searchPage(pageIndex, pattern, ...) -> Array`