What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

텍스트 검색

PDF Oxide는 정규식 지원, 대소문자 구분 없는 매칭, 전체 단어 모드, 그리고 매칭 항목별 경계 상자를 갖춘 PDF 문서 전문 검색 기능을 제공합니다. 검색 결과에는 페이지 번호, 일치한 텍스트, 각 매칭 항목의 정확한 좌표가 포함되어 있어 검색 및 하이라이트 워크플로를 손쉽게 구성할 수 있습니다.

여러 페이지에 걸친 사용자 정의 쿼리에는 TextSearcher::search()를, 일반적인 사용 사례에는 Pdf 편의 메서드(search(), search_page(), highlight_matches())를 사용하세요.

빠른 예제

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
results = doc.search("conclusion", case_insensitive=True)
for r in results:
    print(f"Page {r['page']}: '{r['text']}' at ({r['x']:.1f}, {r['y']:.1f})")

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const results = doc.searchAll("conclusion", { caseSensitive: false });
for (const r of results) {
  console.log(`Page ${r.page}: '${r.text}' at (${r.x.toFixed(1)}, ${r.y.toFixed(1)})`);
}
doc.close();

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
results, _ := doc.SearchAll("conclusion", false)
for _, r := range results {
    fmt.Printf("Page %d: '%s' at (%.1f, %.1f)\n", r.Page, r.Text, r.X, r.Y)
}

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
var results = doc.SearchAll("conclusion");
foreach (var r in results)
{
    Console.WriteLine($"Page {r.Page}: '{r.Text}' at ({r.X:F1}, {r.Y:F1})");
}

WASM

const doc = new WasmPdfDocument(bytes);
const results = doc.search("conclusion");
for (const r of results) {
    console.log(`Page ${r.page}: '${r.text}' at (${r.x.toFixed(1)}, ${r.y.toFixed(1)})`);
}

Rust

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("conclusion")?;
for r in &results {
    println!("Page {}: '{}' at ({:.1}, {:.1})", r.page, r.text, r.bbox.x, r.bbox.y);
}

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.search.SearchMatch;
import java.nio.file.Path;
import java.util.List;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    List<SearchMatch> results = doc.search("conclusion", true, false, 0);
    for (SearchMatch m : results) {
        System.out.printf("Page %d: '%s' at (%.1f, %.1f)%n",
            m.pageIndex(), m.text(), m.bbox().x0(), m.bbox().y0());
    }
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import java.nio.file.Path

PdfDocument.open(Path.of("report.pdf")).use { doc ->
    val results = doc.search("conclusion", true, false, 0)
    for (m in results) {
        println("Page ${m.pageIndex()}: '${m.text()}' at (${m.bbox().x0()}, ${m.bbox().y0()})")
    }
}

Scala

import fyi.oxide.pdf.{PdfDocument, searchSeq}
import scala.util.Using

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  val results = doc.searchSeq("conclusion")
  for (m <- results)
    println(f"Page ${m.pageIndex}: '${m.text}' at (${m.bbox.x0}%.1f, ${m.bbox.y0}%.1f)")
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "report.pdf")]
  (doseq [m (pdf/search doc "conclusion")]
    (printf "Page %d: '%s' at (%.1f, %.1f)%n"
            (.pageIndex m) (.text m) (.x0 (.bbox m)) (.y0 (.bbox m)))))

Ruby

require 'pdf_oxide'

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  doc.search('conclusion', case_sensitive: false).each do |r|
    bbox = r[:bbox]
    printf("Page %d: '%s' at (%.1f, %.1f)\n", r[:page], r[:text], bbox[:x], bbox[:y])
  end
end

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <cstdio>

auto doc = pdf_oxide::Document::open("report.pdf");
auto results = doc.search_all("conclusion", /*case_sensitive=*/false);
for (const auto& r : results) {
    std::printf("Page %d: '%s' at (%.1f, %.1f)\n",
                r.page, r.text.c_str(), r.bbox.x, r.bbox.y);
}

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let results = try doc.searchAll("conclusion", false)
for r in results {
    print("Page \(r.page): '\(r.text)' at (\(r.bbox.x), \(r.bbox.y))")
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
final results = doc.searchAll('conclusion', false);
for (final r in results) {
  print("Page ${r.page}: '${r.text}' at (${r.bbox.x}, ${r.bbox.y})");
}
doc.close();

library(pdfoxide)

doc <- pdf_open("report.pdf")
results <- pdf_search_all(doc, "conclusion", case_sensitive = FALSE)
for (r in results) {
  cat(sprintf("Page %d: '%s' at (%.1f, %.1f)\n",
              r$page, r$text, r$bbox$x, r$bbox$y))
}

Julia

using PdfOxide

doc = open_document("report.pdf")
results = search_all(doc, "conclusion", false)
for r in results
    println("Page $(r.page): '$(r.text)' at ($(r.bbox.x), $(r.bbox.y))")
end

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("report.pdf");
const results = try doc.searchAll(a, "conclusion", false);
defer doc.freeSearchResults(a, results);
for (results) |r| {
    std.debug.print("Page {d}: '{s}' at ({d:.1}, {d:.1})\n", .{ r.page, r.text, r.bbox.x, r.bbox.y });
}

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSArray<POXSearchResult*> *results = [doc searchAll:@"conclusion" caseSensitive:NO error:&err];
for (POXSearchResult *r in results) {
    NSLog(@"Page %ld: '%@' at (%.1f, %.1f)", (long)r.page, r.text, r.bbox.x, r.bbox.y);
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, results} = PdfOxide.search_all(doc, "conclusion", false)

for r <- results do
  IO.puts("Page #{r.page}: '#{r.text}' at (#{r.bbox.x}, #{r.bbox.y})")
end

API 참조

`TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>`

PDF 문서의 여러 페이지에서 텍스트를 검색합니다. literal 모드가 활성화되지 않은 경우 패턴은 정규식으로 컴파일됩니다.

매개변수	타입	설명
`doc`	`&mut PdfDocument`	검색할 PDF 문서
`pattern`	`&str`	정규식 패턴 (`literal`이 설정된 경우 리터럴 텍스트)
`options`	`&SearchOptions`	검색 구성

반환값: 페이지와 위치 순으로 정렬된 SearchResult 객체의 벡터.

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};

let mut doc = PdfDocument::open("report.pdf")?;

let options = SearchOptions::new()
    .with_case_insensitive(true)
    .with_max_results(50);

let results = TextSearcher::search(&mut doc, "error|warning", &options)?;
for r in &results {
    println!("Page {}: '{}'", r.page, r.text);
}

`TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>`

미리 컴파일된 정규식을 사용하여 특정 페이지에서 텍스트를 검색합니다.

매개변수	타입	설명
`doc`	`&mut PdfDocument`	PDF 문서
`page`	`usize`	0 기반 페이지 인덱스
`regex`	`&Regex`	미리 컴파일된 정규식 패턴
`options`	`&SearchOptions`	검색 구성

반환값: 지정된 페이지의 SearchResult 객체 벡터.

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};
use regex::Regex;

let mut doc = PdfDocument::open("report.pdf")?;
let regex = Regex::new(r"\d{4}-\d{2}-\d{2}")?; // Date pattern
let options = SearchOptions::default();

let results = TextSearcher::search_page(&mut doc, 0, &regex, &options)?;
for r in &results {
    println!("Date found: '{}' at ({:.1}, {:.1})", r.text, r.bbox.x, r.bbox.y);
}

SearchOptions

텍스트 검색 동작 설정. 빌더 패턴으로 편리하게 구성할 수 있습니다.

필드	타입	기본값	설명
`case_insensitive`	`bool`	`false`	매칭 시 대소문자 무시
`literal`	`bool`	`false`	패턴을 리터럴 텍스트로 처리 (정규식 메타문자 이스케이프)
`whole_word`	`bool`	`false`	전체 단어만 매칭 (패턴을 `\b...\b`로 감쌈)
`max_results`	`usize`	`0`	반환할 최대 결과 수 (0 = 제한 없음)
`page_range`	`Option<(usize, usize)>`	`None`	검색할 페이지 범위 (시작·끝 모두 포함)

빌더 메서드

let options = SearchOptions::new()
    .with_case_insensitive(true)
    .with_literal(true)
    .with_whole_word(true)
    .with_max_results(100)
    .with_page_range(0, 9);

편의 생성자

// Quick case-insensitive search
let options = SearchOptions::case_insensitive();

SearchResult

위치 정보를 포함하는 단일 검색 매칭 결과.

필드	타입	설명
`page`	`usize`	페이지 번호 (0 기반)
`text`	`String`	매칭된 텍스트
`bbox`	`Rect`	매칭 항목의 통합 경계 상자
`start_index`	`usize`	페이지 추출 텍스트 내 시작 인덱스
`end_index`	`usize`	페이지 추출 텍스트 내 종료 인덱스
`span_boxes`	`Vec<Rect>`	매칭 내 각 스팬의 개별 경계 상자 (여러 줄에 걸친 매칭에 유용)

Python: Python API에서 검색 결과는 딕셔너리로 반환됩니다:

{
    "page": 0,
    "text": "conclusion",
    "x": 72.0,
    "y": 650.5,
    "width": 85.3,
    "height": 12.0,
}

Pdf 편의 메서드

고수준 Pdf API는 일반적인 검색 작업을 위한 단축 메서드를 제공합니다.

`search(pattern) -> Vec<SearchResult>`

기본 옵션으로 전체 문서를 검색합니다.

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("important")?;

`search_with_options(pattern, options) -> Vec<SearchResult>`

사용자 정의 옵션으로 검색합니다.

let options = SearchOptions::case_insensitive()
    .with_whole_word(true)
    .with_page_range(0, 5);
let results = pdf.search_with_options("abstract", options)?;

`search_page(page, pattern) -> Vec<SearchResult>`

기본 옵션으로 단일 페이지를 검색합니다.

let results = pdf.search_page(0, r"\d+\.\d+")?; // Find decimal numbers

`highlight_matches(results, color) -> Result<()>`

검색 결과에 대한 하이라이트 주석을 생성합니다. 각 결과는 해당 페이지에 노란색(또는 사용자 정의 색상) 하이라이트 주석이 추가됩니다.

매개변수	타입	설명
`results`	`&[SearchResult]`	하이라이트할 검색 결과
`color`	`[f32; 3]`	RGB 색상 (각 성분 0.0–1.0)

let mut pdf = Pdf::open("report.pdf")?;
let results = pdf.search("important")?;
pdf.highlight_matches(&results, [1.0, 1.0, 0.0])?; // Yellow
pdf.save("highlighted.pdf")?;

Python 검색 API

Python PdfDocument 클래스는 검색 기능을 직접 노출합니다.

`doc.search(pattern, ...) -> list[dict]`

doc.search(
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0,
) -> list[dict]

`doc.search_page(page, pattern, ...) -> list[dict]`

doc.search_page(
    page: int,
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0,
) -> list[dict]

JavaScript 검색 API

WasmPdfDocument 클래스도 동일한 검색 기능을 제공합니다.

`doc.search(pattern, ...) -> Array`

doc.search(pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array

`doc.searchPage(pageIndex, pattern, ...) -> Array`

doc.searchPage(pageIndex, pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array

예제:

const doc = new WasmPdfDocument(bytes);

// Search all pages, case-insensitive
const results = doc.search("error|warning", true);
for (const r of results) {
  console.log(`Page ${r.page}: '${r.text}'`);
}

// Search a single page with whole-word matching
const pageResults = doc.searchPage(0, "abstract", true, false, true);
doc.free();

검색 결과를 JSON으로 직렬화하려면?

일부 바인딩은 페이지의 검색 결과 목록을 FFI 경계를 한 번만 넘어 JSON 배열로 변환하는 일괄 직렬화기를 제공합니다. Rust가 전체 목록을 직렬화하고 바인딩이 디코딩하므로, 매칭 항목마다 각 필드를 건너서 전달할 필요가 없습니다. Go와 C#의 SearchPage 메서드가 내부적으로 이 방식을 사용합니다.

C ABI 시그니처가 권위 있는 정의입니다:

char *pdf_oxide_search_results_to_json(
    const FfiSearchResults *results,
    int32_t *error_code);

pdf_document_search_page(...)가 반환하는 불투명 결과 핸들을 받아, malloc으로 할당된 UTF-8 JSON 문자열을 반환합니다(pdf_free_string으로 해제). 각 요소에는 매칭의 page, text, 경계 상자(x, y, width, height)가 포함됩니다.

Swift — 래퍼는 검색과 직렬화를 searchResultsToJson(_:_:caseSensitive:) 한 번의 호출로 묶습니다:

import PdfOxide

let doc = try PdfDocument(path: "report.pdf")

// Search page 0 for "conclusion" and get the matches as a JSON string
let json = try doc.searchResultsToJson(0, "conclusion", caseSensitive: false)
print(json)
// [{"page":0,"text":"conclusion","x":72.0,"y":650.5,"width":85.3,"height":12.0}, ...]

Go / C#. 이 바인딩들은 내부적으로 pdf_oxide_search_results_to_json을 호출하고 이미 디코딩된 네이티브 레코드를 반환하므로, 직렬화기를 직접 호출할 필요가 없습니다. doc.SearchPage(...)를 사용하세요(Go: doc.SearchPage(page, text, caseSensitive); C#: doc.SearchPage(pageIndex, text, caseSensitive)). 강한 타입의 결과를 받을 수 있습니다. 해당 언어에서 JSON이 필요하다면, 반환된 레코드를 표준 JSON 라이브러리(encoding/json / System.Text.Json)로 직렬화하세요.

Python / Rust. Python의 doc.search(...) / doc.search_page(...)는 이미 네이티브 list[dict] 레코드를 반환하므로 json.dumps로 직접 직렬화할 수 있으며, Rust는 serde_json으로 직렬화 가능한 타입 지정 Vec<SearchResult>를 반환합니다. 두 경우 모두 C-ABI 직렬화기가 필요하지 않습니다.

심화 예제

사용자 정의 색상으로 검색 및 하이라이트

use pdf_oxide::api::Pdf;
use pdf_oxide::search::SearchOptions;

let mut pdf = Pdf::open("contract.pdf")?;

// Find all dollar amounts
let options = SearchOptions::new()
    .with_literal(false); // regex mode
let results = pdf.search_with_options(r"\$[\d,]+\.?\d*", options)?;

println!("Found {} dollar amounts", results.len());
for r in &results {
    println!("  Page {}: {}", r.page + 1, r.text);
}

// Highlight them in green
pdf.highlight_matches(&results, [0.6, 1.0, 0.6])?;
pdf.save("highlighted_amounts.pdf")?;

페이지 범위를 제한하여 검색

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")

# Search only the first 10 pages
results = doc.search(
    "introduction",
    case_insensitive=True,
    whole_word=True,
    max_results=5,
)

for r in results:
    print(f"Found on page {r['page'] + 1}")

여러 PDF에 걸쳐 검색 인덱스 구축

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};
use std::collections::HashMap;

let files = vec!["paper_a.pdf", "paper_b.pdf", "paper_c.pdf"];
let query = "machine learning";
let options = SearchOptions::case_insensitive();

let mut index: HashMap<String, Vec<(usize, String)>> = HashMap::new();

for file in &files {
    let mut doc = PdfDocument::open(file)?;
    let results = TextSearcher::search(&mut doc, query, &options)?;

    for r in results {
        index.entry(file.to_string())
            .or_default()
            .push((r.page, r.text));
    }
}

for (file, matches) in &index {
    println!("{}: {} matches", file, matches.len());
    for (page, text) in matches {
        println!("  Page {}: '{}'", page + 1, text);
    }
}

매칭 주변 문맥 추출

use pdf_oxide::PdfDocument;
use pdf_oxide::search::{TextSearcher, SearchOptions};

let mut doc = PdfDocument::open("report.pdf")?;
let options = SearchOptions::new().with_case_insensitive(true);
let results = TextSearcher::search(&mut doc, "error", &options)?;

for r in &results {
    // Extract full page text for context
    let page_text = doc.extract_text(r.page)?;

    // Show 50 chars before and after the match
    let start = r.start_index.saturating_sub(50);
    let end = (r.end_index + 50).min(page_text.len());
    let context = &page_text[start..end];

    println!("Page {} match: ...{}...", r.page + 1, context.trim());
}

자주 묻는 질문

검색 결과를 JSON으로 가져오려면 어떻게 하나요? Swift에서는 doc.searchResultsToJson(page, term, caseSensitive:)를 호출하면 페이지 검색을 실행하고 매칭 항목의 JSON 배열을 한 번에 반환합니다. Python과 Rust에서는 검색이 네이티브 레코드(list[dict] / Vec<SearchResult>)를 반환하므로 json.dumps / serde_json으로 직렬화합니다. Go와 C#은 강한 타입의 레코드를 반환하며 encoding/json / System.Text.Json으로 직렬화합니다.

각 JSON 매칭 항목에는 무엇이 포함되나요? 매칭의 page(0 기반), 일치한 text, 통합 경계 상자: x, y, width, height(PDF 포인트 단위, 좌하단 원점)가 포함됩니다.

검색은 기본적으로 정규식인가요, 리터럴인가요? 패턴은 기본적으로 정규식으로 컴파일됩니다. literal 모드(with_literal(true) / literal=True)를 활성화하면 정규식 메타문자가 이스케이프되어 텍스트 그대로 매칭됩니다.

대소문자 구분 없는 검색과 전체 단어 매칭을 지원하나요? 네, SearchOptions(Rust)에서 case_insensitive와 whole_word를 설정하거나, Python에서는 키워드 인수로, 다른 바인딩에서는 옵션으로 전달하면 됩니다.

텍스트 검색

빠른 예제

API 참조

TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>

TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>

SearchOptions

빌더 메서드

편의 생성자

SearchResult

Pdf 편의 메서드

search(pattern) -> Vec<SearchResult>

search_with_options(pattern, options) -> Vec<SearchResult>

search_page(page, pattern) -> Vec<SearchResult>

highlight_matches(results, color) -> Result<()>

Python 검색 API

doc.search(pattern, ...) -> list[dict]

doc.search_page(page, pattern, ...) -> list[dict]

JavaScript 검색 API

doc.search(pattern, ...) -> Array

doc.searchPage(pageIndex, pattern, ...) -> Array

검색 결과를 JSON으로 직렬화하려면?

심화 예제

사용자 정의 색상으로 검색 및 하이라이트

페이지 범위를 제한하여 검색

여러 PDF에 걸쳐 검색 인덱스 구축

매칭 주변 문맥 추출

자주 묻는 질문

관련 페이지

`TextSearcher::search(doc, pattern, options) -> Vec<SearchResult>`

`TextSearcher::search_page(doc, page, regex, options) -> Vec<SearchResult>`

`search(pattern) -> Vec<SearchResult>`

`search_with_options(pattern, options) -> Vec<SearchResult>`

`search_page(page, pattern) -> Vec<SearchResult>`

`highlight_matches(results, color) -> Result<()>`

`doc.search(pattern, ...) -> list[dict]`

`doc.search_page(page, pattern, ...) -> list[dict]`

`doc.search(pattern, ...) -> Array`

`doc.searchPage(pageIndex, pattern, ...) -> Array`