What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Scoped Extraction — Pull Content from a Specific Region

When you’re processing invoices, bank statements, tax forms, or any templated layout, you usually know where the fields live. Instead of extracting the whole page and searching for the value, point PDF Oxide at the exact rectangle and get back just what’s there.

The fluent within(page, rect) API returns a scoped region you can chain extraction methods on: extract_text(), extract_words(), extract_chars(), extract_tables().

Binding coverage. within(page, rect) is available in Python, Rust, and WASM. Go and C# expose the equivalent lower-level helpers (ExtractTextInRect, ExtractWordsInRect, ExtractImagesInRect) — see below. The full in-rect family (text, words, lines, tables, images) is exposed end-to-end in Rust, the C ABI, and the Swift wrapper; see In-rect extraction variants for which binding has which.

Quick Example

rect is (x, y, width, height) in PDF points, with the origin at the bottom-left of the page. Letter-size pages are 612 × 792 points.

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")

# Top 92 points of page 0 — typical header band
header = doc.within(0, (0, 700, 612, 92)).extract_text()
print(header)

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let mut doc = PdfDocument::open("invoice.pdf")?;
let header = doc.within(0, Rect::new(0.0, 700.0, 612.0, 92.0)).extract_text()?;
println!("{}", header);

JavaScript (WASM)

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const headerRegion = doc.within(0, [0, 700, 612, 92]);
console.log(headerRegion.extractText());
doc.free();

Go (lower-level helper, same effect)

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("invoice.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    // ExtractTextInRect(pageIndex, x, y, width, height)
    header, _ := doc.ExtractTextInRect(0, 0, 700, 612, 92)
    fmt.Println(header)
}

C# (lower-level helper)

using PdfOxide;

using var doc = PdfDocument.Open("invoice.pdf");
string header = doc.ExtractTextInRect(0, 0, 700, 612, 92);
Console.WriteLine(header);

Java (page.text(region); BBox is corner-form (x0, y0, x1, y1))

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.geometry.BBox;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("invoice.pdf"))) {
    // Top 92 points of page 0 → corners (0, 700) … (612, 792)
    String header = doc.page(0).text(new BBox(0, 700, 612, 792));
    System.out.println(header);
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.geometry.BBox

PdfDocument.open(java.nio.file.Path.of("invoice.pdf")).use { doc ->
    val header = doc.page(0).text(BBox(0.0, 700.0, 612.0, 792.0))
    println(header)
}

Scala

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.geometry.BBox
import scala.util.Using

Using.resource(PdfDocument.open("invoice.pdf")) { doc =>
  val header = doc.page(0).text(BBox(0, 700, 612, 792))
  println(header)
}

Clojure

(require '[pdf-oxide.core :as pdf])
(import '[fyi.oxide.pdf.geometry BBox])

(with-open [doc (pdf/open "invoice.pdf")]
  ;; Top 92 points of page 0 → corners (0 700) … (612 792)
  (println (pdf/page-text (pdf/page doc 0) (BBox. 0 700 612 792))))

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("invoice.pdf");
// extract_text_in_rect(page, x, y, w, h)
auto header = doc.extract_text_in_rect(0, 0, 700, 612, 92);
std::cout << header << "\n";

Swift

import PdfOxide

let doc = try Document.open("invoice.pdf")
let header = try doc.extractTextInRect(0, x: 0, y: 700, w: 612, h: 92)
print(header)

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('invoice.pdf');
final header = doc.extractTextInRect(0, 0, 700, 612, 92);
print(header);
doc.close();

library(pdfoxide)

doc <- pdf_open("invoice.pdf")
# pdf_extract_text_in_rect(doc, page, x, y, width, height)
header <- pdf_extract_text_in_rect(doc, 0, 0, 700, 612, 92)
cat(header)

Julia

using PdfOxide

doc = open_document("invoice.pdf")
header = extract_text_in_rect(doc, 0, 0, 700, 612, 92)
println(header)

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("invoice.pdf");
const header = try doc.extractTextInRect(a, 0, 0, 700, 612, 92);  // free header
std.debug.print("{s}\n", .{header});

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"invoice.pdf" error:&err];
NSString *header = [doc extractTextInRect:0 x:0 y:700 w:612 h:92 error:&err];
NSLog(@"%@", header);

Elixir

{:ok, doc} = PdfOxide.open("invoice.pdf")
# extract_text_in_rect(doc, page, x, y, w, h)
{:ok, header} = PdfOxide.extract_text_in_rect(doc, 0, 0, 700, 612, 92)
IO.puts(header)

Chained extraction from a region

The within() fluent form in Python / Rust / WASM lets you call any extraction method on the same scoped region without re-specifying the rect:

Python

doc = PdfDocument("invoice.pdf")
region = doc.within(0, (400, 100, 200, 200))   # bottom-right 200×200 box

total_text = region.extract_text()              # plain text
words      = region.extract_words()             # word-level records
chars      = region.extract_chars()             # character-level records

Rust

let region = doc.within(0, Rect::new(400.0, 100.0, 200.0, 200.0));
let text  = region.extract_text()?;
let words = region.extract_words()?;

C++ (no fluent chain — call each in-rect helper against the same rect)

// bottom-right 200×200 box: x=400, y=100, w=200, h=200
auto text  = doc.extract_text_in_rect(0, 400, 100, 200, 200);
auto words = doc.extract_words_in_rect(0, 400, 100, 200, 200);
auto lines = doc.extract_lines_in_rect(0, 400, 100, 200, 200);

Swift

let text  = try doc.extractTextInRect(0, x: 400, y: 100, w: 200, h: 200)
let words = try doc.extractWordsInRect(0, x: 400, y: 100, w: 200, h: 200)

Dart

final text  = doc.extractTextInRect(0, 400, 100, 200, 200);
final words = doc.extractWordsInRect(0, 400, 100, 200, 200);

text  <- pdf_extract_text_in_rect(doc, 0, 400, 100, 200, 200)
words <- pdf_extract_words_in_rect(doc, 0, 400, 100, 200, 200)

Julia

text  = extract_text_in_rect(doc, 0, 400, 100, 200, 200)
words = extract_words_in_rect(doc, 0, 400, 100, 200, 200)

Zig

const text  = try doc.extractTextInRect(a, 0, 400, 100, 200, 200);
const words = try doc.extractWordsInRect(a, 0, 400, 100, 200, 200);  // freeWords

Objective-C

NSString *text = [doc extractTextInRect:0 x:400 y:100 w:200 h:200 error:&err];
NSArray<POXWord*> *words = [doc extractWordsInRect:0 x:400 y:100 w:200 h:200 error:&err];

Elixir

{:ok, text}  = PdfOxide.extract_text_in_rect(doc, 0, 400, 100, 200, 200)
{:ok, words} = PdfOxide.extract_words_in_rect(doc, 0, 400, 100, 200, 200)

Common use cases

Invoice field extraction

An invoice usually has the vendor address, invoice number, and line-item table in fixed zones. Define the rects once per template:

from pdf_oxide import PdfDocument

TEMPLATES = {
    "acme_v1": {
        "invoice_no":  (450, 720,  120,  20),
        "issue_date":  (450, 700,  120,  20),
        "vendor_name": ( 50, 740,  300,  40),
        "total":       (450, 100,  120,  24),
    },
}

def parse_invoice(path, template):
    doc = PdfDocument(path)
    out = {}
    for field, rect in template.items():
        out[field] = doc.within(0, rect).extract_text().strip()
    return out

print(parse_invoice("invoice-2025-04.pdf", TEMPLATES["acme_v1"]))

Bank statement line items

Most statements have a narrow “transactions” band. Crop to that band and call extract_words() to get each line in reading order with its bbox:

doc = PdfDocument("statement.pdf")
for page in range(doc.page_count()):
    txn_region = doc.within(page, (36, 72, 540, 650))   # skip header + footer
    for w in txn_region.extract_words():
        print(f"page {page}: {w.text} at ({w.x0:.0f},{w.y0:.0f})")

Header / footer stripping

If you’re indexing the body content only, crop away the top and bottom of every page:

Rust

let mut doc = PdfDocument::open("book.pdf")?;
for i in 0..doc.page_count()? {
    let body = doc.within(i, Rect::new(0.0, 100.0, 612.0, 600.0))
                  .extract_text()?;
    // index `body` …
}

Table region detection

When you already know a page contains a table and where, scope to the table rect and let extract_tables() focus only on that region:

Python

tables = doc.within(0, (50, 200, 500, 400)).extract_tables()
for t in tables:
    for row in t["rows"]:
        print([c["text"] for c in row["cells"]])

What rect-scoped extraction variants exist?

Beyond extract_text(), extract_words(), and extract_chars(), two more rect-scoped variants give you geometry-aware results from a single rectangle: lines in rect and tables in rect. Both filter a full-page extraction down to the regions whose bounding box intersects your rectangle, so the coordinates and reading order you get back are the same as a full-page call — just cropped.

Extract text lines in a region (`extract_lines_in_rect`)

Returns the line-level records (each with its text, bounding box, and word count) that fall inside the rectangle. Use it when you need whole lines in reading order rather than individual words — e.g. address blocks, multi-line totals, or a single statement row.

The C ABI signature is authoritative:

FfiTextLineList *pdf_document_extract_lines_in_rect(
    PdfDocument *handle,
    int32_t page_index,
    float x, float y, float w, float h,
    int32_t *error_code);

Rust — extract_lines_in_rect(page_index, region) -> Result<Vec<PathContent>> on PdfDocument:

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let doc = PdfDocument::open("statement.pdf")?;

// Transactions band: skip the header (top 92pt) and footer (bottom 72pt)
let region = Rect::new(36.0, 72.0, 540.0, 628.0);
let lines = doc.extract_lines_in_rect(0, region)?;
for line in &lines {
    println!("{:?}", line.bbox);
}

Python — the fluent region exposes lines via extract_text_lines():

from pdf_oxide import PdfDocument

doc = PdfDocument("statement.pdf")

# Same band as the Rust example above
region = doc.within(0, (36, 72, 540, 628))
for line in region.extract_text_lines():
    print(line.text, line.bbox)

Swift — extractLinesInRect(_:x:y:w:h:) returns [TextLine]:

import PdfOxide

let doc = try PdfDocument(path: "statement.pdf")
let lines = try doc.extractLinesInRect(0, x: 36, y: 72, w: 540, h: 628)
for line in lines {
    print(line.text, line.bbox, line.wordCount)
}

C++ — extract_lines_in_rect(page, x, y, w, h) returns std::vector<TextLine>:

auto lines = doc.extract_lines_in_rect(0, 36, 72, 540, 628);
for (const auto& line : lines) {
    std::cout << line.text << "\n";
}

Dart — extractLinesInRect(page, x, y, w, h) returns List<TextLine>:

final lines = doc.extractLinesInRect(0, 36, 72, 540, 628);
for (final line in lines) {
    print('${line.text} ${line.bbox}');
}

R — pdf_extract_lines_in_rect(doc, page, x, y, width, height):

lines <- pdf_extract_lines_in_rect(doc, 0, 36, 72, 540, 628)

Julia — extract_lines_in_rect(doc, page, x, y, w, h):

lines = extract_lines_in_rect(doc, 0, 36, 72, 540, 628)
for line in lines
    println(line.text, " ", line.bbox)
end

Zig — extractLinesInRect(allocator, page, x, y, w, h):

const lines = try doc.extractLinesInRect(a, 0, 36, 72, 540, 628);  // freeTextLines

Objective-C — extractLinesInRect:x:y:w:h: returns NSArray<POXTextLine*>:

NSArray<POXTextLine*> *lines = [doc extractLinesInRect:0 x:36 y:72 w:540 h:628 error:&err];

Elixir — extract_lines_in_rect(doc, page, x, y, w, h):

{:ok, lines} = PdfOxide.extract_lines_in_rect(doc, 0, 36, 72, 540, 628)

Go / C#. The extract_lines_in_rect C entry point exists, but the Go and C# wrappers do not yet surface it. In those languages, extract lines for the whole page and filter on the returned bounding boxes, or use ExtractWordsInRect (Go) and group words into lines yourself.

Extract tables in a region (`extract_tables_in_rect`)

Scopes table detection to a single rectangle — only tables whose bounding box intersects the rect come back. This is the geometry-aware counterpart to the fluent within(...).extract_tables() shown above.

C ABI signature:

FfiTableList *pdf_document_extract_tables_in_rect(
    PdfDocument *handle,
    int32_t page_index,
    float x, float y, float w, float h,
    int32_t *error_code);

Rust — extract_tables_in_rect(page_index, region) -> Result<Vec<Table>> (a ..._with_config variant takes a custom TableDetectionConfig):

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let doc = PdfDocument::open("invoice.pdf")?;
let region = Rect::new(50.0, 200.0, 500.0, 400.0);
let tables = doc.extract_tables_in_rect(0, region)?;
for table in &tables {
    println!("{} rows × {} cols", table.rows.len(), table.col_count);
}

Python — via the fluent region:

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
tables = doc.within(0, (50, 200, 500, 400)).extract_tables()
for t in tables:
    for row in t["rows"]:
        print([c["text"] for c in row["cells"]])

Swift — extractTablesInRect(_:x:y:w:h:) returns [Table]:

let tables = try doc.extractTablesInRect(0, x: 50, y: 200, w: 500, h: 400)
for table in tables {
    print("\(table.rowCount) rows, header: \(table.hasHeader)")
}

C++ — extract_tables_in_rect(page, x, y, w, h) returns std::vector<Table>:

auto tables = doc.extract_tables_in_rect(0, 50, 200, 500, 400);
for (const auto& table : tables) {
    std::cout << table.rows.size() << " rows\n";
}

Dart — extractTablesInRect(page, x, y, w, h) returns List<Table>:

final tables = doc.extractTablesInRect(0, 50, 200, 500, 400);
for (final table in tables) {
    print('${table.rows.length} rows');
}

R — pdf_extract_tables_in_rect(doc, page, x, y, width, height):

tables <- pdf_extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Julia — extract_tables_in_rect(doc, page, x, y, w, h):

tables = extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Zig — extractTablesInRect(allocator, page, x, y, w, h):

const tables = try doc.extractTablesInRect(a, 0, 50, 200, 500, 400);

Objective-C — extractTablesInRect:x:y:w:h: returns NSArray<POXTable*>:

NSArray<POXTable*> *tables = [doc extractTablesInRect:0 x:50 y:200 w:500 h:400 error:&err];

Elixir — extract_tables_in_rect(doc, page, x, y, w, h):

{:ok, tables} = PdfOxide.extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Go / C#. As with lines, the extract_tables_in_rect C entry point exists but is not yet wrapped in Go or C#. Call ExtractTables(page) for the whole page and keep the tables whose bbox falls inside your rect.

How do I auto-extract a page without choosing text vs OCR?

When you don’t know whether a page is born-digital text, a scan, or a mix, extract_page_auto does the routing for you. It runs the AutoExtractor — per-region text-vs-OCR routing with graceful native fallback (it never surfaces an opaque OCR error) — and returns a JSON PageExtraction: a page kind, the assembled reading-ordered text, a confidence, a typed reason, an ocr_used flag, and a regions[] array where every region carries a bbox, kind, text, confidence, source, and reason (the bbox + reason are present even when a region’s text is empty, so reading order is never silently corrupted).

It is {}-tolerant: pass an empty/null options JSON for the defaults, or supply an AutoExtractOptions object. The recognized fields (serialized snake_case) are:

Field	Type	Default	Meaning
`mode`	`"text_only"` \| `"auto"` \| `"force_ocr"`	`"auto"`	Text-vs-OCR routing strategy
`reconstruct_image_tables`	bool	`true`	Rebuild image-only tables via the spatial detector over OCR spans
`emit_placeholders`	bool	`true`	Emit positioned Figure/Table placeholders in the text flow
`ocr_languages`	string[]	`[]`	OCR language hints (e.g. `["english","chinese"]`)
`min_text_confidence`	float \| null	`null`	Auto-decision confidence threshold
`table_confidence`	float \| null	`null`	Image-table reconstruction threshold
`force_ocr_pages`	int[]	`[]`	0-based page indices to force OCR on

OCR feature gate. OCR only actually runs when the library is built with the ocr feature; otherwise extract_page_auto falls back to the native text layer (never an error). The auto entry point is exposed in Python, Go, C#, Swift, WASM, and the C ABI. In Rust it is the library-level AutoExtractor API rather than a one-line PdfDocument method — see below.

Python — extract_page_auto(page, options_json=None) -> str (JSON):

import json
from pdf_oxide import PdfDocument

doc = PdfDocument("mixed-scan.pdf")

# Defaults (balanced preset)
page = json.loads(doc.extract_page_auto(0))
print(page["kind"], page["confidence"], page["ocr_used"])
for region in page["regions"]:
    print(region["kind"], region["bbox"], region["reason"])

# With options
opts = json.dumps({"mode": "auto", "reconstruct_image_tables": True,
                   "ocr_languages": ["english"]})
page = json.loads(doc.extract_page_auto(0, opts))

Go — ExtractPageAuto(pageIndex, opts ...AutoOption) (string, error) (returns JSON; configure via functional options):

package main

import (
    "encoding/json"
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("mixed-scan.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    raw, err := doc.ExtractPageAuto(0)
    if err != nil { log.Fatal(err) }

    var page map[string]any
    json.Unmarshal([]byte(raw), &page)
    fmt.Println(page["kind"], page["confidence"], page["ocr_used"])
}

C# — ExtractPageAuto(int pageIndex, string? optionsJson = null) -> string (JSON):

using System.Text.Json;
using PdfOxide.Core;

using var doc = PdfDocument.Open("mixed-scan.pdf");

// Defaults
string json = doc.ExtractPageAuto(0);
using var page = JsonDocument.Parse(json);
Console.WriteLine(page.RootElement.GetProperty("kind"));

// With options
string opts = """{"mode":"auto","ocr_languages":["english"]}""";
string json2 = doc.ExtractPageAuto(0, opts);

Swift — extractPageAuto(_:optionsJson:) -> String (defaults to "{}"):

let json = try doc.extractPageAuto(0, optionsJson: "{}")

JavaScript (WASM) — extractPageAuto(pageIndex, optionsJson?):

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const page = JSON.parse(doc.extractPageAuto(0));
console.log(page.kind, page.confidence, page.ocr_used);
doc.free();

Rust — the auto path is the AutoExtractor library API. Build AutoExtractOptions (presets fast(), balanced(), high_fidelity(), or the fluent builder) and call extract_page, which returns a typed PageExtraction (no JSON round-trip):

use pdf_oxide::PdfDocument;
use pdf_oxide::extractors::auto::{AutoExtractor, AutoExtractOptions, ExtractMode};

let doc = PdfDocument::open("mixed-scan.pdf")?;

// Default (balanced) preset
let page = AutoExtractor::new().extract_page(&doc, 0)?;
println!("{:?} conf={} ocr={}", page.kind, page.confidence, page.ocr_used);

// Custom options via the builder
let opts = AutoExtractOptions::builder()
    .mode(ExtractMode::Auto)
    .reconstruct_image_tables(true)
    .ocr_languages(["english"])
    .build();
let page = AutoExtractor::with(opts).extract_page(&doc, 0)?;
for region in &page.regions {
    println!("{:?} {:?} {:?}", region.kind, region.bbox, region.reason);
}

C++ — extract_page_auto(page, options_json = "") returns the JSON envelope:

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("mixed-scan.pdf");
auto json = doc.extract_page_auto(0);                                    // defaults
auto json2 = doc.extract_page_auto(0, R"({"mode":"auto","ocr_languages":["english"]})");

Dart — extractPageAuto(page, [optionsJson]) returns the JSON envelope:

import 'dart:convert';
import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('mixed-scan.pdf');
final page = jsonDecode(doc.extractPageAuto(0));
print('${page["kind"]} ${page["confidence"]} ${page["ocr_used"]}');
doc.close();

R — pdf_extract_page_auto(doc, page, options_json = NULL) returns JSON:

library(jsonlite)

doc  <- pdf_open("mixed-scan.pdf")
page <- fromJSON(pdf_extract_page_auto(doc, 0))
cat(page$kind, page$confidence, page$ocr_used, "\n")

Julia — extract_page_auto(doc, page, options = "{}") returns JSON:

using PdfOxide, JSON

doc  = open_document("mixed-scan.pdf")
page = JSON.parse(extract_page_auto(doc, 0))
println(page["kind"], " ", page["confidence"], " ", page["ocr_used"])

Zig — extractPageAuto(allocator, page, options_json) returns JSON bytes:

const json = try doc.extractPageAuto(a, 0, null);  // free json

Objective-C — extractPageAuto:optionsJson:error: returns the JSON envelope:

NSString *json = [doc extractPageAuto:0 optionsJson:@"{}" error:&err];

Elixir — extract_page_auto(doc, page, options_json \\ "") returns JSON:

{:ok, json} = PdfOxide.extract_page_auto(doc, 0)
page = Jason.decode!(json)
IO.inspect({page["kind"], page["confidence"], page["ocr_used"]})

Java — the auto path is the AutoExtractor API (extractPage → typed result; extractTextForPage for plain text):

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("mixed-scan.pdf"))) {
    AutoExtractor ax = AutoExtractor.of(doc);             // or .fast/.balanced/.highFidelity
    String text = ax.extractTextForPage(0);               // graceful native/OCR routing
    System.out.println(text);
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.AutoExtractor

PdfDocument.open(java.nio.file.Path.of("mixed-scan.pdf")).use { doc ->
    val ax = AutoExtractor.of(doc)
    println(ax.extractTextForPage(0))
}

Scala

import fyi.oxide.pdf.{PdfDocument, AutoExtractor}
import scala.util.Using

Using.resource(PdfDocument.open("mixed-scan.pdf")) { doc =>
  val ax = AutoExtractor.of(doc)
  println(ax.extractTextForPage(0))
}

PHP — the rich JSON envelope is reachable via AutoExtractor::extractPageJson:

use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;

$doc = PdfDocument::open('mixed-scan.pdf');
$ax  = AutoExtractor::balanced($doc);
$page = json_decode($ax->extractPageJson(0), true);
echo $page['kind'], ' ', $page['confidence'], ' ', $page['ocr_used'];

Ruby — auto_extractor.extract_page(page) returns the parsed envelope merged into a Hash:

require 'pdf_oxide'

PdfOxide::PdfDocument.open('mixed-scan.pdf') do |doc|
  result = doc.auto_extractor.extract_page(0)
  cls = result[:classification]            # full PageExtraction JSON as a Hash
  puts [cls['kind'], cls['confidence'], cls['ocr_used']].join(' ')
end

How do I get structured typed regions as JSON?

For a whole-page structured view — headings, body blocks, headers/footers, page numbers, and column order — use the structured extraction entry point. It returns a StructuredPage: page_index, page_width, page_height, and a regions[] array where each region carries a kind (semantic role), text, bbox, spans, and a column_index (for multi-column reading order). Region kinds include body blocks, structural headings (H1–H6), marginal labels, running headers/footers, page numbers, and artifacts.

Most bindings return this as a JSON string (the C ABI serializes it once and bindings deserialize into native types); Rust returns the typed StructuredPage directly.

C ABI signature:

char *pdf_document_extract_structured_to_json(
    PdfDocument *handle,
    int32_t page_index,
    int32_t *error_code);

Python — extract_structured(page) -> str (JSON; deserialize with json.loads):

import json
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
page = json.loads(doc.extract_structured(0))

print(page["page_width"], page["page_height"])
for region in page["regions"]:
    print(region["kind"], region["column_index"], region["text"][:60])

Go — ExtractStructured(page) (string, error):

raw, err := doc.ExtractStructured(0)
if err != nil { log.Fatal(err) }

var page map[string]any
json.Unmarshal([]byte(raw), &page)
for _, r := range page["regions"].([]any) {
    region := r.(map[string]any)
    fmt.Println(region["kind"], region["text"])
}

C# — ExtractStructured(int page) -> string:

using System.Text.Json;

string json = doc.ExtractStructured(0);
using var page = JsonDocument.Parse(json);
foreach (var region in page.RootElement.GetProperty("regions").EnumerateArray())
{
    Console.WriteLine(region.GetProperty("kind"));
}

Swift — extractStructuredJson(_:) -> String:

let json = try doc.extractStructuredJson(0)

JavaScript (WASM) — extractStructured(pageIndex) (returns a JSON string with camelCase keys):

const page = JSON.parse(doc.extractStructured(0));
for (const region of page.regions) {
    console.log(region.kind, region.columnIndex);
}

Rust — extract_structured(page_index) -> Result<StructuredPage> returns typed regions directly (no JSON round-trip). An extract_structured_with_column_mode variant lets you force ColumnMode::Two/Single for stubborn layouts:

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("report.pdf")?;
let page = doc.extract_structured(0)?;
for region in &page.regions {
    println!("{:?} col={:?}: {}", region.kind, region.column_index, region.text);
}

C++ — extract_structured_json(page) returns the JSON string:

auto json = doc.extract_structured_json(0);

Dart — extractStructuredJson(page) returns the JSON string:

import 'dart:convert';

final page = jsonDecode(doc.extractStructuredJson(0));
for (final region in page['regions']) {
    print('${region["kind"]} ${region["column_index"]}');
}

R — pdf_extract_structured_json(doc, page) returns JSON:

library(jsonlite)

page <- fromJSON(pdf_extract_structured_json(doc, 0))
print(page$page_width)

Julia — extract_structured_json(doc, page) returns JSON:

using JSON
page = JSON.parse(extract_structured_json(doc, 0))
for region in page["regions"]
    println(region["kind"], " ", region["column_index"])
end

Zig — extractStructuredJson(allocator, page) returns JSON bytes:

const json = try doc.extractStructuredJson(a, 0);  // free json

Objective-C — extractStructuredJson:error: returns the JSON string:

NSString *json = [doc extractStructuredJson:0 error:&err];

Elixir — extract_structured_json(doc, page) returns JSON:

{:ok, json} = PdfOxide.extract_structured_json(doc, 0)
page = Jason.decode!(json)

Java — extractStructured(page) returns the JSON string:

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;

String json = doc.extractStructured(0);
JsonNode page = new ObjectMapper().readTree(json);
for (JsonNode region : page.get("regions")) {
    System.out.println(region.get("kind").asText());
}

Kotlin

val json = doc.extractStructured(0)   // JSON string; parse with your library of choice

Scala

val json = doc.extractStructured(0)   // JSON string

Clojure — (pdf/extract-structured doc page) returns the JSON string:

(require '[clojure.data.json :as json])

(with-open [doc (pdf/open "report.pdf")]
  (let [page (json/read-str (pdf/extract-structured doc 0))]
    (doseq [region (get page "regions")]
      (println (get region "kind") (get region "column_index")))))

Ruby — extract_structured(page) returns the parsed StructuredPage Hash:

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  page = doc.extract_structured(0)
  page['regions'].each { |r| puts "#{r['kind']} #{r['column_index']}" }
end

PHP — extractStructured($page) returns the deserialized associative array:

$doc = PdfOxide\PdfDocument::open('report.pdf');
$page = $doc->extractStructured(0);
foreach ($page['regions'] as $region) {
    echo $region['kind'], ' ', $region['column_index'], "\n";
}

Coordinate reference

PDF uses a bottom-left origin, measured in points (1 pt = 1/72 inch). A Letter-size page is (0, 0, 612, 792). To target the top 1-inch band, write:

(x, y, w, h) = (0, 792 - 72, 612, 72)
             = (0, 720,      612, 72)

If you’re coming from an image-coordinate world (top-left origin), flip y accordingly.

To get the actual MediaBox of a page before calculating:

Python

doc = PdfDocument("doc.pdf")
mb = doc.page_media_box(0)       # (llx, lly, urx, ury)

Rust

let mb = editor.get_page_media_box(0)?;   // [f32; 4]

Java — page.mediaBox() returns a BBox (x0, y0, x1, y1):

import fyi.oxide.pdf.geometry.BBox;

BBox mb = doc.page(0).mediaBox();         // (x0, y0, x1, y1) in PDF user space
double w = mb.width(), h = mb.height();   // 612 × 792 for US Letter

Kotlin

val mb = doc.page(0).mediaBox()           // BBox(x0, y0, x1, y1)

Scala

val mb = doc.page(0).mediaBox             // BBox(x0, y0, x1, y1)

C++ — via the editor: get_page_media_box(page):

auto editor = pdf_oxide::DocumentEditor::open("doc.pdf");
auto mb = editor.get_page_media_box(0);   // Bbox{x, y, width, height}

Swift

let editor = try DocumentEditor.open("doc.pdf")
let mb = try editor.getPageMediaBox(0)    // Bbox(x, y, width, height)

Dart

final editor = DocumentEditor.open('doc.pdf');
final mb = editor.getPageMediaBox(0);     // Bbox(x, y, width, height)

editor <- pdf_editor_open("doc.pdf")
mb <- pdf_editor_get_page_media_box(editor, 0)   # list(x=, y=, width=, height=)

Julia

editor = open_editor("doc.pdf")
mb = get_page_media_box(editor, 0)        # Bbox

Zig

var editor = try pdf_oxide.DocumentEditor.openEditor("doc.pdf");
const mb = try editor.getPageMediaBox(0);  // Bbox{ x, y, width, height }

Objective-C

POXDocumentEditor *editor = [POXDocumentEditor openEditor:@"doc.pdf" error:&err];
POXBbox mb = [editor pageMediaBox:0 error:&err];   // {x, y, width, height}

Elixir

{:ok, editor} = PdfOxide.open_editor("doc.pdf")
{:ok, mb} = PdfOxide.get_page_media_box(editor, 0)   # %Bbox{}

Go / C# — in-rect helpers

While Go and C# don’t yet expose the fluent within() chain, the underlying lower-level methods are the same:

Method	Go	C#
Text in rect	`doc.ExtractTextInRect(page, x, y, w, h)`	`doc.ExtractTextInRect(page, x, y, w, h)`
Words in rect	`doc.ExtractWordsInRect(page, x, y, w, h)`	(not yet wrapped)
Images in rect	`doc.ExtractImagesInRect(page, x, y, w, h)`	(not yet wrapped)

For patterns that need multiple extraction types against the same rect in Go or C#, keep the rect in variables and call the helpers sequentially. The fluent surface will follow once the editor API stabilises.

FAQ

What’s the difference between extract_words() and extract_lines_in_rect() in a region? extract_words() returns one record per word; extract_lines_in_rect() returns one record per line (text, bbox, and word count) for the lines whose bbox intersects the rectangle. Use lines when you want whole rows in reading order — address blocks, statement rows, multi-line totals — without re-grouping words yourself.

Does extract_page_auto always run OCR? No. It routes per region. With the default "auto" mode it only escalates to OCR where the native text layer is missing or suspect, and OCR only actually runs when the library is built with the ocr feature. Without that feature it falls back to the native text layer and never raises an opaque OCR error.

Which bindings expose the lines-in-rect and tables-in-rect variants? Rust, the C ABI, and Swift expose extract_lines_in_rect / extract_tables_in_rect directly. Python reaches the same results through the fluent region (within(...).extract_text_lines() and within(...).extract_tables()). Go and C# don’t yet wrap the in-rect lines/tables entry points — extract for the full page and filter on the returned bounding boxes.

How fast is scoped extraction? Scoping doesn’t add measurable overhead over full-page extraction — PDF Oxide extracts at a 0.8ms mean (100% pass rate on the benchmark corpus), and an in-rect call filters that same result by bounding box.

Text Extraction — full-page extraction
Extract Tables from PDF — structured tables
Text Search — search results, and search_results_to_json serialization
Extraction Profiles — tune per-document extraction
Page API Reference — iterate + scope from a Page object (page.region(rect))

Scoped Extraction — Pull Content from a Specific Region

Quick Example

Chained extraction from a region

Common use cases

Invoice field extraction

Bank statement line items

Header / footer stripping

Table region detection

What rect-scoped extraction variants exist?

Extract text lines in a region (extract_lines_in_rect)

Extract tables in a region (extract_tables_in_rect)

How do I auto-extract a page without choosing text vs OCR?

How do I get structured typed regions as JSON?

Coordinate reference

Go / C# — in-rect helpers

FAQ

Related Pages

Extract text lines in a region (`extract_lines_in_rect`)

Extract tables in a region (`extract_tables_in_rect`)