What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Видобування за областю — отримання контенту з конкретного регіону

При обробці рахунків-фактур, банківських виписок, податкових форм або будь-яких шаблонних документів ви зазвичай наперед знаєте, де розташовані поля. Замість того щоб витягати всю сторінку й шукати потрібне значення, вкажіть PDF Oxide точний прямокутник і отримайте лише те, що там є.

Fluent-API within(page, rect) повертає регіон із заданим охопленням, на якому можна ланцюжком викликати методи видобування: extract_text(), extract_words(), extract_chars(), extract_tables().

Охоплення прив’язок. within(page, rect) доступний у Python, Rust і WASM. Go та C# мають еквівалентні низькорівневі допоміжні функції (ExtractTextInRect, ExtractWordsInRect, ExtractImagesInRect) — дивіться нижче. Повне сімейство in-rect (текст, слова, рядки, таблиці, зображення) реалізовано наскрізно в Rust, C ABI та обгортці Swift; які прив’язки що підтримують, дивіться в розділі Варіанти in-rect видобування.

Швидкий приклад

rect — це (x, y, width, height) у пунктах PDF, початок координат знаходиться у нижньому лівому куті сторінки. Сторінка формату Letter має розмір 612 × 792 пункти.

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")

# Top 92 points of page 0 — typical header band
header = doc.within(0, (0, 700, 612, 92)).extract_text()
print(header)

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let mut doc = PdfDocument::open("invoice.pdf")?;
let header = doc.within(0, Rect::new(0.0, 700.0, 612.0, 92.0)).extract_text()?;
println!("{}", header);

JavaScript (WASM)

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const headerRegion = doc.within(0, [0, 700, 612, 92]);
console.log(headerRegion.extractText());
doc.free();

Go (низькорівнева допоміжна функція, той самий ефект)

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("invoice.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    // ExtractTextInRect(pageIndex, x, y, width, height)
    header, _ := doc.ExtractTextInRect(0, 0, 700, 612, 92)
    fmt.Println(header)
}

C# (низькорівнева допоміжна функція)

using PdfOxide;

using var doc = PdfDocument.Open("invoice.pdf");
string header = doc.ExtractTextInRect(0, 0, 700, 612, 92);
Console.WriteLine(header);

Java (page.text(region); BBox у форматі кутів (x0, y0, x1, y1))

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.geometry.BBox;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("invoice.pdf"))) {
    // Top 92 points of page 0 → corners (0, 700) … (612, 792)
    String header = doc.page(0).text(new BBox(0, 700, 612, 792));
    System.out.println(header);
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.geometry.BBox

PdfDocument.open(java.nio.file.Path.of("invoice.pdf")).use { doc ->
    val header = doc.page(0).text(BBox(0.0, 700.0, 612.0, 792.0))
    println(header)
}

Scala

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.geometry.BBox
import scala.util.Using

Using.resource(PdfDocument.open("invoice.pdf")) { doc =>
  val header = doc.page(0).text(BBox(0, 700, 612, 792))
  println(header)
}

Clojure

(require '[pdf-oxide.core :as pdf])
(import '[fyi.oxide.pdf.geometry BBox])

(with-open [doc (pdf/open "invoice.pdf")]
  ;; Top 92 points of page 0 → corners (0 700) … (612 792)
  (println (pdf/page-text (pdf/page doc 0) (BBox. 0 700 612 792))))

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("invoice.pdf");
// extract_text_in_rect(page, x, y, w, h)
auto header = doc.extract_text_in_rect(0, 0, 700, 612, 92);
std::cout << header << "\n";

Swift

import PdfOxide

let doc = try Document.open("invoice.pdf")
let header = try doc.extractTextInRect(0, x: 0, y: 700, w: 612, h: 92)
print(header)

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('invoice.pdf');
final header = doc.extractTextInRect(0, 0, 700, 612, 92);
print(header);
doc.close();

library(pdfoxide)

doc <- pdf_open("invoice.pdf")
# pdf_extract_text_in_rect(doc, page, x, y, width, height)
header <- pdf_extract_text_in_rect(doc, 0, 0, 700, 612, 92)
cat(header)

Julia

using PdfOxide

doc = open_document("invoice.pdf")
header = extract_text_in_rect(doc, 0, 0, 700, 612, 92)
println(header)

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("invoice.pdf");
const header = try doc.extractTextInRect(a, 0, 0, 700, 612, 92);  // free header
std.debug.print("{s}\n", .{header});

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"invoice.pdf" error:&err];
NSString *header = [doc extractTextInRect:0 x:0 y:700 w:612 h:92 error:&err];
NSLog(@"%@", header);

Elixir

{:ok, doc} = PdfOxide.open("invoice.pdf")
# extract_text_in_rect(doc, page, x, y, w, h)
{:ok, header} = PdfOxide.extract_text_in_rect(doc, 0, 0, 700, 612, 92)
IO.puts(header)

Ланцюжкове видобування з регіону

Fluent-форма within() у Python / Rust / WASM дозволяє викликати будь-який метод видобування на тому самому регіоні, не вказуючи прямокутник повторно:

Python

doc = PdfDocument("invoice.pdf")
region = doc.within(0, (400, 100, 200, 200))   # bottom-right 200×200 box

total_text = region.extract_text()              # plain text
words      = region.extract_words()             # word-level records
chars      = region.extract_chars()             # character-level records

Rust

let region = doc.within(0, Rect::new(400.0, 100.0, 200.0, 200.0));
let text  = region.extract_text()?;
let words = region.extract_words()?;

C++ (без fluent-ланцюжка — викликайте кожну in-rect функцію для того самого прямокутника окремо)

// bottom-right 200×200 box: x=400, y=100, w=200, h=200
auto text  = doc.extract_text_in_rect(0, 400, 100, 200, 200);
auto words = doc.extract_words_in_rect(0, 400, 100, 200, 200);
auto lines = doc.extract_lines_in_rect(0, 400, 100, 200, 200);

Swift

let text  = try doc.extractTextInRect(0, x: 400, y: 100, w: 200, h: 200)
let words = try doc.extractWordsInRect(0, x: 400, y: 100, w: 200, h: 200)

Dart

final text  = doc.extractTextInRect(0, 400, 100, 200, 200);
final words = doc.extractWordsInRect(0, 400, 100, 200, 200);

text  <- pdf_extract_text_in_rect(doc, 0, 400, 100, 200, 200)
words <- pdf_extract_words_in_rect(doc, 0, 400, 100, 200, 200)

Julia

text  = extract_text_in_rect(doc, 0, 400, 100, 200, 200)
words = extract_words_in_rect(doc, 0, 400, 100, 200, 200)

Zig

const text  = try doc.extractTextInRect(a, 0, 400, 100, 200, 200);
const words = try doc.extractWordsInRect(a, 0, 400, 100, 200, 200);  // freeWords

Objective-C

NSString *text = [doc extractTextInRect:0 x:400 y:100 w:200 h:200 error:&err];
NSArray<POXWord*> *words = [doc extractWordsInRect:0 x:400 y:100 w:200 h:200 error:&err];

Elixir

{:ok, text}  = PdfOxide.extract_text_in_rect(doc, 0, 400, 100, 200, 200)
{:ok, words} = PdfOxide.extract_words_in_rect(doc, 0, 400, 100, 200, 200)

Типові сценарії використання

Видобування полів рахунку-фактури

Рахунок-фактура зазвичай містить адресу постачальника, номер рахунку і таблицю позицій у фіксованих зонах. Визначте прямокутники один раз для кожного шаблону:

from pdf_oxide import PdfDocument

TEMPLATES = {
    "acme_v1": {
        "invoice_no":  (450, 720,  120,  20),
        "issue_date":  (450, 700,  120,  20),
        "vendor_name": ( 50, 740,  300,  40),
        "total":       (450, 100,  120,  24),
    },
}

def parse_invoice(path, template):
    doc = PdfDocument(path)
    out = {}
    for field, rect in template.items():
        out[field] = doc.within(0, rect).extract_text().strip()
    return out

print(parse_invoice("invoice-2025-04.pdf", TEMPLATES["acme_v1"]))

Рядки транзакцій у банківській виписці

У більшості виписок є вузька смуга “транзакцій”. Обріжте її та викличте extract_words() — отримаєте кожен рядок у порядку читання з обмежувальним прямокутником:

doc = PdfDocument("statement.pdf")
for page in range(doc.page_count()):
    txn_region = doc.within(page, (36, 72, 540, 650))   # skip header + footer
    for w in txn_region.extract_words():
        print(f"page {page}: {w.text} at ({w.x0:.0f},{w.y0:.0f})")

Видалення верхнього та нижнього колонтитулів

Якщо потрібно індексувати лише основний вміст, обріжте верх і низ кожної сторінки:

Rust

let mut doc = PdfDocument::open("book.pdf")?;
for i in 0..doc.page_count()? {
    let body = doc.within(i, Rect::new(0.0, 100.0, 612.0, 600.0))
                  .extract_text()?;
    // index `body` …
}

Виявлення регіону таблиці

Якщо ви знаєте, що сторінка містить таблицю і де саме, обмежте область прямокутником таблиці та дайте extract_tables() зосередитися лише на цій ділянці:

Python

tables = doc.within(0, (50, 200, 500, 400)).extract_tables()
for t in tables:
    for row in t["rows"]:
        print([c["text"] for c in row["cells"]])

Які варіанти видобування з прямокутним охопленням існують? {#what-rect-scoped-extraction-variants-exist}

Крім extract_text(), extract_words() і extract_chars(), є ще два варіанти з прямокутним охопленням, що повертають геометрично усвідомлені результати з одного прямокутника: рядки в прямокутнику і таблиці в прямокутнику. Обидва фільтрують повносторінкове видобування на регіони, чий обмежувальний прямокутник перетинається із заданим, тому координати й порядок читання ті самі, що при повносторінковому виклику — просто обрізані.

Видобування текстових рядків у регіоні (`extract_lines_in_rect`)

Повертає записи рівня рядка (кожен із текстом, обмежувальним прямокутником і кількістю слів), що потрапляють всередину прямокутника. Використовуйте, коли потрібні цілі рядки у порядку читання, а не окремі слова — наприклад, блоки адрес, багаторядкові підсумки або один рядок виписки.

Авторитетна сигнатура C ABI:

FfiTextLineList *pdf_document_extract_lines_in_rect(
    PdfDocument *handle,
    int32_t page_index,
    float x, float y, float w, float h,
    int32_t *error_code);

Rust — extract_lines_in_rect(page_index, region) -> Result<Vec<PathContent>> на PdfDocument:

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let doc = PdfDocument::open("statement.pdf")?;

// Transactions band: skip the header (top 92pt) and footer (bottom 72pt)
let region = Rect::new(36.0, 72.0, 540.0, 628.0);
let lines = doc.extract_lines_in_rect(0, region)?;
for line in &lines {
    println!("{:?}", line.bbox);
}

Python — fluent-регіон надає рядки через extract_text_lines():

from pdf_oxide import PdfDocument

doc = PdfDocument("statement.pdf")

# Same band as the Rust example above
region = doc.within(0, (36, 72, 540, 628))
for line in region.extract_text_lines():
    print(line.text, line.bbox)

Swift — extractLinesInRect(_:x:y:w:h:) повертає [TextLine]:

import PdfOxide

let doc = try PdfDocument(path: "statement.pdf")
let lines = try doc.extractLinesInRect(0, x: 36, y: 72, w: 540, h: 628)
for line in lines {
    print(line.text, line.bbox, line.wordCount)
}

C++ — extract_lines_in_rect(page, x, y, w, h) повертає std::vector<TextLine>:

auto lines = doc.extract_lines_in_rect(0, 36, 72, 540, 628);
for (const auto& line : lines) {
    std::cout << line.text << "\n";
}

Dart — extractLinesInRect(page, x, y, w, h) повертає List<TextLine>:

final lines = doc.extractLinesInRect(0, 36, 72, 540, 628);
for (final line in lines) {
    print('${line.text} ${line.bbox}');
}

R — pdf_extract_lines_in_rect(doc, page, x, y, width, height):

lines <- pdf_extract_lines_in_rect(doc, 0, 36, 72, 540, 628)

Julia — extract_lines_in_rect(doc, page, x, y, w, h):

lines = extract_lines_in_rect(doc, 0, 36, 72, 540, 628)
for line in lines
    println(line.text, " ", line.bbox)
end

Zig — extractLinesInRect(allocator, page, x, y, w, h):

const lines = try doc.extractLinesInRect(a, 0, 36, 72, 540, 628);  // freeTextLines

Objective-C — extractLinesInRect:x:y:w:h: повертає NSArray<POXTextLine*>:

NSArray<POXTextLine*> *lines = [doc extractLinesInRect:0 x:36 y:72 w:540 h:628 error:&err];

Elixir — extract_lines_in_rect(doc, page, x, y, w, h):

{:ok, lines} = PdfOxide.extract_lines_in_rect(doc, 0, 36, 72, 540, 628)

Go / C#. Точка входу C extract_lines_in_rect існує, але обгортки Go і C# її ще не надають. У цих мовах видобуйте рядки для всієї сторінки та фільтруйте за поверненими обмежувальними прямокутниками, або використовуйте ExtractWordsInRect (Go) і самостійно групуйте слова в рядки.

Видобування таблиць у регіоні (`extract_tables_in_rect`)

Обмежує виявлення таблиць одним прямокутником — повертаються лише таблиці, чий обмежувальний прямокутник перетинається із заданим. Це геометрично усвідомлений відповідник fluent-форми within(...).extract_tables(), показаної вище.

Сигнатура C ABI:

FfiTableList *pdf_document_extract_tables_in_rect(
    PdfDocument *handle,
    int32_t page_index,
    float x, float y, float w, float h,
    int32_t *error_code);

Rust — extract_tables_in_rect(page_index, region) -> Result<Vec<Table>> (варіант ..._with_config приймає власний TableDetectionConfig):

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let doc = PdfDocument::open("invoice.pdf")?;
let region = Rect::new(50.0, 200.0, 500.0, 400.0);
let tables = doc.extract_tables_in_rect(0, region)?;
for table in &tables {
    println!("{} rows × {} cols", table.rows.len(), table.col_count);
}

Python — через fluent-регіон:

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
tables = doc.within(0, (50, 200, 500, 400)).extract_tables()
for t in tables:
    for row in t["rows"]:
        print([c["text"] for c in row["cells"]])

Swift — extractTablesInRect(_:x:y:w:h:) повертає [Table]:

let tables = try doc.extractTablesInRect(0, x: 50, y: 200, w: 500, h: 400)
for table in tables {
    print("\(table.rowCount) rows, header: \(table.hasHeader)")
}

C++ — extract_tables_in_rect(page, x, y, w, h) повертає std::vector<Table>:

auto tables = doc.extract_tables_in_rect(0, 50, 200, 500, 400);
for (const auto& table : tables) {
    std::cout << table.rows.size() << " rows\n";
}

Dart — extractTablesInRect(page, x, y, w, h) повертає List<Table>:

final tables = doc.extractTablesInRect(0, 50, 200, 500, 400);
for (final table in tables) {
    print('${table.rows.length} rows');
}

R — pdf_extract_tables_in_rect(doc, page, x, y, width, height):

tables <- pdf_extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Julia — extract_tables_in_rect(doc, page, x, y, w, h):

tables = extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Zig — extractTablesInRect(allocator, page, x, y, w, h):

const tables = try doc.extractTablesInRect(a, 0, 50, 200, 500, 400);

Objective-C — extractTablesInRect:x:y:w:h: повертає NSArray<POXTable*>:

NSArray<POXTable*> *tables = [doc extractTablesInRect:0 x:50 y:200 w:500 h:400 error:&err];

Elixir — extract_tables_in_rect(doc, page, x, y, w, h):

{:ok, tables} = PdfOxide.extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Go / C#. Як і з рядками, точка входу C extract_tables_in_rect існує, але в Go або C# ще не обгорнута. Викликайте ExtractTables(page) для всієї сторінки та зберігайте таблиці, чий обмежувальний прямокутник потрапляє у ваш прямокутник.

Як автоматично витягти сторінку, не обираючи між текстом і OCR?

Коли невідомо, чи є сторінка цифровим текстом, скан-копією або сумішшю, extract_page_auto виконає маршрутизацію за вас. Він запускає AutoExtractor — порегіонне маршрутизування текст vs OCR із коректним нативним відступом (жодних непрозорих помилок OCR) — і повертає JSON PageExtraction: kind сторінки, зібраний text у порядку читання, confidence, типізований reason, прапор ocr_used і масив regions[], де кожен регіон містить bbox, kind, text, confidence, source і reason (bbox і reason присутні навіть коли текст регіону порожній, щоб порядок читання ніколи не порушувався непомітно).

Толерантний до {}: передайте порожній / null JSON параметрів для значень за замовчуванням або вкажіть об’єкт AutoExtractOptions. Поля, що розпізнаються (серіалізовані в snake_case):

Поле	Тип	За замовчуванням	Значення
`mode`	`"text_only"` \| `"auto"` \| `"force_ocr"`	`"auto"`	Стратегія маршрутизації текст vs OCR
`reconstruct_image_tables`	bool	`true`	Відновлювати таблиці лише із зображень через просторовий детектор на OCR-спанах
`emit_placeholders`	bool	`true`	Вставляти позиціоновані плейсхолдери Figure/Table у потік тексту
`ocr_languages`	string[]	`[]`	Підказки мови OCR (наприклад, `["english","chinese"]`)
`min_text_confidence`	float \| null	`null`	Поріг впевненості для автоматичного рішення
`table_confidence`	float \| null	`null`	Поріг відновлення графічної таблиці
`force_ocr_pages`	int[]	`[]`	Індекси сторінок (з нуля) для примусового OCR

Ворота функції OCR. OCR реально запускається лише якщо бібліотека зібрана з фічею ocr; інакше extract_page_auto повертається до нативного текстового шару (без помилок). Автоматична точка входу доступна в Python, Go, C#, Swift, WASM і C ABI. У Rust це бібліотечний API AutoExtractor, а не однорядковий метод PdfDocument — дивіться нижче.

Python — extract_page_auto(page, options_json=None) -> str (JSON):

import json
from pdf_oxide import PdfDocument

doc = PdfDocument("mixed-scan.pdf")

# Defaults (balanced preset)
page = json.loads(doc.extract_page_auto(0))
print(page["kind"], page["confidence"], page["ocr_used"])
for region in page["regions"]:
    print(region["kind"], region["bbox"], region["reason"])

# With options
opts = json.dumps({"mode": "auto", "reconstruct_image_tables": True,
                   "ocr_languages": ["english"]})
page = json.loads(doc.extract_page_auto(0, opts))

Go — ExtractPageAuto(pageIndex, opts ...AutoOption) (string, error) (повертає JSON; налаштування через функціональні параметри):

package main

import (
    "encoding/json"
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("mixed-scan.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    raw, err := doc.ExtractPageAuto(0)
    if err != nil { log.Fatal(err) }

    var page map[string]any
    json.Unmarshal([]byte(raw), &page)
    fmt.Println(page["kind"], page["confidence"], page["ocr_used"])
}

C# — ExtractPageAuto(int pageIndex, string? optionsJson = null) -> string (JSON):

using System.Text.Json;
using PdfOxide.Core;

using var doc = PdfDocument.Open("mixed-scan.pdf");

// Defaults
string json = doc.ExtractPageAuto(0);
using var page = JsonDocument.Parse(json);
Console.WriteLine(page.RootElement.GetProperty("kind"));

// With options
string opts = """{"mode":"auto","ocr_languages":["english"]}""";
string json2 = doc.ExtractPageAuto(0, opts);

Swift — extractPageAuto(_:optionsJson:) -> String (за замовчуванням "{}"):

let json = try doc.extractPageAuto(0, optionsJson: "{}")

JavaScript (WASM) — extractPageAuto(pageIndex, optionsJson?):

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const page = JSON.parse(doc.extractPageAuto(0));
console.log(page.kind, page.confidence, page.ocr_used);
doc.free();

Rust — автоматичний шлях — це бібліотечний API AutoExtractor. Побудуйте AutoExtractOptions (пресети fast(), balanced(), high_fidelity() або fluent-будівельник) і викличте extract_page — повернеться типізований PageExtraction (без JSON туди-назад):

use pdf_oxide::PdfDocument;
use pdf_oxide::extractors::auto::{AutoExtractor, AutoExtractOptions, ExtractMode};

let doc = PdfDocument::open("mixed-scan.pdf")?;

// Default (balanced) preset
let page = AutoExtractor::new().extract_page(&doc, 0)?;
println!("{:?} conf={} ocr={}", page.kind, page.confidence, page.ocr_used);

// Custom options via the builder
let opts = AutoExtractOptions::builder()
    .mode(ExtractMode::Auto)
    .reconstruct_image_tables(true)
    .ocr_languages(["english"])
    .build();
let page = AutoExtractor::with(opts).extract_page(&doc, 0)?;
for region in &page.regions {
    println!("{:?} {:?} {:?}", region.kind, region.bbox, region.reason);
}

C++ — extract_page_auto(page, options_json = "") повертає JSON-конверт:

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("mixed-scan.pdf");
auto json = doc.extract_page_auto(0);                                    // defaults
auto json2 = doc.extract_page_auto(0, R"({"mode":"auto","ocr_languages":["english"]})");

Dart — extractPageAuto(page, [optionsJson]) повертає JSON-конверт:

import 'dart:convert';
import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('mixed-scan.pdf');
final page = jsonDecode(doc.extractPageAuto(0));
print('${page["kind"]} ${page["confidence"]} ${page["ocr_used"]}');
doc.close();

R — pdf_extract_page_auto(doc, page, options_json = NULL) повертає JSON:

library(jsonlite)

doc  <- pdf_open("mixed-scan.pdf")
page <- fromJSON(pdf_extract_page_auto(doc, 0))
cat(page$kind, page$confidence, page$ocr_used, "\n")

Julia — extract_page_auto(doc, page, options = "{}") повертає JSON:

using PdfOxide, JSON

doc  = open_document("mixed-scan.pdf")
page = JSON.parse(extract_page_auto(doc, 0))
println(page["kind"], " ", page["confidence"], " ", page["ocr_used"])

Zig — extractPageAuto(allocator, page, options_json) повертає JSON-байти:

const json = try doc.extractPageAuto(a, 0, null);  // free json

Objective-C — extractPageAuto:optionsJson:error: повертає JSON-конверт:

NSString *json = [doc extractPageAuto:0 optionsJson:@"{}" error:&err];

Elixir — extract_page_auto(doc, page, options_json \\ "") повертає JSON:

{:ok, json} = PdfOxide.extract_page_auto(doc, 0)
page = Jason.decode!(json)
IO.inspect({page["kind"], page["confidence"], page["ocr_used"]})

Java — автоматичний шлях — це API AutoExtractor (extractPage → типізований результат; extractTextForPage для простого тексту):

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("mixed-scan.pdf"))) {
    AutoExtractor ax = AutoExtractor.of(doc);             // or .fast/.balanced/.highFidelity
    String text = ax.extractTextForPage(0);               // graceful native/OCR routing
    System.out.println(text);
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.AutoExtractor

PdfDocument.open(java.nio.file.Path.of("mixed-scan.pdf")).use { doc ->
    val ax = AutoExtractor.of(doc)
    println(ax.extractTextForPage(0))
}

Scala

import fyi.oxide.pdf.{PdfDocument, AutoExtractor}
import scala.util.Using

Using.resource(PdfDocument.open("mixed-scan.pdf")) { doc =>
  val ax = AutoExtractor.of(doc)
  println(ax.extractTextForPage(0))
}

PHP — розширений JSON-конверт доступний через AutoExtractor::extractPageJson:

use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;

$doc = PdfDocument::open('mixed-scan.pdf');
$ax  = AutoExtractor::balanced($doc);
$page = json_decode($ax->extractPageJson(0), true);
echo $page['kind'], ' ', $page['confidence'], ' ', $page['ocr_used'];

Ruby — auto_extractor.extract_page(page) повертає розібраний конверт у вигляді Hash:

require 'pdf_oxide'

PdfOxide::PdfDocument.open('mixed-scan.pdf') do |doc|
  result = doc.auto_extractor.extract_page(0)
  cls = result[:classification]            # full PageExtraction JSON as a Hash
  puts [cls['kind'], cls['confidence'], cls['ocr_used']].join(' ')
end

Як отримати структуровані типізовані регіони у вигляді JSON?

Для повносторінкового структурованого подання — заголовки, блоки основного тексту, верхні/нижні колонтитули, номери сторінок, порядок колонок — використовуйте точку входу структурованого видобування. Вона повертає StructuredPage: page_index, page_width, page_height і масив regions[], де кожен регіон містить kind (семантична роль), text, bbox, spans і column_index (для порядку читання в багатоколонковому макеті). Типи kind регіонів включають блоки основного тексту, структурні заголовки (H1–H6), позначки на полях, поточні верхні/нижні колонтитули, номери сторінок і артефакти.

Більшість прив’язок повертають це як рядок JSON (C ABI серіалізує один раз, прив’язки десеріалізують у нативні типи); Rust повертає типізований StructuredPage безпосередньо.

Сигнатура C ABI:

char *pdf_document_extract_structured_to_json(
    PdfDocument *handle,
    int32_t page_index,
    int32_t *error_code);

Python — extract_structured(page) -> str (JSON; десеріалізуйте через json.loads):

import json
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
page = json.loads(doc.extract_structured(0))

print(page["page_width"], page["page_height"])
for region in page["regions"]:
    print(region["kind"], region["column_index"], region["text"][:60])

Go — ExtractStructured(page) (string, error):

raw, err := doc.ExtractStructured(0)
if err != nil { log.Fatal(err) }

var page map[string]any
json.Unmarshal([]byte(raw), &page)
for _, r := range page["regions"].([]any) {
    region := r.(map[string]any)
    fmt.Println(region["kind"], region["text"])
}

C# — ExtractStructured(int page) -> string:

using System.Text.Json;

string json = doc.ExtractStructured(0);
using var page = JsonDocument.Parse(json);
foreach (var region in page.RootElement.GetProperty("regions").EnumerateArray())
{
    Console.WriteLine(region.GetProperty("kind"));
}

Swift — extractStructuredJson(_:) -> String:

let json = try doc.extractStructuredJson(0)

JavaScript (WASM) — extractStructured(pageIndex) (повертає рядок JSON із ключами в camelCase):

const page = JSON.parse(doc.extractStructured(0));
for (const region of page.regions) {
    console.log(region.kind, region.columnIndex);
}

Rust — extract_structured(page_index) -> Result<StructuredPage> повертає типізовані регіони безпосередньо (без JSON туди-назад). Варіант extract_structured_with_column_mode дозволяє примусово задати ColumnMode::Two/Single для складних макетів:

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("report.pdf")?;
let page = doc.extract_structured(0)?;
for region in &page.regions {
    println!("{:?} col={:?}: {}", region.kind, region.column_index, region.text);
}

C++ — extract_structured_json(page) повертає рядок JSON:

auto json = doc.extract_structured_json(0);

Dart — extractStructuredJson(page) повертає рядок JSON:

import 'dart:convert';

final page = jsonDecode(doc.extractStructuredJson(0));
for (final region in page['regions']) {
    print('${region["kind"]} ${region["column_index"]}');
}

R — pdf_extract_structured_json(doc, page) повертає JSON:

library(jsonlite)

page <- fromJSON(pdf_extract_structured_json(doc, 0))
print(page$page_width)

Julia — extract_structured_json(doc, page) повертає JSON:

using JSON
page = JSON.parse(extract_structured_json(doc, 0))
for region in page["regions"]
    println(region["kind"], " ", region["column_index"])
end

Zig — extractStructuredJson(allocator, page) повертає JSON-байти:

const json = try doc.extractStructuredJson(a, 0);  // free json

Objective-C — extractStructuredJson:error: повертає рядок JSON:

NSString *json = [doc extractStructuredJson:0 error:&err];

Elixir — extract_structured_json(doc, page) повертає JSON:

{:ok, json} = PdfOxide.extract_structured_json(doc, 0)
page = Jason.decode!(json)

Java — extractStructured(page) повертає рядок JSON:

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;

String json = doc.extractStructured(0);
JsonNode page = new ObjectMapper().readTree(json);
for (JsonNode region : page.get("regions")) {
    System.out.println(region.get("kind").asText());
}

Kotlin

val json = doc.extractStructured(0)   // JSON string; parse with your library of choice

Scala

val json = doc.extractStructured(0)   // JSON string

Clojure — (pdf/extract-structured doc page) повертає рядок JSON:

(require '[clojure.data.json :as json])

(with-open [doc (pdf/open "report.pdf")]
  (let [page (json/read-str (pdf/extract-structured doc 0))]
    (doseq [region (get page "regions")]
      (println (get region "kind") (get region "column_index")))))

Ruby — extract_structured(page) повертає розібраний Hash StructuredPage:

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  page = doc.extract_structured(0)
  page['regions'].each { |r| puts "#{r['kind']} #{r['column_index']}" }
end

PHP — extractStructured($page) повертає десеріалізований асоціативний масив:

$doc = PdfOxide\PdfDocument::open('report.pdf');
$page = $doc->extractStructured(0);
foreach ($page['regions'] as $region) {
    echo $region['kind'], ' ', $region['column_index'], "\n";
}

Довідник координат

PDF використовує початок координат у нижньому лівому куті, вимірюваний у пунктах (1 пт = 1/72 дюйма). Сторінка формату Letter — (0, 0, 612, 792). Щоб вибрати верхню 1-дюймову смугу, запишіть:

(x, y, w, h) = (0, 792 - 72, 612, 72)
             = (0, 720,      612, 72)

Якщо ви звикли до координат зображень (початок у верхньому лівому куті), інвертуйте y відповідно.

Щоб отримати фактичний MediaBox сторінки перед обчисленнями:

Python

doc = PdfDocument("doc.pdf")
mb = doc.page_media_box(0)       # (llx, lly, urx, ury)

Rust

let mb = editor.get_page_media_box(0)?;   // [f32; 4]

Java — page.mediaBox() повертає BBox (x0, y0, x1, y1):

import fyi.oxide.pdf.geometry.BBox;

BBox mb = doc.page(0).mediaBox();         // (x0, y0, x1, y1) in PDF user space
double w = mb.width(), h = mb.height();   // 612 × 792 for US Letter

Kotlin

val mb = doc.page(0).mediaBox()           // BBox(x0, y0, x1, y1)

Scala

val mb = doc.page(0).mediaBox             // BBox(x0, y0, x1, y1)

C++ — через редактор: get_page_media_box(page):

auto editor = pdf_oxide::DocumentEditor::open("doc.pdf");
auto mb = editor.get_page_media_box(0);   // Bbox{x, y, width, height}

Swift

let editor = try DocumentEditor.open("doc.pdf")
let mb = try editor.getPageMediaBox(0)    // Bbox(x, y, width, height)

Dart

final editor = DocumentEditor.open('doc.pdf');
final mb = editor.getPageMediaBox(0);     // Bbox(x, y, width, height)

editor <- pdf_editor_open("doc.pdf")
mb <- pdf_editor_get_page_media_box(editor, 0)   # list(x=, y=, width=, height=)

Julia

editor = open_editor("doc.pdf")
mb = get_page_media_box(editor, 0)        # Bbox

Zig

var editor = try pdf_oxide.DocumentEditor.openEditor("doc.pdf");
const mb = try editor.getPageMediaBox(0);  // Bbox{ x, y, width, height }

Objective-C

POXDocumentEditor *editor = [POXDocumentEditor openEditor:@"doc.pdf" error:&err];
POXBbox mb = [editor pageMediaBox:0 error:&err];   // {x, y, width, height}

Elixir

{:ok, editor} = PdfOxide.open_editor("doc.pdf")
{:ok, mb} = PdfOxide.get_page_media_box(editor, 0)   # %Bbox{}

Go / C# — допоміжні функції in-rect

Go і C# ще не надають fluent-ланцюжка within(), але базові низькорівневі методи ті самі:

Метод	Go	C#
Текст у прямокутнику	`doc.ExtractTextInRect(page, x, y, w, h)`	`doc.ExtractTextInRect(page, x, y, w, h)`
Слова у прямокутнику	`doc.ExtractWordsInRect(page, x, y, w, h)`	(ще не обгорнуто)
Зображення у прямокутнику	`doc.ExtractImagesInRect(page, x, y, w, h)`	(ще не обгорнуто)

Для шаблонів, що потребують кількох типів видобування для одного прямокутника в Go або C#, зберігайте прямокутник у змінних і викликайте допоміжні функції послідовно. Fluent-інтерфейс з’явиться після стабілізації API редактора.

Часті запитання

У чому різниця між extract_words() і extract_lines_in_rect() у регіоні? extract_words() повертає один запис на слово; extract_lines_in_rect() повертає один запис на рядок (текст, обмежувальний прямокутник і кількість слів) для рядків, чий прямокутник перетинається із заданим. Використовуйте рядки, коли потрібні цілі ряди в порядку читання — блоки адрес, рядки виписок, багаторядкові підсумки — без ручного перегрупування слів.

Чи завжди extract_page_auto запускає OCR? Ні. Маршрутизація відбувається порегіонно. У режимі "auto" за замовчуванням OCR задіюється лише там, де нативний текстовий шар відсутній або викликає сумнів, і OCR реально запускається лише якщо бібліотека зібрана з фічею ocr. Без цієї фічі відбувається повернення до нативного текстового шару без непрозорих помилок OCR.

Які прив’язки підтримують варіанти lines-in-rect і tables-in-rect? Rust, C ABI та Swift надають extract_lines_in_rect / extract_tables_in_rect безпосередньо. Python отримує ті самі результати через fluent-регіон (within(...).extract_text_lines() і within(...).extract_tables()). Go і C# ще не обгортають in-rect точки входу для рядків/таблиць — видобуйте для всієї сторінки та фільтруйте за поверненими обмежувальними прямокутниками.

Наскільки швидке видобування за областю? Обмеження за областю не додає вимірюваних накладних витрат до повносторінкового видобування — PDF Oxide видобуває з середнім часом 0,8 мс (100% прохідність на тестовому корпусі), а in-rect виклик просто фільтрує той самий результат за обмежувальним прямокутником.

Пов’язані сторінки

Видобування тексту — повносторінкове видобування
Видобування таблиць із PDF — структуровані таблиці
Текстовий пошук — результати пошуку і серіалізація search_results_to_json
Профілі видобування — налаштування видобування для кожного документа
Довідник API сторінки — ітерація і обмеження від об’єкта Page (page.region(rect))

Видобування за областю — отримання контенту з конкретного регіону

Швидкий приклад

Ланцюжкове видобування з регіону

Типові сценарії використання

Видобування полів рахунку-фактури

Рядки транзакцій у банківській виписці

Видалення верхнього та нижнього колонтитулів

Виявлення регіону таблиці

Які варіанти видобування з прямокутним охопленням існують? {#what-rect-scoped-extraction-variants-exist}

Видобування текстових рядків у регіоні (extract_lines_in_rect)

Видобування таблиць у регіоні (extract_tables_in_rect)

Як автоматично витягти сторінку, не обираючи між текстом і OCR?

Як отримати структуровані типізовані регіони у вигляді JSON?

Довідник координат

Go / C# — допоміжні функції in-rect

Часті запитання

Пов’язані сторінки

Видобування текстових рядків у регіоні (`extract_lines_in_rect`)

Видобування таблиць у регіоні (`extract_tables_in_rect`)