What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

범위 지정 추출 — 특정 영역에서 콘텐츠 가져오기

청구서, 은행 명세서, 세금 양식, 또는 템플릿 기반 레이아웃을 처리할 때 필드의 위치를 미리 알고 있는 경우가 많습니다. 전체 페이지를 추출한 뒤 값을 검색하는 대신, PDF Oxide에 정확한 직사각형 영역을 지정하면 해당 부분만 가져올 수 있습니다.

플루언트 within(page, rect) API는 범위가 지정된 영역을 반환하며, 이 위에 추출 메서드를 체이닝할 수 있습니다: extract_text(), extract_words(), extract_chars(), extract_tables().

바인딩 지원 범위. within(page, rect)는 Python, Rust, WASM에서 사용할 수 있습니다. Go와 C#은 동등한 저수준 헬퍼(ExtractTextInRect, ExtractWordsInRect, ExtractImagesInRect)를 제공합니다 — 아래를 참조하세요. in-rect 패밀리 전체(텍스트, 단어, 줄, 표, 이미지)는 Rust, C ABI, Swift 래퍼에서 엔드투엔드로 제공됩니다. 각 바인딩의 지원 내역은 In-rect 추출 변형을 참조하세요.

빠른 예제

rect는 PDF 포인트 단위의 (x, y, width, height)이며, 원점은 페이지의 왼쪽 하단입니다. Letter 크기 페이지는 612 × 792 포인트입니다.

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")

# Top 92 points of page 0 — typical header band
header = doc.within(0, (0, 700, 612, 92)).extract_text()
print(header)

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let mut doc = PdfDocument::open("invoice.pdf")?;
let header = doc.within(0, Rect::new(0.0, 700.0, 612.0, 92.0)).extract_text()?;
println!("{}", header);

JavaScript (WASM)

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const headerRegion = doc.within(0, [0, 700, 612, 92]);
console.log(headerRegion.extractText());
doc.free();

Go (저수준 헬퍼, 동일한 효과)

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("invoice.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    // ExtractTextInRect(pageIndex, x, y, width, height)
    header, _ := doc.ExtractTextInRect(0, 0, 700, 612, 92)
    fmt.Println(header)
}

C# (저수준 헬퍼)

using PdfOxide;

using var doc = PdfDocument.Open("invoice.pdf");
string header = doc.ExtractTextInRect(0, 0, 700, 612, 92);
Console.WriteLine(header);

Java (page.text(region); BBox는 코너 형식 (x0, y0, x1, y1))

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.geometry.BBox;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("invoice.pdf"))) {
    // Top 92 points of page 0 → corners (0, 700) … (612, 792)
    String header = doc.page(0).text(new BBox(0, 700, 612, 792));
    System.out.println(header);
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.geometry.BBox

PdfDocument.open(java.nio.file.Path.of("invoice.pdf")).use { doc ->
    val header = doc.page(0).text(BBox(0.0, 700.0, 612.0, 792.0))
    println(header)
}

Scala

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.geometry.BBox
import scala.util.Using

Using.resource(PdfDocument.open("invoice.pdf")) { doc =>
  val header = doc.page(0).text(BBox(0, 700, 612, 792))
  println(header)
}

Clojure

(require '[pdf-oxide.core :as pdf])
(import '[fyi.oxide.pdf.geometry BBox])

(with-open [doc (pdf/open "invoice.pdf")]
  ;; Top 92 points of page 0 → corners (0 700) … (612 792)
  (println (pdf/page-text (pdf/page doc 0) (BBox. 0 700 612 792))))

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("invoice.pdf");
// extract_text_in_rect(page, x, y, w, h)
auto header = doc.extract_text_in_rect(0, 0, 700, 612, 92);
std::cout << header << "\n";

Swift

import PdfOxide

let doc = try Document.open("invoice.pdf")
let header = try doc.extractTextInRect(0, x: 0, y: 700, w: 612, h: 92)
print(header)

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('invoice.pdf');
final header = doc.extractTextInRect(0, 0, 700, 612, 92);
print(header);
doc.close();

library(pdfoxide)

doc <- pdf_open("invoice.pdf")
# pdf_extract_text_in_rect(doc, page, x, y, width, height)
header <- pdf_extract_text_in_rect(doc, 0, 0, 700, 612, 92)
cat(header)

Julia

using PdfOxide

doc = open_document("invoice.pdf")
header = extract_text_in_rect(doc, 0, 0, 700, 612, 92)
println(header)

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("invoice.pdf");
const header = try doc.extractTextInRect(a, 0, 0, 700, 612, 92);  // free header
std.debug.print("{s}\n", .{header});

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"invoice.pdf" error:&err];
NSString *header = [doc extractTextInRect:0 x:0 y:700 w:612 h:92 error:&err];
NSLog(@"%@", header);

Elixir

{:ok, doc} = PdfOxide.open("invoice.pdf")
# extract_text_in_rect(doc, page, x, y, w, h)
{:ok, header} = PdfOxide.extract_text_in_rect(doc, 0, 0, 700, 612, 92)
IO.puts(header)

영역에서 체이닝 추출

Python / Rust / WASM의 within() 플루언트 형식은 직사각형을 다시 지정하지 않고도 동일한 범위 영역에서 모든 추출 메서드를 호출할 수 있습니다:

Python

doc = PdfDocument("invoice.pdf")
region = doc.within(0, (400, 100, 200, 200))   # bottom-right 200×200 box

total_text = region.extract_text()              # plain text
words      = region.extract_words()             # word-level records
chars      = region.extract_chars()             # character-level records

Rust

let region = doc.within(0, Rect::new(400.0, 100.0, 200.0, 200.0));
let text  = region.extract_text()?;
let words = region.extract_words()?;

C++ (플루언트 체인 없음 — 동일한 직사각형에 각 in-rect 헬퍼를 개별 호출)

// bottom-right 200×200 box: x=400, y=100, w=200, h=200
auto text  = doc.extract_text_in_rect(0, 400, 100, 200, 200);
auto words = doc.extract_words_in_rect(0, 400, 100, 200, 200);
auto lines = doc.extract_lines_in_rect(0, 400, 100, 200, 200);

Swift

let text  = try doc.extractTextInRect(0, x: 400, y: 100, w: 200, h: 200)
let words = try doc.extractWordsInRect(0, x: 400, y: 100, w: 200, h: 200)

Dart

final text  = doc.extractTextInRect(0, 400, 100, 200, 200);
final words = doc.extractWordsInRect(0, 400, 100, 200, 200);

text  <- pdf_extract_text_in_rect(doc, 0, 400, 100, 200, 200)
words <- pdf_extract_words_in_rect(doc, 0, 400, 100, 200, 200)

Julia

text  = extract_text_in_rect(doc, 0, 400, 100, 200, 200)
words = extract_words_in_rect(doc, 0, 400, 100, 200, 200)

Zig

const text  = try doc.extractTextInRect(a, 0, 400, 100, 200, 200);
const words = try doc.extractWordsInRect(a, 0, 400, 100, 200, 200);  // freeWords

Objective-C

NSString *text = [doc extractTextInRect:0 x:400 y:100 w:200 h:200 error:&err];
NSArray<POXWord*> *words = [doc extractWordsInRect:0 x:400 y:100 w:200 h:200 error:&err];

Elixir

{:ok, text}  = PdfOxide.extract_text_in_rect(doc, 0, 400, 100, 200, 200)
{:ok, words} = PdfOxide.extract_words_in_rect(doc, 0, 400, 100, 200, 200)

일반적인 활용 사례

청구서 필드 추출

청구서에는 보통 공급업체 주소, 청구서 번호, 품목 표가 고정된 위치에 있습니다. 템플릿별로 직사각형을 한 번 정의해두세요:

from pdf_oxide import PdfDocument

TEMPLATES = {
    "acme_v1": {
        "invoice_no":  (450, 720,  120,  20),
        "issue_date":  (450, 700,  120,  20),
        "vendor_name": ( 50, 740,  300,  40),
        "total":       (450, 100,  120,  24),
    },
}

def parse_invoice(path, template):
    doc = PdfDocument(path)
    out = {}
    for field, rect in template.items():
        out[field] = doc.within(0, rect).extract_text().strip()
    return out

print(parse_invoice("invoice-2025-04.pdf", TEMPLATES["acme_v1"]))

은행 명세서 거래 항목

대부분의 명세서에는 좁은 “거래” 띠가 있습니다. 해당 띠를 잘라내고 extract_words()를 호출하면 바운딩 박스와 함께 읽기 순서대로 각 줄을 가져올 수 있습니다:

doc = PdfDocument("statement.pdf")
for page in range(doc.page_count()):
    txn_region = doc.within(page, (36, 72, 540, 650))   # skip header + footer
    for w in txn_region.extract_words():
        print(f"page {page}: {w.text} at ({w.x0:.0f},{w.y0:.0f})")

헤더 / 푸터 제거

본문 콘텐츠만 인덱싱할 경우, 각 페이지의 상단과 하단을 잘라내세요:

Rust

let mut doc = PdfDocument::open("book.pdf")?;
for i in 0..doc.page_count()? {
    let body = doc.within(i, Rect::new(0.0, 100.0, 612.0, 600.0))
                  .extract_text()?;
    // index `body` …
}

표 영역 감지

페이지에 표가 있고 위치를 알고 있다면, 표 직사각형으로 범위를 좁히고 extract_tables()가 해당 영역에만 집중하도록 하세요:

Python

tables = doc.within(0, (50, 200, 500, 400)).extract_tables()
for t in tables:
    for row in t["rows"]:
        print([c["text"] for c in row["cells"]])

직사각형 범위 추출 변형에는 어떤 것이 있나요? {#what-rect-scoped-extraction-variants-exist}

extract_text(), extract_words(), extract_chars() 외에도, 단일 직사각형에서 기하학적으로 인식된 결과를 반환하는 두 가지 직사각형 범위 변형이 더 있습니다: 직사각형 내 줄과 직사각형 내 표. 두 변형 모두 전체 페이지 추출에서 바운딩 박스가 지정한 직사각형과 교차하는 영역만 필터링하므로, 반환되는 좌표와 읽기 순서는 전체 페이지 호출과 동일하며 단지 잘린 것입니다.

영역 내 텍스트 줄 추출 (`extract_lines_in_rect`)

직사각형 안에 포함되는 줄 수준 레코드(텍스트, 바운딩 박스, 단어 수 포함)를 반환합니다. 개별 단어가 아닌 읽기 순서로 전체 줄이 필요할 때 사용하세요 — 주소 블록, 다줄 합계, 또는 명세서의 단일 행 등.

C ABI 시그니처가 기준입니다:

FfiTextLineList *pdf_document_extract_lines_in_rect(
    PdfDocument *handle,
    int32_t page_index,
    float x, float y, float w, float h,
    int32_t *error_code);

Rust — PdfDocument의 extract_lines_in_rect(page_index, region) -> Result<Vec<PathContent>>:

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let doc = PdfDocument::open("statement.pdf")?;

// Transactions band: skip the header (top 92pt) and footer (bottom 72pt)
let region = Rect::new(36.0, 72.0, 540.0, 628.0);
let lines = doc.extract_lines_in_rect(0, region)?;
for line in &lines {
    println!("{:?}", line.bbox);
}

Python — 플루언트 영역은 extract_text_lines()로 줄을 제공합니다:

from pdf_oxide import PdfDocument

doc = PdfDocument("statement.pdf")

# Same band as the Rust example above
region = doc.within(0, (36, 72, 540, 628))
for line in region.extract_text_lines():
    print(line.text, line.bbox)

Swift — extractLinesInRect(_:x:y:w:h:)는 [TextLine]을 반환합니다:

import PdfOxide

let doc = try PdfDocument(path: "statement.pdf")
let lines = try doc.extractLinesInRect(0, x: 36, y: 72, w: 540, h: 628)
for line in lines {
    print(line.text, line.bbox, line.wordCount)
}

C++ — extract_lines_in_rect(page, x, y, w, h)는 std::vector<TextLine>을 반환합니다:

auto lines = doc.extract_lines_in_rect(0, 36, 72, 540, 628);
for (const auto& line : lines) {
    std::cout << line.text << "\n";
}

Dart — extractLinesInRect(page, x, y, w, h)는 List<TextLine>을 반환합니다:

final lines = doc.extractLinesInRect(0, 36, 72, 540, 628);
for (final line in lines) {
    print('${line.text} ${line.bbox}');
}

R — pdf_extract_lines_in_rect(doc, page, x, y, width, height):

lines <- pdf_extract_lines_in_rect(doc, 0, 36, 72, 540, 628)

Julia — extract_lines_in_rect(doc, page, x, y, w, h):

lines = extract_lines_in_rect(doc, 0, 36, 72, 540, 628)
for line in lines
    println(line.text, " ", line.bbox)
end

Zig — extractLinesInRect(allocator, page, x, y, w, h):

const lines = try doc.extractLinesInRect(a, 0, 36, 72, 540, 628);  // freeTextLines

Objective-C — extractLinesInRect:x:y:w:h:는 NSArray<POXTextLine*>을 반환합니다:

NSArray<POXTextLine*> *lines = [doc extractLinesInRect:0 x:36 y:72 w:540 h:628 error:&err];

Elixir — extract_lines_in_rect(doc, page, x, y, w, h):

{:ok, lines} = PdfOxide.extract_lines_in_rect(doc, 0, 36, 72, 540, 628)

Go / C#. extract_lines_in_rect C 진입점은 존재하지만 Go와 C# 래퍼는 아직 이를 노출하지 않습니다. 이 언어들에서는 전체 페이지의 줄을 추출한 후 반환된 바운딩 박스로 필터링하거나, ExtractWordsInRect(Go)를 사용해 단어를 줄로 직접 묶으세요.

영역 내 표 추출 (`extract_tables_in_rect`)

표 감지를 단일 직사각형으로 범위 지정합니다 — 바운딩 박스가 해당 직사각형과 교차하는 표만 반환됩니다. 위에서 보여준 플루언트 within(...).extract_tables()의 기하학적 인식 대응 변형입니다.

C ABI 시그니처:

FfiTableList *pdf_document_extract_tables_in_rect(
    PdfDocument *handle,
    int32_t page_index,
    float x, float y, float w, float h,
    int32_t *error_code);

Rust — extract_tables_in_rect(page_index, region) -> Result<Vec<Table>>(..._with_config 변형은 커스텀 TableDetectionConfig를 받습니다):

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let doc = PdfDocument::open("invoice.pdf")?;
let region = Rect::new(50.0, 200.0, 500.0, 400.0);
let tables = doc.extract_tables_in_rect(0, region)?;
for table in &tables {
    println!("{} rows × {} cols", table.rows.len(), table.col_count);
}

Python — 플루언트 영역을 통해:

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
tables = doc.within(0, (50, 200, 500, 400)).extract_tables()
for t in tables:
    for row in t["rows"]:
        print([c["text"] for c in row["cells"]])

Swift — extractTablesInRect(_:x:y:w:h:)는 [Table]을 반환합니다:

let tables = try doc.extractTablesInRect(0, x: 50, y: 200, w: 500, h: 400)
for table in tables {
    print("\(table.rowCount) rows, header: \(table.hasHeader)")
}

C++ — extract_tables_in_rect(page, x, y, w, h)는 std::vector<Table>을 반환합니다:

auto tables = doc.extract_tables_in_rect(0, 50, 200, 500, 400);
for (const auto& table : tables) {
    std::cout << table.rows.size() << " rows\n";
}

Dart — extractTablesInRect(page, x, y, w, h)는 List<Table>을 반환합니다:

final tables = doc.extractTablesInRect(0, 50, 200, 500, 400);
for (final table in tables) {
    print('${table.rows.length} rows');
}

R — pdf_extract_tables_in_rect(doc, page, x, y, width, height):

tables <- pdf_extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Julia — extract_tables_in_rect(doc, page, x, y, w, h):

tables = extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Zig — extractTablesInRect(allocator, page, x, y, w, h):

const tables = try doc.extractTablesInRect(a, 0, 50, 200, 500, 400);

Objective-C — extractTablesInRect:x:y:w:h:는 NSArray<POXTable*>을 반환합니다:

NSArray<POXTable*> *tables = [doc extractTablesInRect:0 x:50 y:200 w:500 h:400 error:&err];

Elixir — extract_tables_in_rect(doc, page, x, y, w, h):

{:ok, tables} = PdfOxide.extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Go / C#. 줄의 경우와 마찬가지로 extract_tables_in_rect C 진입점은 존재하지만 Go나 C#에는 아직 래핑되지 않았습니다. 전체 페이지를 대상으로 ExtractTables(page)를 호출한 후 바운딩 박스가 지정한 직사각형 안에 들어오는 표만 유지하세요.

텍스트와 OCR을 선택하지 않고 페이지를 자동으로 추출하려면?

페이지가 디지털 텍스트인지, 스캔본인지, 아니면 혼합인지 알 수 없을 때 extract_page_auto가 자동으로 라우팅합니다. AutoExtractor를 실행합니다 — 영역별 텍스트 vs OCR 라우팅, 네이티브로의 우아한 폴백 포함(불투명한 OCR 오류를 발생시키지 않음) — JSON PageExtraction을 반환합니다: 페이지 kind, 읽기 순서로 조합된 text, confidence, 타입화된 reason, ocr_used 플래그, 그리고 각 영역에 bbox, kind, text, confidence, source, reason을 포함하는 regions[] 배열(영역의 텍스트가 비어 있어도 bbox와 reason이 존재하므로 읽기 순서가 암묵적으로 손상되지 않습니다).

{}를 허용합니다: 기본값에는 빈/null 옵션 JSON을 전달하거나 AutoExtractOptions 객체를 제공하세요. 인식되는 필드(직렬화된 snake_case)는 다음과 같습니다:

필드	타입	기본값	의미
`mode`	`"text_only"` \| `"auto"` \| `"force_ocr"`	`"auto"`	텍스트 vs OCR 라우팅 전략
`reconstruct_image_tables`	bool	`true`	OCR 스팬 위의 공간 감지기로 이미지 전용 표 재구성
`emit_placeholders`	bool	`true`	텍스트 흐름에 위치 지정된 Figure/Table 플레이스홀더 삽입
`ocr_languages`	string[]	`[]`	OCR 언어 힌트 (예: `["english","chinese"]`)
`min_text_confidence`	float \| null	`null`	자동 결정 신뢰도 임계값
`table_confidence`	float \| null	`null`	이미지 표 재구성 임계값
`force_ocr_pages`	int[]	`[]`	OCR을 강제할 0 기반 페이지 인덱스

OCR 기능 게이트. OCR은 라이브러리가 ocr 기능과 함께 빌드된 경우에만 실제로 실행됩니다. 그렇지 않으면 extract_page_auto는 네이티브 텍스트 레이어로 폴백합니다(오류 없음). 자동 진입점은 Python, Go, C#, Swift, WASM, C ABI에서 제공됩니다. Rust에서는 단일 PdfDocument 메서드가 아니라 라이브러리 수준의 AutoExtractor API입니다 — 아래를 참조하세요.

Python — extract_page_auto(page, options_json=None) -> str (JSON):

import json
from pdf_oxide import PdfDocument

doc = PdfDocument("mixed-scan.pdf")

# Defaults (balanced preset)
page = json.loads(doc.extract_page_auto(0))
print(page["kind"], page["confidence"], page["ocr_used"])
for region in page["regions"]:
    print(region["kind"], region["bbox"], region["reason"])

# With options
opts = json.dumps({"mode": "auto", "reconstruct_image_tables": True,
                   "ocr_languages": ["english"]})
page = json.loads(doc.extract_page_auto(0, opts))

Go — ExtractPageAuto(pageIndex, opts ...AutoOption) (string, error) (JSON 반환; 함수형 옵션으로 구성):

package main

import (
    "encoding/json"
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("mixed-scan.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    raw, err := doc.ExtractPageAuto(0)
    if err != nil { log.Fatal(err) }

    var page map[string]any
    json.Unmarshal([]byte(raw), &page)
    fmt.Println(page["kind"], page["confidence"], page["ocr_used"])
}

C# — ExtractPageAuto(int pageIndex, string? optionsJson = null) -> string (JSON):

using System.Text.Json;
using PdfOxide.Core;

using var doc = PdfDocument.Open("mixed-scan.pdf");

// Defaults
string json = doc.ExtractPageAuto(0);
using var page = JsonDocument.Parse(json);
Console.WriteLine(page.RootElement.GetProperty("kind"));

// With options
string opts = """{"mode":"auto","ocr_languages":["english"]}""";
string json2 = doc.ExtractPageAuto(0, opts);

Swift — extractPageAuto(_:optionsJson:) -> String (기본값 "{}"):

let json = try doc.extractPageAuto(0, optionsJson: "{}")

JavaScript (WASM) — extractPageAuto(pageIndex, optionsJson?):

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const page = JSON.parse(doc.extractPageAuto(0));
console.log(page.kind, page.confidence, page.ocr_used);
doc.free();

Rust — 자동 경로는 AutoExtractor 라이브러리 API입니다. AutoExtractOptions(프리셋 fast(), balanced(), high_fidelity(), 또는 플루언트 빌더)를 구성하고 extract_page를 호출하면 타입화된 PageExtraction이 반환됩니다(JSON 왕복 없음):

use pdf_oxide::PdfDocument;
use pdf_oxide::extractors::auto::{AutoExtractor, AutoExtractOptions, ExtractMode};

let doc = PdfDocument::open("mixed-scan.pdf")?;

// Default (balanced) preset
let page = AutoExtractor::new().extract_page(&doc, 0)?;
println!("{:?} conf={} ocr={}", page.kind, page.confidence, page.ocr_used);

// Custom options via the builder
let opts = AutoExtractOptions::builder()
    .mode(ExtractMode::Auto)
    .reconstruct_image_tables(true)
    .ocr_languages(["english"])
    .build();
let page = AutoExtractor::with(opts).extract_page(&doc, 0)?;
for region in &page.regions {
    println!("{:?} {:?} {:?}", region.kind, region.bbox, region.reason);
}

C++ — extract_page_auto(page, options_json = "") JSON 봉투를 반환합니다:

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("mixed-scan.pdf");
auto json = doc.extract_page_auto(0);                                    // defaults
auto json2 = doc.extract_page_auto(0, R"({"mode":"auto","ocr_languages":["english"]})");

Dart — extractPageAuto(page, [optionsJson]) JSON 봉투를 반환합니다:

import 'dart:convert';
import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('mixed-scan.pdf');
final page = jsonDecode(doc.extractPageAuto(0));
print('${page["kind"]} ${page["confidence"]} ${page["ocr_used"]}');
doc.close();

R — pdf_extract_page_auto(doc, page, options_json = NULL) JSON을 반환합니다:

library(jsonlite)

doc  <- pdf_open("mixed-scan.pdf")
page <- fromJSON(pdf_extract_page_auto(doc, 0))
cat(page$kind, page$confidence, page$ocr_used, "\n")

Julia — extract_page_auto(doc, page, options = "{}") JSON을 반환합니다:

using PdfOxide, JSON

doc  = open_document("mixed-scan.pdf")
page = JSON.parse(extract_page_auto(doc, 0))
println(page["kind"], " ", page["confidence"], " ", page["ocr_used"])

Zig — extractPageAuto(allocator, page, options_json) JSON 바이트를 반환합니다:

const json = try doc.extractPageAuto(a, 0, null);  // free json

Objective-C — extractPageAuto:optionsJson:error: JSON 봉투를 반환합니다:

NSString *json = [doc extractPageAuto:0 optionsJson:@"{}" error:&err];

Elixir — extract_page_auto(doc, page, options_json \\ "") JSON을 반환합니다:

{:ok, json} = PdfOxide.extract_page_auto(doc, 0)
page = Jason.decode!(json)
IO.inspect({page["kind"], page["confidence"], page["ocr_used"]})

Java — 자동 경로는 AutoExtractor API입니다(extractPage → 타입화된 결과; 일반 텍스트는 extractTextForPage):

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("mixed-scan.pdf"))) {
    AutoExtractor ax = AutoExtractor.of(doc);             // or .fast/.balanced/.highFidelity
    String text = ax.extractTextForPage(0);               // graceful native/OCR routing
    System.out.println(text);
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.AutoExtractor

PdfDocument.open(java.nio.file.Path.of("mixed-scan.pdf")).use { doc ->
    val ax = AutoExtractor.of(doc)
    println(ax.extractTextForPage(0))
}

Scala

import fyi.oxide.pdf.{PdfDocument, AutoExtractor}
import scala.util.Using

Using.resource(PdfDocument.open("mixed-scan.pdf")) { doc =>
  val ax = AutoExtractor.of(doc)
  println(ax.extractTextForPage(0))
}

PHP — 풍부한 JSON 봉투는 AutoExtractor::extractPageJson으로 접근할 수 있습니다:

use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;

$doc = PdfDocument::open('mixed-scan.pdf');
$ax  = AutoExtractor::balanced($doc);
$page = json_decode($ax->extractPageJson(0), true);
echo $page['kind'], ' ', $page['confidence'], ' ', $page['ocr_used'];

Ruby — auto_extractor.extract_page(page)는 파싱된 봉투를 Hash에 병합하여 반환합니다:

require 'pdf_oxide'

PdfOxide::PdfDocument.open('mixed-scan.pdf') do |doc|
  result = doc.auto_extractor.extract_page(0)
  cls = result[:classification]            # full PageExtraction JSON as a Hash
  puts [cls['kind'], cls['confidence'], cls['ocr_used']].join(' ')
end

구조화된 타입 영역을 JSON으로 가져오려면?

전체 페이지의 구조화된 뷰 — 제목, 본문 블록, 헤더/푸터, 페이지 번호, 열 순서 — 를 얻으려면 구조화 추출 진입점을 사용하세요. StructuredPage를 반환합니다: page_index, page_width, page_height, 그리고 각 영역에 kind(의미 역할), text, bbox, spans, column_index(다단 읽기 순서)를 포함하는 regions[] 배열. 영역 kind에는 본문 블록, 구조적 제목(H1–H6), 여백 레이블, 실행 헤더/푸터, 페이지 번호, 아티팩트가 포함됩니다.

대부분의 바인딩은 이를 JSON 문자열로 반환합니다(C ABI가 한 번 직렬화하고 바인딩이 네이티브 타입으로 역직렬화합니다). Rust는 타입화된 StructuredPage를 직접 반환합니다.

C ABI 시그니처:

char *pdf_document_extract_structured_to_json(
    PdfDocument *handle,
    int32_t page_index,
    int32_t *error_code);

Python — extract_structured(page) -> str (JSON; json.loads로 역직렬화):

import json
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
page = json.loads(doc.extract_structured(0))

print(page["page_width"], page["page_height"])
for region in page["regions"]:
    print(region["kind"], region["column_index"], region["text"][:60])

Go — ExtractStructured(page) (string, error):

raw, err := doc.ExtractStructured(0)
if err != nil { log.Fatal(err) }

var page map[string]any
json.Unmarshal([]byte(raw), &page)
for _, r := range page["regions"].([]any) {
    region := r.(map[string]any)
    fmt.Println(region["kind"], region["text"])
}

C# — ExtractStructured(int page) -> string:

using System.Text.Json;

string json = doc.ExtractStructured(0);
using var page = JsonDocument.Parse(json);
foreach (var region in page.RootElement.GetProperty("regions").EnumerateArray())
{
    Console.WriteLine(region.GetProperty("kind"));
}

Swift — extractStructuredJson(_:) -> String:

let json = try doc.extractStructuredJson(0)

JavaScript (WASM) — extractStructured(pageIndex) (camelCase 키가 있는 JSON 문자열 반환):

const page = JSON.parse(doc.extractStructured(0));
for (const region of page.regions) {
    console.log(region.kind, region.columnIndex);
}

Rust — extract_structured(page_index) -> Result<StructuredPage>는 타입화된 영역을 직접 반환합니다(JSON 왕복 없음). extract_structured_with_column_mode 변형으로 까다로운 레이아웃에 ColumnMode::Two/Single을 강제 적용할 수 있습니다:

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("report.pdf")?;
let page = doc.extract_structured(0)?;
for region in &page.regions {
    println!("{:?} col={:?}: {}", region.kind, region.column_index, region.text);
}

C++ — extract_structured_json(page) JSON 문자열을 반환합니다:

auto json = doc.extract_structured_json(0);

Dart — extractStructuredJson(page) JSON 문자열을 반환합니다:

import 'dart:convert';

final page = jsonDecode(doc.extractStructuredJson(0));
for (final region in page['regions']) {
    print('${region["kind"]} ${region["column_index"]}');
}

R — pdf_extract_structured_json(doc, page) JSON을 반환합니다:

library(jsonlite)

page <- fromJSON(pdf_extract_structured_json(doc, 0))
print(page$page_width)

Julia — extract_structured_json(doc, page) JSON을 반환합니다:

using JSON
page = JSON.parse(extract_structured_json(doc, 0))
for region in page["regions"]
    println(region["kind"], " ", region["column_index"])
end

Zig — extractStructuredJson(allocator, page) JSON 바이트를 반환합니다:

const json = try doc.extractStructuredJson(a, 0);  // free json

Objective-C — extractStructuredJson:error: JSON 문자열을 반환합니다:

NSString *json = [doc extractStructuredJson:0 error:&err];

Elixir — extract_structured_json(doc, page) JSON을 반환합니다:

{:ok, json} = PdfOxide.extract_structured_json(doc, 0)
page = Jason.decode!(json)

Java — extractStructured(page) JSON 문자열을 반환합니다:

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;

String json = doc.extractStructured(0);
JsonNode page = new ObjectMapper().readTree(json);
for (JsonNode region : page.get("regions")) {
    System.out.println(region.get("kind").asText());
}

Kotlin

val json = doc.extractStructured(0)   // JSON string; parse with your library of choice

Scala

val json = doc.extractStructured(0)   // JSON string

Clojure — (pdf/extract-structured doc page) JSON 문자열을 반환합니다:

(require '[clojure.data.json :as json])

(with-open [doc (pdf/open "report.pdf")]
  (let [page (json/read-str (pdf/extract-structured doc 0))]
    (doseq [region (get page "regions")]
      (println (get region "kind") (get region "column_index")))))

Ruby — extract_structured(page) 파싱된 StructuredPage Hash를 반환합니다:

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  page = doc.extract_structured(0)
  page['regions'].each { |r| puts "#{r['kind']} #{r['column_index']}" }
end

PHP — extractStructured($page) 역직렬화된 연관 배열을 반환합니다:

$doc = PdfOxide\PdfDocument::open('report.pdf');
$page = $doc->extractStructured(0);
foreach ($page['regions'] as $region) {
    echo $region['kind'], ' ', $region['column_index'], "\n";
}

좌표 참고

PDF는 왼쪽 하단 원점을 사용하며, 포인트 단위로 측정합니다(1 pt = 1/72 인치). Letter 크기 페이지는 (0, 0, 612, 792)입니다. 상단 1인치 띠를 지정하려면:

(x, y, w, h) = (0, 792 - 72, 612, 72)
             = (0, 720,      612, 72)

이미지 좌표계(왼쪽 상단 원점)에서 온 경우 y를 적절히 뒤집으세요.

계산 전에 페이지의 실제 MediaBox를 가져오려면:

Python

doc = PdfDocument("doc.pdf")
mb = doc.page_media_box(0)       # (llx, lly, urx, ury)

Rust

let mb = editor.get_page_media_box(0)?;   // [f32; 4]

Java — page.mediaBox()는 BBox(x0, y0, x1, y1)를 반환합니다:

import fyi.oxide.pdf.geometry.BBox;

BBox mb = doc.page(0).mediaBox();         // (x0, y0, x1, y1) in PDF user space
double w = mb.width(), h = mb.height();   // 612 × 792 for US Letter

Kotlin

val mb = doc.page(0).mediaBox()           // BBox(x0, y0, x1, y1)

Scala

val mb = doc.page(0).mediaBox             // BBox(x0, y0, x1, y1)

C++ — 편집기를 통해: get_page_media_box(page):

auto editor = pdf_oxide::DocumentEditor::open("doc.pdf");
auto mb = editor.get_page_media_box(0);   // Bbox{x, y, width, height}

Swift

let editor = try DocumentEditor.open("doc.pdf")
let mb = try editor.getPageMediaBox(0)    // Bbox(x, y, width, height)

Dart

final editor = DocumentEditor.open('doc.pdf');
final mb = editor.getPageMediaBox(0);     // Bbox(x, y, width, height)

editor <- pdf_editor_open("doc.pdf")
mb <- pdf_editor_get_page_media_box(editor, 0)   # list(x=, y=, width=, height=)

Julia

editor = open_editor("doc.pdf")
mb = get_page_media_box(editor, 0)        # Bbox

Zig

var editor = try pdf_oxide.DocumentEditor.openEditor("doc.pdf");
const mb = try editor.getPageMediaBox(0);  // Bbox{ x, y, width, height }

Objective-C

POXDocumentEditor *editor = [POXDocumentEditor openEditor:@"doc.pdf" error:&err];
POXBbox mb = [editor pageMediaBox:0 error:&err];   // {x, y, width, height}

Elixir

{:ok, editor} = PdfOxide.open_editor("doc.pdf")
{:ok, mb} = PdfOxide.get_page_media_box(editor, 0)   # %Bbox{}

Go / C# — in-rect 헬퍼

Go와 C#은 아직 플루언트 within() 체인을 제공하지 않지만, 기반이 되는 저수준 메서드는 동일합니다:

메서드	Go	C#
직사각형 내 텍스트	`doc.ExtractTextInRect(page, x, y, w, h)`	`doc.ExtractTextInRect(page, x, y, w, h)`
직사각형 내 단어	`doc.ExtractWordsInRect(page, x, y, w, h)`	(아직 미래핑)
직사각형 내 이미지	`doc.ExtractImagesInRect(page, x, y, w, h)`	(아직 미래핑)

Go나 C#에서 동일한 직사각형에 대해 여러 추출 유형이 필요한 패턴의 경우, 직사각형을 변수에 보관하고 헬퍼를 순차적으로 호출하세요. 편집기 API가 안정화되면 플루언트 인터페이스도 추가될 예정입니다.

자주 묻는 질문

영역에서 extract_words()와 extract_lines_in_rect()의 차이는 무엇인가요? extract_words()는 단어마다 하나의 레코드를 반환합니다. extract_lines_in_rect()는 바운딩 박스가 직사각형과 교차하는 줄마다 하나의 레코드(텍스트, 바운딩 박스, 단어 수)를 반환합니다. 주소 블록, 명세서 행, 다줄 합계처럼 단어를 직접 묶지 않고 읽기 순서로 전체 행이 필요할 때는 줄 추출을 사용하세요.

extract_page_auto는 항상 OCR을 실행하나요? 아니요. 영역별로 라우팅합니다. 기본 "auto" 모드에서는 네이티브 텍스트 레이어가 없거나 의심스러운 경우에만 OCR로 에스컬레이션하며, OCR이 실제로 실행되려면 라이브러리가 ocr 기능과 함께 빌드되어야 합니다. 해당 기능 없이는 네이티브 텍스트 레이어로 폴백하며 불투명한 OCR 오류가 발생하지 않습니다.

어떤 바인딩이 lines-in-rect와 tables-in-rect 변형을 제공하나요? Rust, C ABI, Swift는 extract_lines_in_rect / extract_tables_in_rect를 직접 제공합니다. Python은 플루언트 영역(within(...).extract_text_lines() 및 within(...).extract_tables())으로 동일한 결과를 얻습니다. Go와 C#은 아직 in-rect 줄/표 진입점을 래핑하지 않습니다 — 전체 페이지를 추출한 후 반환된 바운딩 박스로 필터링하세요.

범위 지정 추출은 얼마나 빠른가요? 범위 지정은 전체 페이지 추출 대비 측정 가능한 오버헤드를 추가하지 않습니다 — PDF Oxide는 벤치마크 코퍼스에서 평균 0.8ms(통과율 100%)로 추출하며, in-rect 호출은 해당 결과를 바운딩 박스로 필터링할 뿐입니다.

범위 지정 추출 — 특정 영역에서 콘텐츠 가져오기

빠른 예제

영역에서 체이닝 추출

일반적인 활용 사례

청구서 필드 추출

은행 명세서 거래 항목

헤더 / 푸터 제거

표 영역 감지

직사각형 범위 추출 변형에는 어떤 것이 있나요? {#what-rect-scoped-extraction-variants-exist}

영역 내 텍스트 줄 추출 (extract_lines_in_rect)

영역 내 표 추출 (extract_tables_in_rect)

텍스트와 OCR을 선택하지 않고 페이지를 자동으로 추출하려면?

구조화된 타입 영역을 JSON으로 가져오려면?

좌표 참고

Go / C# — in-rect 헬퍼

자주 묻는 질문

관련 페이지

영역 내 텍스트 줄 추출 (`extract_lines_in_rect`)

영역 내 표 추출 (`extract_tables_in_rect`)