What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

텍스트 추출

PDF Oxide는 여러 수준의 텍스트 추출을 제공합니다. 전체 페이지 텍스트, 폰트 메타데이터가 포함된 스타일 span, 그리고 정밀한 위치 정보를 가진 개별 문자 추출이 가능합니다. 빠른 콘텐츠 조회에는 extract_text()를, 폰트 및 위치 데이터가 필요할 때는 extract_spans()를, 커스텀 레이아웃 엔진이나 OCR 후처리 같은 문자 단위 분석에는 extract_chars()를 사용하세요.

태그가 지정된 PDF의 경우 텍스트 추출은 올바른 읽기 순서를 위해 문서의 구조 트리를 자동으로 따릅니다. 태그가 없는 PDF의 경우 페이지 콘텐츠 순서를 기반으로 지능적인 줄바꿈 감지를 사용합니다. 여기에는 RFC 스타일이나 논문 스타일 문서에서 본문 텍스트가 단편화되는 것을 방지하는 단일 열 보호 기능도 포함됩니다.

읽기 순서 지원

읽기 순서 파이프라인은 다양한 문자 체계와 레이아웃에서 정확한 결과를 생성합니다:

Latin — 열 감지가 포함된 기본 좌→우, 상→하 순서.
Arabic — 사전 형성된 span 역전(Pass 0)으로 문자를 시각적 순서 대신 논리적 읽기 순서로 배치.
CJK — 공간적 테이블 감지기가 rowspan 레이블 열을 보존하며, 3pt Y축 양자화로 표 콘텐츠가 본문과 뒤섞이지 않도록 처리.
회전 / dvips 생성 PDF — 열 감지에서 중앙값 기반 이상값 제거로 퇴화된 CTM 좌표를 처리.
다단 학술 논문 — XYCut 단일 열 보호로 단편화를 수정하고, 행 인식 span 정렬로 본문 내 표 콘텐츠를 처리.

단어 및 줄 분할

extract_words()와 extract_text_lines()는 단어 및 줄바꿈 임계값을 조정하는 선택적 키워드 인수를 받습니다:

파라미터	기본값	설명
`word_gap_threshold`	적응형	단어 구분으로 판단하는 인접 문자 간 최소 수평 간격(포인트 단위)
`line_gap_threshold`	적응형	줄바꿈으로 판단하는 기준선 간 최소 수직 간격
`profile`	`"auto"`	`"auto"`, `"dense"`, `"standard"`, `"sparse"` 중 하나 — 다양한 레이아웃에 맞는 프리셋 선택

적응형 파라미터는 페이지의 폰트 메트릭에서 도출됩니다. 계산된 값을 확인하려면 page_layout_params()를, 커스텀 프로파일을 만들려면 ExtractionProfile을 사용하세요.

Python 전용 튜닝: word_gap_threshold, line_gap_threshold, profile, page_layout_params()는 Python 바인딩에서 제공됩니다. Node.js, JavaScript, Go, C#, WASM 바인딩은 키워드 인수 없이 적응형 기본값을 사용하는 extractWords(pageIndex) / extractTextLines(pageIndex)를 제공합니다. 해당 언어에서 튜닝이 필요하다면 아래 Rust API를 사용하세요.

Python

from pdf_oxide import PdfDocument, ExtractionProfile

doc = PdfDocument("receipt.pdf")

params = doc.page_layout_params(0)
print(params.word_gap_threshold, params.line_gap_threshold)

words = doc.extract_words(0, word_gap_threshold=2.5, profile="dense")
lines = doc.extract_text_lines(0, profile=ExtractionProfile.DENSE)

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("receipt.pdf");
const words = doc.extractWords(0);      // adaptive defaults
const lines = doc.extractTextLines(0);
doc.close();

JavaScript

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("receipt.pdf");
const words = doc.extractWords(0);
const lines = doc.extractTextLines(0);
doc.close();

TypeScript

import { PdfDocument } from "pdf-oxide";

const doc: PdfDocument = new PdfDocument("receipt.pdf");
const words = doc.extractWords(0);
const lines = doc.extractTextLines(0);
doc.close();

Rust

use pdf_oxide::{PdfDocument, ExtractionProfile};

let mut doc = PdfDocument::open("receipt.pdf")?;

let params = doc.page_layout_params(0)?;
println!("{} {}", params.word_gap_threshold, params.line_gap_threshold);

let words = doc.extract_words_with_config(0, /* word_gap_threshold */ Some(2.5), ExtractionProfile::Dense)?;
let lines = doc.extract_text_lines_with_profile(0, ExtractionProfile::Dense)?;

words, _ := doc.ExtractWords(0)     // adaptive defaults
lines, _ := doc.ExtractTextLines(0)

var words = doc.ExtractWords(0);     // adaptive defaults
var lines = doc.ExtractTextLines(0);

WASM

const doc = new WasmPdfDocument(bytes);
const words = doc.extractWords(0);
const lines = doc.extractTextLines(0);

Java

try (PdfDocument doc = PdfDocument.open(Path.of("receipt.pdf"))) {
    List<TextWord> words = doc.page(0).words();      // adaptive defaults
    List<TextLine> lines = doc.page(0).lines();
}

C++

auto doc = pdf_oxide::Document::open("receipt.pdf");
auto words = doc.extract_words(0);   // adaptive defaults
auto lines = doc.extract_text_lines(0);

Swift

let doc = try Document.open("receipt.pdf")
let words = try doc.extractWords(0)   // adaptive defaults
let lines = try doc.extractTextLines(0)

Kotlin

PdfDocument.open(Path.of("receipt.pdf")).use { doc ->
    val words = doc.page(0).words()   // adaptive defaults
    val lines = doc.page(0).lines()
}

Dart

final doc = PdfDocument.open('receipt.pdf');
final words = doc.extractWords(0);   // adaptive defaults
final lines = doc.extractTextLines(0);

doc <- pdf_open("receipt.pdf")
words <- pdf_extract_words(doc, 0)        # adaptive defaults
lines <- pdf_extract_text_lines(doc, 0)

Julia

doc = open_document("receipt.pdf")
words = extract_words(doc, 0)    # adaptive defaults
lines = extract_text_lines(doc, 0)

Zig

var doc = try pdf_oxide.Document.open("receipt.pdf");
const words = try doc.extractWords(a, 0);    // adaptive defaults
const lines = try doc.extractTextLines(a, 0);

Scala

Using.resource(PdfDocument.open("receipt.pdf")) { doc =>
  val words = doc.page(0).wordsSeq   // adaptive defaults
  val lines = doc.page(0).linesSeq
}

Clojure

(with-open [doc (pdf/open "receipt.pdf")]
  (pdf/words (pdf/page doc 0))   ; adaptive defaults
  (pdf/lines (pdf/page doc 0)))

Objective-C

POXDocument *doc = [POXDocument openPath:@"receipt.pdf" error:&err];
NSArray<POXWord*> *words = [doc extractWords:0 error:&err];        // adaptive defaults
NSArray<POXTextLine*> *lines = [doc extractTextLines:0 error:&err];

Elixir

{:ok, doc} = PdfOxide.open("receipt.pdf")
{:ok, words} = PdfOxide.extract_words(doc, 0)    # adaptive defaults
{:ok, lines} = PdfOxide.extract_text_lines(doc, 0)

빠른 예제

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const text = doc.extractText(0);
console.log(text);

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
text, _ := doc.ExtractText(0)
fmt.Println(text)

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
string text = doc.ExtractText(0);
Console.WriteLine(text);

WASM

const doc = new WasmPdfDocument(bytes);
const text = doc.extractText(0);
console.log(text);

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;
let text = doc.extract_text(0)?;
println!("{}", text);

Java

import fyi.oxide.pdf.*;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    String text = doc.extractText(0);
    System.out.println(text);
}

PHP

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('report.pdf');
$text = $doc->extractText(0);
echo $text;
$doc->close();

Ruby

require 'pdf_oxide'

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  text = doc.extract_text(0)
  puts text
end

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("report.pdf");
auto text = doc.extract_text(0);
std::cout << text << "\n";

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let text = try doc.extractText(0)
print(text)

Kotlin

import fyi.oxide.pdf.*

PdfDocument.open(java.nio.file.Path.of("report.pdf")).use { doc ->
    val text = doc.extractText(0)
    println(text)
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
final text = doc.extractText(0);
print(text);

library(pdfoxide)

doc <- pdf_open("report.pdf")
text <- pdf_extract_text(doc, 0)
cat(text)

Julia

using PdfOxide

doc = open_document("report.pdf")
text = extract_text(doc, 0)
println(text)

Zig

const pdf_oxide = @import("pdf_oxide");

var doc = try pdf_oxide.Document.open("report.pdf");
const text = try doc.extractText(a, 0);
std.debug.print("{s}\n", .{text});

Scala

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  val text = doc.extractText(0)
  println(text)
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "report.pdf")]
  (println (pdf/extract-text doc 0)))

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSString *text = [doc extractText:0 error:&err];
NSLog(@"%@", text);

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, text} = PdfOxide.extract_text(doc, 0)
IO.puts(text)

API 레퍼런스

`extract_text(page_index) -> str`

페이지의 모든 텍스트를 단일 문자열로 추출합니다. 태그가 지정된 PDF를 자동으로 감지하고, 가능한 경우 구조 트리를 사용하여 읽기 순서를 결정합니다. span 사이의 수직 및 수평 간격을 기반으로 줄바꿈과 공백을 삽입합니다.

파라미터	타입	설명
`page_index`	`int` / `usize`	0부터 시작하는 페이지 인덱스

반환값: 페이지의 전체 텍스트 콘텐츠.

Python

doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(f"--- Page {i + 1} ---")
    print(text)

Node.js

const doc = new PdfDocument("report.pdf");
for (let i = 0; i < doc.getPageCount(); i++) {
    const text = doc.extractText(i);
    console.log(`--- Page ${i + 1} ---`);
    console.log(text);
}

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
count, _ := doc.PageCount()
for i := 0; i < count; i++ {
    text, _ := doc.ExtractText(i)
    fmt.Printf("--- Page %d ---\n", i+1)
    fmt.Println(text)
}

using var doc = PdfDocument.Open("report.pdf");
for (int i = 0; i < doc.PageCount; i++)
{
    string text = doc.ExtractText(i);
    Console.WriteLine($"--- Page {i + 1} ---");
    Console.WriteLine(text);
}

WASM

const doc = new WasmPdfDocument(bytes);
for (let i = 0; i < doc.pageCount(); i++) {
    const text = doc.extractText(i);
    console.log(`--- Page ${i + 1} ---`);
    console.log(text);
}

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let page_count = doc.page_count()?;
for i in 0..page_count {
    let text = doc.extract_text(i)?;
    println!("--- Page {} ---", i + 1);
    println!("{}", text);
}

Java

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    int n = doc.pageCount();
    for (int i = 0; i < n; i++) {
        System.out.println("--- Page " + (i + 1) + " ---");
        System.out.println(doc.extractText(i));
    }
}

PHP

$doc = PdfDocument::open('report.pdf');
$n = $doc->pageCount();
for ($i = 0; $i < $n; $i++) {
    echo "--- Page " . ($i + 1) . " ---\n";
    echo $doc->extractText($i);
}
$doc->close();

Ruby

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  (0...doc.page_count).each do |i|
    puts "--- Page #{i + 1} ---"
    puts doc.extract_text(i)
  end
end

C++

auto doc = pdf_oxide::Document::open("report.pdf");
int n = doc.page_count();
for (int i = 0; i < n; i++) {
    std::cout << "--- Page " << (i + 1) << " ---\n";
    std::cout << doc.extract_text(i) << "\n";
}

Swift

let doc = try Document.open("report.pdf")
let n = try doc.pageCount()
for i in 0..<n {
    print("--- Page \(i + 1) ---")
    print(try doc.extractText(i))
}

Kotlin

PdfDocument.open(java.nio.file.Path.of("report.pdf")).use { doc ->
    for (i in 0 until doc.pageCount()) {
        println("--- Page ${i + 1} ---")
        println(doc.extractText(i))
    }
}

Dart

final doc = PdfDocument.open('report.pdf');
for (var i = 0; i < doc.pageCount; i++) {
    print('--- Page ${i + 1} ---');
    print(doc.extractText(i));
}

doc <- pdf_open("report.pdf")
for (i in seq_len(pdf_page_count(doc)) - 1) {
    cat(sprintf("--- Page %d ---\n", i + 1))
    cat(pdf_extract_text(doc, i))
}

Julia

doc = open_document("report.pdf")
for i in 0:(page_count(doc) - 1)
    println("--- Page $(i + 1) ---")
    println(extract_text(doc, i))
end

Zig

var doc = try pdf_oxide.Document.open("report.pdf");
const n = try doc.pageCount();
var i: usize = 0;
while (i < n) : (i += 1) {
    std.debug.print("--- Page {d} ---\n", .{i + 1});
    const text = try doc.extractText(a, i);
    std.debug.print("{s}\n", .{text});
}

Scala

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  for (i <- 0 until doc.pageCount()) {
    println(s"--- Page ${i + 1} ---")
    println(doc.extractText(i))
  }
}

Clojure

(with-open [doc (pdf/open "report.pdf")]
  (doseq [i (range (pdf/page-count doc))]
    (println (str "--- Page " (inc i) " ---"))
    (println (pdf/extract-text doc i))))

Objective-C

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSInteger n = [doc pageCountError:&err];
for (NSInteger i = 0; i < n; i++) {
    NSLog(@"--- Page %ld ---", (long)(i + 1));
    NSLog(@"%@", [doc extractText:i error:&err]);
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, n} = PdfOxide.page_count(doc)
Enum.each(0..(n - 1), fn i ->
  IO.puts("--- Page #{i + 1} ---")
  {:ok, text} = PdfOxide.extract_text(doc, i)
  IO.puts(text)
end)

`extract_spans(page_index) -> list[TextSpan]`

텍스트를 span으로 추출합니다. span은 동일한 폰트와 스타일을 가진 연속된 텍스트 단위입니다. 각 span에는 텍스트 콘텐츠, 경계 박스, 폰트 이름, 폰트 크기, 굵기, 이탤릭 플래그, 색상이 포함됩니다. 레이아웃 또는 폰트 정보가 필요한 대부분의 추출 작업에 권장되는 방식입니다.

파라미터	타입	설명
`page_index`	`int` / `usize`	0부터 시작하는 페이지 인덱스

반환값: TextSpan 객체의 리스트/벡터.

TextSpan 필드

필드	타입	설명
`text`	`str`	span의 텍스트 콘텐츠
`bbox`	`Rect`	경계 박스 (x, y, width, height)
`font_name`	`str`	폰트 이름/계열 (예: “Helvetica”, “TimesNewRoman”)
`font_size`	`f32`	폰트 크기 (포인트 단위)
`font_weight`	`FontWeight`	굵기: Normal, Bold, Light, SemiBold 등
`is_italic`	`bool`	span이 이탤릭인지 여부
`color`	`Color`	RGB 색상 (r, g, b), 값 범위 0.0–1.0
`mcid`	`Option<u32>`	태그가 지정된 PDF용 마크드 콘텐츠 ID
`sequence`	`usize`	추출 순서 (Y 좌표 정렬의 동점 처리)
`is_monospace`	`bool`	등폭 폰트 여부 (Courier, Consolas 등)
`char_widths`	`list[float]`	정확한 경계 박스를 위한 글리프별 전진 너비
`char_spacing`	`f32`	문자 간격 (Tc 파라미터)
`word_spacing`	`f32`	단어 간격 (Tw 파라미터)
`horizontal_scaling`	`f32`	수평 스케일링 비율 (Tz, 기본값 100.0)

Rust

let mut doc = PdfDocument::open("paper.pdf")?;
let spans = doc.extract_spans(0)?;

for span in &spans {
    println!(
        "'{}' at ({:.1}, {:.1}) font={} size={:.1}pt bold={} italic={}",
        span.text,
        span.bbox.x, span.bbox.y,
        span.font_name,
        span.font_size,
        span.font_weight == FontWeight::Bold,
        span.is_italic,
    );
}

`extract_spans_with_config(page_index, config) -> Vec<TextSpan>`

커스텀 span 병합 설정으로 span을 추출합니다. 기본 병합 동작이 문서에서 잘못된 단어 경계를 생성할 때 사용합니다.

파라미터	타입	설명
`page_index`	`usize`	0부터 시작하는 페이지 인덱스
`config`	`SpanMergingConfig`	추출 파라미터를 제어하는 설정

Rust

use pdf_oxide::extractors::SpanMergingConfig;

let mut doc = PdfDocument::open("report.pdf")?;
let config = SpanMergingConfig::adaptive();
let spans = doc.extract_spans_with_config(0, config)?;

`extract_chars(page_index) -> list[TextChar]`

정밀한 경계 박스, 폰트 메타데이터, 변환 속성을 가진 개별 문자를 추출합니다. 이것은 저수준 API입니다. 대부분의 사용 사례에서는 extract_text() 또는 extract_spans()를 권장합니다. 문자 추출은 텍스트 그룹화 및 병합을 건너뛰어 span 추출보다 30–50% 빠릅니다.

파라미터	타입	설명
`page_index`	`int` / `usize`	0부터 시작하는 페이지 인덱스

반환값: TextChar 객체의 리스트/벡터.

TextChar 필드

필드	타입	설명
`char`	`char`	문자
`bbox`	`Rect`	경계 박스 (x, y, width, height)
`font_name`	`str`	폰트 이름/계열
`font_size`	`f32`	폰트 크기 (포인트 단위)
`font_weight`	`FontWeight`	굵기 (Normal, Bold 등)
`is_italic`	`bool`	이탤릭 플래그
`color`	`Color`	RGB 색상 (채널당 0.0–1.0)
`mcid`	`Option<u32>`	마크드 콘텐츠 ID
`origin_x`	`f32`	기준선 원점 X 좌표
`origin_y`	`f32`	기준선 원점 Y 좌표
`rotation_degrees`	`f32`	텍스트 회전 각도 (0–360, 시계 방향)
`advance_width`	`f32`	다음 문자 위치까지의 수평 거리
`matrix`	`[f32; 6]`	전체 변환 행렬 [a, b, c, d, e, f]

Python

doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)

for ch in chars:
    print(f"'{ch.char}' at ({ch.bbox[0]:.1f}, {ch.bbox[1]:.1f}) "
          f"font={ch.font_name} size={ch.font_size:.1f}")

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
chars, _ := doc.ExtractChars(0)

for _, ch := range chars {
    fmt.Printf("'%c' at (%.1f, %.1f) font=%s size=%.1f\n",
        ch.Char, ch.X, ch.Y, ch.FontName, ch.FontSize)
}

using var doc = PdfDocument.Open("report.pdf");
var chars = doc.ExtractChars(0);

foreach (var ch in chars)
{
    Console.WriteLine($"'{ch.Char}' at ({ch.X:F1}, {ch.Y:F1}) {ch.W:F1}x{ch.H:F1}");
}

WASM

const doc = new WasmPdfDocument(bytes);
const chars = doc.extractChars(0);

for (const ch of chars) {
    console.log(`'${ch.char}' at (${ch.bbox[0].toFixed(1)}, ${ch.bbox[1].toFixed(1)}) font=${ch.fontName} size=${ch.fontSize.toFixed(1)}`);
}

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let chars = doc.extract_chars(0)?;

for ch in &chars {
    println!(
        "'{}' origin=({:.1}, {:.1}) rotation={:.0} advance={:.1}",
        ch.char, ch.origin_x, ch.origin_y,
        ch.rotation_degrees, ch.advance_width,
    );
}

C++

auto doc = pdf_oxide::Document::open("report.pdf");
auto chars = doc.extract_chars(0);

for (const auto& ch : chars) {
    std::printf("U+%04X at (%.1f, %.1f) font=%s size=%.1f\n",
        ch.character, ch.bbox.x, ch.bbox.y,
        ch.font_name.c_str(), ch.font_size);
}

Swift

let doc = try Document.open("report.pdf")
let chars = try doc.extractChars(0)

for ch in chars {
    let scalar = String(UnicodeScalar(ch.character)!)
    print("'\(scalar)' at (\(ch.bbox.x), \(ch.bbox.y)) font=\(ch.fontName) size=\(ch.fontSize)")
}

Dart

final doc = PdfDocument.open('report.pdf');
final chars = doc.extractChars(0);

for (final ch in chars) {
    final glyph = String.fromCharCode(ch.character);
    print("'$glyph' at (${ch.bbox.x}, ${ch.bbox.y}) font=${ch.fontName} size=${ch.fontSize}");
}

doc <- pdf_open("report.pdf")
chars <- pdf_extract_chars(doc, 0)

for (ch in chars) {
    cat(sprintf("U+%04X at (%.1f, %.1f) font=%s size=%.1f\n",
        ch$character, ch$bbox$x, ch$bbox$y, ch$font_name, ch$font_size))
}

Julia

doc = open_document("report.pdf")
chars = extract_chars(doc, 0)

for ch in chars
    glyph = Char(ch.character)
    println("'$glyph' at ($(ch.bbox.x), $(ch.bbox.y)) font=$(ch.font_name) size=$(ch.font_size)")
end

Zig

var doc = try pdf_oxide.Document.open("report.pdf");
const chars = try doc.extractChars(a, 0);

for (chars) |ch| {
    std.debug.print("U+{X:0>4} at ({d:.1}, {d:.1}) font={s} size={d:.1}\n",
        .{ ch.character, ch.bbox.x, ch.bbox.y, ch.fontName, ch.fontSize });
}

Objective-C

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSArray<POXChar*> *chars = [doc extractChars:0 error:&err];

for (POXChar *ch in chars) {
    NSLog(@"U+%04X at (%.1f, %.1f) font=%@ size=%.1f",
        ch.character, ch.bbox.x, ch.bbox.y, ch.fontName, ch.fontSize);
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, chars} = PdfOxide.extract_chars(doc, 0)

Enum.each(chars, fn ch ->
  glyph = <<ch.character::utf8>>
  IO.puts("'#{glyph}' at (#{ch.bbox.x}, #{ch.bbox.y}) font=#{ch.font_name} size=#{ch.font_size}")
end)

`extract_page_text(page_index) -> PageText`

단일 추출 패스에서 span, 문자, 페이지 크기를 가져옵니다. 페이지 콘텐츠 스트림을 한 번만 파싱하므로 extract_spans() + extract_chars()를 별도로 호출하는 것보다 효율적입니다.

파라미터	타입	설명
`page_index`	`int` / `usize`	0부터 시작하는 페이지 인덱스

반환값: 필드 spans, chars, page_width, page_height, text를 가진 PageText 객체 (Python dict / JS object).

Python

doc = PdfDocument("report.pdf")
result = doc.extract_page_text(0)
# result is a dict with: spans, chars, page_width, page_height, text

for span in result["spans"]:
    print(f"'{span.text}' font={span.font_name} size={span.font_size}")

WASM

const result = doc.extractPageText(0);
// result has: spans, chars, pageWidth, pageHeight, text

for (const span of result.spans) {
    console.log(`'${span.text}' font=${span.fontName} size=${span.fontSize}`);
}

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let result = doc.extract_page_text(0)?;
println!("Page is {}x{} pt", result.page_width, result.page_height);
for span in &result.spans {
    println!("'{}' font={} size={:.1}", span.text, span.font_name, span.font_size);
}

열 인식 읽기 순서

다단 PDF(논문, 신문 등)의 경우 열 인식 읽기 순서를 사용하면 열을 가로질러 읽지 않고 각 열을 독립적으로 읽을 수 있습니다:

Python

# Default: top-to-bottom (reads across columns)
spans = doc.extract_spans(0)

# Column-aware: reads each column separately
spans = doc.extract_spans(0, reading_order="column_aware")

WASM

const spans = doc.extractSpans(0, undefined, "column_aware");

Rust

use pdf_oxide::extractors::ReadingOrder;

let spans = doc.extract_spans_with_reading_order(0, ReadingOrder::ColumnAware)?;

`to_plain_text(page_index, options) -> str`

단일 페이지를 일반 텍스트로 변환합니다. API 일관성을 위해 변환 옵션을 받지만, 대부분의 옵션은 주로 Markdown/HTML 출력에 적용됩니다.

파라미터	타입	기본값	설명
`page_index`	`int` / `usize`	–	0부터 시작하는 페이지 인덱스
`preserve_layout`	`bool`	`false`	시각적 레이아웃 보존
`detect_headings`	`bool`	`true`	제목 감지
`include_images`	`bool`	`true`	이미지 포함
`image_output_dir`	`str` / `None`	`None`	이미지 출력 디렉토리

Python

doc = PdfDocument("paper.pdf")
text = doc.to_plain_text(0)

Node.js

const doc = new PdfDocument("paper.pdf");
const text = doc.toPlainText(0);

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
text, _ := doc.ToPlainText(0)

using var doc = PdfDocument.Open("paper.pdf");
string text = doc.ToPlainText(0);

WASM

const doc = new WasmPdfDocument(bytes);
const text = doc.extractText(0);

Rust

use pdf_oxide::converters::ConversionOptions;

let mut doc = PdfDocument::open("paper.pdf")?;
let options = ConversionOptions::default();
let text = doc.to_plain_text(0, &options)?;

C++

auto doc = pdf_oxide::Document::open("paper.pdf");
auto text = doc.to_plain_text(0);

Swift

let doc = try Document.open("paper.pdf")
let text = try doc.toPlainText(0)

Dart

final doc = PdfDocument.open('paper.pdf');
final text = doc.toPlainText(0);

doc <- pdf_open("paper.pdf")
text <- pdf_to_plain_text(doc, 0)

Julia

doc = open_document("paper.pdf")
text = to_plain_text(doc, 0)

Zig

var doc = try pdf_oxide.Document.open("paper.pdf");
const text = try doc.toPlainText(a, 0);

Objective-C

POXDocument *doc = [POXDocument openPath:@"paper.pdf" error:&err];
NSString *text = [doc toPlainText:0 error:&err];

Elixir

{:ok, doc} = PdfOxide.open("paper.pdf")
{:ok, text} = PdfOxide.to_plain_text(doc, 0)

`extract_hierarchical_content(page_index) -> Option<StructureElement>`

페이지 콘텐츠를 계층적 구조 트리로 추출합니다. 태그가 없는 PDF의 경우 None을 반환합니다. 태그가 지정된 PDF의 경우 문서의 논리적 구조(제목, 단락, 표, 그림)를 나타내는 StructureElement 트리를 반환합니다.

파라미터	타입	설명
`page_index`	`int` / `usize`	0부터 시작하는 페이지 인덱스

Rust

let mut doc = PdfDocument::open("tagged-report.pdf")?;
if let Some(root) = doc.extract_hierarchical_content(0)? {
    println!("Structure type: {:?}", root.structure_type);
    for child in &root.children {
        println!("  Child: {:?}", child.structure_type);
    }
}

고급 예제

span에서 단어 빈도 표 작성

from collections import Counter
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
words = Counter()

for page in range(doc.page_count()):
    text = doc.extract_text(page)
    for word in text.split():
        words[word.lower().strip(".,;:!?\"'()[]")] += 1

for word, count in words.most_common(20):
    print(f"{word:20s} {count}")

span 메타데이터로 굵은 제목 감지

use pdf_oxide::PdfDocument;
use pdf_oxide::layout::FontWeight;

let mut doc = PdfDocument::open("paper.pdf")?;
let spans = doc.extract_spans(0)?;

let headings: Vec<_> = spans.iter()
    .filter(|s| s.font_weight == FontWeight::Bold && s.font_size > 14.0)
    .collect();

for h in headings {
    println!("Heading: '{}' ({}pt)", h.text, h.font_size);
}

문자별 데이터를 CSV로 내보내기

import csv
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)

with open("characters.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["char", "x", "y", "width", "height", "font", "size"])
    for ch in chars:
        writer.writerow([
            ch.char, ch.bbox[0], ch.bbox[1],
            ch.bbox[2], ch.bbox[3],
            ch.font_name, ch.font_size,
        ])

벡터 패스 추출

extract_paths()는 페이지에서 벡터 패스 데이터(선, 곡선, 사각형)를 반환합니다. 표 테두리, 구분선, 그래픽 요소를 감지하는 데 유용합니다.

doc = PdfDocument("report.pdf")
paths = doc.extract_paths(0)
for path in paths:
    for op in path["operations"]:
        print(f"{op['type']}: {op.get('x', '')}, {op.get('y', '')}")
        # types: move_to, line_to, curve_to, rectangle, close_path

텍스트 추출

읽기 순서 지원

단어 및 줄 분할

빠른 예제

API 레퍼런스

extract_text(page_index) -> str

extract_spans(page_index) -> list[TextSpan]

TextSpan 필드

extract_spans_with_config(page_index, config) -> Vec<TextSpan>

extract_chars(page_index) -> list[TextChar]

TextChar 필드

extract_page_text(page_index) -> PageText

열 인식 읽기 순서

to_plain_text(page_index, options) -> str

extract_hierarchical_content(page_index) -> Option<StructureElement>

고급 예제

span에서 단어 빈도 표 작성

span 메타데이터로 굵은 제목 감지

문자별 데이터를 CSV로 내보내기

벡터 패스 추출

관련 페이지

`extract_text(page_index) -> str`

`extract_spans(page_index) -> list[TextSpan]`

`extract_spans_with_config(page_index, config) -> Vec<TextSpan>`

`extract_chars(page_index) -> list[TextChar]`

`extract_page_text(page_index) -> PageText`

`to_plain_text(page_index, options) -> str`

`extract_hierarchical_content(page_index) -> Option<StructureElement>`