Skip to content

Text Extraction

PDF Oxide provides multiple levels of text extraction: full-page text, styled spans with font metadata, and individual characters with precise positioning. Use extract_text() for quick content retrieval, extract_spans() when you need font and position data, and extract_chars() for per-character analysis such as custom layout engines or OCR post-processing.

For Tagged PDFs, text extraction automatically follows the document’s structure tree for correct reading order. For untagged PDFs, extraction uses page content order with intelligent line-break detection — including a single-column guard that prevents fragmentation of body text on RFC-style and thesis-style documents.

Reading-order support

The reading-order pipeline produces correct output across scripts and layouts:

  • Latin — default left-to-right, top-to-bottom with column detection.
  • Arabic — pre-shaped span reversal (Pass 0) puts characters in logical reading order instead of visual order.
  • CJK — rowspan-label columns are preserved through the spatial table detector; 3pt Y-band quantisation keeps tabular content from interleaving.
  • Rotated / dvips-generated PDFs — median-based outlier rejection in column detection handles degenerate CTM coordinates.
  • Multi-column academic papers — XYCut single-column guard fixes fragmentation; row-aware span sorting handles tabular content inside text bodies.

Word and line segmentation

extract_words() and extract_text_lines() accept optional kwargs that tune the word- and line-break thresholds:

Parameter Default Description
word_gap_threshold adaptive Minimum horizontal gap (in points) between adjacent characters to count as a word break
line_gap_threshold adaptive Minimum vertical gap between baselines to count as a line break
profile "auto" One of "auto", "dense", "standard", "sparse" — picks a preset tuned for different layouts

Adaptive parameters are derived from the page’s font metrics; use page_layout_params() to inspect the computed values, and ExtractionProfile to build a custom profile.

Python-only tuning: word_gap_threshold, line_gap_threshold, profile, and page_layout_params() are exposed on the Python binding. Node.js, JavaScript, Go, C#, and WASM bindings expose extractWords(pageIndex) / extractTextLines(pageIndex) using the adaptive defaults without kwargs. For tuning from those languages, use the Rust API below.

Python

from pdf_oxide import PdfDocument, ExtractionProfile

doc = PdfDocument("receipt.pdf")

params = doc.page_layout_params(0)
print(params.word_gap_threshold, params.line_gap_threshold)

words = doc.extract_words(0, word_gap_threshold=2.5, profile="dense")
lines = doc.extract_text_lines(0, profile=ExtractionProfile.DENSE)

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("receipt.pdf");
const words = doc.extractWords(0);      // adaptive defaults
const lines = doc.extractTextLines(0);
doc.close();

JavaScript

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("receipt.pdf");
const words = doc.extractWords(0);
const lines = doc.extractTextLines(0);
doc.close();

TypeScript

import { PdfDocument } from "pdf-oxide";

const doc: PdfDocument = new PdfDocument("receipt.pdf");
const words = doc.extractWords(0);
const lines = doc.extractTextLines(0);
doc.close();

Rust

use pdf_oxide::{PdfDocument, ExtractionProfile};

let mut doc = PdfDocument::open("receipt.pdf")?;

let params = doc.page_layout_params(0)?;
println!("{} {}", params.word_gap_threshold, params.line_gap_threshold);

let words = doc.extract_words_with_config(0, /* word_gap_threshold */ Some(2.5), ExtractionProfile::Dense)?;
let lines = doc.extract_text_lines_with_profile(0, ExtractionProfile::Dense)?;

Go

words, _ := doc.ExtractWords(0)     // adaptive defaults
lines, _ := doc.ExtractTextLines(0)

C#

var words = doc.ExtractWords(0);     // adaptive defaults
var lines = doc.ExtractTextLines(0);

WASM

const doc = new WasmPdfDocument(bytes);
const words = doc.extractWords(0);
const lines = doc.extractTextLines(0);

Quick Example

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const text = doc.extractText(0);
console.log(text);

Go

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
text, _ := doc.ExtractText(0)
fmt.Println(text)

C#

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
string text = doc.ExtractText(0);
Console.WriteLine(text);

WASM

const doc = new WasmPdfDocument(bytes);
const text = doc.extractText(0);
console.log(text);

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;
let text = doc.extract_text(0)?;
println!("{}", text);

API Reference

extract_text(page_index) -> str

Extract all text from a page as a single string. Automatically detects Tagged PDFs and uses the structure tree for reading order when available. Inserts line breaks and spaces based on vertical and horizontal gaps between spans.

Parameter Type Description
page_index int / usize Zero-based page index

Returns: The full text content of the page.

Python

doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(f"--- Page {i + 1} ---")
    print(text)

Node.js

const doc = new PdfDocument("report.pdf");
for (let i = 0; i < doc.getPageCount(); i++) {
    const text = doc.extractText(i);
    console.log(`--- Page ${i + 1} ---`);
    console.log(text);
}

Go

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
count, _ := doc.PageCount()
for i := 0; i < count; i++ {
    text, _ := doc.ExtractText(i)
    fmt.Printf("--- Page %d ---\n", i+1)
    fmt.Println(text)
}

C#

using var doc = PdfDocument.Open("report.pdf");
for (int i = 0; i < doc.PageCount; i++)
{
    string text = doc.ExtractText(i);
    Console.WriteLine($"--- Page {i + 1} ---");
    Console.WriteLine(text);
}

WASM

const doc = new WasmPdfDocument(bytes);
for (let i = 0; i < doc.pageCount(); i++) {
    const text = doc.extractText(i);
    console.log(`--- Page ${i + 1} ---`);
    console.log(text);
}

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let page_count = doc.page_count()?;
for i in 0..page_count {
    let text = doc.extract_text(i)?;
    println!("--- Page {} ---", i + 1);
    println!("{}", text);
}

extract_spans(page_index) -> list[TextSpan]

Extract text as spans – contiguous runs of text with the same font and style. Each span includes the text content, bounding box, font name, font size, weight, italic flag, and color. This is the recommended approach for most extraction tasks that need layout or font information.

Parameter Type Description
page_index int / usize Zero-based page index

Returns: A list/vector of TextSpan objects.

TextSpan Fields

Field Type Description
text str The text content of the span
bbox Rect Bounding box (x, y, width, height)
font_name str Font name/family (e.g., “Helvetica”, “TimesNewRoman”)
font_size f32 Font size in points
font_weight FontWeight Weight: Normal, Bold, Light, SemiBold, etc.
is_italic bool Whether the span is italic
color Color RGB color (r, g, b) with values 0.0–1.0
mcid Option<u32> Marked Content ID for Tagged PDFs
sequence usize Extraction order (tie-breaker for Y-coordinate sorting)
is_monospace bool Whether the font is fixed-width (Courier, Consolas, etc.)
char_widths list[float] Per-glyph advance widths for accurate bounding boxes
char_spacing f32 Character spacing (Tc parameter)
word_spacing f32 Word spacing (Tw parameter)
horizontal_scaling f32 Horizontal scaling percentage (Tz, default 100.0)

Rust

let mut doc = PdfDocument::open("paper.pdf")?;
let spans = doc.extract_spans(0)?;

for span in &spans {
    println!(
        "'{}' at ({:.1}, {:.1}) font={} size={:.1}pt bold={} italic={}",
        span.text,
        span.bbox.x, span.bbox.y,
        span.font_name,
        span.font_size,
        span.font_weight == FontWeight::Bold,
        span.is_italic,
    );
}

extract_spans_with_config(page_index, config) -> Vec<TextSpan>

Extract spans with custom span-merging configuration. Use this when the default merging behavior produces incorrect word boundaries for your document.

Parameter Type Description
page_index usize Zero-based page index
config SpanMergingConfig Configuration controlling extraction parameters

Rust

use pdf_oxide::extractors::SpanMergingConfig;

let mut doc = PdfDocument::open("report.pdf")?;
let config = SpanMergingConfig::adaptive();
let spans = doc.extract_spans_with_config(0, config)?;

extract_chars(page_index) -> list[TextChar]

Extract individual characters with precise bounding boxes, font metadata, and transformation properties. This is a low-level API – prefer extract_text() or extract_spans() for most use cases. Character extraction is 30–50% faster than span extraction because it skips text grouping and merging.

Parameter Type Description
page_index int / usize Zero-based page index

Returns: A list/vector of TextChar objects.

TextChar Fields

Field Type Description
char char The character
bbox Rect Bounding box (x, y, width, height)
font_name str Font name/family
font_size f32 Font size in points
font_weight FontWeight Weight (Normal, Bold, etc.)
is_italic bool Italic flag
color Color RGB color (0.0–1.0 per component)
mcid Option<u32> Marked Content ID
origin_x f32 Baseline origin X coordinate
origin_y f32 Baseline origin Y coordinate
rotation_degrees f32 Text rotation angle (0–360, clockwise)
advance_width f32 Horizontal distance to next character position
matrix [f32; 6] Full transformation matrix [a, b, c, d, e, f]

Python

doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)

for ch in chars:
    print(f"'{ch.char}' at ({ch.bbox[0]:.1f}, {ch.bbox[1]:.1f}) "
          f"font={ch.font_name} size={ch.font_size:.1f}")

<!-- Node.js: extractChars not yet in binding (js/src/index.ts) -->

Go

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
chars, _ := doc.ExtractChars(0)

for _, ch := range chars {
    fmt.Printf("'%c' at (%.1f, %.1f) font=%s size=%.1f\n",
        ch.Char, ch.X, ch.Y, ch.FontName, ch.FontSize)
}

C#

using var doc = PdfDocument.Open("report.pdf");
var chars = doc.ExtractChars(0);

foreach (var ch in chars)
{
    Console.WriteLine($"'{ch.Char}' at ({ch.X:F1}, {ch.Y:F1}) {ch.W:F1}x{ch.H:F1}");
}

WASM

const doc = new WasmPdfDocument(bytes);
const chars = doc.extractChars(0);

for (const ch of chars) {
    console.log(`'${ch.char}' at (${ch.bbox[0].toFixed(1)}, ${ch.bbox[1].toFixed(1)}) font=${ch.fontName} size=${ch.fontSize.toFixed(1)}`);
}

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let chars = doc.extract_chars(0)?;

for ch in &chars {
    println!(
        "'{}' origin=({:.1}, {:.1}) rotation={:.0} advance={:.1}",
        ch.char, ch.origin_x, ch.origin_y,
        ch.rotation_degrees, ch.advance_width,
    );
}

extract_page_text(page_index) -> PageText

Get spans, characters, and page dimensions from a single extraction pass. More efficient than calling extract_spans() + extract_chars() separately because it parses the page content stream only once.

Parameter Type Description
page_index int / usize Zero-based page index

Returns: A PageText object (Python dict / JS object) with fields: spans, chars, page_width, page_height, text.

Python

doc = PdfDocument("report.pdf")
result = doc.extract_page_text(0)
# result is a dict with: spans, chars, page_width, page_height, text

for span in result["spans"]:
    print(f"'{span.text}' font={span.font_name} size={span.font_size}")

<!-- Node.js: extractPageText not yet in binding (js/src/index.ts) --> <!-- Go: ExtractPageText not yet in binding (go/pdf_oxide.go) --> <!-- C#: ExtractPageText not yet in binding (csharp/PdfOxide/Core/PdfDocument.cs) -->

WASM

const result = doc.extractPageText(0);
// result has: spans, chars, pageWidth, pageHeight, text

for (const span of result.spans) {
    console.log(`'${span.text}' font=${span.fontName} size=${span.fontSize}`);
}

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let result = doc.extract_page_text(0)?;
println!("Page is {}x{} pt", result.page_width, result.page_height);
for span in &result.spans {
    println!("'{}' font={} size={:.1}", span.text, span.font_name, span.font_size);
}

Column-Aware Reading Order

For multi-column PDFs (research papers, newspapers), use column-aware reading order to read each column separately instead of reading across columns:

Python

# Default: top-to-bottom (reads across columns)
spans = doc.extract_spans(0)

# Column-aware: reads each column separately
spans = doc.extract_spans(0, reading_order="column_aware")

<!-- Node.js: extractSpans not yet in binding (js/src/index.ts) --> <!-- Go: ExtractSpans not yet in binding (go/pdf_oxide.go) --> <!-- C#: ExtractSpans not yet in binding (csharp/PdfOxide/Core/PdfDocument.cs) -->

WASM

const spans = doc.extractSpans(0, undefined, "column_aware");

Rust

use pdf_oxide::extractors::ReadingOrder;

let spans = doc.extract_spans_with_reading_order(0, ReadingOrder::ColumnAware)?;

to_plain_text(page_index, options) -> str

Convert a single page to plain text. Accepts conversion options for API consistency, although most options apply primarily to Markdown/HTML output.

Parameter Type Default Description
page_index int / usize Zero-based page index
preserve_layout bool false Preserve visual layout
detect_headings bool true Detect headings
include_images bool true Include images
image_output_dir str / None None Image output directory

Python

doc = PdfDocument("paper.pdf")
text = doc.to_plain_text(0)

Node.js

const doc = new PdfDocument("paper.pdf");
const text = doc.toPlainText(0);

Go

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
text, _ := doc.ToPlainText(0)

C#

using var doc = PdfDocument.Open("paper.pdf");
string text = doc.ToPlainText(0);

WASM

const doc = new WasmPdfDocument(bytes);
const text = doc.extractText(0);

Rust

use pdf_oxide::converters::ConversionOptions;

let mut doc = PdfDocument::open("paper.pdf")?;
let options = ConversionOptions::default();
let text = doc.to_plain_text(0, &options)?;

extract_hierarchical_content(page_index) -> Option<StructureElement>

Extract page content as a hierarchical structure tree. Returns None for untagged PDFs. For Tagged PDFs, returns a StructureElement tree that represents the document’s logical structure (headings, paragraphs, tables, figures).

Parameter Type Description
page_index int / usize Zero-based page index

Rust

let mut doc = PdfDocument::open("tagged-report.pdf")?;
if let Some(root) = doc.extract_hierarchical_content(0)? {
    println!("Structure type: {:?}", root.structure_type);
    for child in &root.children {
        println!("  Child: {:?}", child.structure_type);
    }
}

Advanced Examples

Build a word-frequency table from spans

from collections import Counter
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
words = Counter()

for page in range(doc.page_count()):
    text = doc.extract_text(page)
    for word in text.split():
        words[word.lower().strip(".,;:!?\"'()[]")] += 1

for word, count in words.most_common(20):
    print(f"{word:20s} {count}")

Detect bold headings using span metadata

use pdf_oxide::PdfDocument;
use pdf_oxide::layout::FontWeight;

let mut doc = PdfDocument::open("paper.pdf")?;
let spans = doc.extract_spans(0)?;

let headings: Vec<_> = spans.iter()
    .filter(|s| s.font_weight == FontWeight::Bold && s.font_size > 14.0)
    .collect();

for h in headings {
    println!("Heading: '{}' ({}pt)", h.text, h.font_size);
}

Export per-character data to CSV

import csv
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)

with open("characters.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["char", "x", "y", "width", "height", "font", "size"])
    for ch in chars:
        writer.writerow([
            ch.char, ch.bbox[0], ch.bbox[1],
            ch.bbox[2], ch.bbox[3],
            ch.font_name, ch.font_size,
        ])

Extract Vector Paths

extract_paths() returns vector path data (lines, curves, rectangles) from a page. Useful for detecting table borders, separators, and graphical elements.

doc = PdfDocument("report.pdf")
paths = doc.extract_paths(0)
for path in paths:
    for op in path["operations"]:
        print(f"{op['type']}: {op.get('x', '')}, {op.get('y', '')}")
        # types: move_to, line_to, curve_to, rectangle, close_path