Text Extraction
PDF Oxide provides multiple levels of text extraction: full-page text, styled spans with font metadata, and individual characters with precise positioning. Use extract_text() for quick content retrieval, extract_spans() when you need font and position data, and extract_chars() for per-character analysis such as custom layout engines or OCR post-processing.
For Tagged PDFs, text extraction automatically follows the document’s structure tree for correct reading order. For untagged PDFs, extraction uses page content order with intelligent line-break detection — including a single-column guard that prevents fragmentation of body text on RFC-style and thesis-style documents.
Reading-order support
The reading-order pipeline produces correct output across scripts and layouts:
- Latin — default left-to-right, top-to-bottom with column detection.
- Arabic — pre-shaped span reversal (Pass 0) puts characters in logical reading order instead of visual order.
- CJK — rowspan-label columns are preserved through the spatial table detector; 3pt Y-band quantisation keeps tabular content from interleaving.
- Rotated / dvips-generated PDFs — median-based outlier rejection in column detection handles degenerate CTM coordinates.
- Multi-column academic papers — XYCut single-column guard fixes fragmentation; row-aware span sorting handles tabular content inside text bodies.
Word and line segmentation
extract_words() and extract_text_lines() accept optional kwargs that tune the word- and line-break thresholds:
| Parameter | Default | Description |
|---|---|---|
word_gap_threshold |
adaptive | Minimum horizontal gap (in points) between adjacent characters to count as a word break |
line_gap_threshold |
adaptive | Minimum vertical gap between baselines to count as a line break |
profile |
"auto" |
One of "auto", "dense", "standard", "sparse" — picks a preset tuned for different layouts |
Adaptive parameters are derived from the page’s font metrics; use page_layout_params() to inspect the computed values, and ExtractionProfile to build a custom profile.
Python-only tuning:
word_gap_threshold,line_gap_threshold,profile, andpage_layout_params()are exposed on the Python binding. Node.js, JavaScript, Go, C#, and WASM bindings exposeextractWords(pageIndex)/extractTextLines(pageIndex)using the adaptive defaults without kwargs. For tuning from those languages, use the Rust API below.
Python
from pdf_oxide import PdfDocument, ExtractionProfile
doc = PdfDocument("receipt.pdf")
params = doc.page_layout_params(0)
print(params.word_gap_threshold, params.line_gap_threshold)
words = doc.extract_words(0, word_gap_threshold=2.5, profile="dense")
lines = doc.extract_text_lines(0, profile=ExtractionProfile.DENSE)
Node.js
const { PdfDocument } = require("pdf-oxide");
const doc = new PdfDocument("receipt.pdf");
const words = doc.extractWords(0); // adaptive defaults
const lines = doc.extractTextLines(0);
doc.close();
JavaScript
const { PdfDocument } = require("pdf-oxide");
const doc = new PdfDocument("receipt.pdf");
const words = doc.extractWords(0);
const lines = doc.extractTextLines(0);
doc.close();
TypeScript
import { PdfDocument } from "pdf-oxide";
const doc: PdfDocument = new PdfDocument("receipt.pdf");
const words = doc.extractWords(0);
const lines = doc.extractTextLines(0);
doc.close();
Rust
use pdf_oxide::{PdfDocument, ExtractionProfile};
let mut doc = PdfDocument::open("receipt.pdf")?;
let params = doc.page_layout_params(0)?;
println!("{} {}", params.word_gap_threshold, params.line_gap_threshold);
let words = doc.extract_words_with_config(0, /* word_gap_threshold */ Some(2.5), ExtractionProfile::Dense)?;
let lines = doc.extract_text_lines_with_profile(0, ExtractionProfile::Dense)?;
Go
words, _ := doc.ExtractWords(0) // adaptive defaults
lines, _ := doc.ExtractTextLines(0)
C#
var words = doc.ExtractWords(0); // adaptive defaults
var lines = doc.ExtractTextLines(0);
WASM
const doc = new WasmPdfDocument(bytes);
const words = doc.extractWords(0);
const lines = doc.extractTextLines(0);
Quick Example
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)
Node.js
const { PdfDocument } = require("pdf-oxide");
const doc = new PdfDocument("report.pdf");
const text = doc.extractText(0);
console.log(text);
Go
import pdfoxide "github.com/yfedoseev/pdf_oxide/go"
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
text, _ := doc.ExtractText(0)
fmt.Println(text)
C#
using PdfOxide.Core;
using var doc = PdfDocument.Open("report.pdf");
string text = doc.ExtractText(0);
Console.WriteLine(text);
WASM
const doc = new WasmPdfDocument(bytes);
const text = doc.extractText(0);
console.log(text);
Rust
use pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("report.pdf")?;
let text = doc.extract_text(0)?;
println!("{}", text);
API Reference
extract_text(page_index) -> str
Extract all text from a page as a single string. Automatically detects Tagged PDFs and uses the structure tree for reading order when available. Inserts line breaks and spaces based on vertical and horizontal gaps between spans.
| Parameter | Type | Description |
|---|---|---|
page_index |
int / usize |
Zero-based page index |
Returns: The full text content of the page.
Python
doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
text = doc.extract_text(i)
print(f"--- Page {i + 1} ---")
print(text)
Node.js
const doc = new PdfDocument("report.pdf");
for (let i = 0; i < doc.getPageCount(); i++) {
const text = doc.extractText(i);
console.log(`--- Page ${i + 1} ---`);
console.log(text);
}
Go
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
count, _ := doc.PageCount()
for i := 0; i < count; i++ {
text, _ := doc.ExtractText(i)
fmt.Printf("--- Page %d ---\n", i+1)
fmt.Println(text)
}
C#
using var doc = PdfDocument.Open("report.pdf");
for (int i = 0; i < doc.PageCount; i++)
{
string text = doc.ExtractText(i);
Console.WriteLine($"--- Page {i + 1} ---");
Console.WriteLine(text);
}
WASM
const doc = new WasmPdfDocument(bytes);
for (let i = 0; i < doc.pageCount(); i++) {
const text = doc.extractText(i);
console.log(`--- Page ${i + 1} ---`);
console.log(text);
}
Rust
let mut doc = PdfDocument::open("report.pdf")?;
let page_count = doc.page_count()?;
for i in 0..page_count {
let text = doc.extract_text(i)?;
println!("--- Page {} ---", i + 1);
println!("{}", text);
}
extract_spans(page_index) -> list[TextSpan]
Extract text as spans – contiguous runs of text with the same font and style. Each span includes the text content, bounding box, font name, font size, weight, italic flag, and color. This is the recommended approach for most extraction tasks that need layout or font information.
| Parameter | Type | Description |
|---|---|---|
page_index |
int / usize |
Zero-based page index |
Returns: A list/vector of TextSpan objects.
TextSpan Fields
| Field | Type | Description |
|---|---|---|
text |
str |
The text content of the span |
bbox |
Rect |
Bounding box (x, y, width, height) |
font_name |
str |
Font name/family (e.g., “Helvetica”, “TimesNewRoman”) |
font_size |
f32 |
Font size in points |
font_weight |
FontWeight |
Weight: Normal, Bold, Light, SemiBold, etc. |
is_italic |
bool |
Whether the span is italic |
color |
Color |
RGB color (r, g, b) with values 0.0–1.0 |
mcid |
Option<u32> |
Marked Content ID for Tagged PDFs |
sequence |
usize |
Extraction order (tie-breaker for Y-coordinate sorting) |
is_monospace |
bool |
Whether the font is fixed-width (Courier, Consolas, etc.) |
char_widths |
list[float] |
Per-glyph advance widths for accurate bounding boxes |
char_spacing |
f32 |
Character spacing (Tc parameter) |
word_spacing |
f32 |
Word spacing (Tw parameter) |
horizontal_scaling |
f32 |
Horizontal scaling percentage (Tz, default 100.0) |
Rust
let mut doc = PdfDocument::open("paper.pdf")?;
let spans = doc.extract_spans(0)?;
for span in &spans {
println!(
"'{}' at ({:.1}, {:.1}) font={} size={:.1}pt bold={} italic={}",
span.text,
span.bbox.x, span.bbox.y,
span.font_name,
span.font_size,
span.font_weight == FontWeight::Bold,
span.is_italic,
);
}
extract_spans_with_config(page_index, config) -> Vec<TextSpan>
Extract spans with custom span-merging configuration. Use this when the default merging behavior produces incorrect word boundaries for your document.
| Parameter | Type | Description |
|---|---|---|
page_index |
usize |
Zero-based page index |
config |
SpanMergingConfig |
Configuration controlling extraction parameters |
Rust
use pdf_oxide::extractors::SpanMergingConfig;
let mut doc = PdfDocument::open("report.pdf")?;
let config = SpanMergingConfig::adaptive();
let spans = doc.extract_spans_with_config(0, config)?;
extract_chars(page_index) -> list[TextChar]
Extract individual characters with precise bounding boxes, font metadata, and transformation properties. This is a low-level API – prefer extract_text() or extract_spans() for most use cases. Character extraction is 30–50% faster than span extraction because it skips text grouping and merging.
| Parameter | Type | Description |
|---|---|---|
page_index |
int / usize |
Zero-based page index |
Returns: A list/vector of TextChar objects.
TextChar Fields
| Field | Type | Description |
|---|---|---|
char |
char |
The character |
bbox |
Rect |
Bounding box (x, y, width, height) |
font_name |
str |
Font name/family |
font_size |
f32 |
Font size in points |
font_weight |
FontWeight |
Weight (Normal, Bold, etc.) |
is_italic |
bool |
Italic flag |
color |
Color |
RGB color (0.0–1.0 per component) |
mcid |
Option<u32> |
Marked Content ID |
origin_x |
f32 |
Baseline origin X coordinate |
origin_y |
f32 |
Baseline origin Y coordinate |
rotation_degrees |
f32 |
Text rotation angle (0–360, clockwise) |
advance_width |
f32 |
Horizontal distance to next character position |
matrix |
[f32; 6] |
Full transformation matrix [a, b, c, d, e, f] |
Python
doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for ch in chars:
print(f"'{ch.char}' at ({ch.bbox[0]:.1f}, {ch.bbox[1]:.1f}) "
f"font={ch.font_name} size={ch.font_size:.1f}")
<!-- Node.js: extractChars not yet in binding (js/src/index.ts) -->
Go
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
chars, _ := doc.ExtractChars(0)
for _, ch := range chars {
fmt.Printf("'%c' at (%.1f, %.1f) font=%s size=%.1f\n",
ch.Char, ch.X, ch.Y, ch.FontName, ch.FontSize)
}
C#
using var doc = PdfDocument.Open("report.pdf");
var chars = doc.ExtractChars(0);
foreach (var ch in chars)
{
Console.WriteLine($"'{ch.Char}' at ({ch.X:F1}, {ch.Y:F1}) {ch.W:F1}x{ch.H:F1}");
}
WASM
const doc = new WasmPdfDocument(bytes);
const chars = doc.extractChars(0);
for (const ch of chars) {
console.log(`'${ch.char}' at (${ch.bbox[0].toFixed(1)}, ${ch.bbox[1].toFixed(1)}) font=${ch.fontName} size=${ch.fontSize.toFixed(1)}`);
}
Rust
let mut doc = PdfDocument::open("report.pdf")?;
let chars = doc.extract_chars(0)?;
for ch in &chars {
println!(
"'{}' origin=({:.1}, {:.1}) rotation={:.0} advance={:.1}",
ch.char, ch.origin_x, ch.origin_y,
ch.rotation_degrees, ch.advance_width,
);
}
extract_page_text(page_index) -> PageText
Get spans, characters, and page dimensions from a single extraction pass. More efficient than calling extract_spans() + extract_chars() separately because it parses the page content stream only once.
| Parameter | Type | Description |
|---|---|---|
page_index |
int / usize |
Zero-based page index |
Returns: A PageText object (Python dict / JS object) with fields: spans, chars, page_width, page_height, text.
Python
doc = PdfDocument("report.pdf")
result = doc.extract_page_text(0)
# result is a dict with: spans, chars, page_width, page_height, text
for span in result["spans"]:
print(f"'{span.text}' font={span.font_name} size={span.font_size}")
<!-- Node.js: extractPageText not yet in binding (js/src/index.ts) --> <!-- Go: ExtractPageText not yet in binding (go/pdf_oxide.go) --> <!-- C#: ExtractPageText not yet in binding (csharp/PdfOxide/Core/PdfDocument.cs) -->
WASM
const result = doc.extractPageText(0);
// result has: spans, chars, pageWidth, pageHeight, text
for (const span of result.spans) {
console.log(`'${span.text}' font=${span.fontName} size=${span.fontSize}`);
}
Rust
let mut doc = PdfDocument::open("report.pdf")?;
let result = doc.extract_page_text(0)?;
println!("Page is {}x{} pt", result.page_width, result.page_height);
for span in &result.spans {
println!("'{}' font={} size={:.1}", span.text, span.font_name, span.font_size);
}
Column-Aware Reading Order
For multi-column PDFs (research papers, newspapers), use column-aware reading order to read each column separately instead of reading across columns:
Python
# Default: top-to-bottom (reads across columns)
spans = doc.extract_spans(0)
# Column-aware: reads each column separately
spans = doc.extract_spans(0, reading_order="column_aware")
<!-- Node.js: extractSpans not yet in binding (js/src/index.ts) --> <!-- Go: ExtractSpans not yet in binding (go/pdf_oxide.go) --> <!-- C#: ExtractSpans not yet in binding (csharp/PdfOxide/Core/PdfDocument.cs) -->
WASM
const spans = doc.extractSpans(0, undefined, "column_aware");
Rust
use pdf_oxide::extractors::ReadingOrder;
let spans = doc.extract_spans_with_reading_order(0, ReadingOrder::ColumnAware)?;
to_plain_text(page_index, options) -> str
Convert a single page to plain text. Accepts conversion options for API consistency, although most options apply primarily to Markdown/HTML output.
| Parameter | Type | Default | Description |
|---|---|---|---|
page_index |
int / usize |
– | Zero-based page index |
preserve_layout |
bool |
false |
Preserve visual layout |
detect_headings |
bool |
true |
Detect headings |
include_images |
bool |
true |
Include images |
image_output_dir |
str / None |
None |
Image output directory |
Python
doc = PdfDocument("paper.pdf")
text = doc.to_plain_text(0)
Node.js
const doc = new PdfDocument("paper.pdf");
const text = doc.toPlainText(0);
Go
doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
text, _ := doc.ToPlainText(0)
C#
using var doc = PdfDocument.Open("paper.pdf");
string text = doc.ToPlainText(0);
WASM
const doc = new WasmPdfDocument(bytes);
const text = doc.extractText(0);
Rust
use pdf_oxide::converters::ConversionOptions;
let mut doc = PdfDocument::open("paper.pdf")?;
let options = ConversionOptions::default();
let text = doc.to_plain_text(0, &options)?;
extract_hierarchical_content(page_index) -> Option<StructureElement>
Extract page content as a hierarchical structure tree. Returns None for untagged PDFs. For Tagged PDFs, returns a StructureElement tree that represents the document’s logical structure (headings, paragraphs, tables, figures).
| Parameter | Type | Description |
|---|---|---|
page_index |
int / usize |
Zero-based page index |
Rust
let mut doc = PdfDocument::open("tagged-report.pdf")?;
if let Some(root) = doc.extract_hierarchical_content(0)? {
println!("Structure type: {:?}", root.structure_type);
for child in &root.children {
println!(" Child: {:?}", child.structure_type);
}
}
Advanced Examples
Build a word-frequency table from spans
from collections import Counter
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
words = Counter()
for page in range(doc.page_count()):
text = doc.extract_text(page)
for word in text.split():
words[word.lower().strip(".,;:!?\"'()[]")] += 1
for word, count in words.most_common(20):
print(f"{word:20s} {count}")
Detect bold headings using span metadata
use pdf_oxide::PdfDocument;
use pdf_oxide::layout::FontWeight;
let mut doc = PdfDocument::open("paper.pdf")?;
let spans = doc.extract_spans(0)?;
let headings: Vec<_> = spans.iter()
.filter(|s| s.font_weight == FontWeight::Bold && s.font_size > 14.0)
.collect();
for h in headings {
println!("Heading: '{}' ({}pt)", h.text, h.font_size);
}
Export per-character data to CSV
import csv
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
with open("characters.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["char", "x", "y", "width", "height", "font", "size"])
for ch in chars:
writer.writerow([
ch.char, ch.bbox[0], ch.bbox[1],
ch.bbox[2], ch.bbox[3],
ch.font_name, ch.font_size,
])
Extract Vector Paths
extract_paths() returns vector path data (lines, curves, rectangles) from a page. Useful for detecting table borders, separators, and graphical elements.
doc = PdfDocument("report.pdf")
paths = doc.extract_paths(0)
for path in paths:
for op in path["operations"]:
print(f"{op['type']}: {op.get('x', '')}, {op.get('y', '')}")
# types: move_to, line_to, curve_to, rectangle, close_path
Related Pages
- Markdown Conversion – Convert text to structured Markdown
- HTML Conversion – Convert text to HTML with formatting
- Text Search – Search extracted text with regex
- Metadata & XMP – Read document-level metadata