Skip to content

Extraction Profiles — Tune Space Detection per Document Type

Different PDFs bury their spaces differently. An arXiv paper uses tight, justified columns. An IRS form uses rigid cell alignment. A GDPR policy runs dense, justified paragraphs with minimal kerning. A single tj_offset_threshold that works for one of these will insert garbage spaces into another.

ExtractionProfile ships nine pre-tuned parameter sets that map cleanly onto real document classes. Pass the profile to extract_text() / extract_words() and PDF Oxide applies the right word-margin ratio, TJ offset threshold, and adaptive-threshold toggle for that document style.

Binding coverage. Extraction profiles are currently exposed in Python (pdf_oxide.ExtractionProfile) and Rust (pdf_oxide::config::ExtractionProfile). The Node, WASM, Go, and C# bindings use the CONSERVATIVE default internally; to apply a different profile from those runtimes, invoke the Rust CLI (pdf-oxide extract --profile academic doc.pdf) or bridge through a Python / Rust step.

Quick Example

Python

from pdf_oxide import PdfDocument, ExtractionProfile

doc = PdfDocument("paper.pdf")

# Academic papers: tight spacing, citation detection on
text = doc.extract_text(0, profile=ExtractionProfile.academic())
print(text)

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::config::ExtractionProfile;

let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text_with_profile(0, ExtractionProfile::ACADEMIC)?;
println!("{}", text);

Available Profiles

Profile Best for TJ threshold Word-margin ratio Adaptive
conservative() Default — general text, minimal false spaces −120 0.10 off
aggressive() PDFs that suppress spaces; fixes merged words −80 0.20 off
balanced() Mixed content −100 0.15 off
academic() arXiv papers, conference proceedings, tech reports −105 0.12 on + citation / email detection
policy() Legal, GDPR, government regulations −110 0.18 on
form() IRS forms, applications, questionnaires −120 0.08 off
government() Mixed government reports with tables −105 0.14 off
scanned_ocr() OCR output where coordinates are noisy depends depends on
adaptive() Let the extractor auto-tune from font statistics depends depends on

When Each Profile Helps

Academic / conference papers — academic()

Tight typesetting, two-column layouts, embedded citations. Default settings often over-insert spaces inside ligatures (fi, ff) or under-insert between words where kerning is aggressive.

doc = PdfDocument("neurips-paper.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.academic())

The academic profile turns on adaptive thresholding plus citation / email detection so inline [1,2,3] refs and author@lab.edu emails survive cleanly.

IRS forms, applications — form()

Form PDFs care about column alignment more than about word boundaries. The form() profile uses a very tight word-margin ratio (0.08) so rigidly-aligned field labels don’t collapse into their values.

doc = PdfDocument("w2.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.form())

GDPR / policy / regulation — policy()

Justified paragraphs insert variable whitespace that breaks the default threshold. policy() uses a more generous word margin (0.18) plus adaptive thresholding to correctly read dense legal prose.

doc = PdfDocument("gdpr.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.policy())

Scanned OCR output — scanned_ocr()

When the page was OCR’d (Tesseract, PaddleOCR, Azure), character positions are noisy and kerning hints are absent. scanned_ocr() compensates with adaptive thresholding that re-reads font statistics per page.

doc = PdfDocument("scanned.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.scanned_ocr())

Let the library pick — adaptive()

If you don’t know the document class ahead of time, adaptive() samples font statistics on the first pass and picks thresholds before extracting. Slightly slower than a fixed profile but forgiving across mixed corpora.

for pdf_path in Path("mixed_corpus/").glob("*.pdf"):
    doc = PdfDocument(str(pdf_path))
    text = doc.extract_text(0, profile=ExtractionProfile.adaptive())

Profile fields

Every profile exposes its tuning knobs so you can read or clone them:

Python

from pdf_oxide import ExtractionProfile

p = ExtractionProfile.academic()
print(p.name)                # "Academic"
print(p.word_margin_ratio)   # 0.12
print(p.tj_offset_threshold) # -105.0

# Inspect every preset
for profile in ExtractionProfile.all_profiles():
    print(profile.name, profile.word_margin_ratio)

Rust

use pdf_oxide::config::ExtractionProfile;

let p = ExtractionProfile::ACADEMIC;
println!("{} margin={} tj={}",
    p.name, p.word_margin_ratio, p.tj_offset_threshold);

Choosing a profile in production pipelines

If you ingest a mixed corpus — academic papers alongside IRS forms alongside web-scraped HTML-exports — pick adaptive() as your default. It costs a few percent extra per page but eliminates the worst failures (merged words, missing spaces across columns).

If your corpus is homogeneous — you run a Title IX intake pipeline, a contract-review tool, or an arXiv crawler — pick the matching profile explicitly: you’ll get the best extraction quality and you’ll avoid the per-page sampling cost of adaptive().