Extraction Profiles — Tune Space Detection per Document Type
Different PDFs bury their spaces differently. An arXiv paper uses tight, justified columns. An IRS form uses rigid cell alignment. A GDPR policy runs dense, justified paragraphs with minimal kerning. A single tj_offset_threshold that works for one of these will insert garbage spaces into another.
ExtractionProfile ships nine pre-tuned parameter sets that map cleanly onto real document classes. Pass the profile to extract_text() / extract_words() and PDF Oxide applies the right word-margin ratio, TJ offset threshold, and adaptive-threshold toggle for that document style.
Binding coverage. Extraction profiles are currently exposed in Python (
pdf_oxide.ExtractionProfile) and Rust (pdf_oxide::config::ExtractionProfile). The Node, WASM, Go, and C# bindings use theCONSERVATIVEdefault internally; to apply a different profile from those runtimes, invoke the Rust CLI (pdf-oxide extract --profile academic doc.pdf) or bridge through a Python / Rust step.
Quick Example
Python
from pdf_oxide import PdfDocument, ExtractionProfile
doc = PdfDocument("paper.pdf")
# Academic papers: tight spacing, citation detection on
text = doc.extract_text(0, profile=ExtractionProfile.academic())
print(text)
Rust
use pdf_oxide::PdfDocument;
use pdf_oxide::config::ExtractionProfile;
let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text_with_profile(0, ExtractionProfile::ACADEMIC)?;
println!("{}", text);
Available Profiles
| Profile | Best for | TJ threshold | Word-margin ratio | Adaptive |
|---|---|---|---|---|
conservative() |
Default — general text, minimal false spaces | −120 | 0.10 | off |
aggressive() |
PDFs that suppress spaces; fixes merged words | −80 | 0.20 | off |
balanced() |
Mixed content | −100 | 0.15 | off |
academic() |
arXiv papers, conference proceedings, tech reports | −105 | 0.12 | on + citation / email detection |
policy() |
Legal, GDPR, government regulations | −110 | 0.18 | on |
form() |
IRS forms, applications, questionnaires | −120 | 0.08 | off |
government() |
Mixed government reports with tables | −105 | 0.14 | off |
scanned_ocr() |
OCR output where coordinates are noisy | depends | depends | on |
adaptive() |
Let the extractor auto-tune from font statistics | depends | depends | on |
When Each Profile Helps
Academic / conference papers — academic()
Tight typesetting, two-column layouts, embedded citations. Default settings often over-insert spaces inside ligatures (fi, ff) or under-insert between words where kerning is aggressive.
doc = PdfDocument("neurips-paper.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.academic())
The academic profile turns on adaptive thresholding plus citation / email detection so inline [1,2,3] refs and author@lab.edu emails survive cleanly.
IRS forms, applications — form()
Form PDFs care about column alignment more than about word boundaries. The form() profile uses a very tight word-margin ratio (0.08) so rigidly-aligned field labels don’t collapse into their values.
doc = PdfDocument("w2.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.form())
GDPR / policy / regulation — policy()
Justified paragraphs insert variable whitespace that breaks the default threshold. policy() uses a more generous word margin (0.18) plus adaptive thresholding to correctly read dense legal prose.
doc = PdfDocument("gdpr.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.policy())
Scanned OCR output — scanned_ocr()
When the page was OCR’d (Tesseract, PaddleOCR, Azure), character positions are noisy and kerning hints are absent. scanned_ocr() compensates with adaptive thresholding that re-reads font statistics per page.
doc = PdfDocument("scanned.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.scanned_ocr())
Let the library pick — adaptive()
If you don’t know the document class ahead of time, adaptive() samples font statistics on the first pass and picks thresholds before extracting. Slightly slower than a fixed profile but forgiving across mixed corpora.
for pdf_path in Path("mixed_corpus/").glob("*.pdf"):
doc = PdfDocument(str(pdf_path))
text = doc.extract_text(0, profile=ExtractionProfile.adaptive())
Profile fields
Every profile exposes its tuning knobs so you can read or clone them:
Python
from pdf_oxide import ExtractionProfile
p = ExtractionProfile.academic()
print(p.name) # "Academic"
print(p.word_margin_ratio) # 0.12
print(p.tj_offset_threshold) # -105.0
# Inspect every preset
for profile in ExtractionProfile.all_profiles():
print(profile.name, profile.word_margin_ratio)
Rust
use pdf_oxide::config::ExtractionProfile;
let p = ExtractionProfile::ACADEMIC;
println!("{} margin={} tj={}",
p.name, p.word_margin_ratio, p.tj_offset_threshold);
Choosing a profile in production pipelines
If you ingest a mixed corpus — academic papers alongside IRS forms alongside web-scraped HTML-exports — pick adaptive() as your default. It costs a few percent extra per page but eliminates the worst failures (merged words, missing spaces across columns).
If your corpus is homogeneous — you run a Title IX intake pipeline, a contract-review tool, or an arXiv crawler — pick the matching profile explicitly: you’ll get the best extraction quality and you’ll avoid the per-page sampling cost of adaptive().
Related Pages
- Text Extraction — full extraction API
- Reading Order (XY-cut) — column-aware reading order
- OCR Scanned PDFs — when a profile isn’t enough
- Python API Reference