Reading Order & XY-Cut — Extract Multi-Column PDFs in Natural Order
Multi-column PDFs — academic papers, textbooks, magazine articles, policy briefs — trip up most extraction tools. A naïve top-to-bottom read pulls a word from column 1, then a word from column 2, then back to column 1, producing garbled output like accompaally ("accompa" from column 1 joined to "ally" from column 2).
PDF Oxide uses an XY-cut algorithm to detect columns and produce natural reading order automatically. Since v0.3.34 it also guards against sparse-layout false positives (copyright pages, title pages) and correctly handles mixed layouts where a table sits inside body text.
Quick Example
Extraction is column-aware by default — no flag needed:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("academic-paper.pdf")
text = doc.extract_text(0)
# Columns are read top-to-bottom within each column, not interleaved.
Rust
use pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("academic-paper.pdf")?;
let text = doc.extract_text(0)?;
JavaScript / TypeScript (Node)
const { PdfDocument } = require("pdf-oxide");
const doc = new PdfDocument("academic-paper.pdf");
const text = doc.extractText(0);
doc.close();
JavaScript (WASM)
import { WasmPdfDocument } from "pdf-oxide-wasm";
const doc = new WasmPdfDocument(bytes);
console.log(doc.extractText(0));
doc.free();
Go
doc, _ := pdfoxide.Open("academic-paper.pdf")
defer doc.Close()
text, _ := doc.ExtractText(0)
fmt.Println(text)
C#
using PdfOxide;
using var doc = PdfDocument.Open("academic-paper.pdf");
Console.WriteLine(doc.ExtractText(0));
What XY-cut Does
The XY-cut algorithm recursively splits a page into rectangular regions by alternating vertical and horizontal cuts along whitespace gutters:
- Project all characters onto the X axis. If a tall, wide vertical gap shows up (the column gutter), split the page into two regions at that X coordinate.
- Within each region, project onto the Y axis and split on horizontal gutters (paragraph breaks, section boundaries).
- Recurse until each leaf region has no strong gutter — these are the atomic blocks.
- Serialize blocks in top-to-bottom, left-to-right order.
This matches how a human reads: column 1 top to bottom, then column 2 top to bottom, then any full-width footer.
When XY-cut Activates
XY-cut runs automatically when extract_text detects a multi-column layout. It is skipped for:
- Single-column pages (no vertical gutter is found, so the default row-aware sort is used)
- Sparse pages with fewer than ~10 text spans per apparent column — these are typically title or copyright pages where two X-center peaks are an artefact rather than real columns (fixed in v0.3.34)
No configuration is needed for the common case. If you want to force one mode or the other, see “Opt-out” below.
What v0.3.34 Fixed
Interleaved multi-column output on untagged PDFs
On untagged multi-column PDFs (academic textbooks, genetics references), extract_text previously applied XY-cut inside extract_spans() and then re-sorted the result with a row-aware sort in extract_text_with_options, undoing the column structure. Result: garbled fragments like accompaally.
Fix: the row-aware re-sort is now skipped on pages that are genuinely multi-column. Verified clean on Hartwell Genetics, Murphy ML, and Kandel Neural Science textbooks.
Table-within-text pages
Mixed-layout pages (a table embedded in running body text) could trick the column detector because tab-expanded table rows filled the column gutter. Fix:
- Wide spans (>55 % of region width) are excluded from the projection density — tab-padded rows no longer mask the gutter.
- Single-character spans (table cell values like
G,T) are excluded from the projection so they don’t scatter across the gutter. - Coverage uses a character-count estimate rather than raw bbox width, so tab-padded rows no longer masquerade as dense body text.
Sparse-layout false positives
Copyright pages, title pages, and colophons can produce two X-center peaks with only 7–10 spans per “column”. These are no longer treated as multi-column, preventing XY-cut from splitting sentences whose halves sit at different X positions on the same line.
Structured Access per Column
Going lower-level than extract_text, you can pull words or character-level data with the same column ordering applied:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
for w in doc.extract_words(0):
print(f"{w.text} ({w.x0:.0f},{w.y0:.0f})")
Rust
let mut doc = PdfDocument::open("paper.pdf")?;
for w in doc.extract_words(0)? {
println!("{} ({:.0},{:.0})", w.text, w.x0, w.y0);
}
Go
doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
words, _ := doc.ExtractWords(0)
for _, w := range words {
fmt.Printf("%s (%.0f,%.0f)\n", w.Text, w.X0, w.Y0)
}
C#
using var doc = PdfDocument.Open("paper.pdf");
// Node/C# return rows of (text, x, y, w, h):
var lines = doc.ExtractTextLines(0);
foreach (var (text, x, y, w, h) in lines)
Console.WriteLine($"{text} ({x:F0},{y:F0})");
Each word / line carries its bounding box so you can group by column and re-order yourself if you need a custom policy (e.g. read the right column first for Arabic layouts).
Detecting Multi-Column Pages Manually
If you want to branch on whether a page is multi-column before extracting:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("mixed.pdf")
for i in range(doc.page_count()):
words = doc.extract_words(i)
# Heuristic: distinct X-center clusters
x_centers = {round((w.x0 + w.x1) / 2 / 50) * 50 for w in words}
if len(x_centers) >= 2:
print(f"Page {i}: likely multi-column ({len(x_centers)} X-centers)")
For production use, prefer extract_text and let the library’s XY-cut + sparse-layout guard make the call.
Opt-out or Custom Ordering
If you want raw, position-ordered spans (e.g. for a custom layout engine), use extract_chars or extract_words — these return records with bounding boxes, and you can apply your own sort:
Python
chars = doc.extract_chars(0)
# Top-to-bottom, then left-to-right — ignores columns
chars_sorted = sorted(chars, key=lambda c: (-c.y, c.x))
Rust
let mut chars = doc.extract_chars(0)?;
chars.sort_by(|a, b| b.y.partial_cmp(&a.y).unwrap()
.then(a.x.partial_cmp(&b.x).unwrap()));
Related Pages
- Text Extraction — full extraction API
- Extraction Profiles — tune space detection per document type
- Extract Tables from PDF — structured table output
- Changelog — v0.3.34 multi-column and mixed-layout fixes