阅读顺序与 XY-cut — 按自然顺序提取多栏 PDF
学术论文、教材、杂志文章、政策报告等多栏 PDF 往往令大多数提取工具束手无策。简单的从上到下扫描会将第一栏和第二栏的词语交替取出,产生类似 accompaally(第一栏的 "accompa" 与第二栏的 "ally" 拼合)的乱码输出。
PDF Oxide 采用 XY-cut 算法自动检测栏位,并按自然阅读顺序输出文本。自 v0.3.34 起,还增加了对稀疏版式误判的防护(版权页、标题页),并能正确处理正文中嵌有表格的混合版式。
快速示例
提取默认感知栏位,无需额外配置:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("academic-paper.pdf")
text = doc.extract_text(0)
# Columns are read top-to-bottom within each column, not interleaved.
Rust
use pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("academic-paper.pdf")?;
let text = doc.extract_text(0)?;
JavaScript / TypeScript (Node)
const { PdfDocument } = require("pdf-oxide");
const doc = new PdfDocument("academic-paper.pdf");
const text = doc.extractText(0);
doc.close();
JavaScript (WASM)
import { WasmPdfDocument } from "pdf-oxide-wasm";
const doc = new WasmPdfDocument(bytes);
console.log(doc.extractText(0));
doc.free();
Go
doc, _ := pdfoxide.Open("academic-paper.pdf")
defer doc.Close()
text, _ := doc.ExtractText(0)
fmt.Println(text)
C#
using PdfOxide;
using var doc = PdfDocument.Open("academic-paper.pdf");
Console.WriteLine(doc.ExtractText(0));
Java
import fyi.oxide.pdf.PdfDocument;
import java.nio.file.Path;
try (PdfDocument doc = PdfDocument.open(Path.of("academic-paper.pdf"))) {
String text = doc.extractText(0);
}
Kotlin
import fyi.oxide.pdf.PdfDocument
import java.nio.file.Path
PdfDocument.open(Path.of("academic-paper.pdf")).use { doc ->
val text = doc.extractText(0)
}
Scala
import fyi.oxide.pdf.PdfDocument
import scala.util.Using
Using.resource(PdfDocument.open("academic-paper.pdf")) { doc =>
val text = doc.extractText(0)
}
Clojure
(require '[pdf-oxide.core :as pdf])
(with-open [doc (pdf/open "academic-paper.pdf")]
(pdf/extract-text doc 0))
Ruby
require 'pdf_oxide'
PdfOxide::PdfDocument.open('academic-paper.pdf') do |doc|
text = doc.extract_text(0)
end
PHP
use PdfOxide\PdfDocument;
$doc = PdfDocument::open('academic-paper.pdf');
$text = $doc->extractText(0);
$doc->close();
C++
#include <pdf_oxide/pdf_oxide.hpp>
auto doc = pdf_oxide::Document::open("academic-paper.pdf");
auto text = doc.extract_text(0);
Swift
import PdfOxide
let doc = try Document.open("academic-paper.pdf")
let text = try doc.extractText(0)
Dart
import 'package:pdf_oxide/pdf_oxide.dart';
final doc = PdfDocument.open('academic-paper.pdf');
final text = doc.extractText(0);
doc.close();
R
library(pdfoxide)
doc <- pdf_open("academic-paper.pdf")
text <- pdf_extract_text(doc, 0)
Julia
using PdfOxide
doc = open_document("academic-paper.pdf")
text = extract_text(doc, 0)
Zig
const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;
var doc = try pdf_oxide.Document.open("academic-paper.pdf");
const text = try doc.extractText(a, 0);
Objective-C
#import "POXPdfOxide.h"
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"academic-paper.pdf" error:&err];
NSString *text = [doc extractText:0 error:&err];
Elixir
{:ok, doc} = PdfOxide.open("academic-paper.pdf")
{:ok, text} = PdfOxide.extract_text(doc, 0)
XY-cut 的工作原理
XY-cut 算法沿着空白间距(栏间距)交替进行垂直和水平切割,将页面递归地划分为矩形区域:
- 将所有字符投影到 X 轴。若出现高而宽的垂直空白(栏间距),则在该 X 坐标处将页面切分为两个区域。
- 在每个区域内,投影到 Y 轴,沿水平间距(段落间距、章节边界)切割。
- 递归处理,直至每个叶子区域不再有明显间距——这些即为最小文本块。
- 按从上到下、从左到右的顺序序列化各文本块。
这与人类的阅读方式完全一致:先从上到下读完第一栏,再读第二栏,最后读跨栏的页脚。
XY-cut 的触发条件
当 extract_text 检测到多栏版式时,XY-cut 自动启用。以下情况不触发:
- 单栏页面(未发现垂直间距,使用默认的行感知排序)
- 每个"栏"的文本段少于约 10 个的稀疏页面——这类页面通常是标题页或版权页,两个 X 中心峰值是干扰而非真实的栏位(v0.3.34 修复)
常规场景无需任何配置。如需强制使用某种模式,请参阅下方的"退出机制"。
v0.3.34 修复了什么
未标记 PDF 的多栏输出交错
在未标记的多栏 PDF(学术教材、遗传学参考书等)中,extract_text 此前会在 extract_spans() 内执行 XY-cut,然后在 extract_text_with_options 中以行感知排序重新排列结果,破坏栏位结构,导致输出类似 accompaally 的乱码片段。
修复:对确实为多栏的页面,跳过行感知重排序。已通过 Hartwell Genetics、Murphy ML 和 Kandel Neural Science 教材验证。
正文中嵌套表格的页面
正文与嵌入表格混排的页面,因制表符展开的表格行填满了栏间距,会误导分栏检测器。修复措施:
- 宽度超过区域宽度 55% 的宽幅段从投影密度中排除——制表符填充的行不再遮蔽间距。
- 单字符段(如表格单元格值
G、T)从投影中排除,不再散布于栏间距。 - 覆盖率计算改用字符数估算而非原始边框宽度,制表符填充行不再被误认为密集正文。
稀疏版式的误判
版权页、标题页和版记页可能产生两个 X 中心峰值,但每"栏"只有 7–10 个段。这类页面不再被视为多栏,避免了 XY-cut 将同一行中位于不同 X 位置的句子错误分割。
按栏访问结构化数据
若需比 extract_text 更底层的操作,可在保留栏位顺序的前提下获取单词或字符级别的数据:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
for w in doc.extract_words(0):
print(f"{w.text} ({w.x0:.0f},{w.y0:.0f})")
Rust
let mut doc = PdfDocument::open("paper.pdf")?;
for w in doc.extract_words(0)? {
println!("{} ({:.0},{:.0})", w.text, w.x0, w.y0);
}
Go
doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
words, _ := doc.ExtractWords(0)
for _, w := range words {
fmt.Printf("%s (%.0f,%.0f)\n", w.Text, w.X0, w.Y0)
}
C#
using var doc = PdfDocument.Open("paper.pdf");
// Node/C# return rows of (text, x, y, w, h):
var lines = doc.ExtractTextLines(0);
foreach (var (text, x, y, w, h) in lines)
Console.WriteLine($"{text} ({x:F0},{y:F0})");
Java
try (PdfDocument doc = PdfDocument.open(Path.of("paper.pdf"))) {
for (TextWord w : doc.page(0).words()) {
System.out.printf("%s (%.0f,%.0f)%n", w.text(), w.bbox().x0(), w.bbox().y0());
}
}
Kotlin
PdfDocument.open(Path.of("paper.pdf")).use { doc ->
for (w in doc.page(0).words()) {
println("${w.text()} (${w.bbox().x0()},${w.bbox().y0()})")
}
}
Scala
Using.resource(PdfDocument.open("paper.pdf")) { doc =>
doc.page(0).wordsSeq.foreach { w =>
println(f"${w.text} (${w.bbox.x0}%.0f,${w.bbox.y0}%.0f)")
}
}
Clojure
(with-open [doc (pdf/open "paper.pdf")]
(doseq [w (pdf/words (pdf/page doc 0))]
(printf "%s (%.0f,%.0f)%n" (.text w) (.. w bbox x0) (.. w bbox y0))))
C++
auto doc = pdf_oxide::Document::open("paper.pdf");
for (const auto& w : doc.extract_words(0)) {
std::printf("%s (%.0f,%.0f)\n", w.text.c_str(), w.bbox.x, w.bbox.y);
}
Swift
let doc = try Document.open("paper.pdf")
for w in try doc.extractWords(0) {
print("\(w.text) (\(w.bbox.x),\(w.bbox.y))")
}
Dart
final doc = PdfDocument.open('paper.pdf');
for (final w in doc.extractWords(0)) {
print('${w.text} (${w.bbox.x},${w.bbox.y})');
}
doc.close();
R
doc <- pdf_open("paper.pdf")
words <- pdf_extract_words(doc, 0)
for (w in words) {
cat(sprintf("%s (%.0f,%.0f)\n", w$text, w$bbox$x, w$bbox$y))
}
Julia
doc = open_document("paper.pdf")
for w in extract_words(doc, 0)
println("$(w.text) ($(w.bbox.x),$(w.bbox.y))")
end
Zig
var doc = try pdf_oxide.Document.open("paper.pdf");
const words = try doc.extractWords(a, 0);
defer pdf_oxide.Document.freeWords(a, words);
for (words) |w| {
std.debug.print("{s} ({d:.0},{d:.0})\n", .{ w.text, w.bbox.x, w.bbox.y });
}
Objective-C
POXDocument *doc = [POXDocument openPath:@"paper.pdf" error:&err];
for (POXWord *w in [doc extractWords:0 error:&err]) {
NSLog(@"%@ (%.0f,%.0f)", w.text, w.bbox.x, w.bbox.y);
}
Elixir
{:ok, doc} = PdfOxide.open("paper.pdf")
{:ok, words} = PdfOxide.extract_words(doc, 0)
Enum.each(words, fn w ->
IO.puts("#{w.text} (#{w.bbox.x},#{w.bbox.y})")
end)
每个单词和行都带有边框坐标,您可按栏分组并自定义排序策略(例如阿拉伯语版式中优先读取右栏)。
手动检测多栏页面
若需在提取前判断某页是否为多栏版式:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("mixed.pdf")
for i in range(doc.page_count()):
words = doc.extract_words(i)
# Heuristic: distinct X-center clusters
x_centers = {round((w.x0 + w.x1) / 2 / 50) * 50 for w in words}
if len(x_centers) >= 2:
print(f"Page {i}: likely multi-column ({len(x_centers)} X-centers)")
Java
try (PdfDocument doc = PdfDocument.open(Path.of("mixed.pdf"))) {
for (int i = 0; i < doc.pageCount(); i++) {
Set<Long> xCenters = new HashSet<>();
for (TextWord w : doc.page(i).words()) {
double cx = w.bbox().x0() + w.bbox().width() / 2;
xCenters.add(Math.round(cx / 50) * 50L);
}
if (xCenters.size() >= 2)
System.out.printf("Page %d: likely multi-column (%d X-centers)%n", i, xCenters.size());
}
}
Kotlin
PdfDocument.open(Path.of("mixed.pdf")).use { doc ->
for (i in 0 until doc.pageCount()) {
val xCenters = doc.page(i).words().map {
(Math.round((it.bbox().x0() + it.bbox().width() / 2) / 50) * 50)
}.toSet()
if (xCenters.size >= 2)
println("Page $i: likely multi-column (${xCenters.size} X-centers)")
}
}
Scala
Using.resource(PdfDocument.open("mixed.pdf")) { doc =>
for (i <- 0 until doc.pageCount()) {
val xCenters = doc.page(i).wordsSeq.map { w =>
math.round((w.bbox.x0 + w.bbox.width / 2) / 50) * 50
}.toSet
if (xCenters.size >= 2)
println(s"Page $i: likely multi-column (${xCenters.size} X-centers)")
}
}
Clojure
(with-open [doc (pdf/open "mixed.pdf")]
(doseq [i (range (pdf/page-count doc))]
(let [xs (set (map #(* 50 (Math/round (/ (+ (.. % bbox x0) (/ (.. % bbox width) 2)) 50.0)))
(pdf/words (pdf/page doc i))))]
(when (>= (count xs) 2)
(printf "Page %d: likely multi-column (%d X-centers)%n" i (count xs))))))
C++
auto doc = pdf_oxide::Document::open("mixed.pdf");
for (int i = 0; i < doc.page_count(); ++i) {
std::set<long> x_centers;
for (const auto& w : doc.extract_words(i))
x_centers.insert(std::lround((w.bbox.x + w.bbox.width / 2) / 50) * 50);
if (x_centers.size() >= 2)
std::printf("Page %d: likely multi-column (%zu X-centers)\n", i, x_centers.size());
}
Swift
let doc = try Document.open("mixed.pdf")
for i in 0..<(try doc.pageCount()) {
let xCenters = Set(try doc.extractWords(i).map {
(($0.bbox.x + $0.bbox.width / 2) / 50).rounded() * 50
})
if xCenters.count >= 2 {
print("Page \(i): likely multi-column (\(xCenters.count) X-centers)")
}
}
Dart
final doc = PdfDocument.open('mixed.pdf');
for (var i = 0; i < doc.pageCount; i++) {
final xCenters = doc.extractWords(i)
.map((w) => ((w.bbox.x + w.bbox.width / 2) / 50).round() * 50)
.toSet();
if (xCenters.length >= 2) {
print('Page $i: likely multi-column (${xCenters.length} X-centers)');
}
}
doc.close();
R
doc <- pdf_open("mixed.pdf")
for (i in 0:(pdf_page_count(doc) - 1)) {
words <- pdf_extract_words(doc, i)
x_centers <- unique(sapply(words, function(w)
round((w$bbox$x + w$bbox$width / 2) / 50) * 50))
if (length(x_centers) >= 2)
cat(sprintf("Page %d: likely multi-column (%d X-centers)\n", i, length(x_centers)))
}
Julia
doc = open_document("mixed.pdf")
for i in 0:(page_count(doc) - 1)
x_centers = Set(round(Int, (w.bbox.x + w.bbox.width / 2) / 50) * 50
for w in extract_words(doc, i))
if length(x_centers) >= 2
println("Page $i: likely multi-column ($(length(x_centers)) X-centers)")
end
end
Zig
var doc = try pdf_oxide.Document.open("mixed.pdf");
const n = try doc.pageCount();
var i: i32 = 0;
while (i < n) : (i += 1) {
const words = try doc.extractWords(a, i);
defer pdf_oxide.Document.freeWords(a, words);
var centers = std.AutoHashMap(i64, void).init(a);
defer centers.deinit();
for (words) |w| {
const c: i64 = @intFromFloat(@round((w.bbox.x + w.bbox.width / 2) / 50) * 50);
try centers.put(c, {});
}
if (centers.count() >= 2)
std.debug.print("Page {d}: likely multi-column ({d} X-centers)\n", .{ i, centers.count() });
}
Objective-C
POXDocument *doc = [POXDocument openPath:@"mixed.pdf" error:&err];
for (NSInteger i = 0; i < [doc pageCountError:&err]; i++) {
NSMutableSet<NSNumber*> *xCenters = [NSMutableSet set];
for (POXWord *w in [doc extractWords:i error:&err]) {
long c = lround((w.bbox.x + w.bbox.width / 2) / 50) * 50;
[xCenters addObject:@(c)];
}
if (xCenters.count >= 2)
NSLog(@"Page %ld: likely multi-column (%lu X-centers)", (long)i, (unsigned long)xCenters.count);
}
Elixir
{:ok, doc} = PdfOxide.open("mixed.pdf")
{:ok, n} = PdfOxide.page_count(doc)
for i <- 0..(n - 1) do
{:ok, words} = PdfOxide.extract_words(doc, i)
x_centers = words
|> Enum.map(fn w -> round((w.bbox.x + w.bbox.width / 2) / 50) * 50 end)
|> Enum.uniq()
if length(x_centers) >= 2 do
IO.puts("Page #{i}: likely multi-column (#{length(x_centers)} X-centers)")
end
end
生产环境中,建议直接使用 extract_text,让库内置的 XY-cut 和稀疏版式防护自动处理。
退出机制与自定义排序
若需获取原始的位置排序段(例如用于自定义排版引擎),请使用 extract_chars 或 extract_words——它们返回带边框的记录,可应用自定义排序:
Python
chars = doc.extract_chars(0)
# Top-to-bottom, then left-to-right — ignores columns
chars_sorted = sorted(chars, key=lambda c: (-c.y, c.x))
Rust
let mut chars = doc.extract_chars(0)?;
chars.sort_by(|a, b| b.y.partial_cmp(&a.y).unwrap()
.then(a.x.partial_cmp(&b.x).unwrap()));
Java
List<TextChar> chars = new ArrayList<>(doc.page(0).chars());
// Top-to-bottom, then left-to-right — ignores columns
chars.sort(Comparator
.comparingDouble((TextChar c) -> c.bbox().y0()).reversed()
.thenComparingDouble(c -> c.bbox().x0()));
Kotlin
val chars = doc.page(0).chars()
.sortedWith(compareByDescending<TextChar> { it.bbox().y0() }
.thenBy { it.bbox().x0() })
Scala
val chars = doc.page(0).charsSeq
.sortBy(c => (-c.bbox.y0, c.bbox.x0))
Clojure
(def chars
(sort-by (juxt #(- (.. % bbox y0)) #(.. % bbox x0))
(pdf/chars (pdf/page doc 0))))
C++
auto chars = doc.extract_chars(0);
// Top-to-bottom, then left-to-right — ignores columns
std::sort(chars.begin(), chars.end(), [](const auto& a, const auto& b) {
return a.bbox.y != b.bbox.y ? a.bbox.y > b.bbox.y : a.bbox.x < b.bbox.x;
});
Swift
let chars = try doc.extractChars(0).sorted {
$0.bbox.y != $1.bbox.y ? $0.bbox.y > $1.bbox.y : $0.bbox.x < $1.bbox.x
}
Dart
final chars = doc.extractChars(0)
..sort((a, b) => a.bbox.y != b.bbox.y
? b.bbox.y.compareTo(a.bbox.y)
: a.bbox.x.compareTo(b.bbox.x));
R
chars <- pdf_extract_chars(doc, 0)
# Top-to-bottom, then left-to-right — ignores columns
chars <- chars[order(-sapply(chars, function(c) c$bbox$y),
sapply(chars, function(c) c$bbox$x))]
Julia
chars = extract_chars(doc, 0)
# Top-to-bottom, then left-to-right — ignores columns
sort!(chars, by = c -> (-c.bbox.y, c.bbox.x))
Zig
const chars = try doc.extractChars(a, 0);
defer pdf_oxide.Document.freeChars(a, chars);
std.mem.sort(pdf_oxide.Char, chars, {}, struct {
fn lt(_: void, x: pdf_oxide.Char, y: pdf_oxide.Char) bool {
return if (x.bbox.y != y.bbox.y) x.bbox.y > y.bbox.y else x.bbox.x < y.bbox.x;
}
}.lt);
Objective-C
NSArray<POXChar*> *chars = [doc extractChars:0 error:&err];
// Top-to-bottom, then left-to-right — ignores columns
chars = [chars sortedArrayUsingComparator:^NSComparisonResult(POXChar *a, POXChar *b) {
if (a.bbox.y != b.bbox.y) return a.bbox.y > b.bbox.y ? NSOrderedAscending : NSOrderedDescending;
return a.bbox.x < b.bbox.x ? NSOrderedAscending : NSOrderedDescending;
}];
Elixir
{:ok, chars} = PdfOxide.extract_chars(doc, 0)
# Top-to-bottom, then left-to-right — ignores columns
chars = Enum.sort_by(chars, fn c -> {-c.bbox.y, c.bbox.x} end)
相关页面
- 文本提取 — 完整提取 API
- 提取配置文件 — 按文档类型调整空格检测
- 从 PDF 提取表格 — 结构化表格输出
- 更新日志 — v0.3.34 多栏与混合版式修复