在 Python 中从 PDF 提取表格
从 PDF 文档中提取表格是文档处理流水线中最常见的任务之一。无论是从年度报告中抽取财务数据、抓取产品目录,还是将结构化数据输入 LLM,可靠的表格提取都不可或缺。本指南涵盖在 Python 中从 PDF 提取表格所需了解的一切内容——从简洁的一行代码到处理跨页表格的生产级工作流。
检测引擎
PDF Oxide 采用通用的边 → 对齐/合并 → 交叉点 → 单元格 → 分组表格检测流水线——与 Tabula、pdfplumber 和 PyMuPDF 相同的方案,以纯 Rust 实现。
检测能力:
- 基于交叉点 — 找到水平×垂直线的交叉,从四角矩形构建单元格,通过 union-find 分组为表格。
- 扩展网格 — 当水平线和垂直线位于页面的不同区域时,从所有坐标的笛卡尔积构建虚拟网格。
- 列感知文本检测 — 通过 X 轴投影直方图切分双栏布局,然后对每列分别运行纯文本表格检测。
- 水平线界定的文本表格 — 检测仅有水平线而无垂直线的表格(学术论文中常见)。
- 混合行检测 — 当仅存在垂直边框时,从文本 Y 坐标推断行边界(发票行项目)。
- 虚线/虚点线重构 — 将短线段合并为连续边。
- 章节分隔符拆分 — 在全宽水平分隔线处拆分多节表单。
- 边覆盖过滤 — 移除不参与任何潜在网格的孤立边。
配置
TableDetectionConfig 提供可调参数:
| 字段 | 默认值 | 说明 |
|---|---|---|
horizontal_strategy |
"lines_strict" |
"lines_strict"、"lines"、"text" 或 "explicit" |
vertical_strategy |
"lines_strict" |
同上 |
v_split_gap |
20.0 pt |
触发拆分为独立表格的垂直线间距(v0.3.20 之前硬编码为 4pt) |
snap_tolerance |
3.0 pt |
边对齐合并容差 |
text_tolerance |
3.0 pt |
文本行合并容差 |
行为变更
从 v0.3.20 起,Python extract_tables() 的默认策略变为 Both(同时通过线条和文本检测)。依赖旧版纯文本检测默认值的页面应显式传入 horizontal_strategy="text" 和 vertical_strategy="text"。
Python 绑定现已正确从 table_settings 字典读取 vertical_strategy——此前该参数被静默忽略。
渲染
提取的表格以空格填充的列对齐方式输出(替代旧版的 ASCII 方框绘制字符)。货币和数字列自动右对齐。表单编号前缀("1 Apr 11" → "Apr 11")和装饰性破折号/下划线单元格("------")在渲染时会被去除。
使用 Markdown 转换从 PDF 提取表格数据:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)
# Output includes tables in GFM format:
# | Item | Qty | Price |
# |------|-----|-------|
# | Widget | 10 | $9.99 |
WASM
import { WasmPdfDocument } from "pdf-oxide-wasm";
const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0);
console.log(md);
// Output includes tables in GFM format:
// | Item | Qty | Price |
// |------|-----|-------|
// | Widget | 10 | $9.99 |
doc.free();
Rust
use pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("invoice.pdf")?;
let md = doc.to_markdown(0, true)?;
println!("{}", md);
Go
package main
import (
"fmt"
"log"
pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)
func main() {
doc, err := pdfoxide.Open("invoice.pdf")
if err != nil { log.Fatal(err) }
defer doc.Close()
md, err := doc.ToMarkdown(0)
if err != nil { log.Fatal(err) }
fmt.Println(md)
}
C#
using PdfOxide;
using var doc = PdfDocument.Open("invoice.pdf");
Console.WriteLine(doc.ToMarkdown(0));
Java
import fyi.oxide.pdf.PdfDocument;
try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("invoice.pdf"))) {
System.out.println(doc.toMarkdown(0));
}
PHP
use PdfOxide\PdfDocument;
$doc = PdfDocument::open('invoice.pdf');
echo $doc->toMarkdown(0);
$doc->close();
Ruby
require 'pdf_oxide'
PdfOxide::PdfDocument.open('invoice.pdf') do |doc|
puts doc.to_markdown(0)
end
C++
#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>
auto doc = pdf_oxide::Document::open("invoice.pdf");
std::cout << doc.to_markdown(0) << '\n';
Swift
import PdfOxide
let doc = try Document.open("invoice.pdf")
print(try doc.toMarkdown(0))
Kotlin
import fyi.oxide.pdf.PdfDocument
PdfDocument.open(java.nio.file.Path.of("invoice.pdf")).use { doc ->
println(doc.toMarkdown(0))
}
Dart
import 'package:pdf_oxide/pdf_oxide.dart';
final doc = PdfDocument.open('invoice.pdf');
print(doc.toMarkdown(0));
doc.close();
R
library(pdfoxide)
doc <- pdf_open("invoice.pdf")
cat(pdf_to_markdown(doc, 0))
Julia
using PdfOxide
doc = open_document("invoice.pdf")
println(to_markdown(doc, 0))
Zig
const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;
var doc = try pdf_oxide.Document.open("invoice.pdf");
const md = try doc.toMarkdown(a, 0);
defer a.free(md);
std.debug.print("{s}\n", .{md});
Scala
import fyi.oxide.pdf.PdfDocument
import scala.util.Using
Using.resource(PdfDocument.open("invoice.pdf")) { doc =>
println(doc.toMarkdown(0))
}
Clojure
(require '[pdf-oxide.core :as pdf])
(with-open [d (pdf/open "invoice.pdf")]
(println (pdf/to-markdown d 0)))
Objective-C
#import "POXPdfOxide.h"
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"invoice.pdf" error:&err];
NSLog(@"%@", [doc toMarkdown:0 error:&err]);
Elixir
{:ok, doc} = PdfOxide.open("invoice.pdf")
{:ok, md} = PdfOxide.to_markdown(doc, 0)
IO.puts(md)
PDF Oxide 通过对对齐文本块进行空间分析来检测表格布局,并输出 GitHub Flavored Markdown 表格。
为什么从 PDF 提取表格如此困难
如果你曾尝试过从 PDF 中复制表格并粘贴到电子表格,就会知道结果通常一团糟。这不是 PDF 查看器的问题,而是 PDF 格式本身的根本性局限。
PDF 没有"表格"的概念。 与使用 <table>、<tr>、<td> 标签定义表格结构的 HTML 不同,PDF 文件只存储绘图指令:在坐标 (x, y) 处放置这个字形,从点 A 到点 B 画一条线。没有任何语义层告诉你"这些字符属于第 3 行第 2 列的单元格"。所有表格提取库都必须通过分析页面上文本和线条的空间位置来重建这种结构。
这种重建之所以困难,原因有以下几点:
-
有线条与无线条表格。 当表格有可见网格线时,提取工具可以将这些线条用作单元格边界。无边框表格——在财务报表、政府报告和学术论文中很常见——完全没有线条。库必须纯粹从文本块之间的空白间隔推断列边界,当列宽度不一或数字值右对齐时,这很容易出错。
-
合并单元格和跨列标题。 跨越三列的标题单元格看起来就像一个宽文本块。没有网格线来划定边界,解析器无法可靠地知道标题覆盖了哪些列。一些库处理得很好,许多则会悄悄产生乱码输出。
-
多行单元格内容。 当单元格包含换行的段落文本时,简单的基于行的解析会将每个换行视为独立的行。要正确地将这些行归入同一个单元格,需要理解每行的垂直范围。
-
跨页表格。 大型表格通常跨越两页或更多页。标题行可能在每页重复出现,也可能不重复,页面页脚、水印或页码可能出现在表格行之间。将这些片段拼接成一个完整的表格,需要具备页面感知逻辑。
-
旋转文本与非标准布局。 某些 PDF 对列标题使用旋转文本,或将表格置于多栏页面布局中。这些边缘情况会打破大多数解析器对从左到右、从上到下阅读顺序的假设。
理解这些挑战有助于你为特定文档选择合适的工具。对于简单的对齐表格——大多数发票、订单确认和简单报告——像 PDF Oxide 这样的快速空间分析方法效果很好。对于具有复杂合并、无边框布局或异常格式的文档,你可能需要一个具有更复杂启发算法的库。
表格提取:PDF Oxide vs 其他库
在 Python 中选择 PDF 表格提取库,取决于你的文档类型、性能要求以及所需的输出格式。以下是主要选项的比较:
| 库 | 表格检测 | 有边框表格 | 无边框表格 | 输出格式 | 速度 |
|---|---|---|---|---|---|
| PDF Oxide | 内置 | 是 | 基础 | Markdown/HTML | 0.8ms |
| pdfplumber | 内置 | 是 | 高级 | Python 列表 | 23.2ms |
| Camelot | 内置 | 是 | 是 (lattice/stream) | DataFrames | ~50ms+ |
| PyMuPDF | 基础 (v1.23+) | 是 | 有限 | DataFrames | 4.6ms |
| pypdf | 无 | 无 | 无 | N/A | N/A |
| tabula-py | 内置 | 是 | 是 | DataFrames | ~100ms+ (Java) |
PDF Oxide 是速度最快的选择,遥遥领先。它通过对齐文本块的空间分析来检测表格,并输出整洁的 GitHub Flavored Markdown 表格。平均提取时间 0.8ms,比 pdfplumber 快 29 倍,比 tabula-py 快超过 100 倍。它能很好地处理有边框表格和简单的对齐无边框表格。对于需要 Markdown 输出的 LLM 流水线,它是天然之选。
pdfplumber 拥有最成熟的无边框表格检测。其 find_tables() 方法使用可配置策略,根据文本对齐检测行和列,处理合并单元格和多行单元格内容的能力优于大多数替代方案。代价是速度:每页 23.2ms 对于批量处理来说明显偏慢。
Camelot 提供两种检测模式——lattice(有边框表格)和 stream(无边框表格)。它直接生成 pandas DataFrame,对数据分析工作流很方便。但它依赖 Ghostscript 和 OpenCV,安装较重,且在纯 Python 选项中速度最慢。
PyMuPDF (fitz) 在 1.23 版本中添加了基础表格提取功能。速度快(4.6ms),适合简单有边框表格,但无边框表格支持不如 pdfplumber 或 Camelot。
pypdf 没有表格检测能力。它只提取原始文本,你需要自行编写解析逻辑来重建表格结构。
tabula-py 是基于 Java 的 Tabula 库的 Python 封装。对有边框和无边框表格都有良好检测,但需要 Java 运行时,且由于 JVM 启动开销,速度最慢。更适合一次性提取任务,而非高吞吐量流水线。
对于大多数生产场景,推荐的方案是以 PDF Oxide 作为主要提取工具以兼顾速度和简洁,对于需要高级启发算法的复杂表格布局文档,回退到 pdfplumber。
安装
pip install pdf_oxide
基础表格提取
以 Markdown 表格输出
最简单的方法——将页面转换为 Markdown,其中会以 GFM 语法包含表格:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
md = doc.to_markdown(i, detect_headings=True)
if "|" in md: # Page contains a table
print(f"--- Page {i + 1} ---")
print(md)
WASM
const doc = new WasmPdfDocument(bytes);
for (let i = 0; i < doc.pageCount(); i++) {
const md = doc.toMarkdown(i);
if (md.includes("|")) { // Page contains a table
console.log(`--- Page ${i + 1} ---`);
console.log(md);
}
}
doc.free();
Rust
let mut doc = PdfDocument::open("report.pdf")?;
for i in 0..doc.page_count()? {
let md = doc.to_markdown(i, true)?;
if md.contains("|") {
println!("--- Page {} ---", i + 1);
println!("{}", md);
}
}
Go
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
n, _ := doc.PageCount()
for i := 0; i < n; i++ {
md, _ := doc.ToMarkdown(i)
if strings.Contains(md, "|") {
fmt.Printf("--- Page %d ---\n%s\n", i+1, md)
}
}
C#
using var doc = PdfDocument.Open("report.pdf");
for (int i = 0; i < doc.PageCount; i++)
{
var md = doc.ToMarkdown(i);
if (md.Contains("|"))
Console.WriteLine($"--- Page {i + 1} ---\n{md}");
}
Java
try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("report.pdf"))) {
for (int i = 0; i < doc.pageCount(); i++) {
String md = doc.toMarkdown(i);
if (md.contains("|")) { // Page contains a table
System.out.println("--- Page " + (i + 1) + " ---");
System.out.println(md);
}
}
}
PHP
$doc = PdfDocument::open('report.pdf');
for ($i = 0; $i < $doc->pageCount(); $i++) {
$md = $doc->toMarkdown($i);
if (str_contains($md, '|')) { // Page contains a table
echo "--- Page " . ($i + 1) . " ---\n";
echo $md;
}
}
$doc->close();
Ruby
PdfOxide::PdfDocument.open('report.pdf') do |doc|
doc.page_count.times do |i|
md = doc.to_markdown(i)
if md.include?('|') # Page contains a table
puts "--- Page #{i + 1} ---"
puts md
end
end
end
C++
auto doc = pdf_oxide::Document::open("report.pdf");
for (int i = 0; i < doc.page_count(); i++) {
auto md = doc.to_markdown(i);
if (md.find('|') != std::string::npos) { // Page contains a table
std::cout << "--- Page " << (i + 1) << " ---\n" << md << '\n';
}
}
Swift
let doc = try Document.open("report.pdf")
for i in 0..<(try doc.pageCount()) {
let md = try doc.toMarkdown(i)
if md.contains("|") { // Page contains a table
print("--- Page \(i + 1) ---")
print(md)
}
}
Kotlin
PdfDocument.open(java.nio.file.Path.of("report.pdf")).use { doc ->
for (i in 0 until doc.pageCount()) {
val md = doc.toMarkdown(i)
if (md.contains("|")) { // Page contains a table
println("--- Page ${i + 1} ---")
println(md)
}
}
}
Dart
final doc = PdfDocument.open('report.pdf');
for (var i = 0; i < doc.pageCount; i++) {
final md = doc.toMarkdown(i);
if (md.contains('|')) { // Page contains a table
print('--- Page ${i + 1} ---');
print(md);
}
}
doc.close();
R
doc <- pdf_open("report.pdf")
for (i in 0:(pdf_page_count(doc) - 1)) {
md <- pdf_to_markdown(doc, i)
if (grepl("\\|", md)) { # Page contains a table
cat(sprintf("--- Page %d ---\n%s\n", i + 1, md))
}
}
Julia
doc = open_document("report.pdf")
for i in 0:(page_count(doc) - 1)
md = to_markdown(doc, i)
if occursin("|", md) # Page contains a table
println("--- Page $(i + 1) ---")
println(md)
end
end
Zig
var doc = try pdf_oxide.Document.open("report.pdf");
const n = try doc.pageCount();
var i: i32 = 0;
while (i < n) : (i += 1) {
const md = try doc.toMarkdown(a, i);
defer a.free(md);
if (std.mem.indexOfScalar(u8, md, '|') != null) { // Page contains a table
std.debug.print("--- Page {d} ---\n{s}\n", .{ i + 1, md });
}
}
Scala
Using.resource(PdfDocument.open("report.pdf")) { doc =>
for (i <- 0 until doc.pageCount()) {
val md = doc.toMarkdown(i)
if (md.contains("|")) { // Page contains a table
println(s"--- Page ${i + 1} ---")
println(md)
}
}
}
Clojure
(with-open [d (pdf/open "report.pdf")]
(doseq [i (range (pdf/page-count d))]
(let [md (pdf/to-markdown d i)]
(when (.contains md "|") ; Page contains a table
(println (str "--- Page " (inc i) " ---"))
(println md)))))
Objective-C
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
for (NSInteger i = 0; i < [doc pageCountError:&err]; i++) {
NSString *md = [doc toMarkdown:i error:&err];
if ([md containsString:@"|"]) { // Page contains a table
NSLog(@"--- Page %ld ---\n%@", (long)(i + 1), md);
}
}
Elixir
{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, n} = PdfOxide.page_count(doc)
for i <- 0..(n - 1) do
{:ok, md} = PdfOxide.to_markdown(doc, i)
if String.contains?(md, "|") do # Page contains a table
IO.puts("--- Page #{i + 1} ---")
IO.puts(md)
end
end
结构化表格提取 (v0.3.34)
若要在不解析 Markdown 的情况下获得对行和边界框的类型化访问,可调用 ExtractTables(pageIndex) (Go, C#) / extract_tables(page) (Python, Rust)。每个表格都公开结构化单元格,让你可以直接将结果输入数据库或 DataFrame,无需正则表达式。
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("invoice.pdf")
for table in doc.extract_tables(0):
for row in table.rows:
print(row)
Rust
let mut doc = PdfDocument::open("invoice.pdf")?;
for table in doc.extract_tables(0)? {
for row in &table.rows {
println!("{:?}", row);
}
}
Go
doc, _ := pdfoxide.Open("invoice.pdf")
defer doc.Close()
tables, _ := doc.ExtractTables(0)
for _, t := range tables {
for _, row := range t.Rows {
fmt.Println(row)
}
}
C#
using var doc = PdfDocument.Open("invoice.pdf");
foreach (var table in doc.ExtractTables(0))
foreach (var row in table.Rows)
Console.WriteLine(string.Join(" | ", row));
Java
import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.table.Table;
import fyi.oxide.pdf.table.TableCell;
try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("invoice.pdf"))) {
for (Table table : doc.page(0).tables()) {
String[][] grid = new String[table.rows()][table.cols()];
for (TableCell c : table.cells()) grid[c.row()][c.col()] = c.text();
for (String[] row : grid) System.out.println(String.join(" | ", row));
}
}
C++
auto doc = pdf_oxide::Document::open("invoice.pdf");
for (const auto& table : doc.extract_tables(0)) {
for (int r = 0; r < table.row_count; r++) {
for (int c = 0; c < table.col_count; c++) {
std::cout << table.cell(r, c);
if (c + 1 < table.col_count) std::cout << " | ";
}
std::cout << '\n';
}
}
Swift
let doc = try Document.open("invoice.pdf")
for table in try doc.extractTables(0) {
for r in 0..<table.rowCount {
let row = (0..<table.colCount).map { table.cell(r, $0) }
print(row.joined(separator: " | "))
}
}
Kotlin
import fyi.oxide.pdf.PdfDocument
PdfDocument.open(java.nio.file.Path.of("invoice.pdf")).use { doc ->
for (table in doc.page(0).tables()) {
val grid = Array(table.rows()) { arrayOfNulls<String>(table.cols()) }
table.cells().forEach { grid[it.row()][it.col()] = it.text() }
grid.forEach { println(it.joinToString(" | ")) }
}
}
Dart
final doc = PdfDocument.open('invoice.pdf');
for (final table in doc.extractTables(0)) {
for (var r = 0; r < table.rowCount; r++) {
final row = [for (var c = 0; c < table.colCount; c++) table.cell(r, c)];
print(row.join(' | '));
}
}
doc.close();
R
doc <- pdf_open("invoice.pdf")
for (table in pdf_extract_tables(doc, 0)) {
for (r in seq_len(table$row_count)) {
cat(paste(table$cells[r, ], collapse = " | "), "\n")
}
}
Julia
doc = open_document("invoice.pdf")
for table in extract_tables(doc, 0)
for r in 1:table.row_count
println(join(table.cells[r, :], " | "))
end
end
Zig
var doc = try pdf_oxide.Document.open("invoice.pdf");
const tables = try doc.extractTables(a, 0);
defer pdf_oxide.Document.freeTables(a, tables);
for (tables) |table| {
var r: i32 = 0;
while (r < table.rowCount) : (r += 1) {
var c: i32 = 0;
while (c < table.colCount) : (c += 1) {
std.debug.print("{s}", .{table.cell(r, c)});
if (c + 1 < table.colCount) std.debug.print(" | ", .{});
}
std.debug.print("\n", .{});
}
}
Scala
import fyi.oxide.pdf.PdfDocument
import scala.jdk.CollectionConverters._
import scala.util.Using
Using.resource(PdfDocument.open("invoice.pdf")) { doc =>
for (table <- doc.page(0).tables().asScala) {
val grid = Array.ofDim[String](table.rows(), table.cols())
table.cells().asScala.foreach(c => grid(c.row())(c.col()) = c.text())
grid.foreach(row => println(row.mkString(" | ")))
}
}
Clojure
(with-open [d (pdf/open "invoice.pdf")]
(doseq [table (pdf/tables (pdf/page d 0))]
(let [grid (make-array String (.rows table) (.cols table))]
(doseq [c (.cells table)]
(aset grid (.row c) (.col c) (.text c)))
(doseq [row grid]
(println (clojure.string/join " | " row))))))
Objective-C
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"invoice.pdf" error:&err];
for (POXTable *table in [doc extractTables:0 error:&err]) {
for (NSInteger r = 0; r < table.rowCount; r++) {
NSMutableArray<NSString *> *row = [NSMutableArray array];
for (NSInteger c = 0; c < table.colCount; c++)
[row addObject:([table cellTextAtRow:r col:c] ?: @"")];
NSLog(@"%@", [row componentsJoinedByString:@" | "]);
}
}
Elixir
{:ok, doc} = PdfOxide.open("invoice.pdf")
{:ok, tables} = PdfOxide.extract_tables(doc, 0)
for table <- tables do
for r <- 0..(table.row_count - 1) do
row = for c <- 0..(table.col_count - 1), do: PdfOxide.cell(table, r, c)
IO.puts(Enum.join(row, " | "))
end
end
将 Markdown 表格解析为行
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0)
# Extract table rows from Markdown
rows = []
for line in md.split("\n"):
line = line.strip()
if line.startswith("|") and not line.startswith("|--"):
cells = [cell.strip() for cell in line.split("|")[1:-1]]
rows.append(cells)
header = rows[0] if rows else []
data = rows[1:] if len(rows) > 1 else []
print(f"Columns: {header}")
for row in data:
print(row)
WASM
const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0);
const rows = [];
for (const line of md.split("\n")) {
const trimmed = line.trim();
if (trimmed.startsWith("|") && !trimmed.startsWith("|--")) {
const cells = trimmed.split("|").slice(1, -1).map(c => c.trim());
rows.push(cells);
}
}
const header = rows[0] || [];
const data = rows.slice(1);
console.log("Columns:", header);
data.forEach(row => console.log(row));
doc.free();
Rust
let mut doc = PdfDocument::open("invoice.pdf")?;
let md = doc.to_markdown(0, false)?;
let rows: Vec<Vec<String>> = md.lines()
.map(|l| l.trim())
.filter(|l| l.starts_with('|') && !l.starts_with("|--"))
.map(|l| l.split('|').skip(1).map(|c| c.trim().to_string())
.take_while(|c| !c.is_empty()).collect())
.collect();
if let Some(header) = rows.first() {
println!("Columns: {:?}", header);
for row in &rows[1..] {
println!("{:?}", row);
}
}
Java
import fyi.oxide.pdf.PdfDocument;
import java.util.*;
try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("invoice.pdf"))) {
String md = doc.toMarkdown(0);
List<List<String>> rows = new ArrayList<>();
for (String line : md.split("\n")) {
line = line.strip();
if (line.startsWith("|") && !line.startsWith("|--")) {
String[] parts = line.substring(1, line.length() - 1).split("\\|");
List<String> cells = new ArrayList<>();
for (String p : parts) cells.add(p.strip());
rows.add(cells);
}
}
System.out.println("Columns: " + (rows.isEmpty() ? List.of() : rows.get(0)));
for (int i = 1; i < rows.size(); i++) System.out.println(rows.get(i));
}
PHP
$doc = PdfDocument::open('invoice.pdf');
$md = $doc->toMarkdown(0);
$rows = [];
foreach (explode("\n", $md) as $line) {
$line = trim($line);
if (str_starts_with($line, '|') && !str_starts_with($line, '|--')) {
$cells = array_map('trim', array_slice(explode('|', $line), 1, -1));
$rows[] = $cells;
}
}
$header = $rows[0] ?? [];
echo "Columns: " . implode(', ', $header) . "\n";
foreach (array_slice($rows, 1) as $row) {
echo implode(' | ', $row) . "\n";
}
$doc->close();
Ruby
PdfOxide::PdfDocument.open('invoice.pdf') do |doc|
md = doc.to_markdown(0)
rows = md.lines.map(&:strip)
.select { |l| l.start_with?('|') && !l.start_with?('|--') }
.map { |l| l.split('|')[1..-2].map(&:strip) }
header = rows.first || []
puts "Columns: #{header.inspect}"
rows.drop(1).each { |row| puts row.inspect }
end
C++
#include <pdf_oxide/pdf_oxide.hpp>
#include <sstream>
#include <vector>
auto doc = pdf_oxide::Document::open("invoice.pdf");
auto md = doc.to_markdown(0);
std::vector<std::vector<std::string>> rows;
std::istringstream stream(md);
for (std::string line; std::getline(stream, line);) {
auto s = line.find_first_not_of(" \t");
if (s == std::string::npos) continue;
line = line.substr(s);
if (line.rfind("|", 0) != 0 || line.rfind("|--", 0) == 0) continue;
std::vector<std::string> cells;
std::istringstream cs(line.substr(1, line.size() - 2));
for (std::string cell; std::getline(cs, cell, '|');) cells.push_back(cell);
rows.push_back(cells);
}
Swift
let doc = try Document.open("invoice.pdf")
let md = try doc.toMarkdown(0)
let rows = md.split(separator: "\n").map { $0.trimmingCharacters(in: .whitespaces) }
.filter { $0.hasPrefix("|") && !$0.hasPrefix("|--") }
.map { line -> [String] in
line.dropFirst().dropLast().split(separator: "|", omittingEmptySubsequences: false)
.map { $0.trimmingCharacters(in: .whitespaces) }
}
if let header = rows.first {
print("Columns:", header)
for row in rows.dropFirst() { print(row) }
}
Kotlin
PdfDocument.open(java.nio.file.Path.of("invoice.pdf")).use { doc ->
val md = doc.toMarkdown(0)
val rows = md.split("\n").map { it.trim() }
.filter { it.startsWith("|") && !it.startsWith("|--") }
.map { it.removeSurrounding("|").split("|").map(String::trim) }
rows.firstOrNull()?.let { println("Columns: $it") }
rows.drop(1).forEach { println(it) }
}
Dart
final doc = PdfDocument.open('invoice.pdf');
final md = doc.toMarkdown(0);
final rows = md.split('\n').map((l) => l.trim())
.where((l) => l.startsWith('|') && !l.startsWith('|--'))
.map((l) => l.substring(1, l.length - 1).split('|').map((c) => c.trim()).toList())
.toList();
if (rows.isNotEmpty) {
print('Columns: ${rows.first}');
for (final row in rows.skip(1)) print(row);
}
doc.close();
R
doc <- pdf_open("invoice.pdf")
md <- pdf_to_markdown(doc, 0)
lines <- trimws(strsplit(md, "\n")[[1]])
lines <- lines[startsWith(lines, "|") & !startsWith(lines, "|--")]
rows <- lapply(lines, function(l) {
cells <- strsplit(l, "\\|")[[1]]
trimws(cells[2:(length(cells) - 1)])
})
if (length(rows) > 0) {
cat("Columns:", rows[[1]], "\n")
for (row in rows[-1]) cat(row, "\n")
}
Julia
doc = open_document("invoice.pdf")
md = to_markdown(doc, 0)
rows = [strip.(split(l, "|")[2:end-1])
for l in strip.(split(md, "\n"))
if startswith(l, "|") && !startswith(l, "|--")]
if !isempty(rows)
println("Columns: ", rows[1])
for row in rows[2:end]
println(row)
end
end
Zig
var doc = try pdf_oxide.Document.open("invoice.pdf");
const md = try doc.toMarkdown(a, 0);
defer a.free(md);
var lines = std.mem.splitScalar(u8, md, '\n');
while (lines.next()) |raw| {
const line = std.mem.trim(u8, raw, " \t\r");
if (!std.mem.startsWith(u8, line, "|") or std.mem.startsWith(u8, line, "|--")) continue;
const inner = line[1 .. line.len - 1];
var cells = std.mem.splitScalar(u8, inner, '|');
while (cells.next()) |cell| {
std.debug.print("{s}\t", .{std.mem.trim(u8, cell, " \t")});
}
std.debug.print("\n", .{});
}
Scala
Using.resource(PdfDocument.open("invoice.pdf")) { doc =>
val md = doc.toMarkdown(0)
val rows = md.split("\n").map(_.trim)
.filter(l => l.startsWith("|") && !l.startsWith("|--"))
.map(_.stripPrefix("|").stripSuffix("|").split("\\|").map(_.trim).toList)
.toList
rows.headOption.foreach(h => println(s"Columns: $h"))
rows.drop(1).foreach(println)
}
Clojure
(with-open [d (pdf/open "invoice.pdf")]
(let [md (pdf/to-markdown d 0)
rows (->> (clojure.string/split-lines md)
(map clojure.string/trim)
(filter #(and (.startsWith % "|") (not (.startsWith % "|--"))))
(map #(->> (clojure.string/split % #"\|")
(drop 1) (butlast) (map clojure.string/trim) vec)))]
(when-let [header (first rows)]
(println "Columns:" header)
(doseq [row (rest rows)] (println row)))))
Objective-C
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"invoice.pdf" error:&err];
NSString *md = [doc toMarkdown:0 error:&err];
NSMutableArray<NSArray<NSString *> *> *rows = [NSMutableArray array];
for (NSString *raw in [md componentsSeparatedByString:@"\n"]) {
NSString *line = [raw stringByTrimmingCharactersInSet:
[NSCharacterSet whitespaceCharacterSet]];
if (![line hasPrefix:@"|"] || [line hasPrefix:@"|--"]) continue;
NSArray<NSString *> *parts = [line componentsSeparatedByString:@"|"];
NSMutableArray<NSString *> *cells = [NSMutableArray array];
for (NSUInteger i = 1; i + 1 < parts.count; i++)
[cells addObject:[parts[i] stringByTrimmingCharactersInSet:
[NSCharacterSet whitespaceCharacterSet]]];
[rows addObject:cells];
}
if (rows.count > 0) NSLog(@"Columns: %@", rows[0]);
Elixir
{:ok, doc} = PdfOxide.open("invoice.pdf")
{:ok, md} = PdfOxide.to_markdown(doc, 0)
rows =
md
|> String.split("\n")
|> Enum.map(&String.trim/1)
|> Enum.filter(&(String.starts_with?(&1, "|") and not String.starts_with?(&1, "|--")))
|> Enum.map(fn line ->
line |> String.split("|") |> Enum.slice(1..-2//1) |> Enum.map(&String.trim/1)
end)
case rows do
[header | data] ->
IO.puts("Columns: #{inspect(header)}")
Enum.each(data, &IO.inspect/1)
[] ->
:ok
end
导出为 CSV
import csv
from pdf_oxide import PdfDocument
doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0)
rows = []
for line in md.split("\n"):
line = line.strip()
if line.startswith("|") and not line.startswith("|--"):
cells = [cell.strip() for cell in line.split("|")[1:-1]]
rows.append(cells)
with open("table.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(rows)
导出为 Pandas DataFrame
import pandas as pd
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
md = doc.to_markdown(0)
rows = []
for line in md.split("\n"):
line = line.strip()
if line.startswith("|") and not line.startswith("|--"):
cells = [cell.strip() for cell in line.split("|")[1:-1]]
rows.append(cells)
if rows:
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df)
利用字符位置进行自定义表格解析
对于更精细的控制,可使用字符级提取和空间分析:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("financial.pdf")
chars = doc.extract_chars(0)
# Group characters by Y position (rows)
rows = {}
for ch in chars:
row_key = round(ch.y / 2) * 2 # Snap to 2pt grid
rows.setdefault(row_key, []).append(ch)
# Sort rows top-to-bottom, characters left-to-right
for y in sorted(rows.keys(), reverse=True):
line_chars = sorted(rows[y], key=lambda c: c.x)
text = "".join(c.char for c in line_chars)
print(text)
WASM
const doc = new WasmPdfDocument(bytes);
const chars = doc.extractChars(0);
// Group characters by Y position (rows)
const rows = new Map();
for (const ch of chars) {
const rowKey = Math.round(ch.y / 2) * 2; // Snap to 2pt grid
if (!rows.has(rowKey)) rows.set(rowKey, []);
rows.get(rowKey).push(ch);
}
// Sort rows top-to-bottom, characters left-to-right
const sortedKeys = [...rows.keys()].sort((a, b) => b - a);
for (const y of sortedKeys) {
const lineChars = rows.get(y).sort((a, b) => a.x - b.x);
const text = lineChars.map(c => c.char).join("");
console.log(text);
}
doc.free();
Rust
use std::collections::BTreeMap;
let mut doc = PdfDocument::open("financial.pdf")?;
let chars = doc.extract_chars(0)?;
let mut rows: BTreeMap<i32, Vec<_>> = BTreeMap::new();
for ch in &chars {
let row_key = ((ch.y / 2.0).round() * 2.0) as i32;
rows.entry(row_key).or_default().push(ch);
}
for (_, line_chars) in rows.iter().rev() {
let mut sorted = line_chars.clone();
sorted.sort_by(|a, b| a.x.partial_cmp(&b.x).unwrap());
let text: String = sorted.iter().map(|c| c.char).collect();
println!("{}", text);
}
Go
doc, _ := pdfoxide.Open("financial.pdf")
defer doc.Close()
chars, _ := doc.ExtractChars(0)
rows := map[int][]pdfoxide.Char{}
for _, ch := range chars {
key := int(math.Round(float64(ch.Y)/2) * 2)
rows[key] = append(rows[key], ch)
}
keys := make([]int, 0, len(rows))
for k := range rows { keys = append(keys, k) }
sort.Sort(sort.Reverse(sort.IntSlice(keys)))
for _, y := range keys {
line := rows[y]
sort.Slice(line, func(i, j int) bool { return line[i].X < line[j].X })
var b strings.Builder
for _, c := range line { b.WriteString(c.Char) }
fmt.Println(b.String())
}
C#
using var doc = PdfDocument.Open("financial.pdf");
var chars = doc.ExtractChars(0);
var rows = chars
.GroupBy(c => (int)(Math.Round(c.Y / 2) * 2))
.OrderByDescending(g => g.Key);
foreach (var row in rows)
{
var line = string.Concat(row.OrderBy(c => c.X).Select(c => c.Char));
Console.WriteLine(line);
}
C++
#include <pdf_oxide/pdf_oxide.hpp>
#include <map>
#include <vector>
#include <algorithm>
auto doc = pdf_oxide::Document::open("financial.pdf");
auto chars = doc.extract_chars(0);
// Group characters by Y position (rows)
std::map<int, std::vector<pdf_oxide::Char>> rows;
for (const auto& ch : chars) {
int key = static_cast<int>(std::lround(ch.bbox.y / 2.0) * 2);
rows[key].push_back(ch);
}
// Sort rows top-to-bottom, characters left-to-right
for (auto it = rows.rbegin(); it != rows.rend(); ++it) {
auto& line = it->second;
std::sort(line.begin(), line.end(),
[](const auto& a, const auto& b) { return a.bbox.x < b.bbox.x; });
std::string text;
for (const auto& c : line) text += static_cast<char>(c.character);
std::cout << text << '\n';
}
Swift
let doc = try Document.open("financial.pdf")
let chars = try doc.extractChars(0)
// Group characters by Y position (rows)
var rows: [Int: [Char]] = [:]
for ch in chars {
let key = Int((ch.bbox.y / 2).rounded()) * 2 // Snap to 2pt grid
rows[key, default: []].append(ch)
}
// Sort rows top-to-bottom, characters left-to-right
for y in rows.keys.sorted(by: >) {
let line = rows[y]!.sorted { $0.bbox.x < $1.bbox.x }
let text = String(line.compactMap { Unicode.Scalar($0.character).map(Character.init) })
print(text)
}
Dart
final doc = PdfDocument.open('financial.pdf');
final chars = doc.extractChars(0);
// Group characters by Y position (rows)
final rows = <int, List<Char>>{};
for (final ch in chars) {
final key = (ch.bbox.y / 2).round() * 2; // Snap to 2pt grid
rows.putIfAbsent(key, () => []).add(ch);
}
// Sort rows top-to-bottom, characters left-to-right
final keys = rows.keys.toList()..sort((a, b) => b - a);
for (final y in keys) {
final line = rows[y]!..sort((a, b) => a.bbox.x.compareTo(b.bbox.x));
final text = String.fromCharCodes(line.map((c) => c.character));
print(text);
}
doc.close();
R
doc <- pdf_open("financial.pdf")
chars <- pdf_extract_chars(doc, 0)
# Group characters by Y position (rows), snapped to a 2pt grid
keys <- sapply(chars, function(ch) round(ch$bbox$y / 2) * 2)
for (y in sort(unique(keys), decreasing = TRUE)) {
line <- chars[keys == y]
line <- line[order(sapply(line, function(c) c$bbox$x))]
text <- paste(intToUtf8(sapply(line, function(c) c$character), multiple = TRUE),
collapse = "")
cat(text, "\n")
}
Julia
doc = open_document("financial.pdf")
chars = extract_chars(doc, 0)
# Group characters by Y position (rows), snapped to a 2pt grid
rows = Dict{Int,Vector}()
for ch in chars
key = round(Int, ch.bbox.y / 2) * 2
push!(get!(rows, key, []), ch)
end
for y in sort(collect(keys(rows)), rev = true)
line = sort(rows[y], by = c -> c.bbox.x)
text = join(Char.(getfield.(line, :character)))
println(text)
end
Zig
var doc = try pdf_oxide.Document.open("financial.pdf");
const chars = try doc.extractChars(a, 0);
defer pdf_oxide.Document.freeChars(a, chars);
// Group characters by Y position (rows)
var rows = std.AutoArrayHashMap(i32, std.ArrayList(pdf_oxide.Char)).init(a);
for (chars) |ch| {
const key: i32 = @intFromFloat(@round(ch.bbox.y / 2.0) * 2.0);
const gop = try rows.getOrPut(key);
if (!gop.found_existing) gop.value_ptr.* = std.ArrayList(pdf_oxide.Char).init(a);
try gop.value_ptr.append(ch);
}
// Sort keys descending (top-to-bottom), characters left-to-right
const keys = rows.keys();
std.mem.sort(i32, keys, {}, comptime std.sort.desc(i32));
for (keys) |y| {
var line = rows.get(y).?;
std.mem.sort(pdf_oxide.Char, line.items, {}, struct {
fn lt(_: void, x: pdf_oxide.Char, z: pdf_oxide.Char) bool { return x.bbox.x < z.bbox.x; }
}.lt);
for (line.items) |c| {
var buf: [4]u8 = undefined;
const len = std.unicode.utf8Encode(@intCast(c.character), &buf) catch 0;
std.debug.print("{s}", .{buf[0..len]});
}
std.debug.print("\n", .{});
}
Objective-C
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"financial.pdf" error:&err];
NSArray<POXChar *> *chars = [doc extractChars:0 error:&err];
// Group characters by Y position (rows)
NSMutableDictionary<NSNumber *, NSMutableArray<POXChar *> *> *rows =
[NSMutableDictionary dictionary];
for (POXChar *ch in chars) {
NSNumber *key = @((NSInteger)(round(ch.bbox.y / 2.0) * 2));
if (!rows[key]) rows[key] = [NSMutableArray array];
[rows[key] addObject:ch];
}
// Sort rows top-to-bottom, characters left-to-right
NSArray<NSNumber *> *keys = [[rows allKeys]
sortedArrayUsingSelector:@selector(compare:)];
for (NSNumber *y in [keys reverseObjectEnumerator]) {
NSArray<POXChar *> *line = [rows[y] sortedArrayUsingComparator:
^(POXChar *x, POXChar *z) { return [@(x.bbox.x) compare:@(z.bbox.x)]; }];
NSMutableString *text = [NSMutableString string];
for (POXChar *c in line)
[text appendString:[[NSString alloc] initWithBytes:&(uint32_t){c.character}
length:4 encoding:NSUTF32LittleEndianStringEncoding]];
NSLog(@"%@", text);
}
Elixir
{:ok, doc} = PdfOxide.open("financial.pdf")
{:ok, chars} = PdfOxide.extract_chars(doc, 0)
# Group characters by Y position (rows), snapped to a 2pt grid
chars
|> Enum.group_by(fn ch -> round(ch.bbox.y / 2) * 2 end)
|> Enum.sort_by(fn {y, _} -> -y end)
|> Enum.each(fn {_y, line} ->
text =
line
|> Enum.sort_by(fn c -> c.bbox.x end)
|> Enum.map(fn c -> <<c.character::utf8>> end)
|> Enum.join()
IO.puts(text)
end)
将表格导出为 Markdown
当你需要将 PDF 内容输入大型语言模型、构建 RAG 流水线,或以人类可读且机器可解析的格式存储提取数据时,Markdown 是理想的输出格式。PDF Oxide 原生以 GitHub Flavored Markdown (GFM) 格式输出表格,无需额外转换步骤。
from pdf_oxide import PdfDocument
doc = PdfDocument("quarterly-report.pdf")
# Extract all tables across all pages as Markdown
all_tables = []
for i in range(doc.page_count()):
md = doc.to_markdown(i, detect_headings=True)
# Split the markdown into sections and find table blocks
in_table = False
current_table = []
for line in md.split("\n"):
if line.strip().startswith("|"):
in_table = True
current_table.append(line)
else:
if in_table and current_table:
all_tables.append("\n".join(current_table))
current_table = []
in_table = False
if current_table:
all_tables.append("\n".join(current_table))
print(f"Found {len(all_tables)} tables")
for idx, table in enumerate(all_tables):
print(f"\n--- Table {idx + 1} ---")
print(table)
GFM 表格输出与 LLM 提示词直接兼容。你可以直接将其传入 OpenAI 或 Anthropic 的 API 调用,模型无需任何额外格式处理即可理解表格结构:
# Feed extracted table to an LLM for analysis
prompt = f"""Analyze the following financial table and summarize the key trends:
{all_tables[0]}
"""
这种方法比先用 pdfplumber 提取表格再手动转换为 Markdown 要快得多。
处理跨页表格
跨页表格是 PDF 提取中的常见挑战。财务报表、库存清单和监管文件中经常包含跨越两页、五页甚至数十页的表格。关键在于从每页单独提取表格,然后将各行拼接在一起,同时小心处理重复的标题行和页面碎片。
from pdf_oxide import PdfDocument
doc = PdfDocument("long-report.pdf")
def extract_table_rows(md_text):
"""Extract table rows from markdown text, returning header and data separately."""
header = None
data_rows = []
for line in md_text.split("\n"):
line = line.strip()
if not line.startswith("|") or line.startswith("|--"):
continue
cells = [cell.strip() for cell in line.split("|")[1:-1]]
if header is None:
header = cells
else:
data_rows.append(cells)
return header, data_rows
# Collect rows across all pages
combined_header = None
combined_rows = []
for i in range(doc.page_count()):
md = doc.to_markdown(i)
header, rows = extract_table_rows(md)
if header is None:
continue # No table on this page
if combined_header is None:
combined_header = header
elif header == combined_header:
pass # Skip repeated header on subsequent pages
else:
# Different table — save current and start new
print(f"Table with {len(combined_rows)} rows found")
combined_header = header
combined_rows = []
combined_rows.extend(rows)
if combined_header and combined_rows:
print(f"Columns: {combined_header}")
print(f"Total rows: {len(combined_rows)}")
for row in combined_rows[:5]:
print(row)
if len(combined_rows) > 5:
print(f"... and {len(combined_rows) - 5} more rows")
这一模式对每页重复标题行的表格(最常见的情况)能可靠运行。对于标题只出现在第一页的表格,可以简化逻辑:只从第一个含表格的页面获取标题,将后续所有行视为数据行。
将表格导出为 CSV 或 DataFrame
提取表格数据后,通常需要将其转换为结构化格式以便进一步分析。以下示例展示如何用几行代码将 PDF 内容转换为 pandas DataFrame 或 CSV 文件。
批量导出:将所有表格分别导出为 CSV
import csv
from pdf_oxide import PdfDocument
doc = PdfDocument("catalog.pdf")
table_count = 0
for i in range(doc.page_count()):
md = doc.to_markdown(i)
rows = []
for line in md.split("\n"):
line = line.strip()
if line.startswith("|") and not line.startswith("|--"):
cells = [cell.strip() for cell in line.split("|")[1:-1]]
rows.append(cells)
if len(rows) > 1: # At least header + one data row
table_count += 1
filename = f"table_page{i + 1}_{table_count}.csv"
with open(filename, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(rows)
print(f"Saved {filename} ({len(rows) - 1} data rows)")
print(f"Exported {table_count} tables total")
跨页表格导入 DataFrame
对于跨多页的表格,将多页拼接模式与 pandas 结合使用:
import pandas as pd
from pdf_oxide import PdfDocument
doc = PdfDocument("financial-statement.pdf")
header = None
all_rows = []
for i in range(doc.page_count()):
md = doc.to_markdown(i)
for line in md.split("\n"):
line = line.strip()
if not line.startswith("|") or line.startswith("|--"):
continue
cells = [cell.strip() for cell in line.split("|")[1:-1]]
if header is None:
header = cells
elif cells == header:
continue # Skip repeated header
else:
all_rows.append(cells)
if header and all_rows:
df = pd.DataFrame(all_rows, columns=header)
# Clean up numeric columns
for col in df.columns:
# Try to convert columns that look numeric
cleaned = df[col].str.replace(r"[$,%]", "", regex=True).str.strip()
try:
df[col] = pd.to_numeric(cleaned)
except (ValueError, TypeError):
pass # Keep as string
print(df.dtypes)
print(df.head(10))
df.to_csv("financial_data.csv", index=False)
这一工作流将为你生成具有正确数值类型的干净 DataFrame,可直接用于 pandas 分析、matplotlib 绘图或加载到数据库中。
复杂表格:何时使用 pdfplumber
PDF Oxide 的表格检测能很好地处理标准对齐表格。对于复杂情况——合并单元格、跨列标题、无边框表格或多行单元格内容——pdfplumber 专用的表格提取算法更为强大:
import pdfplumber
with pdfplumber.open("complex-report.pdf") as pdf:
page = pdf.pages[0]
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
各场景推荐工具
| 场景 | 推荐 |
|---|---|
| 简单对齐表格 | PDF Oxide(快 29 倍) |
| 全页 Markdown 中的表格 | PDF Oxide |
| 复杂合并单元格 / 跨列标题 | pdfplumber |
| 无边框表格 | pdfplumber |
| 对速度敏感的批量处理 | PDF Oxide |
两者结合使用
用 PDF Oxide 快速提取文本,用 pdfplumber 提取复杂表格:
from pdf_oxide import PdfDocument
import pdfplumber
# Fast full-text extraction
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
# Targeted table extraction for complex pages
with pdfplumber.open("report.pdf") as pdf:
tables = pdf.pages[0].extract_tables()
相关页面
- Markdown 转换 — 完整 Markdown API 参考
- 文本提取 — 纯文本与字符提取
- PDF Oxide vs pdfplumber — 详细对比
- PDF 转 Markdown — Markdown 转换指南