Extract Tables from PDF in Python
Extracting tables from PDF documents is one of the most common tasks in document processing pipelines. Whether you are pulling financial data from annual reports, scraping product catalogs, or feeding structured data into an LLM, reliable table extraction is essential. This guide covers everything you need to know about extracting tables from PDFs in Python, from quick one-liners to production-grade workflows for multi-page tables.
Detection engine
PDF Oxide uses the universal edges → snap/merge → intersections → cells → groups table-detection pipeline — the same approach used by Tabula, pdfplumber, and PyMuPDF, implemented in pure Rust.
Detection capabilities:
- Intersection-based — finds H×V line crossings, builds cells from four-corner rectangles, groups into tables via union-find.
- Extended grid — when horizontal and vertical lines live in different page regions, a virtual grid is built from the Cartesian product of all coordinates.
- Column-aware text detection — segments 2-column layouts via X-projection histogram, then runs text-only table detection per column.
- H-rule-bounded text tables — detects tables bounded by horizontal rules but no vertical lines (common in academic papers).
- Hybrid row detection — infers row boundaries from text Y-positions when only vertical borders exist (invoice line items).
- Dotted / dashed line reconstitution — merges short line segments into continuous edges.
- Section divider splitting — splits multi-section forms at full-width horizontal dividers.
- Edge coverage filtering — removes orphan edges that don’t participate in any potential grid.
Configuration
TableDetectionConfig exposes tunable parameters:
| Field | Default | Description |
|---|---|---|
horizontal_strategy |
"lines_strict" |
"lines_strict", "lines", "text", or "explicit" |
vertical_strategy |
"lines_strict" |
Same vocabulary |
v_split_gap |
20.0 pt |
Gap between vertical lines that triggers splitting into separate tables (was hardcoded 4pt prior to v0.3.20) |
snap_tolerance |
3.0 pt |
Edge-snap merging tolerance |
text_tolerance |
3.0 pt |
Text-line merging tolerance |
Behavior change
From v0.3.20 onwards, the default strategy for Python extract_tables() is Both (detects via both lines and text). Pages that relied on the old Text-only default should pass horizontal_strategy="text" and vertical_strategy="text" explicitly.
The Python binding now correctly reads vertical_strategy from the table_settings dict — previously it was silently ignored.
Rendering
Extracted tables are emitted with space-padded column alignment (replacing the ASCII box-drawing characters from earlier versions). Right-aligns currency and number columns automatically. Form-number prefixes ("1 Apr 11" → "Apr 11") and decorative dash / underscore cells ("------") are stripped during rendering.
Extract table data from a PDF using Markdown conversion:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)
# Output includes tables in GFM format:
# | Item | Qty | Price |
# |------|-----|-------|
# | Widget | 10 | $9.99 |
WASM
import { WasmPdfDocument } from "pdf-oxide-wasm";
const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0);
console.log(md);
// Output includes tables in GFM format:
// | Item | Qty | Price |
// |------|-----|-------|
// | Widget | 10 | $9.99 |
doc.free();
Rust
use pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("invoice.pdf")?;
let md = doc.to_markdown(0, true)?;
println!("{}", md);
Go
package main
import (
"fmt"
"log"
pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)
func main() {
doc, err := pdfoxide.Open("invoice.pdf")
if err != nil { log.Fatal(err) }
defer doc.Close()
md, err := doc.ToMarkdown(0)
if err != nil { log.Fatal(err) }
fmt.Println(md)
}
C#
using PdfOxide;
using var doc = PdfDocument.Open("invoice.pdf");
Console.WriteLine(doc.ToMarkdown(0));
PDF Oxide detects tabular layouts from spatial analysis of aligned text blocks and emits GitHub Flavored Markdown tables.
Why Table Extraction from PDFs Is Challenging
If you have ever tried to copy a table from a PDF and paste it into a spreadsheet, you know the result is usually a mess. That is not a bug in your PDF viewer — it reflects a fundamental limitation of the PDF format itself.
PDFs have no concept of a “table.” Unlike HTML, which uses <table>, <tr>, and <td> tags to define tabular structure, a PDF file stores only drawing instructions: place this glyph at coordinates (x, y), draw a line from point A to point B. There is no semantic layer that says “these characters belong to a cell in row 3, column 2.” Every table extraction library must reconstruct that structure by analyzing the spatial positions of text and lines on the page.
This reconstruction is hard for several reasons:
-
Bordered vs. borderless tables. When a table has visible grid lines, extraction tools can use those lines as cell boundaries. Borderless tables — common in financial statements, government reports, and academic papers — have no lines at all. The library must infer column boundaries purely from whitespace gaps between text blocks, which is error-prone when columns have variable widths or when numeric values are right-aligned.
-
Merged cells and spanning headers. A header cell that spans three columns looks like a single wide text block. Without the grid lines to delineate it, a parser has no reliable way to know which columns the header covers. Some libraries handle this well; many silently produce garbled output.
-
Multi-line cell content. When a cell contains a paragraph of text that wraps to multiple lines, naive row-based parsing treats each wrapped line as a separate row. Correctly grouping those lines back into a single cell requires understanding the vertical extent of each row.
-
Multi-page tables. Large tables often span two or more pages. The header row may or may not be repeated on each page, and page footers, watermarks, or page numbers may appear between table rows. Stitching these fragments back into a single coherent table requires page-aware logic.
-
Rotated text and non-standard layouts. Some PDFs use rotated text for column headers, or place tables in multi-column page layouts. These edge cases break assumptions that most parsers make about left-to-right, top-to-bottom reading order.
Understanding these challenges helps you choose the right tool for your specific documents. For straightforward aligned tables — the majority of invoices, order confirmations, and simple reports — a fast spatial analysis approach like PDF Oxide works well. For documents with complex merging, borderless layouts, or unusual formatting, you may need a library with more sophisticated heuristics.
Table Extraction: PDF Oxide vs Other Libraries
Choosing a library for PDF table extraction in Python depends on your documents, your performance requirements, and how you need the output formatted. Here is how the major options compare:
| Library | Table Detection | Bordered Tables | Borderless Tables | Output Format | Speed |
|---|---|---|---|---|---|
| PDF Oxide | Built-in | Yes | Basic | Markdown/HTML | 0.8ms |
| pdfplumber | Built-in | Yes | Advanced | Python lists | 23.2ms |
| Camelot | Built-in | Yes | Yes (lattice/stream) | DataFrames | ~50ms+ |
| PyMuPDF | Basic (v1.23+) | Yes | Limited | DataFrames | 4.6ms |
| pypdf | No | No | No | N/A | N/A |
| tabula-py | Built-in | Yes | Yes | DataFrames | ~100ms+ (Java) |
PDF Oxide is the fastest option by a wide margin. It detects tables through spatial analysis of aligned text blocks and outputs clean GitHub Flavored Markdown tables. At 0.8ms mean extraction time, it is 29x faster than pdfplumber and over 100x faster than tabula-py. It handles bordered tables and simple aligned borderless tables well. For LLM pipelines where you need Markdown output anyway, it is the natural choice.
pdfplumber has the most mature borderless table detection. Its find_tables() method uses configurable strategies for detecting rows and columns based on text alignment, and it handles merged cells and multi-line cell content better than most alternatives. The trade-off is speed: at 23.2ms per page, it is significantly slower for batch processing.
Camelot offers two detection modes — lattice (for bordered tables) and stream (for borderless tables). It produces pandas DataFrames directly, which is convenient for data analysis workflows. However, it depends on Ghostscript and OpenCV, making installation heavier, and its speed is the slowest among pure-Python options.
PyMuPDF (fitz) added basic table extraction in version 1.23. It is fast (4.6ms) and works well for simple bordered tables, but its borderless table support is limited compared to pdfplumber or Camelot.
pypdf does not have any table detection capability. It extracts raw text, so you would need to write your own parsing logic to reconstruct table structure.
tabula-py is a Python wrapper around the Java-based Tabula library. It provides good table detection for both bordered and borderless tables, but requires a Java runtime and is the slowest option due to JVM startup overhead. It is best suited for one-off extraction tasks rather than high-throughput pipelines.
For most production use cases, the recommended approach is to use PDF Oxide as your primary extractor for speed and simplicity, and fall back to pdfplumber for the subset of documents with complex table layouts that require advanced heuristics.
Installation
pip install pdf_oxide
Basic Table Extraction
As Markdown Tables
The simplest approach — convert the page to Markdown, which includes tables in GFM syntax:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
md = doc.to_markdown(i, detect_headings=True)
if "|" in md: # Page contains a table
print(f"--- Page {i + 1} ---")
print(md)
WASM
const doc = new WasmPdfDocument(bytes);
for (let i = 0; i < doc.pageCount(); i++) {
const md = doc.toMarkdown(i);
if (md.includes("|")) { // Page contains a table
console.log(`--- Page ${i + 1} ---`);
console.log(md);
}
}
doc.free();
Rust
let mut doc = PdfDocument::open("report.pdf")?;
for i in 0..doc.page_count()? {
let md = doc.to_markdown(i, true)?;
if md.contains("|") {
println!("--- Page {} ---", i + 1);
println!("{}", md);
}
}
Go
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
n, _ := doc.PageCount()
for i := 0; i < n; i++ {
md, _ := doc.ToMarkdown(i)
if strings.Contains(md, "|") {
fmt.Printf("--- Page %d ---\n%s\n", i+1, md)
}
}
C#
using var doc = PdfDocument.Open("report.pdf");
for (int i = 0; i < doc.PageCount; i++)
{
var md = doc.ToMarkdown(i);
if (md.Contains("|"))
Console.WriteLine($"--- Page {i + 1} ---\n{md}");
}
Structured Table Extraction (v0.3.34)
For typed access to rows and bounding boxes without parsing Markdown, call ExtractTables(pageIndex) (Go, C#) / extract_tables(page) (Python, Rust). Each table exposes structured cells so you can pipe results directly into a database or DataFrame without regex.
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("invoice.pdf")
for table in doc.extract_tables(0):
for row in table.rows:
print(row)
Rust
let mut doc = PdfDocument::open("invoice.pdf")?;
for table in doc.extract_tables(0)? {
for row in &table.rows {
println!("{:?}", row);
}
}
Go
doc, _ := pdfoxide.Open("invoice.pdf")
defer doc.Close()
tables, _ := doc.ExtractTables(0)
for _, t := range tables {
for _, row := range t.Rows {
fmt.Println(row)
}
}
C#
using var doc = PdfDocument.Open("invoice.pdf");
foreach (var table in doc.ExtractTables(0))
foreach (var row in table.Rows)
Console.WriteLine(string.Join(" | ", row));
Parse Markdown Tables to Rows
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0)
# Extract table rows from Markdown
rows = []
for line in md.split("\n"):
line = line.strip()
if line.startswith("|") and not line.startswith("|--"):
cells = [cell.strip() for cell in line.split("|")[1:-1]]
rows.append(cells)
header = rows[0] if rows else []
data = rows[1:] if len(rows) > 1 else []
print(f"Columns: {header}")
for row in data:
print(row)
WASM
const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0);
const rows = [];
for (const line of md.split("\n")) {
const trimmed = line.trim();
if (trimmed.startsWith("|") && !trimmed.startsWith("|--")) {
const cells = trimmed.split("|").slice(1, -1).map(c => c.trim());
rows.push(cells);
}
}
const header = rows[0] || [];
const data = rows.slice(1);
console.log("Columns:", header);
data.forEach(row => console.log(row));
doc.free();
Rust
let mut doc = PdfDocument::open("invoice.pdf")?;
let md = doc.to_markdown(0, false)?;
let rows: Vec<Vec<String>> = md.lines()
.map(|l| l.trim())
.filter(|l| l.starts_with('|') && !l.starts_with("|--"))
.map(|l| l.split('|').skip(1).map(|c| c.trim().to_string())
.take_while(|c| !c.is_empty()).collect())
.collect();
if let Some(header) = rows.first() {
println!("Columns: {:?}", header);
for row in &rows[1..] {
println!("{:?}", row);
}
}
Export to CSV
import csv
from pdf_oxide import PdfDocument
doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0)
rows = []
for line in md.split("\n"):
line = line.strip()
if line.startswith("|") and not line.startswith("|--"):
cells = [cell.strip() for cell in line.split("|")[1:-1]]
rows.append(cells)
with open("table.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(rows)
Export to Pandas DataFrame
import pandas as pd
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
md = doc.to_markdown(0)
rows = []
for line in md.split("\n"):
line = line.strip()
if line.startswith("|") and not line.startswith("|--"):
cells = [cell.strip() for cell in line.split("|")[1:-1]]
rows.append(cells)
if rows:
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df)
Using Character Positions for Custom Table Parsing
For fine-grained control, use character-level extraction and spatial analysis:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("financial.pdf")
chars = doc.extract_chars(0)
# Group characters by Y position (rows)
rows = {}
for ch in chars:
row_key = round(ch.y / 2) * 2 # Snap to 2pt grid
rows.setdefault(row_key, []).append(ch)
# Sort rows top-to-bottom, characters left-to-right
for y in sorted(rows.keys(), reverse=True):
line_chars = sorted(rows[y], key=lambda c: c.x)
text = "".join(c.char for c in line_chars)
print(text)
WASM
const doc = new WasmPdfDocument(bytes);
const chars = doc.extractChars(0);
// Group characters by Y position (rows)
const rows = new Map();
for (const ch of chars) {
const rowKey = Math.round(ch.y / 2) * 2; // Snap to 2pt grid
if (!rows.has(rowKey)) rows.set(rowKey, []);
rows.get(rowKey).push(ch);
}
// Sort rows top-to-bottom, characters left-to-right
const sortedKeys = [...rows.keys()].sort((a, b) => b - a);
for (const y of sortedKeys) {
const lineChars = rows.get(y).sort((a, b) => a.x - b.x);
const text = lineChars.map(c => c.char).join("");
console.log(text);
}
doc.free();
Rust
use std::collections::BTreeMap;
let mut doc = PdfDocument::open("financial.pdf")?;
let chars = doc.extract_chars(0)?;
let mut rows: BTreeMap<i32, Vec<_>> = BTreeMap::new();
for ch in &chars {
let row_key = ((ch.y / 2.0).round() * 2.0) as i32;
rows.entry(row_key).or_default().push(ch);
}
for (_, line_chars) in rows.iter().rev() {
let mut sorted = line_chars.clone();
sorted.sort_by(|a, b| a.x.partial_cmp(&b.x).unwrap());
let text: String = sorted.iter().map(|c| c.char).collect();
println!("{}", text);
}
Go
doc, _ := pdfoxide.Open("financial.pdf")
defer doc.Close()
chars, _ := doc.ExtractChars(0)
rows := map[int][]pdfoxide.Char{}
for _, ch := range chars {
key := int(math.Round(float64(ch.Y)/2) * 2)
rows[key] = append(rows[key], ch)
}
keys := make([]int, 0, len(rows))
for k := range rows { keys = append(keys, k) }
sort.Sort(sort.Reverse(sort.IntSlice(keys)))
for _, y := range keys {
line := rows[y]
sort.Slice(line, func(i, j int) bool { return line[i].X < line[j].X })
var b strings.Builder
for _, c := range line { b.WriteString(c.Char) }
fmt.Println(b.String())
}
C#
using var doc = PdfDocument.Open("financial.pdf");
var chars = doc.ExtractChars(0);
var rows = chars
.GroupBy(c => (int)(Math.Round(c.Y / 2) * 2))
.OrderByDescending(g => g.Key);
foreach (var row in rows)
{
var line = string.Concat(row.OrderBy(c => c.X).Select(c => c.Char));
Console.WriteLine(line);
}
Extract Tables to Markdown
Markdown is the ideal output format when you are feeding PDF content to a large language model, building a RAG pipeline, or storing extracted data in a format that is both human-readable and machine-parseable. PDF Oxide outputs tables in GitHub Flavored Markdown (GFM) format natively, so no additional conversion step is needed.
from pdf_oxide import PdfDocument
doc = PdfDocument("quarterly-report.pdf")
# Extract all tables across all pages as Markdown
all_tables = []
for i in range(doc.page_count()):
md = doc.to_markdown(i, detect_headings=True)
# Split the markdown into sections and find table blocks
in_table = False
current_table = []
for line in md.split("\n"):
if line.strip().startswith("|"):
in_table = True
current_table.append(line)
else:
if in_table and current_table:
all_tables.append("\n".join(current_table))
current_table = []
in_table = False
if current_table:
all_tables.append("\n".join(current_table))
print(f"Found {len(all_tables)} tables")
for idx, table in enumerate(all_tables):
print(f"\n--- Table {idx + 1} ---")
print(table)
The GFM table output is directly compatible with LLM prompts. You can pass it straight into an OpenAI or Anthropic API call and the model will understand the tabular structure without any additional formatting:
# Feed extracted table to an LLM for analysis
prompt = f"""Analyze the following financial table and summarize the key trends:
{all_tables[0]}
"""
This approach is significantly faster than extracting tables with pdfplumber and then converting them to Markdown yourself.
Handling Multi-Page Tables
Tables that span multiple pages are a common challenge in PDF extraction. Financial statements, inventory lists, and regulatory filings frequently contain tables that run across two, five, or even dozens of pages. The key insight is that you need to extract the table from each page separately and then stitch the rows together, being careful to handle repeated headers and page artifacts.
from pdf_oxide import PdfDocument
doc = PdfDocument("long-report.pdf")
def extract_table_rows(md_text):
"""Extract table rows from markdown text, returning header and data separately."""
header = None
data_rows = []
for line in md_text.split("\n"):
line = line.strip()
if not line.startswith("|") or line.startswith("|--"):
continue
cells = [cell.strip() for cell in line.split("|")[1:-1]]
if header is None:
header = cells
else:
data_rows.append(cells)
return header, data_rows
# Collect rows across all pages
combined_header = None
combined_rows = []
for i in range(doc.page_count()):
md = doc.to_markdown(i)
header, rows = extract_table_rows(md)
if header is None:
continue # No table on this page
if combined_header is None:
combined_header = header
elif header == combined_header:
pass # Skip repeated header on subsequent pages
else:
# Different table — save current and start new
print(f"Table with {len(combined_rows)} rows found")
combined_header = header
combined_rows = []
combined_rows.extend(rows)
if combined_header and combined_rows:
print(f"Columns: {combined_header}")
print(f"Total rows: {len(combined_rows)}")
for row in combined_rows[:5]:
print(row)
if len(combined_rows) > 5:
print(f"... and {len(combined_rows) - 5} more rows")
This pattern works reliably for tables where the header row is repeated on each page (the most common case). For tables where the header only appears on the first page, you can simplify the logic by only capturing the header from the first page that contains a table and treating all subsequent rows as data.
Export Tables to CSV or DataFrame
Once you have extracted table data, you often need it in a structured format for further analysis. The examples below show how to go from a PDF to a pandas DataFrame or a CSV file in just a few lines.
Batch Export: All Tables to Separate CSV Files
import csv
from pdf_oxide import PdfDocument
doc = PdfDocument("catalog.pdf")
table_count = 0
for i in range(doc.page_count()):
md = doc.to_markdown(i)
rows = []
for line in md.split("\n"):
line = line.strip()
if line.startswith("|") and not line.startswith("|--"):
cells = [cell.strip() for cell in line.split("|")[1:-1]]
rows.append(cells)
if len(rows) > 1: # At least header + one data row
table_count += 1
filename = f"table_page{i + 1}_{table_count}.csv"
with open(filename, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(rows)
print(f"Saved {filename} ({len(rows) - 1} data rows)")
print(f"Exported {table_count} tables total")
Multi-Page Table to DataFrame
For tables that span multiple pages, combine the multi-page stitching pattern with pandas:
import pandas as pd
from pdf_oxide import PdfDocument
doc = PdfDocument("financial-statement.pdf")
header = None
all_rows = []
for i in range(doc.page_count()):
md = doc.to_markdown(i)
for line in md.split("\n"):
line = line.strip()
if not line.startswith("|") or line.startswith("|--"):
continue
cells = [cell.strip() for cell in line.split("|")[1:-1]]
if header is None:
header = cells
elif cells == header:
continue # Skip repeated header
else:
all_rows.append(cells)
if header and all_rows:
df = pd.DataFrame(all_rows, columns=header)
# Clean up numeric columns
for col in df.columns:
# Try to convert columns that look numeric
cleaned = df[col].str.replace(r"[$,%]", "", regex=True).str.strip()
try:
df[col] = pd.to_numeric(cleaned)
except (ValueError, TypeError):
pass # Keep as string
print(df.dtypes)
print(df.head(10))
df.to_csv("financial_data.csv", index=False)
This workflow gives you a clean DataFrame with proper numeric types, ready for analysis with pandas, plotting with matplotlib, or loading into a database.
Complex Tables: When to Use pdfplumber
PDF Oxide’s table detection handles standard aligned tables well. For complex cases — merged cells, spanning headers, borderless tables, or multi-line cell content — pdfplumber’s dedicated table extraction algorithms are more robust:
import pdfplumber
with pdfplumber.open("complex-report.pdf") as pdf:
page = pdf.pages[0]
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
When to Use Each
| Scenario | Recommended |
|---|---|
| Simple aligned tables | PDF Oxide (29× faster) |
| Tables as part of full-page Markdown | PDF Oxide |
| Complex merged cells / spanning headers | pdfplumber |
| Borderless tables | pdfplumber |
| Speed-critical batch processing | PDF Oxide |
Use Both Together
Fast text extraction with PDF Oxide, complex table extraction with pdfplumber:
from pdf_oxide import PdfDocument
import pdfplumber
# Fast full-text extraction
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
# Targeted table extraction for complex pages
with pdfplumber.open("report.pdf") as pdf:
tables = pdf.pages[0].extract_tables()
Related Pages
- Markdown Conversion — full Markdown API reference
- Text Extraction — plain text and character extraction
- PDF Oxide vs pdfplumber — detailed comparison
- PDF to Markdown — Markdown conversion guide