Skip to content

Extract Tables from PDF in Python

Extracting tables from PDF documents is one of the most common tasks in document processing pipelines. Whether you are pulling financial data from annual reports, scraping product catalogs, or feeding structured data into an LLM, reliable table extraction is essential. This guide covers everything you need to know about extracting tables from PDFs in Python, from quick one-liners to production-grade workflows for multi-page tables.

Detection engine

PDF Oxide uses the universal edges → snap/merge → intersections → cells → groups table-detection pipeline — the same approach used by Tabula, pdfplumber, and PyMuPDF, implemented in pure Rust.

Detection capabilities:

  • Intersection-based — finds H×V line crossings, builds cells from four-corner rectangles, groups into tables via union-find.
  • Extended grid — when horizontal and vertical lines live in different page regions, a virtual grid is built from the Cartesian product of all coordinates.
  • Column-aware text detection — segments 2-column layouts via X-projection histogram, then runs text-only table detection per column.
  • H-rule-bounded text tables — detects tables bounded by horizontal rules but no vertical lines (common in academic papers).
  • Hybrid row detection — infers row boundaries from text Y-positions when only vertical borders exist (invoice line items).
  • Dotted / dashed line reconstitution — merges short line segments into continuous edges.
  • Section divider splitting — splits multi-section forms at full-width horizontal dividers.
  • Edge coverage filtering — removes orphan edges that don’t participate in any potential grid.

Configuration

TableDetectionConfig exposes tunable parameters:

Field Default Description
horizontal_strategy "lines_strict" "lines_strict", "lines", "text", or "explicit"
vertical_strategy "lines_strict" Same vocabulary
v_split_gap 20.0 pt Gap between vertical lines that triggers splitting into separate tables (was hardcoded 4pt prior to v0.3.20)
snap_tolerance 3.0 pt Edge-snap merging tolerance
text_tolerance 3.0 pt Text-line merging tolerance

Behavior change

From v0.3.20 onwards, the default strategy for Python extract_tables() is Both (detects via both lines and text). Pages that relied on the old Text-only default should pass horizontal_strategy="text" and vertical_strategy="text" explicitly.

The Python binding now correctly reads vertical_strategy from the table_settings dict — previously it was silently ignored.

Rendering

Extracted tables are emitted with space-padded column alignment (replacing the ASCII box-drawing characters from earlier versions). Right-aligns currency and number columns automatically. Form-number prefixes ("1 Apr 11""Apr 11") and decorative dash / underscore cells ("------") are stripped during rendering.

Extract table data from a PDF using Markdown conversion:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)
# Output includes tables in GFM format:
# | Item | Qty | Price |
# |------|-----|-------|
# | Widget | 10 | $9.99 |

WASM

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0);
console.log(md);
// Output includes tables in GFM format:
// | Item | Qty | Price |
// |------|-----|-------|
// | Widget | 10 | $9.99 |
doc.free();

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("invoice.pdf")?;
let md = doc.to_markdown(0, true)?;
println!("{}", md);

Go

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("invoice.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    md, err := doc.ToMarkdown(0)
    if err != nil { log.Fatal(err) }
    fmt.Println(md)
}

C#

using PdfOxide;

using var doc = PdfDocument.Open("invoice.pdf");
Console.WriteLine(doc.ToMarkdown(0));

PDF Oxide detects tabular layouts from spatial analysis of aligned text blocks and emits GitHub Flavored Markdown tables.

Why Table Extraction from PDFs Is Challenging

If you have ever tried to copy a table from a PDF and paste it into a spreadsheet, you know the result is usually a mess. That is not a bug in your PDF viewer — it reflects a fundamental limitation of the PDF format itself.

PDFs have no concept of a “table.” Unlike HTML, which uses <table>, <tr>, and <td> tags to define tabular structure, a PDF file stores only drawing instructions: place this glyph at coordinates (x, y), draw a line from point A to point B. There is no semantic layer that says “these characters belong to a cell in row 3, column 2.” Every table extraction library must reconstruct that structure by analyzing the spatial positions of text and lines on the page.

This reconstruction is hard for several reasons:

  • Bordered vs. borderless tables. When a table has visible grid lines, extraction tools can use those lines as cell boundaries. Borderless tables — common in financial statements, government reports, and academic papers — have no lines at all. The library must infer column boundaries purely from whitespace gaps between text blocks, which is error-prone when columns have variable widths or when numeric values are right-aligned.

  • Merged cells and spanning headers. A header cell that spans three columns looks like a single wide text block. Without the grid lines to delineate it, a parser has no reliable way to know which columns the header covers. Some libraries handle this well; many silently produce garbled output.

  • Multi-line cell content. When a cell contains a paragraph of text that wraps to multiple lines, naive row-based parsing treats each wrapped line as a separate row. Correctly grouping those lines back into a single cell requires understanding the vertical extent of each row.

  • Multi-page tables. Large tables often span two or more pages. The header row may or may not be repeated on each page, and page footers, watermarks, or page numbers may appear between table rows. Stitching these fragments back into a single coherent table requires page-aware logic.

  • Rotated text and non-standard layouts. Some PDFs use rotated text for column headers, or place tables in multi-column page layouts. These edge cases break assumptions that most parsers make about left-to-right, top-to-bottom reading order.

Understanding these challenges helps you choose the right tool for your specific documents. For straightforward aligned tables — the majority of invoices, order confirmations, and simple reports — a fast spatial analysis approach like PDF Oxide works well. For documents with complex merging, borderless layouts, or unusual formatting, you may need a library with more sophisticated heuristics.

Table Extraction: PDF Oxide vs Other Libraries

Choosing a library for PDF table extraction in Python depends on your documents, your performance requirements, and how you need the output formatted. Here is how the major options compare:

Library Table Detection Bordered Tables Borderless Tables Output Format Speed
PDF Oxide Built-in Yes Basic Markdown/HTML 0.8ms
pdfplumber Built-in Yes Advanced Python lists 23.2ms
Camelot Built-in Yes Yes (lattice/stream) DataFrames ~50ms+
PyMuPDF Basic (v1.23+) Yes Limited DataFrames 4.6ms
pypdf No No No N/A N/A
tabula-py Built-in Yes Yes DataFrames ~100ms+ (Java)

PDF Oxide is the fastest option by a wide margin. It detects tables through spatial analysis of aligned text blocks and outputs clean GitHub Flavored Markdown tables. At 0.8ms mean extraction time, it is 29x faster than pdfplumber and over 100x faster than tabula-py. It handles bordered tables and simple aligned borderless tables well. For LLM pipelines where you need Markdown output anyway, it is the natural choice.

pdfplumber has the most mature borderless table detection. Its find_tables() method uses configurable strategies for detecting rows and columns based on text alignment, and it handles merged cells and multi-line cell content better than most alternatives. The trade-off is speed: at 23.2ms per page, it is significantly slower for batch processing.

Camelot offers two detection modes — lattice (for bordered tables) and stream (for borderless tables). It produces pandas DataFrames directly, which is convenient for data analysis workflows. However, it depends on Ghostscript and OpenCV, making installation heavier, and its speed is the slowest among pure-Python options.

PyMuPDF (fitz) added basic table extraction in version 1.23. It is fast (4.6ms) and works well for simple bordered tables, but its borderless table support is limited compared to pdfplumber or Camelot.

pypdf does not have any table detection capability. It extracts raw text, so you would need to write your own parsing logic to reconstruct table structure.

tabula-py is a Python wrapper around the Java-based Tabula library. It provides good table detection for both bordered and borderless tables, but requires a Java runtime and is the slowest option due to JVM startup overhead. It is best suited for one-off extraction tasks rather than high-throughput pipelines.

For most production use cases, the recommended approach is to use PDF Oxide as your primary extractor for speed and simplicity, and fall back to pdfplumber for the subset of documents with complex table layouts that require advanced heuristics.

Installation

pip install pdf_oxide

Basic Table Extraction

As Markdown Tables

The simplest approach — convert the page to Markdown, which includes tables in GFM syntax:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    md = doc.to_markdown(i, detect_headings=True)
    if "|" in md:  # Page contains a table
        print(f"--- Page {i + 1} ---")
        print(md)

WASM

const doc = new WasmPdfDocument(bytes);
for (let i = 0; i < doc.pageCount(); i++) {
    const md = doc.toMarkdown(i);
    if (md.includes("|")) { // Page contains a table
        console.log(`--- Page ${i + 1} ---`);
        console.log(md);
    }
}
doc.free();

Rust

let mut doc = PdfDocument::open("report.pdf")?;
for i in 0..doc.page_count()? {
    let md = doc.to_markdown(i, true)?;
    if md.contains("|") {
        println!("--- Page {} ---", i + 1);
        println!("{}", md);
    }
}

Go

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()

n, _ := doc.PageCount()
for i := 0; i < n; i++ {
    md, _ := doc.ToMarkdown(i)
    if strings.Contains(md, "|") {
        fmt.Printf("--- Page %d ---\n%s\n", i+1, md)
    }
}

C#

using var doc = PdfDocument.Open("report.pdf");
for (int i = 0; i < doc.PageCount; i++)
{
    var md = doc.ToMarkdown(i);
    if (md.Contains("|"))
        Console.WriteLine($"--- Page {i + 1} ---\n{md}");
}

Structured Table Extraction (v0.3.34)

For typed access to rows and bounding boxes without parsing Markdown, call ExtractTables(pageIndex) (Go, C#) / extract_tables(page) (Python, Rust). Each table exposes structured cells so you can pipe results directly into a database or DataFrame without regex.

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
for table in doc.extract_tables(0):
    for row in table.rows:
        print(row)

Rust

let mut doc = PdfDocument::open("invoice.pdf")?;
for table in doc.extract_tables(0)? {
    for row in &table.rows {
        println!("{:?}", row);
    }
}

Go

doc, _ := pdfoxide.Open("invoice.pdf")
defer doc.Close()

tables, _ := doc.ExtractTables(0)
for _, t := range tables {
    for _, row := range t.Rows {
        fmt.Println(row)
    }
}

C#

using var doc = PdfDocument.Open("invoice.pdf");
foreach (var table in doc.ExtractTables(0))
    foreach (var row in table.Rows)
        Console.WriteLine(string.Join(" | ", row));

Parse Markdown Tables to Rows

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0)

# Extract table rows from Markdown
rows = []
for line in md.split("\n"):
    line = line.strip()
    if line.startswith("|") and not line.startswith("|--"):
        cells = [cell.strip() for cell in line.split("|")[1:-1]]
        rows.append(cells)

header = rows[0] if rows else []
data = rows[1:] if len(rows) > 1 else []
print(f"Columns: {header}")
for row in data:
    print(row)

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0);

const rows = [];
for (const line of md.split("\n")) {
    const trimmed = line.trim();
    if (trimmed.startsWith("|") && !trimmed.startsWith("|--")) {
        const cells = trimmed.split("|").slice(1, -1).map(c => c.trim());
        rows.push(cells);
    }
}

const header = rows[0] || [];
const data = rows.slice(1);
console.log("Columns:", header);
data.forEach(row => console.log(row));
doc.free();

Rust

let mut doc = PdfDocument::open("invoice.pdf")?;
let md = doc.to_markdown(0, false)?;

let rows: Vec<Vec<String>> = md.lines()
    .map(|l| l.trim())
    .filter(|l| l.starts_with('|') && !l.starts_with("|--"))
    .map(|l| l.split('|').skip(1).map(|c| c.trim().to_string())
        .take_while(|c| !c.is_empty()).collect())
    .collect();

if let Some(header) = rows.first() {
    println!("Columns: {:?}", header);
    for row in &rows[1..] {
        println!("{:?}", row);
    }
}

Export to CSV

import csv
from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0)

rows = []
for line in md.split("\n"):
    line = line.strip()
    if line.startswith("|") and not line.startswith("|--"):
        cells = [cell.strip() for cell in line.split("|")[1:-1]]
        rows.append(cells)

with open("table.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(rows)

Export to Pandas DataFrame

import pandas as pd
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
md = doc.to_markdown(0)

rows = []
for line in md.split("\n"):
    line = line.strip()
    if line.startswith("|") and not line.startswith("|--"):
        cells = [cell.strip() for cell in line.split("|")[1:-1]]
        rows.append(cells)

if rows:
    df = pd.DataFrame(rows[1:], columns=rows[0])
    print(df)

Using Character Positions for Custom Table Parsing

For fine-grained control, use character-level extraction and spatial analysis:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("financial.pdf")
chars = doc.extract_chars(0)

# Group characters by Y position (rows)
rows = {}
for ch in chars:
    row_key = round(ch.y / 2) * 2  # Snap to 2pt grid
    rows.setdefault(row_key, []).append(ch)

# Sort rows top-to-bottom, characters left-to-right
for y in sorted(rows.keys(), reverse=True):
    line_chars = sorted(rows[y], key=lambda c: c.x)
    text = "".join(c.char for c in line_chars)
    print(text)

WASM

const doc = new WasmPdfDocument(bytes);
const chars = doc.extractChars(0);

// Group characters by Y position (rows)
const rows = new Map();
for (const ch of chars) {
    const rowKey = Math.round(ch.y / 2) * 2; // Snap to 2pt grid
    if (!rows.has(rowKey)) rows.set(rowKey, []);
    rows.get(rowKey).push(ch);
}

// Sort rows top-to-bottom, characters left-to-right
const sortedKeys = [...rows.keys()].sort((a, b) => b - a);
for (const y of sortedKeys) {
    const lineChars = rows.get(y).sort((a, b) => a.x - b.x);
    const text = lineChars.map(c => c.char).join("");
    console.log(text);
}
doc.free();

Rust

use std::collections::BTreeMap;

let mut doc = PdfDocument::open("financial.pdf")?;
let chars = doc.extract_chars(0)?;

let mut rows: BTreeMap<i32, Vec<_>> = BTreeMap::new();
for ch in &chars {
    let row_key = ((ch.y / 2.0).round() * 2.0) as i32;
    rows.entry(row_key).or_default().push(ch);
}

for (_, line_chars) in rows.iter().rev() {
    let mut sorted = line_chars.clone();
    sorted.sort_by(|a, b| a.x.partial_cmp(&b.x).unwrap());
    let text: String = sorted.iter().map(|c| c.char).collect();
    println!("{}", text);
}

Go

doc, _ := pdfoxide.Open("financial.pdf")
defer doc.Close()

chars, _ := doc.ExtractChars(0)
rows := map[int][]pdfoxide.Char{}
for _, ch := range chars {
    key := int(math.Round(float64(ch.Y)/2) * 2)
    rows[key] = append(rows[key], ch)
}

keys := make([]int, 0, len(rows))
for k := range rows { keys = append(keys, k) }
sort.Sort(sort.Reverse(sort.IntSlice(keys)))

for _, y := range keys {
    line := rows[y]
    sort.Slice(line, func(i, j int) bool { return line[i].X < line[j].X })
    var b strings.Builder
    for _, c := range line { b.WriteString(c.Char) }
    fmt.Println(b.String())
}

C#

using var doc = PdfDocument.Open("financial.pdf");
var chars = doc.ExtractChars(0);

var rows = chars
    .GroupBy(c => (int)(Math.Round(c.Y / 2) * 2))
    .OrderByDescending(g => g.Key);

foreach (var row in rows)
{
    var line = string.Concat(row.OrderBy(c => c.X).Select(c => c.Char));
    Console.WriteLine(line);
}

Extract Tables to Markdown

Markdown is the ideal output format when you are feeding PDF content to a large language model, building a RAG pipeline, or storing extracted data in a format that is both human-readable and machine-parseable. PDF Oxide outputs tables in GitHub Flavored Markdown (GFM) format natively, so no additional conversion step is needed.

from pdf_oxide import PdfDocument

doc = PdfDocument("quarterly-report.pdf")

# Extract all tables across all pages as Markdown
all_tables = []
for i in range(doc.page_count()):
    md = doc.to_markdown(i, detect_headings=True)
    # Split the markdown into sections and find table blocks
    in_table = False
    current_table = []
    for line in md.split("\n"):
        if line.strip().startswith("|"):
            in_table = True
            current_table.append(line)
        else:
            if in_table and current_table:
                all_tables.append("\n".join(current_table))
                current_table = []
            in_table = False

    if current_table:
        all_tables.append("\n".join(current_table))

print(f"Found {len(all_tables)} tables")
for idx, table in enumerate(all_tables):
    print(f"\n--- Table {idx + 1} ---")
    print(table)

The GFM table output is directly compatible with LLM prompts. You can pass it straight into an OpenAI or Anthropic API call and the model will understand the tabular structure without any additional formatting:

# Feed extracted table to an LLM for analysis
prompt = f"""Analyze the following financial table and summarize the key trends:

{all_tables[0]}
"""

This approach is significantly faster than extracting tables with pdfplumber and then converting them to Markdown yourself.

Handling Multi-Page Tables

Tables that span multiple pages are a common challenge in PDF extraction. Financial statements, inventory lists, and regulatory filings frequently contain tables that run across two, five, or even dozens of pages. The key insight is that you need to extract the table from each page separately and then stitch the rows together, being careful to handle repeated headers and page artifacts.

from pdf_oxide import PdfDocument

doc = PdfDocument("long-report.pdf")

def extract_table_rows(md_text):
    """Extract table rows from markdown text, returning header and data separately."""
    header = None
    data_rows = []
    for line in md_text.split("\n"):
        line = line.strip()
        if not line.startswith("|") or line.startswith("|--"):
            continue
        cells = [cell.strip() for cell in line.split("|")[1:-1]]
        if header is None:
            header = cells
        else:
            data_rows.append(cells)
    return header, data_rows

# Collect rows across all pages
combined_header = None
combined_rows = []

for i in range(doc.page_count()):
    md = doc.to_markdown(i)
    header, rows = extract_table_rows(md)

    if header is None:
        continue  # No table on this page

    if combined_header is None:
        combined_header = header
    elif header == combined_header:
        pass  # Skip repeated header on subsequent pages
    else:
        # Different table — save current and start new
        print(f"Table with {len(combined_rows)} rows found")
        combined_header = header
        combined_rows = []

    combined_rows.extend(rows)

if combined_header and combined_rows:
    print(f"Columns: {combined_header}")
    print(f"Total rows: {len(combined_rows)}")
    for row in combined_rows[:5]:
        print(row)
    if len(combined_rows) > 5:
        print(f"... and {len(combined_rows) - 5} more rows")

This pattern works reliably for tables where the header row is repeated on each page (the most common case). For tables where the header only appears on the first page, you can simplify the logic by only capturing the header from the first page that contains a table and treating all subsequent rows as data.

Export Tables to CSV or DataFrame

Once you have extracted table data, you often need it in a structured format for further analysis. The examples below show how to go from a PDF to a pandas DataFrame or a CSV file in just a few lines.

Batch Export: All Tables to Separate CSV Files

import csv
from pdf_oxide import PdfDocument

doc = PdfDocument("catalog.pdf")
table_count = 0

for i in range(doc.page_count()):
    md = doc.to_markdown(i)
    rows = []
    for line in md.split("\n"):
        line = line.strip()
        if line.startswith("|") and not line.startswith("|--"):
            cells = [cell.strip() for cell in line.split("|")[1:-1]]
            rows.append(cells)

    if len(rows) > 1:  # At least header + one data row
        table_count += 1
        filename = f"table_page{i + 1}_{table_count}.csv"
        with open(filename, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(rows)
        print(f"Saved {filename} ({len(rows) - 1} data rows)")

print(f"Exported {table_count} tables total")

Multi-Page Table to DataFrame

For tables that span multiple pages, combine the multi-page stitching pattern with pandas:

import pandas as pd
from pdf_oxide import PdfDocument

doc = PdfDocument("financial-statement.pdf")

header = None
all_rows = []

for i in range(doc.page_count()):
    md = doc.to_markdown(i)
    for line in md.split("\n"):
        line = line.strip()
        if not line.startswith("|") or line.startswith("|--"):
            continue
        cells = [cell.strip() for cell in line.split("|")[1:-1]]
        if header is None:
            header = cells
        elif cells == header:
            continue  # Skip repeated header
        else:
            all_rows.append(cells)

if header and all_rows:
    df = pd.DataFrame(all_rows, columns=header)
    # Clean up numeric columns
    for col in df.columns:
        # Try to convert columns that look numeric
        cleaned = df[col].str.replace(r"[$,%]", "", regex=True).str.strip()
        try:
            df[col] = pd.to_numeric(cleaned)
        except (ValueError, TypeError):
            pass  # Keep as string

    print(df.dtypes)
    print(df.head(10))
    df.to_csv("financial_data.csv", index=False)

This workflow gives you a clean DataFrame with proper numeric types, ready for analysis with pandas, plotting with matplotlib, or loading into a database.

Complex Tables: When to Use pdfplumber

PDF Oxide’s table detection handles standard aligned tables well. For complex cases — merged cells, spanning headers, borderless tables, or multi-line cell content — pdfplumber’s dedicated table extraction algorithms are more robust:

import pdfplumber

with pdfplumber.open("complex-report.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()
    for table in tables:
        for row in table:
            print(row)

When to Use Each

Scenario Recommended
Simple aligned tables PDF Oxide (29× faster)
Tables as part of full-page Markdown PDF Oxide
Complex merged cells / spanning headers pdfplumber
Borderless tables pdfplumber
Speed-critical batch processing PDF Oxide

Use Both Together

Fast text extraction with PDF Oxide, complex table extraction with pdfplumber:

from pdf_oxide import PdfDocument
import pdfplumber

# Fast full-text extraction
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)

# Targeted table extraction for complex pages
with pdfplumber.open("report.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()