What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Extract Tables from PDF in Python

Extracting tables from PDF documents is one of the most common tasks in document processing pipelines. Whether you are pulling financial data from annual reports, scraping product catalogs, or feeding structured data into an LLM, reliable table extraction is essential. This guide covers everything you need to know about extracting tables from PDFs in Python, from quick one-liners to production-grade workflows for multi-page tables.

Detection engine

PDF Oxide uses the universal edges → snap/merge → intersections → cells → groups table-detection pipeline — the same approach used by Tabula, pdfplumber, and PyMuPDF, implemented in pure Rust.

Detection capabilities:

Intersection-based — finds H×V line crossings, builds cells from four-corner rectangles, groups into tables via union-find.
Extended grid — when horizontal and vertical lines live in different page regions, a virtual grid is built from the Cartesian product of all coordinates.
Column-aware text detection — segments 2-column layouts via X-projection histogram, then runs text-only table detection per column.
H-rule-bounded text tables — detects tables bounded by horizontal rules but no vertical lines (common in academic papers).
Hybrid row detection — infers row boundaries from text Y-positions when only vertical borders exist (invoice line items).
Dotted / dashed line reconstitution — merges short line segments into continuous edges.
Section divider splitting — splits multi-section forms at full-width horizontal dividers.
Edge coverage filtering — removes orphan edges that don’t participate in any potential grid.

Configuration

TableDetectionConfig exposes tunable parameters:

Field	Default	Description
`horizontal_strategy`	`"lines_strict"`	`"lines_strict"`, `"lines"`, `"text"`, or `"explicit"`
`vertical_strategy`	`"lines_strict"`	Same vocabulary
`v_split_gap`	`20.0` pt	Gap between vertical lines that triggers splitting into separate tables (was hardcoded 4pt prior to v0.3.20)
`snap_tolerance`	`3.0` pt	Edge-snap merging tolerance
`text_tolerance`	`3.0` pt	Text-line merging tolerance

Behavior change

From v0.3.20 onwards, the default strategy for Python extract_tables() is Both (detects via both lines and text). Pages that relied on the old Text-only default should pass horizontal_strategy="text" and vertical_strategy="text" explicitly.

The Python binding now correctly reads vertical_strategy from the table_settings dict — previously it was silently ignored.

Rendering

Extracted tables are emitted with space-padded column alignment (replacing the ASCII box-drawing characters from earlier versions). Right-aligns currency and number columns automatically. Form-number prefixes ("1 Apr 11" → "Apr 11") and decorative dash / underscore cells ("------") are stripped during rendering.

Extract table data from a PDF using Markdown conversion:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)
# Output includes tables in GFM format:
# | Item | Qty | Price |
# |------|-----|-------|
# | Widget | 10 | $9.99 |

WASM

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0);
console.log(md);
// Output includes tables in GFM format:
// | Item | Qty | Price |
// |------|-----|-------|
// | Widget | 10 | $9.99 |
doc.free();

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("invoice.pdf")?;
let md = doc.to_markdown(0, true)?;
println!("{}", md);

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("invoice.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    md, err := doc.ToMarkdown(0)
    if err != nil { log.Fatal(err) }
    fmt.Println(md)
}

using PdfOxide;

using var doc = PdfDocument.Open("invoice.pdf");
Console.WriteLine(doc.ToMarkdown(0));

Java

import fyi.oxide.pdf.PdfDocument;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("invoice.pdf"))) {
    System.out.println(doc.toMarkdown(0));
}

PHP

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('invoice.pdf');
echo $doc->toMarkdown(0);
$doc->close();

Ruby

require 'pdf_oxide'

PdfOxide::PdfDocument.open('invoice.pdf') do |doc|
  puts doc.to_markdown(0)
end

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

auto doc = pdf_oxide::Document::open("invoice.pdf");
std::cout << doc.to_markdown(0) << '\n';

Swift

import PdfOxide

let doc = try Document.open("invoice.pdf")
print(try doc.toMarkdown(0))

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("invoice.pdf")).use { doc ->
    println(doc.toMarkdown(0))
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('invoice.pdf');
print(doc.toMarkdown(0));
doc.close();

library(pdfoxide)

doc <- pdf_open("invoice.pdf")
cat(pdf_to_markdown(doc, 0))

Julia

using PdfOxide

doc = open_document("invoice.pdf")
println(to_markdown(doc, 0))

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("invoice.pdf");
const md = try doc.toMarkdown(a, 0);
defer a.free(md);
std.debug.print("{s}\n", .{md});

Scala

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open("invoice.pdf")) { doc =>
  println(doc.toMarkdown(0))
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [d (pdf/open "invoice.pdf")]
  (println (pdf/to-markdown d 0)))

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"invoice.pdf" error:&err];
NSLog(@"%@", [doc toMarkdown:0 error:&err]);

Elixir

{:ok, doc} = PdfOxide.open("invoice.pdf")
{:ok, md} = PdfOxide.to_markdown(doc, 0)
IO.puts(md)

PDF Oxide detects tabular layouts from spatial analysis of aligned text blocks and emits GitHub Flavored Markdown tables.

Why Table Extraction from PDFs Is Challenging

If you have ever tried to copy a table from a PDF and paste it into a spreadsheet, you know the result is usually a mess. That is not a bug in your PDF viewer — it reflects a fundamental limitation of the PDF format itself.

PDFs have no concept of a “table.” Unlike HTML, which uses <table>, <tr>, and <td> tags to define tabular structure, a PDF file stores only drawing instructions: place this glyph at coordinates (x, y), draw a line from point A to point B. There is no semantic layer that says “these characters belong to a cell in row 3, column 2.” Every table extraction library must reconstruct that structure by analyzing the spatial positions of text and lines on the page.

This reconstruction is hard for several reasons:

Bordered vs. borderless tables. When a table has visible grid lines, extraction tools can use those lines as cell boundaries. Borderless tables — common in financial statements, government reports, and academic papers — have no lines at all. The library must infer column boundaries purely from whitespace gaps between text blocks, which is error-prone when columns have variable widths or when numeric values are right-aligned.
Merged cells and spanning headers. A header cell that spans three columns looks like a single wide text block. Without the grid lines to delineate it, a parser has no reliable way to know which columns the header covers. Some libraries handle this well; many silently produce garbled output.
Multi-line cell content. When a cell contains a paragraph of text that wraps to multiple lines, naive row-based parsing treats each wrapped line as a separate row. Correctly grouping those lines back into a single cell requires understanding the vertical extent of each row.
Multi-page tables. Large tables often span two or more pages. The header row may or may not be repeated on each page, and page footers, watermarks, or page numbers may appear between table rows. Stitching these fragments back into a single coherent table requires page-aware logic.
Rotated text and non-standard layouts. Some PDFs use rotated text for column headers, or place tables in multi-column page layouts. These edge cases break assumptions that most parsers make about left-to-right, top-to-bottom reading order.

Understanding these challenges helps you choose the right tool for your specific documents. For straightforward aligned tables — the majority of invoices, order confirmations, and simple reports — a fast spatial analysis approach like PDF Oxide works well. For documents with complex merging, borderless layouts, or unusual formatting, you may need a library with more sophisticated heuristics.

Table Extraction: PDF Oxide vs Other Libraries

Choosing a library for PDF table extraction in Python depends on your documents, your performance requirements, and how you need the output formatted. Here is how the major options compare:

Library	Table Detection	Bordered Tables	Borderless Tables	Output Format	Speed
PDF Oxide	Built-in	Yes	Basic	Markdown/HTML	0.8ms
pdfplumber	Built-in	Yes	Advanced	Python lists	23.2ms
Camelot	Built-in	Yes	Yes (lattice/stream)	DataFrames	~50ms+
PyMuPDF	Basic (v1.23+)	Yes	Limited	DataFrames	4.6ms
pypdf	No	No	No	N/A	N/A
tabula-py	Built-in	Yes	Yes	DataFrames	~100ms+ (Java)

PDF Oxide is the fastest option by a wide margin. It detects tables through spatial analysis of aligned text blocks and outputs clean GitHub Flavored Markdown tables. At 0.8ms mean extraction time, it is 29x faster than pdfplumber and over 100x faster than tabula-py. It handles bordered tables and simple aligned borderless tables well. For LLM pipelines where you need Markdown output anyway, it is the natural choice.

pdfplumber has the most mature borderless table detection. Its find_tables() method uses configurable strategies for detecting rows and columns based on text alignment, and it handles merged cells and multi-line cell content better than most alternatives. The trade-off is speed: at 23.2ms per page, it is significantly slower for batch processing.

Camelot offers two detection modes — lattice (for bordered tables) and stream (for borderless tables). It produces pandas DataFrames directly, which is convenient for data analysis workflows. However, it depends on Ghostscript and OpenCV, making installation heavier, and its speed is the slowest among pure-Python options.

PyMuPDF (fitz) added basic table extraction in version 1.23. It is fast (4.6ms) and works well for simple bordered tables, but its borderless table support is limited compared to pdfplumber or Camelot.

pypdf does not have any table detection capability. It extracts raw text, so you would need to write your own parsing logic to reconstruct table structure.

tabula-py is a Python wrapper around the Java-based Tabula library. It provides good table detection for both bordered and borderless tables, but requires a Java runtime and is the slowest option due to JVM startup overhead. It is best suited for one-off extraction tasks rather than high-throughput pipelines.

For most production use cases, the recommended approach is to use PDF Oxide as your primary extractor for speed and simplicity, and fall back to pdfplumber for the subset of documents with complex table layouts that require advanced heuristics.

Installation

pip install pdf_oxide

Basic Table Extraction

As Markdown Tables

The simplest approach — convert the page to Markdown, which includes tables in GFM syntax:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    md = doc.to_markdown(i, detect_headings=True)
    if "|" in md:  # Page contains a table
        print(f"--- Page {i + 1} ---")
        print(md)

WASM

const doc = new WasmPdfDocument(bytes);
for (let i = 0; i < doc.pageCount(); i++) {
    const md = doc.toMarkdown(i);
    if (md.includes("|")) { // Page contains a table
        console.log(`--- Page ${i + 1} ---`);
        console.log(md);
    }
}
doc.free();

Rust

let mut doc = PdfDocument::open("report.pdf")?;
for i in 0..doc.page_count()? {
    let md = doc.to_markdown(i, true)?;
    if md.contains("|") {
        println!("--- Page {} ---", i + 1);
        println!("{}", md);
    }
}

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()

n, _ := doc.PageCount()
for i := 0; i < n; i++ {
    md, _ := doc.ToMarkdown(i)
    if strings.Contains(md, "|") {
        fmt.Printf("--- Page %d ---\n%s\n", i+1, md)
    }
}

using var doc = PdfDocument.Open("report.pdf");
for (int i = 0; i < doc.PageCount; i++)
{
    var md = doc.ToMarkdown(i);
    if (md.Contains("|"))
        Console.WriteLine($"--- Page {i + 1} ---\n{md}");
}

Java

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("report.pdf"))) {
    for (int i = 0; i < doc.pageCount(); i++) {
        String md = doc.toMarkdown(i);
        if (md.contains("|")) {  // Page contains a table
            System.out.println("--- Page " + (i + 1) + " ---");
            System.out.println(md);
        }
    }
}

PHP

$doc = PdfDocument::open('report.pdf');
for ($i = 0; $i < $doc->pageCount(); $i++) {
    $md = $doc->toMarkdown($i);
    if (str_contains($md, '|')) {  // Page contains a table
        echo "--- Page " . ($i + 1) . " ---\n";
        echo $md;
    }
}
$doc->close();

Ruby

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  doc.page_count.times do |i|
    md = doc.to_markdown(i)
    if md.include?('|')  # Page contains a table
      puts "--- Page #{i + 1} ---"
      puts md
    end
  end
end

C++

auto doc = pdf_oxide::Document::open("report.pdf");
for (int i = 0; i < doc.page_count(); i++) {
    auto md = doc.to_markdown(i);
    if (md.find('|') != std::string::npos) {  // Page contains a table
        std::cout << "--- Page " << (i + 1) << " ---\n" << md << '\n';
    }
}

Swift

let doc = try Document.open("report.pdf")
for i in 0..<(try doc.pageCount()) {
    let md = try doc.toMarkdown(i)
    if md.contains("|") {  // Page contains a table
        print("--- Page \(i + 1) ---")
        print(md)
    }
}

Kotlin

PdfDocument.open(java.nio.file.Path.of("report.pdf")).use { doc ->
    for (i in 0 until doc.pageCount()) {
        val md = doc.toMarkdown(i)
        if (md.contains("|")) {  // Page contains a table
            println("--- Page ${i + 1} ---")
            println(md)
        }
    }
}

Dart

final doc = PdfDocument.open('report.pdf');
for (var i = 0; i < doc.pageCount; i++) {
  final md = doc.toMarkdown(i);
  if (md.contains('|')) {  // Page contains a table
    print('--- Page ${i + 1} ---');
    print(md);
  }
}
doc.close();

doc <- pdf_open("report.pdf")
for (i in 0:(pdf_page_count(doc) - 1)) {
  md <- pdf_to_markdown(doc, i)
  if (grepl("\\|", md)) {  # Page contains a table
    cat(sprintf("--- Page %d ---\n%s\n", i + 1, md))
  }
}

Julia

doc = open_document("report.pdf")
for i in 0:(page_count(doc) - 1)
    md = to_markdown(doc, i)
    if occursin("|", md)  # Page contains a table
        println("--- Page $(i + 1) ---")
        println(md)
    end
end

Zig

var doc = try pdf_oxide.Document.open("report.pdf");
const n = try doc.pageCount();
var i: i32 = 0;
while (i < n) : (i += 1) {
    const md = try doc.toMarkdown(a, i);
    defer a.free(md);
    if (std.mem.indexOfScalar(u8, md, '|') != null) {  // Page contains a table
        std.debug.print("--- Page {d} ---\n{s}\n", .{ i + 1, md });
    }
}

Scala

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  for (i <- 0 until doc.pageCount()) {
    val md = doc.toMarkdown(i)
    if (md.contains("|")) {  // Page contains a table
      println(s"--- Page ${i + 1} ---")
      println(md)
    }
  }
}

Clojure

(with-open [d (pdf/open "report.pdf")]
  (doseq [i (range (pdf/page-count d))]
    (let [md (pdf/to-markdown d i)]
      (when (.contains md "|")  ; Page contains a table
        (println (str "--- Page " (inc i) " ---"))
        (println md)))))

Objective-C

NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
for (NSInteger i = 0; i < [doc pageCountError:&err]; i++) {
    NSString *md = [doc toMarkdown:i error:&err];
    if ([md containsString:@"|"]) {  // Page contains a table
        NSLog(@"--- Page %ld ---\n%@", (long)(i + 1), md);
    }
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, n} = PdfOxide.page_count(doc)
for i <- 0..(n - 1) do
  {:ok, md} = PdfOxide.to_markdown(doc, i)
  if String.contains?(md, "|") do  # Page contains a table
    IO.puts("--- Page #{i + 1} ---")
    IO.puts(md)
  end
end

Structured Table Extraction (v0.3.34)

For typed access to rows and bounding boxes without parsing Markdown, call ExtractTables(pageIndex) (Go, C#) / extract_tables(page) (Python, Rust). Each table exposes structured cells so you can pipe results directly into a database or DataFrame without regex.

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
for table in doc.extract_tables(0):
    for row in table.rows:
        print(row)

Rust

let mut doc = PdfDocument::open("invoice.pdf")?;
for table in doc.extract_tables(0)? {
    for row in &table.rows {
        println!("{:?}", row);
    }
}

doc, _ := pdfoxide.Open("invoice.pdf")
defer doc.Close()

tables, _ := doc.ExtractTables(0)
for _, t := range tables {
    for _, row := range t.Rows {
        fmt.Println(row)
    }
}

using var doc = PdfDocument.Open("invoice.pdf");
foreach (var table in doc.ExtractTables(0))
    foreach (var row in table.Rows)
        Console.WriteLine(string.Join(" | ", row));

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.table.Table;
import fyi.oxide.pdf.table.TableCell;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("invoice.pdf"))) {
    for (Table table : doc.page(0).tables()) {
        String[][] grid = new String[table.rows()][table.cols()];
        for (TableCell c : table.cells()) grid[c.row()][c.col()] = c.text();
        for (String[] row : grid) System.out.println(String.join(" | ", row));
    }
}

C++

auto doc = pdf_oxide::Document::open("invoice.pdf");
for (const auto& table : doc.extract_tables(0)) {
    for (int r = 0; r < table.row_count; r++) {
        for (int c = 0; c < table.col_count; c++) {
            std::cout << table.cell(r, c);
            if (c + 1 < table.col_count) std::cout << " | ";
        }
        std::cout << '\n';
    }
}

Swift

let doc = try Document.open("invoice.pdf")
for table in try doc.extractTables(0) {
    for r in 0..<table.rowCount {
        let row = (0..<table.colCount).map { table.cell(r, $0) }
        print(row.joined(separator: " | "))
    }
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("invoice.pdf")).use { doc ->
    for (table in doc.page(0).tables()) {
        val grid = Array(table.rows()) { arrayOfNulls<String>(table.cols()) }
        table.cells().forEach { grid[it.row()][it.col()] = it.text() }
        grid.forEach { println(it.joinToString(" | ")) }
    }
}

Dart

final doc = PdfDocument.open('invoice.pdf');
for (final table in doc.extractTables(0)) {
  for (var r = 0; r < table.rowCount; r++) {
    final row = [for (var c = 0; c < table.colCount; c++) table.cell(r, c)];
    print(row.join(' | '));
  }
}
doc.close();

doc <- pdf_open("invoice.pdf")
for (table in pdf_extract_tables(doc, 0)) {
  for (r in seq_len(table$row_count)) {
    cat(paste(table$cells[r, ], collapse = " | "), "\n")
  }
}

Julia

doc = open_document("invoice.pdf")
for table in extract_tables(doc, 0)
    for r in 1:table.row_count
        println(join(table.cells[r, :], " | "))
    end
end

Zig

var doc = try pdf_oxide.Document.open("invoice.pdf");
const tables = try doc.extractTables(a, 0);
defer pdf_oxide.Document.freeTables(a, tables);
for (tables) |table| {
    var r: i32 = 0;
    while (r < table.rowCount) : (r += 1) {
        var c: i32 = 0;
        while (c < table.colCount) : (c += 1) {
            std.debug.print("{s}", .{table.cell(r, c)});
            if (c + 1 < table.colCount) std.debug.print(" | ", .{});
        }
        std.debug.print("\n", .{});
    }
}

Scala

import fyi.oxide.pdf.PdfDocument
import scala.jdk.CollectionConverters._
import scala.util.Using

Using.resource(PdfDocument.open("invoice.pdf")) { doc =>
  for (table <- doc.page(0).tables().asScala) {
    val grid = Array.ofDim[String](table.rows(), table.cols())
    table.cells().asScala.foreach(c => grid(c.row())(c.col()) = c.text())
    grid.foreach(row => println(row.mkString(" | ")))
  }
}

Clojure

(with-open [d (pdf/open "invoice.pdf")]
  (doseq [table (pdf/tables (pdf/page d 0))]
    (let [grid (make-array String (.rows table) (.cols table))]
      (doseq [c (.cells table)]
        (aset grid (.row c) (.col c) (.text c)))
      (doseq [row grid]
        (println (clojure.string/join " | " row))))))

Objective-C

NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"invoice.pdf" error:&err];
for (POXTable *table in [doc extractTables:0 error:&err]) {
    for (NSInteger r = 0; r < table.rowCount; r++) {
        NSMutableArray<NSString *> *row = [NSMutableArray array];
        for (NSInteger c = 0; c < table.colCount; c++)
            [row addObject:([table cellTextAtRow:r col:c] ?: @"")];
        NSLog(@"%@", [row componentsJoinedByString:@" | "]);
    }
}

Elixir

{:ok, doc} = PdfOxide.open("invoice.pdf")
{:ok, tables} = PdfOxide.extract_tables(doc, 0)
for table <- tables do
  for r <- 0..(table.row_count - 1) do
    row = for c <- 0..(table.col_count - 1), do: PdfOxide.cell(table, r, c)
    IO.puts(Enum.join(row, " | "))
  end
end

Parse Markdown Tables to Rows

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0)

# Extract table rows from Markdown
rows = []
for line in md.split("\n"):
    line = line.strip()
    if line.startswith("|") and not line.startswith("|--"):
        cells = [cell.strip() for cell in line.split("|")[1:-1]]
        rows.append(cells)

header = rows[0] if rows else []
data = rows[1:] if len(rows) > 1 else []
print(f"Columns: {header}")
for row in data:
    print(row)

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0);

const rows = [];
for (const line of md.split("\n")) {
    const trimmed = line.trim();
    if (trimmed.startsWith("|") && !trimmed.startsWith("|--")) {
        const cells = trimmed.split("|").slice(1, -1).map(c => c.trim());
        rows.push(cells);
    }
}

const header = rows[0] || [];
const data = rows.slice(1);
console.log("Columns:", header);
data.forEach(row => console.log(row));
doc.free();

Rust

let mut doc = PdfDocument::open("invoice.pdf")?;
let md = doc.to_markdown(0, false)?;

let rows: Vec<Vec<String>> = md.lines()
    .map(|l| l.trim())
    .filter(|l| l.starts_with('|') && !l.starts_with("|--"))
    .map(|l| l.split('|').skip(1).map(|c| c.trim().to_string())
        .take_while(|c| !c.is_empty()).collect())
    .collect();

if let Some(header) = rows.first() {
    println!("Columns: {:?}", header);
    for row in &rows[1..] {
        println!("{:?}", row);
    }
}

Java

import fyi.oxide.pdf.PdfDocument;
import java.util.*;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("invoice.pdf"))) {
    String md = doc.toMarkdown(0);
    List<List<String>> rows = new ArrayList<>();
    for (String line : md.split("\n")) {
        line = line.strip();
        if (line.startsWith("|") && !line.startsWith("|--")) {
            String[] parts = line.substring(1, line.length() - 1).split("\\|");
            List<String> cells = new ArrayList<>();
            for (String p : parts) cells.add(p.strip());
            rows.add(cells);
        }
    }
    System.out.println("Columns: " + (rows.isEmpty() ? List.of() : rows.get(0)));
    for (int i = 1; i < rows.size(); i++) System.out.println(rows.get(i));
}

PHP

$doc = PdfDocument::open('invoice.pdf');
$md = $doc->toMarkdown(0);

$rows = [];
foreach (explode("\n", $md) as $line) {
    $line = trim($line);
    if (str_starts_with($line, '|') && !str_starts_with($line, '|--')) {
        $cells = array_map('trim', array_slice(explode('|', $line), 1, -1));
        $rows[] = $cells;
    }
}
$header = $rows[0] ?? [];
echo "Columns: " . implode(', ', $header) . "\n";
foreach (array_slice($rows, 1) as $row) {
    echo implode(' | ', $row) . "\n";
}
$doc->close();

Ruby

PdfOxide::PdfDocument.open('invoice.pdf') do |doc|
  md = doc.to_markdown(0)

  rows = md.lines.map(&:strip)
            .select { |l| l.start_with?('|') && !l.start_with?('|--') }
            .map { |l| l.split('|')[1..-2].map(&:strip) }

  header = rows.first || []
  puts "Columns: #{header.inspect}"
  rows.drop(1).each { |row| puts row.inspect }
end

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <sstream>
#include <vector>

auto doc = pdf_oxide::Document::open("invoice.pdf");
auto md = doc.to_markdown(0);

std::vector<std::vector<std::string>> rows;
std::istringstream stream(md);
for (std::string line; std::getline(stream, line);) {
    auto s = line.find_first_not_of(" \t");
    if (s == std::string::npos) continue;
    line = line.substr(s);
    if (line.rfind("|", 0) != 0 || line.rfind("|--", 0) == 0) continue;
    std::vector<std::string> cells;
    std::istringstream cs(line.substr(1, line.size() - 2));
    for (std::string cell; std::getline(cs, cell, '|');) cells.push_back(cell);
    rows.push_back(cells);
}

Swift

let doc = try Document.open("invoice.pdf")
let md = try doc.toMarkdown(0)

let rows = md.split(separator: "\n").map { $0.trimmingCharacters(in: .whitespaces) }
    .filter { $0.hasPrefix("|") && !$0.hasPrefix("|--") }
    .map { line -> [String] in
        line.dropFirst().dropLast().split(separator: "|", omittingEmptySubsequences: false)
            .map { $0.trimmingCharacters(in: .whitespaces) }
    }

if let header = rows.first {
    print("Columns:", header)
    for row in rows.dropFirst() { print(row) }
}

Kotlin

PdfDocument.open(java.nio.file.Path.of("invoice.pdf")).use { doc ->
    val md = doc.toMarkdown(0)
    val rows = md.split("\n").map { it.trim() }
        .filter { it.startsWith("|") && !it.startsWith("|--") }
        .map { it.removeSurrounding("|").split("|").map(String::trim) }

    rows.firstOrNull()?.let { println("Columns: $it") }
    rows.drop(1).forEach { println(it) }
}

Dart

final doc = PdfDocument.open('invoice.pdf');
final md = doc.toMarkdown(0);

final rows = md.split('\n').map((l) => l.trim())
    .where((l) => l.startsWith('|') && !l.startsWith('|--'))
    .map((l) => l.substring(1, l.length - 1).split('|').map((c) => c.trim()).toList())
    .toList();

if (rows.isNotEmpty) {
  print('Columns: ${rows.first}');
  for (final row in rows.skip(1)) print(row);
}
doc.close();

doc <- pdf_open("invoice.pdf")
md  <- pdf_to_markdown(doc, 0)

lines <- trimws(strsplit(md, "\n")[[1]])
lines <- lines[startsWith(lines, "|") & !startsWith(lines, "|--")]
rows  <- lapply(lines, function(l) {
  cells <- strsplit(l, "\\|")[[1]]
  trimws(cells[2:(length(cells) - 1)])
})

if (length(rows) > 0) {
  cat("Columns:", rows[[1]], "\n")
  for (row in rows[-1]) cat(row, "\n")
}

Julia

doc = open_document("invoice.pdf")
md  = to_markdown(doc, 0)

rows = [strip.(split(l, "|")[2:end-1])
        for l in strip.(split(md, "\n"))
        if startswith(l, "|") && !startswith(l, "|--")]

if !isempty(rows)
    println("Columns: ", rows[1])
    for row in rows[2:end]
        println(row)
    end
end

Zig

var doc = try pdf_oxide.Document.open("invoice.pdf");
const md = try doc.toMarkdown(a, 0);
defer a.free(md);

var lines = std.mem.splitScalar(u8, md, '\n');
while (lines.next()) |raw| {
    const line = std.mem.trim(u8, raw, " \t\r");
    if (!std.mem.startsWith(u8, line, "|") or std.mem.startsWith(u8, line, "|--")) continue;
    const inner = line[1 .. line.len - 1];
    var cells = std.mem.splitScalar(u8, inner, '|');
    while (cells.next()) |cell| {
        std.debug.print("{s}\t", .{std.mem.trim(u8, cell, " \t")});
    }
    std.debug.print("\n", .{});
}

Scala

Using.resource(PdfDocument.open("invoice.pdf")) { doc =>
  val md = doc.toMarkdown(0)
  val rows = md.split("\n").map(_.trim)
    .filter(l => l.startsWith("|") && !l.startsWith("|--"))
    .map(_.stripPrefix("|").stripSuffix("|").split("\\|").map(_.trim).toList)
    .toList

  rows.headOption.foreach(h => println(s"Columns: $h"))
  rows.drop(1).foreach(println)
}

Clojure

(with-open [d (pdf/open "invoice.pdf")]
  (let [md   (pdf/to-markdown d 0)
        rows (->> (clojure.string/split-lines md)
                  (map clojure.string/trim)
                  (filter #(and (.startsWith % "|") (not (.startsWith % "|--"))))
                  (map #(->> (clojure.string/split % #"\|")
                             (drop 1) (butlast) (map clojure.string/trim) vec)))]
    (when-let [header (first rows)]
      (println "Columns:" header)
      (doseq [row (rest rows)] (println row)))))

Objective-C

NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"invoice.pdf" error:&err];
NSString *md = [doc toMarkdown:0 error:&err];

NSMutableArray<NSArray<NSString *> *> *rows = [NSMutableArray array];
for (NSString *raw in [md componentsSeparatedByString:@"\n"]) {
    NSString *line = [raw stringByTrimmingCharactersInSet:
        [NSCharacterSet whitespaceCharacterSet]];
    if (![line hasPrefix:@"|"] || [line hasPrefix:@"|--"]) continue;
    NSArray<NSString *> *parts = [line componentsSeparatedByString:@"|"];
    NSMutableArray<NSString *> *cells = [NSMutableArray array];
    for (NSUInteger i = 1; i + 1 < parts.count; i++)
        [cells addObject:[parts[i] stringByTrimmingCharactersInSet:
            [NSCharacterSet whitespaceCharacterSet]]];
    [rows addObject:cells];
}
if (rows.count > 0) NSLog(@"Columns: %@", rows[0]);

Elixir

{:ok, doc} = PdfOxide.open("invoice.pdf")
{:ok, md} = PdfOxide.to_markdown(doc, 0)

rows =
  md
  |> String.split("\n")
  |> Enum.map(&String.trim/1)
  |> Enum.filter(&(String.starts_with?(&1, "|") and not String.starts_with?(&1, "|--")))
  |> Enum.map(fn line ->
    line |> String.split("|") |> Enum.slice(1..-2//1) |> Enum.map(&String.trim/1)
  end)

case rows do
  [header | data] ->
    IO.puts("Columns: #{inspect(header)}")
    Enum.each(data, &IO.inspect/1)
  [] ->
    :ok
end

Export to CSV

import csv
from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0)

rows = []
for line in md.split("\n"):
    line = line.strip()
    if line.startswith("|") and not line.startswith("|--"):
        cells = [cell.strip() for cell in line.split("|")[1:-1]]
        rows.append(cells)

with open("table.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(rows)

Export to Pandas DataFrame

import pandas as pd
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
md = doc.to_markdown(0)

rows = []
for line in md.split("\n"):
    line = line.strip()
    if line.startswith("|") and not line.startswith("|--"):
        cells = [cell.strip() for cell in line.split("|")[1:-1]]
        rows.append(cells)

if rows:
    df = pd.DataFrame(rows[1:], columns=rows[0])
    print(df)

Using Character Positions for Custom Table Parsing

For fine-grained control, use character-level extraction and spatial analysis:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("financial.pdf")
chars = doc.extract_chars(0)

# Group characters by Y position (rows)
rows = {}
for ch in chars:
    row_key = round(ch.y / 2) * 2  # Snap to 2pt grid
    rows.setdefault(row_key, []).append(ch)

# Sort rows top-to-bottom, characters left-to-right
for y in sorted(rows.keys(), reverse=True):
    line_chars = sorted(rows[y], key=lambda c: c.x)
    text = "".join(c.char for c in line_chars)
    print(text)

WASM

const doc = new WasmPdfDocument(bytes);
const chars = doc.extractChars(0);

// Group characters by Y position (rows)
const rows = new Map();
for (const ch of chars) {
    const rowKey = Math.round(ch.y / 2) * 2; // Snap to 2pt grid
    if (!rows.has(rowKey)) rows.set(rowKey, []);
    rows.get(rowKey).push(ch);
}

// Sort rows top-to-bottom, characters left-to-right
const sortedKeys = [...rows.keys()].sort((a, b) => b - a);
for (const y of sortedKeys) {
    const lineChars = rows.get(y).sort((a, b) => a.x - b.x);
    const text = lineChars.map(c => c.char).join("");
    console.log(text);
}
doc.free();

Rust

use std::collections::BTreeMap;

let mut doc = PdfDocument::open("financial.pdf")?;
let chars = doc.extract_chars(0)?;

let mut rows: BTreeMap<i32, Vec<_>> = BTreeMap::new();
for ch in &chars {
    let row_key = ((ch.y / 2.0).round() * 2.0) as i32;
    rows.entry(row_key).or_default().push(ch);
}

for (_, line_chars) in rows.iter().rev() {
    let mut sorted = line_chars.clone();
    sorted.sort_by(|a, b| a.x.partial_cmp(&b.x).unwrap());
    let text: String = sorted.iter().map(|c| c.char).collect();
    println!("{}", text);
}

doc, _ := pdfoxide.Open("financial.pdf")
defer doc.Close()

chars, _ := doc.ExtractChars(0)
rows := map[int][]pdfoxide.Char{}
for _, ch := range chars {
    key := int(math.Round(float64(ch.Y)/2) * 2)
    rows[key] = append(rows[key], ch)
}

keys := make([]int, 0, len(rows))
for k := range rows { keys = append(keys, k) }
sort.Sort(sort.Reverse(sort.IntSlice(keys)))

for _, y := range keys {
    line := rows[y]
    sort.Slice(line, func(i, j int) bool { return line[i].X < line[j].X })
    var b strings.Builder
    for _, c := range line { b.WriteString(c.Char) }
    fmt.Println(b.String())
}

using var doc = PdfDocument.Open("financial.pdf");
var chars = doc.ExtractChars(0);

var rows = chars
    .GroupBy(c => (int)(Math.Round(c.Y / 2) * 2))
    .OrderByDescending(g => g.Key);

foreach (var row in rows)
{
    var line = string.Concat(row.OrderBy(c => c.X).Select(c => c.Char));
    Console.WriteLine(line);
}

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <map>
#include <vector>
#include <algorithm>

auto doc = pdf_oxide::Document::open("financial.pdf");
auto chars = doc.extract_chars(0);

// Group characters by Y position (rows)
std::map<int, std::vector<pdf_oxide::Char>> rows;
for (const auto& ch : chars) {
    int key = static_cast<int>(std::lround(ch.bbox.y / 2.0) * 2);
    rows[key].push_back(ch);
}

// Sort rows top-to-bottom, characters left-to-right
for (auto it = rows.rbegin(); it != rows.rend(); ++it) {
    auto& line = it->second;
    std::sort(line.begin(), line.end(),
              [](const auto& a, const auto& b) { return a.bbox.x < b.bbox.x; });
    std::string text;
    for (const auto& c : line) text += static_cast<char>(c.character);
    std::cout << text << '\n';
}

Swift

let doc = try Document.open("financial.pdf")
let chars = try doc.extractChars(0)

// Group characters by Y position (rows)
var rows: [Int: [Char]] = [:]
for ch in chars {
    let key = Int((ch.bbox.y / 2).rounded()) * 2  // Snap to 2pt grid
    rows[key, default: []].append(ch)
}

// Sort rows top-to-bottom, characters left-to-right
for y in rows.keys.sorted(by: >) {
    let line = rows[y]!.sorted { $0.bbox.x < $1.bbox.x }
    let text = String(line.compactMap { Unicode.Scalar($0.character).map(Character.init) })
    print(text)
}

Dart

final doc = PdfDocument.open('financial.pdf');
final chars = doc.extractChars(0);

// Group characters by Y position (rows)
final rows = <int, List<Char>>{};
for (final ch in chars) {
  final key = (ch.bbox.y / 2).round() * 2; // Snap to 2pt grid
  rows.putIfAbsent(key, () => []).add(ch);
}

// Sort rows top-to-bottom, characters left-to-right
final keys = rows.keys.toList()..sort((a, b) => b - a);
for (final y in keys) {
  final line = rows[y]!..sort((a, b) => a.bbox.x.compareTo(b.bbox.x));
  final text = String.fromCharCodes(line.map((c) => c.character));
  print(text);
}
doc.close();

doc   <- pdf_open("financial.pdf")
chars <- pdf_extract_chars(doc, 0)

# Group characters by Y position (rows), snapped to a 2pt grid
keys <- sapply(chars, function(ch) round(ch$bbox$y / 2) * 2)
for (y in sort(unique(keys), decreasing = TRUE)) {
  line <- chars[keys == y]
  line <- line[order(sapply(line, function(c) c$bbox$x))]
  text <- paste(intToUtf8(sapply(line, function(c) c$character), multiple = TRUE),
                collapse = "")
  cat(text, "\n")
}

Julia

doc   = open_document("financial.pdf")
chars = extract_chars(doc, 0)

# Group characters by Y position (rows), snapped to a 2pt grid
rows = Dict{Int,Vector}()
for ch in chars
    key = round(Int, ch.bbox.y / 2) * 2
    push!(get!(rows, key, []), ch)
end

for y in sort(collect(keys(rows)), rev = true)
    line = sort(rows[y], by = c -> c.bbox.x)
    text = join(Char.(getfield.(line, :character)))
    println(text)
end

Zig

var doc = try pdf_oxide.Document.open("financial.pdf");
const chars = try doc.extractChars(a, 0);
defer pdf_oxide.Document.freeChars(a, chars);

// Group characters by Y position (rows)
var rows = std.AutoArrayHashMap(i32, std.ArrayList(pdf_oxide.Char)).init(a);
for (chars) |ch| {
    const key: i32 = @intFromFloat(@round(ch.bbox.y / 2.0) * 2.0);
    const gop = try rows.getOrPut(key);
    if (!gop.found_existing) gop.value_ptr.* = std.ArrayList(pdf_oxide.Char).init(a);
    try gop.value_ptr.append(ch);
}

// Sort keys descending (top-to-bottom), characters left-to-right
const keys = rows.keys();
std.mem.sort(i32, keys, {}, comptime std.sort.desc(i32));
for (keys) |y| {
    var line = rows.get(y).?;
    std.mem.sort(pdf_oxide.Char, line.items, {}, struct {
        fn lt(_: void, x: pdf_oxide.Char, z: pdf_oxide.Char) bool { return x.bbox.x < z.bbox.x; }
    }.lt);
    for (line.items) |c| {
        var buf: [4]u8 = undefined;
        const len = std.unicode.utf8Encode(@intCast(c.character), &buf) catch 0;
        std.debug.print("{s}", .{buf[0..len]});
    }
    std.debug.print("\n", .{});
}

Objective-C

NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"financial.pdf" error:&err];
NSArray<POXChar *> *chars = [doc extractChars:0 error:&err];

// Group characters by Y position (rows)
NSMutableDictionary<NSNumber *, NSMutableArray<POXChar *> *> *rows =
    [NSMutableDictionary dictionary];
for (POXChar *ch in chars) {
    NSNumber *key = @((NSInteger)(round(ch.bbox.y / 2.0) * 2));
    if (!rows[key]) rows[key] = [NSMutableArray array];
    [rows[key] addObject:ch];
}

// Sort rows top-to-bottom, characters left-to-right
NSArray<NSNumber *> *keys = [[rows allKeys]
    sortedArrayUsingSelector:@selector(compare:)];
for (NSNumber *y in [keys reverseObjectEnumerator]) {
    NSArray<POXChar *> *line = [rows[y] sortedArrayUsingComparator:
        ^(POXChar *x, POXChar *z) { return [@(x.bbox.x) compare:@(z.bbox.x)]; }];
    NSMutableString *text = [NSMutableString string];
    for (POXChar *c in line)
        [text appendString:[[NSString alloc] initWithBytes:&(uint32_t){c.character}
            length:4 encoding:NSUTF32LittleEndianStringEncoding]];
    NSLog(@"%@", text);
}

Elixir

{:ok, doc} = PdfOxide.open("financial.pdf")
{:ok, chars} = PdfOxide.extract_chars(doc, 0)

# Group characters by Y position (rows), snapped to a 2pt grid
chars
|> Enum.group_by(fn ch -> round(ch.bbox.y / 2) * 2 end)
|> Enum.sort_by(fn {y, _} -> -y end)
|> Enum.each(fn {_y, line} ->
  text =
    line
    |> Enum.sort_by(fn c -> c.bbox.x end)
    |> Enum.map(fn c -> <<c.character::utf8>> end)
    |> Enum.join()

  IO.puts(text)
end)

Extract Tables to Markdown

Markdown is the ideal output format when you are feeding PDF content to a large language model, building a RAG pipeline, or storing extracted data in a format that is both human-readable and machine-parseable. PDF Oxide outputs tables in GitHub Flavored Markdown (GFM) format natively, so no additional conversion step is needed.

from pdf_oxide import PdfDocument

doc = PdfDocument("quarterly-report.pdf")

# Extract all tables across all pages as Markdown
all_tables = []
for i in range(doc.page_count()):
    md = doc.to_markdown(i, detect_headings=True)
    # Split the markdown into sections and find table blocks
    in_table = False
    current_table = []
    for line in md.split("\n"):
        if line.strip().startswith("|"):
            in_table = True
            current_table.append(line)
        else:
            if in_table and current_table:
                all_tables.append("\n".join(current_table))
                current_table = []
            in_table = False

    if current_table:
        all_tables.append("\n".join(current_table))

print(f"Found {len(all_tables)} tables")
for idx, table in enumerate(all_tables):
    print(f"\n--- Table {idx + 1} ---")
    print(table)

The GFM table output is directly compatible with LLM prompts. You can pass it straight into an OpenAI or Anthropic API call and the model will understand the tabular structure without any additional formatting:

# Feed extracted table to an LLM for analysis
prompt = f"""Analyze the following financial table and summarize the key trends:

{all_tables[0]}
"""

This approach is significantly faster than extracting tables with pdfplumber and then converting them to Markdown yourself.

Handling Multi-Page Tables

Tables that span multiple pages are a common challenge in PDF extraction. Financial statements, inventory lists, and regulatory filings frequently contain tables that run across two, five, or even dozens of pages. The key insight is that you need to extract the table from each page separately and then stitch the rows together, being careful to handle repeated headers and page artifacts.

from pdf_oxide import PdfDocument

doc = PdfDocument("long-report.pdf")

def extract_table_rows(md_text):
    """Extract table rows from markdown text, returning header and data separately."""
    header = None
    data_rows = []
    for line in md_text.split("\n"):
        line = line.strip()
        if not line.startswith("|") or line.startswith("|--"):
            continue
        cells = [cell.strip() for cell in line.split("|")[1:-1]]
        if header is None:
            header = cells
        else:
            data_rows.append(cells)
    return header, data_rows

# Collect rows across all pages
combined_header = None
combined_rows = []

for i in range(doc.page_count()):
    md = doc.to_markdown(i)
    header, rows = extract_table_rows(md)

    if header is None:
        continue  # No table on this page

    if combined_header is None:
        combined_header = header
    elif header == combined_header:
        pass  # Skip repeated header on subsequent pages
    else:
        # Different table — save current and start new
        print(f"Table with {len(combined_rows)} rows found")
        combined_header = header
        combined_rows = []

    combined_rows.extend(rows)

if combined_header and combined_rows:
    print(f"Columns: {combined_header}")
    print(f"Total rows: {len(combined_rows)}")
    for row in combined_rows[:5]:
        print(row)
    if len(combined_rows) > 5:
        print(f"... and {len(combined_rows) - 5} more rows")

This pattern works reliably for tables where the header row is repeated on each page (the most common case). For tables where the header only appears on the first page, you can simplify the logic by only capturing the header from the first page that contains a table and treating all subsequent rows as data.

Export Tables to CSV or DataFrame

Once you have extracted table data, you often need it in a structured format for further analysis. The examples below show how to go from a PDF to a pandas DataFrame or a CSV file in just a few lines.

Batch Export: All Tables to Separate CSV Files

import csv
from pdf_oxide import PdfDocument

doc = PdfDocument("catalog.pdf")
table_count = 0

for i in range(doc.page_count()):
    md = doc.to_markdown(i)
    rows = []
    for line in md.split("\n"):
        line = line.strip()
        if line.startswith("|") and not line.startswith("|--"):
            cells = [cell.strip() for cell in line.split("|")[1:-1]]
            rows.append(cells)

    if len(rows) > 1:  # At least header + one data row
        table_count += 1
        filename = f"table_page{i + 1}_{table_count}.csv"
        with open(filename, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(rows)
        print(f"Saved {filename} ({len(rows) - 1} data rows)")

print(f"Exported {table_count} tables total")

Multi-Page Table to DataFrame

For tables that span multiple pages, combine the multi-page stitching pattern with pandas:

import pandas as pd
from pdf_oxide import PdfDocument

doc = PdfDocument("financial-statement.pdf")

header = None
all_rows = []

for i in range(doc.page_count()):
    md = doc.to_markdown(i)
    for line in md.split("\n"):
        line = line.strip()
        if not line.startswith("|") or line.startswith("|--"):
            continue
        cells = [cell.strip() for cell in line.split("|")[1:-1]]
        if header is None:
            header = cells
        elif cells == header:
            continue  # Skip repeated header
        else:
            all_rows.append(cells)

if header and all_rows:
    df = pd.DataFrame(all_rows, columns=header)
    # Clean up numeric columns
    for col in df.columns:
        # Try to convert columns that look numeric
        cleaned = df[col].str.replace(r"[$,%]", "", regex=True).str.strip()
        try:
            df[col] = pd.to_numeric(cleaned)
        except (ValueError, TypeError):
            pass  # Keep as string

    print(df.dtypes)
    print(df.head(10))
    df.to_csv("financial_data.csv", index=False)

This workflow gives you a clean DataFrame with proper numeric types, ready for analysis with pandas, plotting with matplotlib, or loading into a database.

Complex Tables: When to Use pdfplumber

PDF Oxide’s table detection handles standard aligned tables well. For complex cases — merged cells, spanning headers, borderless tables, or multi-line cell content — pdfplumber’s dedicated table extraction algorithms are more robust:

import pdfplumber

with pdfplumber.open("complex-report.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()
    for table in tables:
        for row in table:
            print(row)

When to Use Each

Scenario	Recommended
Simple aligned tables	PDF Oxide (29× faster)
Tables as part of full-page Markdown	PDF Oxide
Complex merged cells / spanning headers	pdfplumber
Borderless tables	pdfplumber
Speed-critical batch processing	PDF Oxide

Use Both Together

Fast text extraction with PDF Oxide, complex table extraction with pdfplumber:

from pdf_oxide import PdfDocument
import pdfplumber

# Fast full-text extraction
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)

# Targeted table extraction for complex pages
with pdfplumber.open("report.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()

Markdown Conversion — full Markdown API reference
Text Extraction — plain text and character extraction
PDF Oxide vs pdfplumber — detailed comparison
PDF to Markdown — Markdown conversion guide

Extract Tables from PDF in Python

Detection engine

Configuration

Behavior change

Rendering

Why Table Extraction from PDFs Is Challenging

Table Extraction: PDF Oxide vs Other Libraries

Installation

Basic Table Extraction

As Markdown Tables

Structured Table Extraction (v0.3.34)

Parse Markdown Tables to Rows

Export to CSV

Export to Pandas DataFrame

Using Character Positions for Custom Table Parsing

Extract Tables to Markdown

Handling Multi-Page Tables

Export Tables to CSV or DataFrame

Batch Export: All Tables to Separate CSV Files

Multi-Page Table to DataFrame

Complex Tables: When to Use pdfplumber

When to Use Each

Use Both Together

Related Pages