Skip to content

Extract Text from PDF in Python

PDF text extraction is one of the most common tasks in document processing pipelines — from building search indexes and feeding RAG systems to data mining and compliance workflows. This guide covers everything you need to extract text from PDFs in Python, JavaScript, and Rust using PDF Oxide, including plain text extraction, character-level positioning, styled spans, OCR for scanned documents, encrypted file handling, and performance tuning for batch pipelines.

Extract text from any PDF in three lines:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("document.pdf")
text = doc.extract_text(0)  # page 0
print(text)

WASM

import { WasmPdfDocument } from "pdf-oxide-wasm";

const bytes = new Uint8Array(buffer);
const doc = new WasmPdfDocument(bytes);
const text = doc.extractText(0); // page 0
console.log(text);
doc.free();

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("document.pdf")?;
let text = doc.extract_text(0)?;
println!("{}", text);

Go

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("document.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    text, err := doc.ExtractText(0) // page 0
    if err != nil { log.Fatal(err) }
    fmt.Println(text)
}

C#

using PdfOxide;

using var doc = PdfDocument.Open("document.pdf");
var text = doc.ExtractText(0); // page 0
Console.WriteLine(text);

PDF Oxide extracts text at 0.8ms mean per page — 5× faster than PyMuPDF, 15× faster than pypdf — with a 100% pass rate on 3,830 test PDFs.

Why PDF Text Extraction Is Hard

PDF is a visual format, not a text format. Unlike HTML or Markdown, a PDF file does not store “paragraphs” or “sentences” — it stores individual characters positioned at specific coordinates on a page. Extracting readable text requires:

  • Font decoding — PDF fonts map character codes to glyphs using encoding tables (WinAnsi, MacRoman, Unicode CMaps, Type 1, TrueType, CIDFont). A character code of 0x41 might mean “A” in one font and “α” in another.
  • Text stream parsing — Text operators like Tj, TJ, ', " place characters on the page. Kerning adjustments in TJ arrays shift characters by fractions of a point. Missing spaces must be inferred from gaps between character positions.
  • Layout reconstruction — Characters on a page have no explicit reading order. Two-column layouts, headers, footers, tables, and sidebars must be spatially analyzed to produce a linear text flow.
  • Encoding edge cases — CJK text (Chinese, Japanese, Korean) uses CIDFont/CMap encoding with thousands of glyphs. Arabic and Hebrew require right-to-left reordering. Ligatures (fi, fl, ffi) must be decomposed.
  • Embedded subsets — Many PDFs embed only the glyphs they use, with custom encoding vectors. A font might map glyph index 1→“T”, 2→“h”, 3→“e” with no standard encoding.

This is why different PDF libraries produce different text output for the same file — and why some fail entirely on complex documents. PDF Oxide handles all of these cases with a Rust-based parser that has been tested on 3,830 real-world PDFs with a 100% pass rate.

Installation

Python (PyPI):

pip install pdf_oxide

Pre-built wheels for Linux (x86_64, aarch64), macOS (Intel and Apple Silicon), and Windows (x86_64). Python 3.8+. No system dependencies — the Rust core is compiled into the wheel, so there is no need to install Poppler, MuPDF, or any C libraries.

JavaScript (npm):

npm install pdf-oxide-wasm

Works in Node.js 18+ and modern browsers. The WASM binary is bundled in the package.

Rust (Cargo):

cargo add pdf_oxide

Requires Rust 1.70+. No system dependencies beyond a standard Rust toolchain.

Extract All Pages

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
full_text = []
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    full_text.append(text)

print("\n".join(full_text))

WASM

const doc = new WasmPdfDocument(bytes);
const fullText = doc.extractAllText();
console.log(fullText);
doc.free();

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let mut full_text = Vec::new();
for i in 0..doc.page_count()? {
    full_text.push(doc.extract_text(i)?);
}
println!("{}", full_text.join("\n"));

Go

doc, err := pdfoxide.Open("report.pdf")
if err != nil { log.Fatal(err) }
defer doc.Close()

full, err := doc.ExtractAllText()
if err != nil { log.Fatal(err) }
fmt.Println(full)

C#

using var doc = PdfDocument.Open("report.pdf");
var parts = new List<string>();
for (int i = 0; i < doc.PageCount; i++)
    parts.Add(doc.ExtractText(i));
Console.WriteLine(string.Join("\n", parts));

Extract Text with Character Positions

Get exact coordinates, font names, and sizes for every character:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
chars = doc.extract_chars(0)

for ch in chars[:20]:
    print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f}) "
          f"font={ch.font_name} size={ch.font_size:.1f}")

WASM

const doc = new WasmPdfDocument(bytes);
const chars = doc.extractChars(0);
for (const ch of chars.slice(0, 20)) {
    console.log(`'${ch.char}' at (${ch.x.toFixed(1)}, ${ch.y.toFixed(1)}) font=${ch.fontName} size=${ch.fontSize.toFixed(1)}`);
}
doc.free();

Rust

let mut doc = PdfDocument::open("paper.pdf")?;
let chars = doc.extract_chars(0)?;
for ch in chars.iter().take(20) {
    println!("'{}' at ({:.1}, {:.1}) font={} size={:.1}",
        ch.char, ch.x, ch.y, ch.font_name, ch.font_size);
}

Go

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()

chars, _ := doc.ExtractChars(0)
for _, ch := range chars[:20] {
    fmt.Printf("%q at (%.1f, %.1f) font=%s size=%.1f\n",
        ch.Char, ch.X, ch.Y, ch.FontName, ch.FontSize)
}

C#

using var doc = PdfDocument.Open("paper.pdf");
var chars = doc.ExtractChars(0);
foreach (var ch in chars.Take(20))
    Console.WriteLine($"'{ch.Char}' at ({ch.X:F1}, {ch.Y:F1}) font={ch.FontName} size={ch.FontSize:F1}");

Each character includes:

Field Type Description
char str The Unicode character
x, y float Position in points
font_size float Font size in points
font_name str PostScript font name
bbox tuple Bounding box (x0, y0, x1, y1)

Character-level extraction is useful for reconstructing tables, detecting headings by font size, or building bounding boxes around text regions. For example, you can group characters into lines by their y coordinate and detect column boundaries by gaps in x positions.

Extract Styled Text Spans

Group consecutive characters by font and size:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
spans = doc.extract_spans(0)

for span in spans:
    print(f"'{span.text}' font={span.font_name} size={span.font_size:.1f}")

WASM

const doc = new WasmPdfDocument(bytes);
const spans = doc.extractSpans(0);
for (const span of spans) {
    console.log(`'${span.text}' font=${span.fontName} size=${span.fontSize.toFixed(1)}`);
}
doc.free();

Rust

let mut doc = PdfDocument::open("paper.pdf")?;
let spans = doc.extract_spans(0)?;
for span in &spans {
    println!("'{}' font={} size={:.1}", span.text, span.font_name, span.font_size);
}

Useful for detecting headings, bold text, or building structured output.

Batch Processing

Process hundreds or thousands of PDFs:

from pdf_oxide import PdfDocument, PdfError
from pathlib import Path

pdf_dir = Path("documents/")
for pdf_path in pdf_dir.glob("*.pdf"):
    try:
        doc = PdfDocument(str(pdf_path))
        for i in range(doc.page_count()):
            text = doc.extract_text(i)
            # Process text...
    except PdfError as e:
        print(f"Skipped {pdf_path.name}: {e}")

At 0.8ms per page, processing 3,830 PDFs takes about 3.1 seconds. For production pipelines, see the Batch Processing guide for parallel processing patterns using multiprocessing and async I/O.

Handling Scanned PDFs (OCR)

If a PDF contains scanned images instead of text, extract_text() returns empty or minimal output. Use PDF Oxide’s built-in OCR:

from pdf_oxide import PdfDocument

doc = PdfDocument("scanned.pdf")
text = doc.extract_text(0)

if not text.strip():
    # Page is likely scanned — use OCR
    text = doc.extract_text_ocr(0)
    print(text)

PDF Oxide uses PaddleOCR via ONNX Runtime — no Tesseract installation required. See OCR guide for model selection and configuration.

Handling Encrypted PDFs

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("protected.pdf", password="secret")
text = doc.extract_text(0)
print(text)

WASM

const doc = new WasmPdfDocument(bytes);
doc.authenticate("secret");
const text = doc.extractText(0);
console.log(text);
doc.free();

Rust

let mut doc = PdfDocument::open_with_password("protected.pdf", "secret")?;
let text = doc.extract_text(0)?;
println!("{}", text);

Go

doc, _ := pdfoxide.Open("protected.pdf")
defer doc.Close()

if _, err := doc.Authenticate("secret"); err != nil { log.Fatal(err) }
text, _ := doc.ExtractText(0)
fmt.Println(text)

C#

using var doc = PdfDocument.OpenWithPassword("protected.pdf", "secret");
Console.WriteLine(doc.ExtractText(0));

Supports AES-256, AES-128, and RC4 encrypted PDFs. Unlike pdfplumber (which cannot open encrypted files at all) and pdfminer (which fails on AES-256), PDF Oxide handles all standard PDF encryption methods transparently.

Output as Markdown

For structured output with headings and formatting:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0);
console.log(md);
doc.free();

Rust

let mut doc = PdfDocument::open("paper.pdf")?;
let md = doc.to_markdown(0, true)?;
println!("{}", md);

Go

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()

md, _ := doc.ToMarkdown(0)
fmt.Println(md)

C#

using var doc = PdfDocument.Open("paper.pdf");
Console.WriteLine(doc.ToMarkdown(0));

See PDF to Markdown guide for RAG and LLM integration patterns.

Search Within PDFs

Find text across all pages with position data:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("manual.pdf")
results = doc.search("configuration")
for r in results:
    print(f"Page {r.page}: '{r.text}' at ({r.x:.0f}, {r.y:.0f})")

WASM

const doc = new WasmPdfDocument(bytes);
const results = doc.search("configuration", false);
for (const r of results) {
    console.log(`Page ${r.page}: '${r.text}' at (${r.x.toFixed(0)}, ${r.y.toFixed(0)})`);
}
doc.free();

Rust

let mut pdf = Pdf::open("manual.pdf")?;
let results = pdf.search("configuration")?;
for r in &results {
    println!("Page {}: '{}' at ({:.0}, {:.0})", r.page, r.text, r.bbox.x, r.bbox.y);
}

Go

doc, _ := pdfoxide.Open("manual.pdf")
defer doc.Close()

results, _ := doc.SearchAll("configuration", false)
for _, r := range results {
    fmt.Printf("Page %d: %q at (%.0f, %.0f)\n", r.PageIndex, r.Text, r.X, r.Y)
}

C#

using var doc = PdfDocument.Open("manual.pdf");
foreach (var r in doc.SearchAll("configuration", caseSensitive: false))
    Console.WriteLine($"Page {r.PageIndex}: '{r.Text}' at ({r.X:F0}, {r.Y:F0})");

Comparison with Other Python PDF Libraries

There are several Python libraries for PDF text extraction. Here is how they compare:

  • pypdf — Pure Python, no C dependencies. Easy to install but slow (12ms per page) and fails on 1.6% of PDFs due to limited font and encoding support. No character position data. Good for simple PDFs where speed does not matter.
  • pdfplumber — Built on pdfminer, provides detailed character and table extraction. Very slow (23ms per page) and cannot open encrypted PDFs. Best for table extraction when you need cell-level data and do not need performance.
  • PyMuPDF (fitz) — Python bindings to the MuPDF C library. Fast (4.6ms per page) and reliable (99.3% pass rate). Requires a C library installation and has AGPL licensing. A solid choice if the license works for your project.
  • pypdfium2 — Python bindings to Google’s PDFium engine. Fast (4.1ms per page) but p99 latency is high (42ms) on complex documents. Limited API surface compared to PyMuPDF.
  • pdfminer.six — Pure Python with detailed layout analysis. Very slow and unmaintained. Fails on AES-256 encrypted PDFs. Largely superseded by pdfplumber.
  • PDF Oxide — Rust-based with Python bindings via PyO3. Fastest option (0.8ms per page), 100% pass rate, handles all encryption methods, includes built-in OCR. MIT licensed with no system dependencies.

PDF Oxide was built specifically to address the gaps in existing libraries: the speed limitations of pure-Python parsers, the licensing restrictions of MuPDF, and the reliability issues that cause libraries to fail on real-world PDFs with unusual fonts, broken cross-reference tables, or non-standard encodings.

Performance: How Fast Is PDF Oxide?

Benchmarked on 3,830 PDFs from three independent public test suites:

Library Mean p99 Pass Rate
PDF Oxide 0.8ms 9ms 100%
PyMuPDF 4.6ms 28ms 99.3%
pypdfium2 4.1ms 42ms 99.2%
pypdf 12.1ms 97ms 98.4%
pdfplumber 23.2ms 189ms 98.8%

For a pipeline processing 10,000 PDFs:

  • PDF Oxide: 8 seconds
  • PyMuPDF: 46 seconds
  • pypdf: 2 minutes
  • pdfplumber: 3.9 minutes

See full benchmarks for methodology and reproduction steps.

Common Issues and Troubleshooting

Empty text output

If extract_text() returns an empty string, the page likely contains scanned images rather than text. Use extract_text_ocr() instead. See OCR Scanned PDFs for setup instructions.

Garbled or incorrect characters

This usually indicates a font with a non-standard encoding vector or a missing ToUnicode CMap. PDF Oxide handles most encoding edge cases, but some intentionally obfuscated PDFs (DRM-protected content) may produce incorrect output.

Missing spaces or merged words

PDF text operators place characters individually. Space inference depends on the gap between character positions relative to the font’s space width. If words appear merged, try extract_chars() and apply custom spacing logic based on character positions.

Different output than other libraries

Different libraries use different heuristics for space inference, line breaking, and reading order. PDF Oxide achieves 99.5% text parity with PyMuPDF across 3,830 PDFs. The 0.5% difference is in whitespace normalization and ligature handling.

Real-World Use Cases

Search indexing — Extract text from every page of every PDF in a document repository, then feed the text into Elasticsearch, Typesense, or a vector database for full-text search. PDF Oxide’s speed makes it practical to re-index thousands of documents on demand.

RAG pipelines (retrieval-augmented generation) — Extract and chunk PDF text for embedding with OpenAI, Cohere, or open-source models. Use extract_spans() to preserve heading structure so chunks align with document sections. See the PDF to Markdown guide for LLM-optimized output.

Compliance and audit — Scan contracts, invoices, and regulatory filings for specific clauses or keywords. Use doc.search() to locate terms across all pages with exact positions, or extract full text for NLP-based clause detection.

Data extraction — Pull structured data from invoices, receipts, bank statements, and forms. Combine extract_chars() for positional data with domain-specific rules to locate fields like “Total Amount” or “Invoice Date” and extract adjacent values.

Academic research — Process thousands of research papers for literature review, citation extraction, or meta-analysis. PDF Oxide handles the full range of PDF producers (LaTeX, Word, InDesign, Quark) and font encodings found in academic publications.