Skip to content

Convert PDF to Markdown in Python

PDF to Markdown conversion is one of the most important steps in modern document processing. Whether you are building an LLM-powered application, a RAG pipeline, or simply archiving documents in a readable format, converting PDF to Markdown in Python gives you structured, portable output that works everywhere.

Why Convert PDF to Markdown?

Markdown has become the standard interchange format for AI and document workflows. Here is why converting PDF to Markdown matters:

LLM context windows work best with structured text. Large language models like GPT-4, Claude, and Llama produce dramatically better results when their input is clean Markdown rather than raw extracted text. Headings give the model a map of the document, and formatting like bold and italic carries semantic weight that plain text discards.

RAG pipelines need clean, chunked text with headings preserved. Retrieval-augmented generation systems split documents into chunks, embed them, and retrieve the most relevant pieces at query time. Markdown headings are natural chunk boundaries – splitting on ## gives you semantically coherent sections with a built-in title for each chunk. Plain text extraction loses these boundaries entirely, forcing you to rely on heuristics like paragraph length or sentence count.

Markdown preserves document structure while being plain text. Headings, bullet lists, numbered lists, tables, bold, and italic all survive the conversion in a format that is both human-readable and machine-parseable. A Markdown file is just a text file – it works with version control, text search, and every programming language.

The alternatives are worse. Plain text extraction loses all structure: headings become indistinguishable from body text, tables collapse into jumbled lines, and lists lose their hierarchy. HTML conversion preserves structure but adds enormous bloat – a 2 KB Markdown file might become 15 KB of HTML with nested <div> tags, CSS classes, and escaped entities. Markdown hits the sweet spot: structured, lightweight, and universally supported.

Quick Start

Convert a PDF page to clean Markdown in three lines:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

WASM

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0);
console.log(md);
doc.free();

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("paper.pdf")?;
let md = doc.to_markdown(0, true)?;
println!("{}", md);

Go

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("paper.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    md, err := doc.ToMarkdown(0)
    if err != nil { log.Fatal(err) }
    fmt.Println(md)
}

C#

using PdfOxide;

using var doc = PdfDocument.Open("paper.pdf");
Console.WriteLine(doc.ToMarkdown(0));

PDF Oxide detects headings from font size clusters, preserves bold/italic formatting, converts tables to GFM syntax, and optionally embeds images. No other Python PDF library provides built-in Markdown conversion.

Installation

pip install pdf_oxide

Convert Entire Document

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")
md = doc.to_markdown_all(detect_headings=True)

with open("book.md", "w", encoding="utf-8") as f:
    f.write(md)

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdownAll();
console.log(md);
doc.free();

Rust

let mut doc = PdfDocument::open("book.pdf")?;
let md = doc.to_markdown_all(true)?;
std::fs::write("book.md", &md)?;

Go

doc, _ := pdfoxide.Open("book.pdf")
defer doc.Close()

md, _ := doc.ToMarkdownAll()
_ = os.WriteFile("book.md", []byte(md), 0644)

C#

using var doc = PdfDocument.Open("book.pdf");
File.WriteAllText("book.md", doc.ToMarkdownAll());

to_markdown_all() converts every page and joins them with --- separators.

Conversion Options

Parameter Default Description
detect_headings True Map font sizes to #, ##, ### headings
preserve_layout False Preserve visual positioning
include_images True Include images in output
embed_images True Embed as base64 data URIs
image_output_dir None Save images to this directory instead

Headings Only (No Images)

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True, include_images=False)

Save Images to a Directory

doc = PdfDocument("report.pdf")
md = doc.to_markdown(0,
    detect_headings=True,
    embed_images=False,
    image_output_dir="output/images"
)
with open("output/report.md", "w") as f:
    f.write(md)

RAG / LLM Pipeline Integration

Markdown is the ideal format for RAG pipelines. Headings provide natural chunk boundaries, and the structured format preserves meaning that plain text loses.

Chunk by Heading

Python

from pdf_oxide import PdfDocument
import re

doc = PdfDocument("paper.pdf")
md = doc.to_markdown_all(detect_headings=True)

# Split on headings for semantic chunking
chunks = re.split(r'\n(?=#{1,3} )', md)
chunks = [chunk.strip() for chunk in chunks if chunk.strip()]

for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk[:80]}...")

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdownAll();

// Split on headings for semantic chunking
const chunks = md.split(/\n(?=#{1,3} )/).filter(c => c.trim());
chunks.forEach((chunk, i) => {
    console.log(`Chunk ${i}: ${chunk.slice(0, 80)}...`);
});
doc.free();

Rust

let mut doc = PdfDocument::open("paper.pdf")?;
let md = doc.to_markdown_all(true)?;

let chunks: Vec<&str> = md.split("\n#")
    .map(|c| c.trim())
    .filter(|c| !c.is_empty())
    .collect();

for (i, chunk) in chunks.iter().enumerate() {
    println!("Chunk {}: {}...", i, &chunk[..chunk.len().min(80)]);
}

Go

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()

md, _ := doc.ToMarkdownAll()

re := regexp.MustCompile(`\n(?=#{1,3} )`)
for i, chunk := range re.Split(md, -1) {
    chunk = strings.TrimSpace(chunk)
    if chunk == "" { continue }
    if len(chunk) > 80 { chunk = chunk[:80] }
    fmt.Printf("Chunk %d: %s...\n", i, chunk)
}

C#

using var doc = PdfDocument.Open("paper.pdf");
var md = doc.ToMarkdownAll();

var chunks = Regex.Split(md, @"\n(?=#{1,3} )")
    .Select(c => c.Trim())
    .Where(c => c.Length > 0)
    .ToList();

for (int i = 0; i < chunks.Count; i++)
{
    var preview = chunks[i].Length > 80 ? chunks[i][..80] : chunks[i];
    Console.WriteLine($"Chunk {i}: {preview}...");
}

Page-Level Chunking

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
chunks = []
for i in range(doc.page_count()):
    md = doc.to_markdown(i, detect_headings=True, include_images=False)
    chunks.append({
        "page": i,
        "content": md,
        "source": "report.pdf"
    })

WASM

const doc = new WasmPdfDocument(bytes);
const chunks = [];
for (let i = 0; i < doc.pageCount(); i++) {
    const md = doc.toMarkdown(i);
    chunks.push({ page: i, content: md, source: "report.pdf" });
}
doc.free();

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let mut chunks = Vec::new();
for i in 0..doc.page_count()? {
    let md = doc.to_markdown(i, true)?;
    chunks.push((i, md));
}

Go

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()

type Chunk struct {
    Page    int
    Content string
    Source  string
}
n, _ := doc.PageCount()
chunks := make([]Chunk, 0, n)
for i := 0; i < n; i++ {
    md, _ := doc.ToMarkdown(i)
    chunks = append(chunks, Chunk{Page: i, Content: md, Source: "report.pdf"})
}

C#

using var doc = PdfDocument.Open("report.pdf");
var chunks = Enumerable.Range(0, doc.PageCount)
    .Select(i => new { Page = i, Content = doc.ToMarkdown(i), Source = "report.pdf" })
    .ToList();

Batch Convert for Vector Database

from pdf_oxide import PdfDocument, PdfError
from pathlib import Path

pdf_dir = Path("documents/")
documents = []

for pdf_path in pdf_dir.glob("*.pdf"):
    try:
        doc = PdfDocument(str(pdf_path))
        md = doc.to_markdown_all(detect_headings=True, include_images=False)
        documents.append({
            "source": pdf_path.name,
            "content": md,
            "pages": doc.page_count()
        })
    except PdfError as e:
        print(f"Skipped {pdf_path.name}: {e}")

print(f"Converted {len(documents)} PDFs")

At 0.8ms per page, converting thousands of PDFs for your vector database takes seconds, not minutes.

How Heading Detection Works

PDF Oxide clusters font sizes across the page to identify heading levels:

  1. Extract all text spans with font size and weight metadata
  2. Cluster spans by size — the most common size is body text
  3. Map larger/bolder sizes to # (largest), ##, ### headings
  4. Preserve bold (**text**) and italic (*text*) inline formatting

This works well for academic papers, reports, and documentation. For PDFs with unusual font schemes, disable heading detection:

md = doc.to_markdown(0, detect_headings=False)

PDF to Markdown for LLM and RAG Pipelines

PDF Oxide’s built-in Markdown conversion is purpose-built for AI workflows. The heading hierarchy it detects maps directly to semantic structure, making downstream processing straightforward.

Feed Markdown to an LLM

Convert a PDF and send the Markdown directly to a language model for summarization, Q&A, or analysis:

from pdf_oxide import PdfDocument

doc = PdfDocument("quarterly-report.pdf")
md = doc.to_markdown_all(detect_headings=True, include_images=False)

# Send to any LLM API -- the Markdown structure helps the model
# understand the document's organization
prompt = f"""Summarize the following document. Pay attention to the
heading structure to identify the main sections.

{md}
"""
# response = llm_client.generate(prompt)

Because PDF Oxide preserves the heading hierarchy (#, ##, ###), the LLM can distinguish section titles from body text and produce section-aware summaries. With plain text extraction, the model has to guess where sections begin and end.

Chunk by Headings for RAG

Splitting on Markdown headings produces semantically meaningful chunks that embed well and retrieve accurately:

from pdf_oxide import PdfDocument
import re

doc = PdfDocument("technical-manual.pdf")
md = doc.to_markdown_all(detect_headings=True, include_images=False)

# Split into chunks at heading boundaries
chunks = re.split(r'\n(?=#{1,3} )', md)
chunks = [c.strip() for c in chunks if c.strip()]

# Each chunk has a heading as its first line -- use it as metadata
for chunk in chunks:
    lines = chunk.split('\n', 1)
    title = lines[0].lstrip('#').strip()
    body = lines[1].strip() if len(lines) > 1 else ""
    # embed_and_store(title=title, content=body, source="technical-manual.pdf")

This approach gives you chunks that are coherent (each chunk is a complete section), titled (the heading serves as metadata for retrieval), and consistently sized (document authors naturally create sections of similar length). PDF Oxide’s heading detection makes this possible without any manual configuration – the font-size clustering algorithm identifies heading levels automatically.

Why PDF Oxide Is Ideal for AI Pipelines

At 0.8ms per page, PDF Oxide is fast enough to convert documents on-the-fly at query time, not just at indexing time. This opens up workflows that are impractical with slower tools:

  • On-demand conversion: Convert a PDF to Markdown when a user uploads it, with no noticeable delay
  • Re-processing: Update your RAG index by re-converting all PDFs when you change your chunking strategy – thousands of pages process in seconds
  • Streaming pipelines: Convert PDFs as they arrive in a queue without building up a backlog

Batch Processing

Convert an entire directory of PDFs to Markdown files:

from pdf_oxide import PdfDocument
from pathlib import Path

for pdf_path in Path("documents/").glob("*.pdf"):
    doc = PdfDocument(str(pdf_path))
    md_parts = []
    for i in range(doc.page_count()):
        md_parts.append(doc.to_markdown(i, detect_headings=True))

    md_path = pdf_path.with_suffix(".md")
    md_path.write_text("\n\n".join(md_parts))
    print(f"Converted {pdf_path.name}{md_path.name}")

At sub-millisecond speeds per page, batch converting hundreds of PDFs completes in seconds. For production workloads with thousands of files, see the Batch Processing guide for parallel processing patterns.

PDF to Markdown: PDF Oxide vs Alternatives

Tool Speed Built-in Heading Detection Table Preservation
PDF Oxide 0.8ms Yes Yes Yes
pymupdf4llm 55.5ms (69x slower) No (separate package) Yes Yes
marker ~500ms+ No (separate tool) Yes Yes
pdfplumber + custom code ~23ms+ No (manual) No Manual
pypdf + custom code ~12ms+ No (manual) No No

PDF Oxide is the only Python PDF library with built-in, fast Markdown conversion. It detects headings from font-size clustering, converts tables to GitHub Flavored Markdown syntax, and preserves inline formatting – all in a single to_markdown() call.

pymupdf4llm requires PyMuPDF (which is AGPL-licensed) plus an additional pymupdf4llm package on top. It is 69 times slower than PDF Oxide and carries copyleft license obligations that may be incompatible with proprietary applications.

marker is a standalone tool, not a library. It uses deep learning models for layout detection, which makes it accurate on complex layouts but orders of magnitude slower. It also requires significant GPU memory for best performance.

pdfplumber and pypdf do not offer Markdown conversion at all. You would need to write custom code to detect headings, reconstruct tables, and format the output as Markdown – a substantial engineering effort to replicate what PDF Oxide provides out of the box.