Convert PDF to Markdown in Python
PDF to Markdown conversion is one of the most important steps in modern document processing. Whether you are building an LLM-powered application, a RAG pipeline, or simply archiving documents in a readable format, converting PDF to Markdown in Python gives you structured, portable output that works everywhere.
Why Convert PDF to Markdown?
Markdown has become the standard interchange format for AI and document workflows. Here is why converting PDF to Markdown matters:
LLM context windows work best with structured text. Large language models like GPT-4, Claude, and Llama produce dramatically better results when their input is clean Markdown rather than raw extracted text. Headings give the model a map of the document, and formatting like bold and italic carries semantic weight that plain text discards.
RAG pipelines need clean, chunked text with headings preserved. Retrieval-augmented generation systems split documents into chunks, embed them, and retrieve the most relevant pieces at query time. Markdown headings are natural chunk boundaries – splitting on ## gives you semantically coherent sections with a built-in title for each chunk. Plain text extraction loses these boundaries entirely, forcing you to rely on heuristics like paragraph length or sentence count.
Markdown preserves document structure while being plain text. Headings, bullet lists, numbered lists, tables, bold, and italic all survive the conversion in a format that is both human-readable and machine-parseable. A Markdown file is just a text file – it works with version control, text search, and every programming language.
The alternatives are worse. Plain text extraction loses all structure: headings become indistinguishable from body text, tables collapse into jumbled lines, and lists lose their hierarchy. HTML conversion preserves structure but adds enormous bloat – a 2 KB Markdown file might become 15 KB of HTML with nested <div> tags, CSS classes, and escaped entities. Markdown hits the sweet spot: structured, lightweight, and universally supported.
Quick Start
Convert a PDF page to clean Markdown in three lines:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)
WASM
import { WasmPdfDocument } from "pdf-oxide-wasm";
const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0);
console.log(md);
doc.free();
Rust
use pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("paper.pdf")?;
let md = doc.to_markdown(0, true)?;
println!("{}", md);
Go
package main
import (
"fmt"
"log"
pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)
func main() {
doc, err := pdfoxide.Open("paper.pdf")
if err != nil { log.Fatal(err) }
defer doc.Close()
md, err := doc.ToMarkdown(0)
if err != nil { log.Fatal(err) }
fmt.Println(md)
}
C#
using PdfOxide;
using var doc = PdfDocument.Open("paper.pdf");
Console.WriteLine(doc.ToMarkdown(0));
PDF Oxide detects headings from font size clusters, preserves bold/italic formatting, converts tables to GFM syntax, and optionally embeds images. No other Python PDF library provides built-in Markdown conversion.
Installation
pip install pdf_oxide
Convert Entire Document
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("book.pdf")
md = doc.to_markdown_all(detect_headings=True)
with open("book.md", "w", encoding="utf-8") as f:
f.write(md)
WASM
const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdownAll();
console.log(md);
doc.free();
Rust
let mut doc = PdfDocument::open("book.pdf")?;
let md = doc.to_markdown_all(true)?;
std::fs::write("book.md", &md)?;
Go
doc, _ := pdfoxide.Open("book.pdf")
defer doc.Close()
md, _ := doc.ToMarkdownAll()
_ = os.WriteFile("book.md", []byte(md), 0644)
C#
using var doc = PdfDocument.Open("book.pdf");
File.WriteAllText("book.md", doc.ToMarkdownAll());
to_markdown_all() converts every page and joins them with --- separators.
Conversion Options
| Parameter | Default | Description |
|---|---|---|
detect_headings |
True |
Map font sizes to #, ##, ### headings |
preserve_layout |
False |
Preserve visual positioning |
include_images |
True |
Include images in output |
embed_images |
True |
Embed as base64 data URIs |
image_output_dir |
None |
Save images to this directory instead |
Headings Only (No Images)
doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True, include_images=False)
Save Images to a Directory
doc = PdfDocument("report.pdf")
md = doc.to_markdown(0,
detect_headings=True,
embed_images=False,
image_output_dir="output/images"
)
with open("output/report.md", "w") as f:
f.write(md)
RAG / LLM Pipeline Integration
Markdown is the ideal format for RAG pipelines. Headings provide natural chunk boundaries, and the structured format preserves meaning that plain text loses.
Chunk by Heading
Python
from pdf_oxide import PdfDocument
import re
doc = PdfDocument("paper.pdf")
md = doc.to_markdown_all(detect_headings=True)
# Split on headings for semantic chunking
chunks = re.split(r'\n(?=#{1,3} )', md)
chunks = [chunk.strip() for chunk in chunks if chunk.strip()]
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {chunk[:80]}...")
WASM
const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdownAll();
// Split on headings for semantic chunking
const chunks = md.split(/\n(?=#{1,3} )/).filter(c => c.trim());
chunks.forEach((chunk, i) => {
console.log(`Chunk ${i}: ${chunk.slice(0, 80)}...`);
});
doc.free();
Rust
let mut doc = PdfDocument::open("paper.pdf")?;
let md = doc.to_markdown_all(true)?;
let chunks: Vec<&str> = md.split("\n#")
.map(|c| c.trim())
.filter(|c| !c.is_empty())
.collect();
for (i, chunk) in chunks.iter().enumerate() {
println!("Chunk {}: {}...", i, &chunk[..chunk.len().min(80)]);
}
Go
doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
md, _ := doc.ToMarkdownAll()
re := regexp.MustCompile(`\n(?=#{1,3} )`)
for i, chunk := range re.Split(md, -1) {
chunk = strings.TrimSpace(chunk)
if chunk == "" { continue }
if len(chunk) > 80 { chunk = chunk[:80] }
fmt.Printf("Chunk %d: %s...\n", i, chunk)
}
C#
using var doc = PdfDocument.Open("paper.pdf");
var md = doc.ToMarkdownAll();
var chunks = Regex.Split(md, @"\n(?=#{1,3} )")
.Select(c => c.Trim())
.Where(c => c.Length > 0)
.ToList();
for (int i = 0; i < chunks.Count; i++)
{
var preview = chunks[i].Length > 80 ? chunks[i][..80] : chunks[i];
Console.WriteLine($"Chunk {i}: {preview}...");
}
Page-Level Chunking
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
chunks = []
for i in range(doc.page_count()):
md = doc.to_markdown(i, detect_headings=True, include_images=False)
chunks.append({
"page": i,
"content": md,
"source": "report.pdf"
})
WASM
const doc = new WasmPdfDocument(bytes);
const chunks = [];
for (let i = 0; i < doc.pageCount(); i++) {
const md = doc.toMarkdown(i);
chunks.push({ page: i, content: md, source: "report.pdf" });
}
doc.free();
Rust
let mut doc = PdfDocument::open("report.pdf")?;
let mut chunks = Vec::new();
for i in 0..doc.page_count()? {
let md = doc.to_markdown(i, true)?;
chunks.push((i, md));
}
Go
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
type Chunk struct {
Page int
Content string
Source string
}
n, _ := doc.PageCount()
chunks := make([]Chunk, 0, n)
for i := 0; i < n; i++ {
md, _ := doc.ToMarkdown(i)
chunks = append(chunks, Chunk{Page: i, Content: md, Source: "report.pdf"})
}
C#
using var doc = PdfDocument.Open("report.pdf");
var chunks = Enumerable.Range(0, doc.PageCount)
.Select(i => new { Page = i, Content = doc.ToMarkdown(i), Source = "report.pdf" })
.ToList();
Batch Convert for Vector Database
from pdf_oxide import PdfDocument, PdfError
from pathlib import Path
pdf_dir = Path("documents/")
documents = []
for pdf_path in pdf_dir.glob("*.pdf"):
try:
doc = PdfDocument(str(pdf_path))
md = doc.to_markdown_all(detect_headings=True, include_images=False)
documents.append({
"source": pdf_path.name,
"content": md,
"pages": doc.page_count()
})
except PdfError as e:
print(f"Skipped {pdf_path.name}: {e}")
print(f"Converted {len(documents)} PDFs")
At 0.8ms per page, converting thousands of PDFs for your vector database takes seconds, not minutes.
How Heading Detection Works
PDF Oxide clusters font sizes across the page to identify heading levels:
- Extract all text spans with font size and weight metadata
- Cluster spans by size — the most common size is body text
- Map larger/bolder sizes to
#(largest),##,###headings - Preserve bold (
**text**) and italic (*text*) inline formatting
This works well for academic papers, reports, and documentation. For PDFs with unusual font schemes, disable heading detection:
md = doc.to_markdown(0, detect_headings=False)
PDF to Markdown for LLM and RAG Pipelines
PDF Oxide’s built-in Markdown conversion is purpose-built for AI workflows. The heading hierarchy it detects maps directly to semantic structure, making downstream processing straightforward.
Feed Markdown to an LLM
Convert a PDF and send the Markdown directly to a language model for summarization, Q&A, or analysis:
from pdf_oxide import PdfDocument
doc = PdfDocument("quarterly-report.pdf")
md = doc.to_markdown_all(detect_headings=True, include_images=False)
# Send to any LLM API -- the Markdown structure helps the model
# understand the document's organization
prompt = f"""Summarize the following document. Pay attention to the
heading structure to identify the main sections.
{md}
"""
# response = llm_client.generate(prompt)
Because PDF Oxide preserves the heading hierarchy (#, ##, ###), the LLM can distinguish section titles from body text and produce section-aware summaries. With plain text extraction, the model has to guess where sections begin and end.
Chunk by Headings for RAG
Splitting on Markdown headings produces semantically meaningful chunks that embed well and retrieve accurately:
from pdf_oxide import PdfDocument
import re
doc = PdfDocument("technical-manual.pdf")
md = doc.to_markdown_all(detect_headings=True, include_images=False)
# Split into chunks at heading boundaries
chunks = re.split(r'\n(?=#{1,3} )', md)
chunks = [c.strip() for c in chunks if c.strip()]
# Each chunk has a heading as its first line -- use it as metadata
for chunk in chunks:
lines = chunk.split('\n', 1)
title = lines[0].lstrip('#').strip()
body = lines[1].strip() if len(lines) > 1 else ""
# embed_and_store(title=title, content=body, source="technical-manual.pdf")
This approach gives you chunks that are coherent (each chunk is a complete section), titled (the heading serves as metadata for retrieval), and consistently sized (document authors naturally create sections of similar length). PDF Oxide’s heading detection makes this possible without any manual configuration – the font-size clustering algorithm identifies heading levels automatically.
Why PDF Oxide Is Ideal for AI Pipelines
At 0.8ms per page, PDF Oxide is fast enough to convert documents on-the-fly at query time, not just at indexing time. This opens up workflows that are impractical with slower tools:
- On-demand conversion: Convert a PDF to Markdown when a user uploads it, with no noticeable delay
- Re-processing: Update your RAG index by re-converting all PDFs when you change your chunking strategy – thousands of pages process in seconds
- Streaming pipelines: Convert PDFs as they arrive in a queue without building up a backlog
Batch Processing
Convert an entire directory of PDFs to Markdown files:
from pdf_oxide import PdfDocument
from pathlib import Path
for pdf_path in Path("documents/").glob("*.pdf"):
doc = PdfDocument(str(pdf_path))
md_parts = []
for i in range(doc.page_count()):
md_parts.append(doc.to_markdown(i, detect_headings=True))
md_path = pdf_path.with_suffix(".md")
md_path.write_text("\n\n".join(md_parts))
print(f"Converted {pdf_path.name} → {md_path.name}")
At sub-millisecond speeds per page, batch converting hundreds of PDFs completes in seconds. For production workloads with thousands of files, see the Batch Processing guide for parallel processing patterns.
PDF to Markdown: PDF Oxide vs Alternatives
| Tool | Speed | Built-in | Heading Detection | Table Preservation |
|---|---|---|---|---|
| PDF Oxide | 0.8ms | Yes | Yes | Yes |
| pymupdf4llm | 55.5ms (69x slower) | No (separate package) | Yes | Yes |
| marker | ~500ms+ | No (separate tool) | Yes | Yes |
| pdfplumber + custom code | ~23ms+ | No (manual) | No | Manual |
| pypdf + custom code | ~12ms+ | No (manual) | No | No |
PDF Oxide is the only Python PDF library with built-in, fast Markdown conversion. It detects headings from font-size clustering, converts tables to GitHub Flavored Markdown syntax, and preserves inline formatting – all in a single to_markdown() call.
pymupdf4llm requires PyMuPDF (which is AGPL-licensed) plus an additional pymupdf4llm package on top. It is 69 times slower than PDF Oxide and carries copyleft license obligations that may be incompatible with proprietary applications.
marker is a standalone tool, not a library. It uses deep learning models for layout detection, which makes it accurate on complex layouts but orders of magnitude slower. It also requires significant GPU memory for best performance.
pdfplumber and pypdf do not offer Markdown conversion at all. You would need to write custom code to detect headings, reconstruct tables, and format the output as Markdown – a substantial engineering effort to replicate what PDF Oxide provides out of the box.
Related Pages
- Markdown Conversion API — full API reference
- PDF for RAG Pipelines — complete RAG integration guide
- Extract Text from PDF — plain text extraction
- Batch Processing — parallel processing patterns