What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Convert PDF to Markdown in Python

PDF to Markdown conversion is one of the most important steps in modern document processing. Whether you are building an LLM-powered application, a RAG pipeline, or simply archiving documents in a readable format, converting PDF to Markdown in Python gives you structured, portable output that works everywhere.

Why Convert PDF to Markdown?

Markdown has become the standard interchange format for AI and document workflows. Here is why converting PDF to Markdown matters:

LLM context windows work best with structured text. Large language models like GPT-4, Claude, and Llama produce dramatically better results when their input is clean Markdown rather than raw extracted text. Headings give the model a map of the document, and formatting like bold and italic carries semantic weight that plain text discards.

RAG pipelines need clean, chunked text with headings preserved. Retrieval-augmented generation systems split documents into chunks, embed them, and retrieve the most relevant pieces at query time. Markdown headings are natural chunk boundaries – splitting on ## gives you semantically coherent sections with a built-in title for each chunk. Plain text extraction loses these boundaries entirely, forcing you to rely on heuristics like paragraph length or sentence count.

Markdown preserves document structure while being plain text. Headings, bullet lists, numbered lists, tables, bold, and italic all survive the conversion in a format that is both human-readable and machine-parseable. A Markdown file is just a text file – it works with version control, text search, and every programming language.

The alternatives are worse. Plain text extraction loses all structure: headings become indistinguishable from body text, tables collapse into jumbled lines, and lists lose their hierarchy. HTML conversion preserves structure but adds enormous bloat – a 2 KB Markdown file might become 15 KB of HTML with nested <div> tags, CSS classes, and escaped entities. Markdown hits the sweet spot: structured, lightweight, and universally supported.

Quick Start

Convert a PDF page to clean Markdown in three lines:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

WASM

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0);
console.log(md);
doc.free();

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("paper.pdf")?;
let md = doc.to_markdown(0, true)?;
println!("{}", md);

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("paper.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    md, err := doc.ToMarkdown(0)
    if err != nil { log.Fatal(err) }
    fmt.Println(md)
}

using PdfOxide;

using var doc = PdfDocument.Open("paper.pdf");
Console.WriteLine(doc.ToMarkdown(0));

Java

import fyi.oxide.pdf.PdfDocument;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("paper.pdf"))) {
    System.out.println(doc.toMarkdown(0));
}

PHP

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');
echo $doc->toMarkdown(0);
$doc->close();

Ruby

require 'pdf_oxide'

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  puts doc.to_markdown(0)
end

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("paper.pdf");
std::cout << doc.to_markdown(0);

Swift

import PdfOxide

let doc = try Document.open("paper.pdf")
print(try doc.toMarkdown(0))

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("paper.pdf")).use { doc ->
    println(doc.toMarkdown(0))
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('paper.pdf');
print(doc.toMarkdown(0));
doc.close();

library(pdfoxide)

doc <- pdf_open("paper.pdf")
cat(pdf_to_markdown(doc, 0))

Julia

using PdfOxide

doc = open_document("paper.pdf")
print(to_markdown(doc, 0))

Zig

const pdf_oxide = @import("pdf_oxide");

var doc = try pdf_oxide.Document.open("paper.pdf");
const md = try doc.toMarkdown(std.heap.page_allocator, 0);
std.debug.print("{s}", .{md});

Scala

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open("paper.pdf")) { doc =>
  println(doc.toMarkdown(0))
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "paper.pdf")]
  (println (pdf/to-markdown doc 0)))

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"paper.pdf" error:&err];
NSLog(@"%@", [doc toMarkdown:0 error:&err]);

Elixir

{:ok, doc} = PdfOxide.open("paper.pdf")
{:ok, md} = PdfOxide.to_markdown(doc, 0)
IO.puts(md)

PDF Oxide detects headings from font size clusters, preserves bold/italic formatting, converts tables to GFM syntax, and optionally embeds images. No other Python PDF library provides built-in Markdown conversion.

Installation

pip install pdf_oxide

Convert Entire Document

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")
md = doc.to_markdown_all(detect_headings=True)

with open("book.md", "w", encoding="utf-8") as f:
    f.write(md)

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdownAll();
console.log(md);
doc.free();

Rust

let mut doc = PdfDocument::open("book.pdf")?;
let md = doc.to_markdown_all(true)?;
std::fs::write("book.md", &md)?;

doc, _ := pdfoxide.Open("book.pdf")
defer doc.Close()

md, _ := doc.ToMarkdownAll()
_ = os.WriteFile("book.md", []byte(md), 0644)

using var doc = PdfDocument.Open("book.pdf");
File.WriteAllText("book.md", doc.ToMarkdownAll());

Java

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("book.pdf"))) {
    java.nio.file.Files.writeString(java.nio.file.Path.of("book.md"), doc.toMarkdown());
}

PHP

$doc = PdfDocument::open('book.pdf');
file_put_contents('book.md', $doc->toMarkdownAll());
$doc->close();

Ruby

PdfOxide::PdfDocument.open('book.pdf') do |doc|
  File.write('book.md', doc.to_markdown)
end

C++

auto doc = pdf_oxide::Document::open("book.pdf");
std::ofstream("book.md") << doc.to_markdown_all();

Swift

let doc = try Document.open("book.pdf")
try doc.toMarkdownAll().write(toFile: "book.md", atomically: true, encoding: .utf8)

Kotlin

PdfDocument.open(java.nio.file.Path.of("book.pdf")).use { doc ->
    java.nio.file.Files.writeString(java.nio.file.Path.of("book.md"), doc.toMarkdown())
}

Dart

final doc = PdfDocument.open('book.pdf');
File('book.md').writeAsStringSync(doc.toMarkdownAll());
doc.close();

doc <- pdf_open("book.pdf")
writeLines(pdf_to_markdown_all(doc), "book.md")

Julia

doc = open_document("book.pdf")
write("book.md", to_markdown_all(doc))

Zig

var doc = try pdf_oxide.Document.open("book.pdf");
const md = try doc.toMarkdownAll(std.heap.page_allocator);
try std.fs.cwd().writeFile(.{ .sub_path = "book.md", .data = md });

Scala

Using.resource(PdfDocument.open("book.pdf")) { doc =>
  java.nio.file.Files.writeString(java.nio.file.Path.of("book.md"), doc.toMarkdown())
}

Clojure

(with-open [doc (pdf/open "book.pdf")]
  (spit "book.md" (pdf/to-markdown doc)))

Objective-C

POXDocument *doc = [POXDocument openPath:@"book.pdf" error:&err];
[[doc toMarkdownAllWithError:&err] writeToFile:@"book.md"
    atomically:YES encoding:NSUTF8StringEncoding error:&err];

Elixir

{:ok, doc} = PdfOxide.open("book.pdf")
{:ok, md} = PdfOxide.to_markdown_all(doc)
File.write!("book.md", md)

to_markdown_all() converts every page and joins them with --- separators.

Conversion Options

Parameter	Default	Description
`detect_headings`	`True`	Map font sizes to `#`, `##`, `###` headings
`preserve_layout`	`False`	Preserve visual positioning
`include_images`	`True`	Include images in output
`embed_images`	`True`	Embed as base64 data URIs
`image_output_dir`	`None`	Save images to this directory instead

Headings Only (No Images)

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True, include_images=False)

Save Images to a Directory

doc = PdfDocument("report.pdf")
md = doc.to_markdown(0,
    detect_headings=True,
    embed_images=False,
    image_output_dir="output/images"
)
with open("output/report.md", "w") as f:
    f.write(md)

RAG / LLM Pipeline Integration

Markdown is the ideal format for RAG pipelines. Headings provide natural chunk boundaries, and the structured format preserves meaning that plain text loses.

Chunk by Heading

Python

from pdf_oxide import PdfDocument
import re

doc = PdfDocument("paper.pdf")
md = doc.to_markdown_all(detect_headings=True)

# Split on headings for semantic chunking
chunks = re.split(r'\n(?=#{1,3} )', md)
chunks = [chunk.strip() for chunk in chunks if chunk.strip()]

for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk[:80]}...")

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdownAll();

// Split on headings for semantic chunking
const chunks = md.split(/\n(?=#{1,3} )/).filter(c => c.trim());
chunks.forEach((chunk, i) => {
    console.log(`Chunk ${i}: ${chunk.slice(0, 80)}...`);
});
doc.free();

Rust

let mut doc = PdfDocument::open("paper.pdf")?;
let md = doc.to_markdown_all(true)?;

let chunks: Vec<&str> = md.split("\n#")
    .map(|c| c.trim())
    .filter(|c| !c.is_empty())
    .collect();

for (i, chunk) in chunks.iter().enumerate() {
    println!("Chunk {}: {}...", i, &chunk[..chunk.len().min(80)]);
}

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()

md, _ := doc.ToMarkdownAll()

re := regexp.MustCompile(`\n(?=#{1,3} )`)
for i, chunk := range re.Split(md, -1) {
    chunk = strings.TrimSpace(chunk)
    if chunk == "" { continue }
    if len(chunk) > 80 { chunk = chunk[:80] }
    fmt.Printf("Chunk %d: %s...\n", i, chunk)
}

using var doc = PdfDocument.Open("paper.pdf");
var md = doc.ToMarkdownAll();

var chunks = Regex.Split(md, @"\n(?=#{1,3} )")
    .Select(c => c.Trim())
    .Where(c => c.Length > 0)
    .ToList();

for (int i = 0; i < chunks.Count; i++)
{
    var preview = chunks[i].Length > 80 ? chunks[i][..80] : chunks[i];
    Console.WriteLine($"Chunk {i}: {preview}...");
}

Java

import java.util.regex.Pattern;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("paper.pdf"))) {
    String md = doc.toMarkdown();
    String[] chunks = Pattern.compile("\\n(?=#{1,3} )").split(md);
    int i = 0;
    for (String chunk : chunks) {
        chunk = chunk.strip();
        if (chunk.isEmpty()) continue;
        String preview = chunk.substring(0, Math.min(80, chunk.length()));
        System.out.printf("Chunk %d: %s...%n", i++, preview);
    }
}

PHP

$doc = PdfDocument::open('paper.pdf');
$md = $doc->toMarkdownAll();

$chunks = preg_split('/\n(?=#{1,3} )/', $md);
$i = 0;
foreach ($chunks as $chunk) {
    $chunk = trim($chunk);
    if ($chunk === '') continue;
    printf("Chunk %d: %s...\n", $i++, substr($chunk, 0, 80));
}
$doc->close();

Ruby

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  md = doc.to_markdown

  chunks = md.split(/\n(?=#{1,3} )/).map(&:strip).reject(&:empty?)
  chunks.each_with_index do |chunk, i|
    puts "Chunk #{i}: #{chunk[0, 80]}..."
  end
end

C++

#include <regex>

auto doc = pdf_oxide::Document::open("paper.pdf");
auto md = doc.to_markdown_all();

std::regex sep(R"(\n(?=#{1,3} ))");
std::sregex_token_iterator it(md.begin(), md.end(), sep, -1), end;
for (int i = 0; it != end; ++it) {
    std::string chunk = *it;
    if (chunk.empty()) continue;
    std::cout << "Chunk " << i++ << ": " << chunk.substr(0, 80) << "...\n";
}

Swift

let doc = try Document.open("paper.pdf")
let md = try doc.toMarkdownAll()

let chunks = md.components(separatedBy: try! NSRegularExpression(pattern: "\\n(?=#{1,3} )"))
for (i, chunk) in chunks.enumerated() where !chunk.isEmpty {
    print("Chunk \(i): \(chunk.prefix(80))...")
}

Kotlin

PdfDocument.open(java.nio.file.Path.of("paper.pdf")).use { doc ->
    val md = doc.toMarkdown()
    md.split(Regex("\\n(?=#{1,3} )"))
        .map { it.trim() }
        .filter { it.isNotEmpty() }
        .forEachIndexed { i, chunk ->
            println("Chunk $i: ${chunk.take(80)}...")
        }
}

Dart

final doc = PdfDocument.open('paper.pdf');
final md = doc.toMarkdownAll();

final chunks = md.split(RegExp(r'\n(?=#{1,3} )'))
    .map((c) => c.trim())
    .where((c) => c.isNotEmpty)
    .toList();
for (var i = 0; i < chunks.length; i++) {
  final preview = chunks[i].length > 80 ? chunks[i].substring(0, 80) : chunks[i];
  print('Chunk $i: $preview...');
}
doc.close();

doc <- pdf_open("paper.pdf")
md <- pdf_to_markdown_all(doc)

chunks <- strsplit(md, "\n(?=#{1,3} )", perl = TRUE)[[1]]
chunks <- trimws(chunks)
chunks <- chunks[nchar(chunks) > 0]
for (i in seq_along(chunks)) {
  cat(sprintf("Chunk %d: %s...\n", i - 1, substr(chunks[i], 1, 80)))
}

Julia

doc = open_document("paper.pdf")
md = to_markdown_all(doc)

chunks = filter(!isempty, strip.(split(md, r"\n(?=#{1,3} )")))
for (i, chunk) in enumerate(chunks)
    println("Chunk $(i-1): $(first(chunk, 80))...")
end

Zig

var doc = try pdf_oxide.Document.open("paper.pdf");
const md = try doc.toMarkdownAll(std.heap.page_allocator);

var it = std.mem.splitSequence(u8, md, "\n#");
var i: usize = 0;
while (it.next()) |chunk| {
    const trimmed = std.mem.trim(u8, chunk, " \n");
    if (trimmed.len == 0) continue;
    std.debug.print("Chunk {d}: {s}...\n", .{ i, trimmed[0..@min(80, trimmed.len)] });
    i += 1;
}

Scala

Using.resource(PdfDocument.open("paper.pdf")) { doc =>
  val md = doc.toMarkdown()
  md.split("(?=\\n#{1,3} )")
    .map(_.trim)
    .filter(_.nonEmpty)
    .zipWithIndex
    .foreach { case (chunk, i) =>
      println(s"Chunk $i: ${chunk.take(80)}...")
    }
}

Clojure

(with-open [doc (pdf/open "paper.pdf")]
  (let [md (pdf/to-markdown doc)
        chunks (->> (clojure.string/split md #"\n(?=#{1,3} )")
                    (map clojure.string/trim)
                    (remove clojure.string/blank?))]
    (doseq [[i chunk] (map-indexed vector chunks)]
      (println (format "Chunk %d: %s..." i (subs chunk 0 (min 80 (count chunk))))))))

Objective-C

POXDocument *doc = [POXDocument openPath:@"paper.pdf" error:&err];
NSString *md = [doc toMarkdownAllWithError:&err];

NSRegularExpression *re = [NSRegularExpression
    regularExpressionWithPattern:@"\\n(?=#{1,3} )" options:0 error:&err];
NSString *split = [re stringByReplacingMatchesInString:md options:0
    range:NSMakeRange(0, md.length) withTemplate:@"�"];
NSInteger i = 0;
for (NSString *chunk in [split componentsSeparatedByString:@"�"]) {
    NSString *t = [chunk stringByTrimmingCharactersInSet:
        NSCharacterSet.whitespaceAndNewlineCharacterSet];
    if (t.length == 0) continue;
    NSLog(@"Chunk %ld: %@...", (long)i++, [t substringToIndex:MIN(80, t.length)]);
}

Elixir

{:ok, doc} = PdfOxide.open("paper.pdf")
{:ok, md} = PdfOxide.to_markdown_all(doc)

md
|> String.split(~r/\n(?=#{1,3} )/)
|> Enum.map(&String.trim/1)
|> Enum.reject(&(&1 == ""))
|> Enum.with_index()
|> Enum.each(fn {chunk, i} ->
  IO.puts("Chunk #{i}: #{String.slice(chunk, 0, 80)}...")
end)

Page-Level Chunking

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
chunks = []
for i in range(doc.page_count()):
    md = doc.to_markdown(i, detect_headings=True, include_images=False)
    chunks.append({
        "page": i,
        "content": md,
        "source": "report.pdf"
    })

WASM

const doc = new WasmPdfDocument(bytes);
const chunks = [];
for (let i = 0; i < doc.pageCount(); i++) {
    const md = doc.toMarkdown(i);
    chunks.push({ page: i, content: md, source: "report.pdf" });
}
doc.free();

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let mut chunks = Vec::new();
for i in 0..doc.page_count()? {
    let md = doc.to_markdown(i, true)?;
    chunks.push((i, md));
}

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()

type Chunk struct {
    Page    int
    Content string
    Source  string
}
n, _ := doc.PageCount()
chunks := make([]Chunk, 0, n)
for i := 0; i < n; i++ {
    md, _ := doc.ToMarkdown(i)
    chunks = append(chunks, Chunk{Page: i, Content: md, Source: "report.pdf"})
}

using var doc = PdfDocument.Open("report.pdf");
var chunks = Enumerable.Range(0, doc.PageCount)
    .Select(i => new { Page = i, Content = doc.ToMarkdown(i), Source = "report.pdf" })
    .ToList();

Java

import java.util.*;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("report.pdf"))) {
    List<Map<String, Object>> chunks = new ArrayList<>();
    for (int i = 0; i < doc.pageCount(); i++) {
        chunks.add(Map.of("page", i, "content", doc.toMarkdown(i), "source", "report.pdf"));
    }
}

PHP

$doc = PdfDocument::open('report.pdf');
$chunks = [];
for ($i = 0; $i < $doc->pageCount(); $i++) {
    $chunks[] = ['page' => $i, 'content' => $doc->toMarkdown($i), 'source' => 'report.pdf'];
}
$doc->close();

Ruby

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  chunks = (0...doc.page_count).map do |i|
    { page: i, content: doc.to_markdown(i), source: 'report.pdf' }
  end
end

C++

auto doc = pdf_oxide::Document::open("report.pdf");
struct Chunk { int page; std::string content; std::string source; };
std::vector<Chunk> chunks;
for (int i = 0; i < doc.page_count(); i++) {
    chunks.push_back({ i, doc.to_markdown(i), "report.pdf" });
}

Swift

let doc = try Document.open("report.pdf")
let chunks = try (0..<doc.pageCount()).map { i in
    (page: i, content: try doc.toMarkdown(i), source: "report.pdf")
}

Kotlin

PdfDocument.open(java.nio.file.Path.of("report.pdf")).use { doc ->
    val chunks = (0 until doc.pageCount()).map { i ->
        mapOf("page" to i, "content" to doc.toMarkdown(i), "source" to "report.pdf")
    }
}

Dart

final doc = PdfDocument.open('report.pdf');
final chunks = [
  for (var i = 0; i < doc.pageCount; i++)
    {'page': i, 'content': doc.toMarkdown(i), 'source': 'report.pdf'}
];
doc.close();

doc <- pdf_open("report.pdf")
chunks <- lapply(0:(pdf_page_count(doc) - 1), function(i) {
  list(page = i, content = pdf_to_markdown(doc, i), source = "report.pdf")
})

Julia

doc = open_document("report.pdf")
chunks = [(page = i, content = to_markdown(doc, i), source = "report.pdf")
          for i in 0:(page_count(doc) - 1)]

Zig

var doc = try pdf_oxide.Document.open("report.pdf");
const a = std.heap.page_allocator;
const Chunk = struct { page: usize, content: []const u8 };
var chunks = std.ArrayList(Chunk).init(a);
var i: usize = 0;
while (i < try doc.pageCount()) : (i += 1) {
    try chunks.append(.{ .page = i, .content = try doc.toMarkdown(a, i) });
}

Scala

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  val chunks = (0 until doc.pageCount()).map { i =>
    Map("page" -> i, "content" -> doc.toMarkdown(i), "source" -> "report.pdf")
  }
}

Clojure

(with-open [doc (pdf/open "report.pdf")]
  (def chunks
    (mapv (fn [i] {:page i :content (pdf/to-markdown doc i) :source "report.pdf"})
          (range (pdf/page-count doc)))))

Objective-C

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSMutableArray *chunks = [NSMutableArray array];
for (NSInteger i = 0; i < [doc pageCountError:&err]; i++) {
    [chunks addObject:@{ @"page": @(i),
                         @"content": [doc toMarkdown:i error:&err],
                         @"source": @"report.pdf" }];
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, n} = PdfOxide.page_count(doc)
chunks =
  for i <- 0..(n - 1) do
    {:ok, md} = PdfOxide.to_markdown(doc, i)
    %{page: i, content: md, source: "report.pdf"}
  end

Batch Convert for Vector Database

from pdf_oxide import PdfDocument, PdfError
from pathlib import Path

pdf_dir = Path("documents/")
documents = []

for pdf_path in pdf_dir.glob("*.pdf"):
    try:
        doc = PdfDocument(str(pdf_path))
        md = doc.to_markdown_all(detect_headings=True, include_images=False)
        documents.append({
            "source": pdf_path.name,
            "content": md,
            "pages": doc.page_count()
        })
    except PdfError as e:
        print(f"Skipped {pdf_path.name}: {e}")

print(f"Converted {len(documents)} PDFs")

At 0.8ms per page, converting thousands of PDFs for your vector database takes seconds, not minutes.

How Heading Detection Works

PDF Oxide clusters font sizes across the page to identify heading levels:

Extract all text spans with font size and weight metadata
Cluster spans by size — the most common size is body text
Map larger/bolder sizes to # (largest), ##, ### headings
Preserve bold (**text**) and italic (*text*) inline formatting

This works well for academic papers, reports, and documentation. For PDFs with unusual font schemes, disable heading detection:

md = doc.to_markdown(0, detect_headings=False)

PDF to Markdown for LLM and RAG Pipelines

PDF Oxide’s built-in Markdown conversion is purpose-built for AI workflows. The heading hierarchy it detects maps directly to semantic structure, making downstream processing straightforward.

Feed Markdown to an LLM

Convert a PDF and send the Markdown directly to a language model for summarization, Q&A, or analysis:

from pdf_oxide import PdfDocument

doc = PdfDocument("quarterly-report.pdf")
md = doc.to_markdown_all(detect_headings=True, include_images=False)

# Send to any LLM API -- the Markdown structure helps the model
# understand the document's organization
prompt = f"""Summarize the following document. Pay attention to the
heading structure to identify the main sections.

{md}
"""
# response = llm_client.generate(prompt)

Because PDF Oxide preserves the heading hierarchy (#, ##, ###), the LLM can distinguish section titles from body text and produce section-aware summaries. With plain text extraction, the model has to guess where sections begin and end.

Chunk by Headings for RAG

Splitting on Markdown headings produces semantically meaningful chunks that embed well and retrieve accurately:

from pdf_oxide import PdfDocument
import re

doc = PdfDocument("technical-manual.pdf")
md = doc.to_markdown_all(detect_headings=True, include_images=False)

# Split into chunks at heading boundaries
chunks = re.split(r'\n(?=#{1,3} )', md)
chunks = [c.strip() for c in chunks if c.strip()]

# Each chunk has a heading as its first line -- use it as metadata
for chunk in chunks:
    lines = chunk.split('\n', 1)
    title = lines[0].lstrip('#').strip()
    body = lines[1].strip() if len(lines) > 1 else ""
    # embed_and_store(title=title, content=body, source="technical-manual.pdf")

This approach gives you chunks that are coherent (each chunk is a complete section), titled (the heading serves as metadata for retrieval), and consistently sized (document authors naturally create sections of similar length). PDF Oxide’s heading detection makes this possible without any manual configuration – the font-size clustering algorithm identifies heading levels automatically.

Why PDF Oxide Is Ideal for AI Pipelines

At 0.8ms per page, PDF Oxide is fast enough to convert documents on-the-fly at query time, not just at indexing time. This opens up workflows that are impractical with slower tools:

On-demand conversion: Convert a PDF to Markdown when a user uploads it, with no noticeable delay
Re-processing: Update your RAG index by re-converting all PDFs when you change your chunking strategy – thousands of pages process in seconds
Streaming pipelines: Convert PDFs as they arrive in a queue without building up a backlog

Batch Processing

Convert an entire directory of PDFs to Markdown files:

from pdf_oxide import PdfDocument
from pathlib import Path

for pdf_path in Path("documents/").glob("*.pdf"):
    doc = PdfDocument(str(pdf_path))
    md_parts = []
    for i in range(doc.page_count()):
        md_parts.append(doc.to_markdown(i, detect_headings=True))

    md_path = pdf_path.with_suffix(".md")
    md_path.write_text("\n\n".join(md_parts))
    print(f"Converted {pdf_path.name} → {md_path.name}")

At sub-millisecond speeds per page, batch converting hundreds of PDFs completes in seconds. For production workloads with thousands of files, see the Batch Processing guide for parallel processing patterns.

PDF to Markdown: PDF Oxide vs Alternatives

Tool	Speed	Built-in	Heading Detection	Table Preservation
PDF Oxide	0.8ms	Yes	Yes	Yes
pymupdf4llm	55.5ms (69x slower)	No (separate package)	Yes	Yes
marker	~500ms+	No (separate tool)	Yes	Yes
pdfplumber + custom code	~23ms+	No (manual)	No	Manual
pypdf + custom code	~12ms+	No (manual)	No	No

PDF Oxide is the only Python PDF library with built-in, fast Markdown conversion. It detects headings from font-size clustering, converts tables to GitHub Flavored Markdown syntax, and preserves inline formatting – all in a single to_markdown() call.

pymupdf4llm requires PyMuPDF (which is AGPL-licensed) plus an additional pymupdf4llm package on top. It is 69 times slower than PDF Oxide and carries copyleft license obligations that may be incompatible with proprietary applications.

marker is a standalone tool, not a library. It uses deep learning models for layout detection, which makes it accurate on complex layouts but orders of magnitude slower. It also requires significant GPU memory for best performance.

pdfplumber and pypdf do not offer Markdown conversion at all. You would need to write custom code to detect headings, reconstruct tables, and format the output as Markdown – a substantial engineering effort to replicate what PDF Oxide provides out of the box.

Markdown Conversion API — full API reference
PDF for RAG Pipelines — complete RAG integration guide
Extract Text from PDF — plain text extraction
Batch Processing — parallel processing patterns

Convert PDF to Markdown in Python

Why Convert PDF to Markdown?

Quick Start

Installation

Convert Entire Document

Conversion Options

Headings Only (No Images)

Save Images to a Directory

RAG / LLM Pipeline Integration

Chunk by Heading

Page-Level Chunking

Batch Convert for Vector Database

How Heading Detection Works

PDF to Markdown for LLM and RAG Pipelines

Feed Markdown to an LLM

Chunk by Headings for RAG

Why PDF Oxide Is Ideal for AI Pipelines

Batch Processing

PDF to Markdown: PDF Oxide vs Alternatives

Related Pages