Skip to content

Markdown 转换

PDF Oxide 将 PDF 页面转换为干净、可读的 Markdown。转换流水线会提取文本 span、按行聚类、在 Tagged PDF 上查询 /StructTreeRoot 获取标题与列表角色、检测多列间距与反向 x 阅读顺序换行、归纳段落,并输出 Markdown 语法。

v0.3.36 起,对 Tagged PDF,转换器直接从 /StructTreeRoot 读取 StructRole(Heading(1..6) | ListItem | ListItemLabel | ListItemBody),不再通过字号去反推标题层级。角色信息会沿嵌套 MCR 传递(H1 → Span → MCRLI → LBody → Span → MCR)。对未打标签的文档,仍采用几何回退:粗体加 5 % 字号提升至 H4,is_ordered_list_marker 可识别 1. / 12. / a) / iv. / A.,并排除图表说明与年份。

多列处理: 同基线的 span 间距大于 max(3 × font_size, 30 pt) 视为跨列。反向 x 阅读顺序换行(列优先的末→首 span)会断开段落而不是拼成无意义的 token。

RTL: 双向排序默认关闭 — 早先无条件的视觉 → 逻辑重排会破坏原本逻辑序的 PDF(希伯来文 בנימין 被反转)。阿拉伯语上下文字形附近虚假的 **bold** 标记会被去掉。若输入为视觉序,调用方可手动调用 text::bidi::reorder_visual_to_logical(Rust)。

内联图像 的 base64 负载限制为 200 KB(v0.3.36 新增)。超过上限的图像会输出一条 HTML 注释记录原始大小;如需写入磁盘请改用 image_output_dir

Quick Example

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("paper.pdf");
const md = doc.toMarkdown(0, { detectHeadings: true });
console.log(md);
doc.close();

Go

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
md, _ := doc.ToMarkdown(0)
fmt.Println(md)

C#

using PdfOxide.Core;

using var doc = PdfDocument.Open("paper.pdf");
var md = doc.ToMarkdown(0);
Console.WriteLine(md);

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0, true);
console.log(md);

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let mut doc = PdfDocument::open("paper.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown(0, &options)?;
println!("{}", md);

API Reference

to_markdown(page_index, ...) -> str

Convert a single page to Markdown.

Python Signature

doc.to_markdown(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
) -> str

JavaScript Signature

doc.toMarkdown(pageIndex, detectHeadings?, includeImages?, includeFormFields?) -> string

Rust Signature

pub fn to_markdown(
    &mut self,
    page_index: usize,
    options: &ConversionOptions,
) -> Result<String>
Parameter Type Default Description
page_index int / usize / number Zero-based page index
preserve_layout bool false Preserve visual layout positioning
detect_headings bool true Detect headings based on font size and weight
include_images bool true Include images in output
image_output_dir str / None None Directory to save extracted images (Python/Rust only)
embed_images bool true Embed images as base64 data URIs (Python/Rust only)
include_form_fields bool true Include form field values (Python/JS)

Returns: Markdown string for the page.


to_markdown_all(...) -> str

Convert all pages to Markdown, separated by horizontal rules (---).

Python Signature

doc.to_markdown_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
) -> str

JavaScript Signature

doc.toMarkdownAll(detectHeadings?, includeImages?, includeFormFields?) -> string

Rust Signature

pub fn to_markdown_all(
    &mut self,
    options: &ConversionOptions,
) -> Result<String>
Parameter Type Default Description
preserve_layout bool false Preserve visual layout
detect_headings bool true Detect headings
include_images bool true Include images
image_output_dir str / None None Image output directory
embed_images bool true Embed images as base64

Returns: Markdown string for all pages joined with --- separators.


to_markdown_with_ocr(page_index, model_path, options) -> str

Convert a page to Markdown with OCR fallback for scanned pages. When the page has little or no extractable text, OCR is used to recognize text from the rendered page image. Requires the ocr feature.

Parameter Type Description
page_index usize Zero-based page index
model_path &str Path to the OCR model files
options &ConversionOptions Conversion options

Rust

let mut doc = PdfDocument::open("scanned.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown_with_ocr(0, "/path/to/models", &options)?;
println!("{}", md);

ConversionOptions

The ConversionOptions struct controls all conversion behavior.

Field Type Default Description
preserve_layout bool false Preserve visual layout with positioning
detect_headings bool true Auto-detect headings from font size clusters
extract_tables bool false Extract tables (experimental)
include_images bool true Include images in output
image_output_dir Option<String> None Save images to this directory
embed_images bool true Embed images as base64 data URIs
reading_order_mode ReadingOrderMode Auto How to determine reading order
bold_marker_behavior BoldMarkerBehavior Conservative Bold marker application strategy

How It Works

The Markdown conversion pipeline operates in several stages:

  1. Text Extraction – Extracts TextSpan objects from the page content stream, capturing text, position, font, size, weight, and color.

  2. Character Clustering – Groups characters into words based on inter-character gaps, then words into lines based on vertical proximity.

  3. Reading Order – Determines reading order using either the Tagged PDF structure tree (preferred) or a graph-based spatial analysis of text block positions.

  4. Heading Detection – When detect_headings is enabled, clusters font sizes across the page to identify heading levels. Larger and bolder text is mapped to #, ##, ### headings.

  5. Formatting – Applies bold (**text**) and italic (*text*) markers based on font weight and style metadata.

  6. Table Detection – Identifies tabular layouts using spatial analysis of aligned text blocks and emits GFM-style Markdown tables.

  7. Whitespace Cleanup – Normalizes spacing, removes redundant blank lines, and ensures consistent paragraph breaks.


Advanced Examples

Convert entire PDF to a Markdown file

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")
md = doc.to_markdown_all(detect_headings=True)

with open("book.md", "w", encoding="utf-8") as f:
    f.write(md)

Node.js

const fs = require("node:fs");

const doc = new PdfDocument("book.pdf");
const md = doc.toMarkdownAll();
fs.writeFileSync("book.md", md);
doc.close();

Go

doc, _ := pdfoxide.Open("book.pdf")
defer doc.Close()
md, _ := doc.ToMarkdownAll()
os.WriteFile("book.md", []byte(md), 0644)

C#

using var doc = PdfDocument.Open("book.pdf");
var md = doc.ToMarkdownAll();
File.WriteAllText("book.md", md);

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdownAll(true);
writeFileSync("book.md", md);
doc.free();

Convert with images saved to a directory

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let mut doc = PdfDocument::open("report.pdf")?;
let options = ConversionOptions {
    detect_headings: true,
    include_images: true,
    embed_images: false,
    image_output_dir: Some("output/images".to_string()),
    ..Default::default()
};

let md = doc.to_markdown_all(&options)?;
std::fs::write("output/report.md", &md)?;

Page-by-page conversion with progress

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
pages = doc.page_count()

parts = []
for i in range(pages):
    md = doc.to_markdown(i, detect_headings=True)
    parts.append(md)
    print(f"Converted page {i + 1}/{pages}")

full_md = "\n\n---\n\n".join(parts)
with open("report.md", "w") as f:
    f.write(full_md)

Disable heading detection for flat text

doc = PdfDocument("form.pdf")
md = doc.to_markdown(0, detect_headings=False)
# All text rendered as paragraphs, no # headings