Markdown 변환
PDF Oxide는 PDF 페이지를 깔끔하고 읽기 좋은 Markdown으로 변환합니다. 변환 파이프라인은 텍스트 스팬을 추출해 줄로 묶고, Tagged PDF에서는 제목과 목록 역할을 /StructTreeRoot에서 직접 읽어오며, 다중 컬럼의 거터와 역방향 x 위치에 따른 읽기 순서 랩을 감지하고, 단락을 묶은 뒤 Markdown 문법으로 내보냅니다.
v0.3.36부터 Tagged PDF의 변환기는 폰트 크기로 제목 레벨을 역추정하는 대신 /StructTreeRoot에서 StructRole(Heading(1..6) | ListItem | ListItemLabel | ListItemBody)을 그대로 읽어옵니다. 역할 정보는 중첩된 MCR (H1 → Span → MCR, LI → LBody → Span → MCR)을 따라 전파됩니다. 태그가 없는 문서는 여전히 기하학적 폴백을 사용합니다. 굵게 + 5% 크기 증가이면 H4로 승격되고, is_ordered_list_marker는 1. / 12. / a) / iv. / A.를 인식하면서 그림 캡션이나 연도는 배제합니다.
다중 컬럼 처리: 동일 베이스라인 스팬이 > max(3 × font_size, 30 pt)만큼 떨어져 있으면 컬럼 간으로 취급합니다. 역방향 x 순서의 랩 (컬럼 우선 배치에서 마지막 스팬 → 다음 컬럼의 첫 스팬)은 단락을 이어 붙이지 않고 끊습니다.
RTL: bidi 재배열은 기본적으로 꺼져 있습니다. 이전에 무조건 수행하던 시각 순서 → 논리 순서 재배열은 이미 논리 순서인 PDF를 깨뜨렸습니다 (히브리어 בנימין이 뒤집히는 사례). 아랍어 문맥 글리프 주변의 잘못된 **bold** 표시도 제거됩니다. 입력이 시각 순서인 호출자는 Rust에서 text::bidi::reorder_visual_to_logical을 직접 호출할 수 있습니다.
인라인 이미지 base64 페이로드는 200 KB로 제한됩니다 (v0.3.36 추가). 한도를 초과한 이미지는 원본 크기를 주석으로 남긴 HTML 주석을 출력합니다. 파일로 이미지를 내보내려면 image_output_dir을 사용하세요.
빠른 예제
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)
Node.js
const { PdfDocument } = require("pdf-oxide");
const doc = new PdfDocument("paper.pdf");
const md = doc.toMarkdown(0, { detectHeadings: true });
console.log(md);
doc.close();
Go
import pdfoxide "github.com/yfedoseev/pdf_oxide/go"
doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
md, _ := doc.ToMarkdown(0)
fmt.Println(md)
C#
using PdfOxide.Core;
using var doc = PdfDocument.Open("paper.pdf");
var md = doc.ToMarkdown(0);
Console.WriteLine(md);
WASM
const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0, true);
console.log(md);
Rust
use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;
let mut doc = PdfDocument::open("paper.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown(0, &options)?;
println!("{}", md);
API Reference
to_markdown(page_index, ...) -> str
Convert a single page to Markdown.
Python Signature
doc.to_markdown(
page: int,
preserve_layout: bool = False,
detect_headings: bool = True,
include_images: bool = True,
image_output_dir: str | None = None,
embed_images: bool = True,
) -> str
JavaScript Signature
doc.toMarkdown(pageIndex, detectHeadings?, includeImages?, includeFormFields?) -> string
Rust Signature
pub fn to_markdown(
&mut self,
page_index: usize,
options: &ConversionOptions,
) -> Result<String>
| Parameter | Type | Default | Description |
|---|---|---|---|
page_index |
int / usize / number |
– | Zero-based page index |
preserve_layout |
bool |
false |
Preserve visual layout positioning |
detect_headings |
bool |
true |
Detect headings based on font size and weight |
include_images |
bool |
true |
Include images in output |
image_output_dir |
str / None |
None |
Directory to save extracted images (Python/Rust only) |
embed_images |
bool |
true |
Embed images as base64 data URIs (Python/Rust only) |
include_form_fields |
bool |
true |
Include form field values (Python/JS) |
Returns: Markdown string for the page.
to_markdown_all(...) -> str
Convert all pages to Markdown, separated by horizontal rules (---).
Python Signature
doc.to_markdown_all(
preserve_layout: bool = False,
detect_headings: bool = True,
include_images: bool = True,
image_output_dir: str | None = None,
embed_images: bool = True,
) -> str
JavaScript Signature
doc.toMarkdownAll(detectHeadings?, includeImages?, includeFormFields?) -> string
Rust Signature
pub fn to_markdown_all(
&mut self,
options: &ConversionOptions,
) -> Result<String>
| Parameter | Type | Default | Description |
|---|---|---|---|
preserve_layout |
bool |
false |
Preserve visual layout |
detect_headings |
bool |
true |
Detect headings |
include_images |
bool |
true |
Include images |
image_output_dir |
str / None |
None |
Image output directory |
embed_images |
bool |
true |
Embed images as base64 |
Returns: Markdown string for all pages joined with --- separators.
to_markdown_with_ocr(page_index, model_path, options) -> str
Convert a page to Markdown with OCR fallback for scanned pages. When the page has little or no extractable text, OCR is used to recognize text from the rendered page image. Requires the ocr feature.
| Parameter | Type | Description |
|---|---|---|
page_index |
usize |
Zero-based page index |
model_path |
&str |
Path to the OCR model files |
options |
&ConversionOptions |
Conversion options |
Rust
let mut doc = PdfDocument::open("scanned.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown_with_ocr(0, "/path/to/models", &options)?;
println!("{}", md);
ConversionOptions
The ConversionOptions struct controls all conversion behavior.
| Field | Type | Default | Description |
|---|---|---|---|
preserve_layout |
bool |
false |
Preserve visual layout with positioning |
detect_headings |
bool |
true |
Auto-detect headings from font size clusters |
extract_tables |
bool |
false |
Extract tables (experimental) |
include_images |
bool |
true |
Include images in output |
image_output_dir |
Option<String> |
None |
Save images to this directory |
embed_images |
bool |
true |
Embed images as base64 data URIs |
reading_order_mode |
ReadingOrderMode |
Auto |
How to determine reading order |
bold_marker_behavior |
BoldMarkerBehavior |
Conservative |
Bold marker application strategy |
How It Works
The Markdown conversion pipeline operates in several stages:
-
Text Extraction – Extracts
TextSpanobjects from the page content stream, capturing text, position, font, size, weight, and color. -
Character Clustering – Groups characters into words based on inter-character gaps, then words into lines based on vertical proximity.
-
Reading Order – Determines reading order using either the Tagged PDF structure tree (preferred) or a graph-based spatial analysis of text block positions.
-
Heading Detection – When
detect_headingsis enabled, clusters font sizes across the page to identify heading levels. Larger and bolder text is mapped to#,##,###headings. -
Formatting – Applies bold (
**text**) and italic (*text*) markers based on font weight and style metadata. -
Table Detection – Identifies tabular layouts using spatial analysis of aligned text blocks and emits GFM-style Markdown tables.
-
Whitespace Cleanup – Normalizes spacing, removes redundant blank lines, and ensures consistent paragraph breaks.
Advanced Examples
Convert entire PDF to a Markdown file
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("book.pdf")
md = doc.to_markdown_all(detect_headings=True)
with open("book.md", "w", encoding="utf-8") as f:
f.write(md)
Node.js
const fs = require("node:fs");
const doc = new PdfDocument("book.pdf");
const md = doc.toMarkdownAll();
fs.writeFileSync("book.md", md);
doc.close();
Go
doc, _ := pdfoxide.Open("book.pdf")
defer doc.Close()
md, _ := doc.ToMarkdownAll()
os.WriteFile("book.md", []byte(md), 0644)
C#
using var doc = PdfDocument.Open("book.pdf");
var md = doc.ToMarkdownAll();
File.WriteAllText("book.md", md);
WASM
const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdownAll(true);
writeFileSync("book.md", md);
doc.free();
Convert with images saved to a directory
use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;
let mut doc = PdfDocument::open("report.pdf")?;
let options = ConversionOptions {
detect_headings: true,
include_images: true,
embed_images: false,
image_output_dir: Some("output/images".to_string()),
..Default::default()
};
let md = doc.to_markdown_all(&options)?;
std::fs::write("output/report.md", &md)?;
Page-by-page conversion with progress
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
pages = doc.page_count()
parts = []
for i in range(pages):
md = doc.to_markdown(i, detect_headings=True)
parts.append(md)
print(f"Converted page {i + 1}/{pages}")
full_md = "\n\n---\n\n".join(parts)
with open("report.md", "w") as f:
f.write(full_md)
Disable heading detection for flat text
doc = PdfDocument("form.pdf")
md = doc.to_markdown(0, detect_headings=False)
# All text rendered as paragraphs, no # headings
Related Pages
- Text Extraction – Raw text and span extraction
- HTML Conversion – Convert to HTML instead of Markdown
- Image Extraction – Extract images separately