Conversão para Markdown
O PDF Oxide converte páginas PDF em Markdown limpo e legível. O pipeline de conversão extrai spans de texto, os agrupa em linhas, consulta /StructTreeRoot para obter títulos e papéis de lista em PDFs marcados, detecta gutters de múltiplas colunas e quebras com ordem de leitura reversa em x, agrupa parágrafos e emite a sintaxe Markdown.
A partir da v0.3.36, para PDFs marcados o conversor lê StructRole(Heading(1..6) | ListItem | ListItemLabel | ListItemBody) diretamente de /StructTreeRoot em vez de rederivar os níveis de título a partir do tamanho da fonte. A informação de papel é propagada através de MCRs aninhadas (H1 → Span → MCR, LI → LBody → Span → MCR). Para documentos sem tags, vale o fallback geométrico: negrito + aumento de 5 % no tamanho promove a H4, e is_ordered_list_marker reconhece 1. / 12. / a) / iv. / A. ao mesmo tempo em que rejeita legendas de figura e anos.
Múltiplas colunas: spans na mesma baseline separados por > max(3 × font_size, 30 pt) são tratados como cross-column. Quebras de ordem de leitura com x decrescente (último span da col 1 → primeiro span da col 2) quebram parágrafos em vez de unir tudo em tokens sem sentido.
RTL: o reorder bidi fica desligado por padrão — o reorder incondicional visual→lógico anterior quebrava PDFs já em ordem lógica (o nome hebraico בנימין estava sendo invertido). Marcadores **bold** espúrios ao redor de glifos árabes contextuais são removidos. Chamadores cuja entrada esteja em ordem visual podem invocar text::bidi::reorder_visual_to_logical manualmente (Rust).
Imagens inline têm payload base64 limitado a 200 KB (adicionado na v0.3.36). Imagens acima do limite emitem um comentário HTML informando o tamanho original; use image_output_dir para gravá-las em disco.
Quick Example
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)
Node.js
const { PdfDocument } = require("pdf-oxide");
const doc = new PdfDocument("paper.pdf");
const md = doc.toMarkdown(0, { detectHeadings: true });
console.log(md);
doc.close();
Go
import pdfoxide "github.com/yfedoseev/pdf_oxide/go"
doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
md, _ := doc.ToMarkdown(0)
fmt.Println(md)
C#
using PdfOxide.Core;
using var doc = PdfDocument.Open("paper.pdf");
var md = doc.ToMarkdown(0);
Console.WriteLine(md);
WASM
const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0, true);
console.log(md);
Rust
use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;
let mut doc = PdfDocument::open("paper.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown(0, &options)?;
println!("{}", md);
API Reference
to_markdown(page_index, ...) -> str
Convert a single page to Markdown.
Python Signature
doc.to_markdown(
page: int,
preserve_layout: bool = False,
detect_headings: bool = True,
include_images: bool = True,
image_output_dir: str | None = None,
embed_images: bool = True,
) -> str
JavaScript Signature
doc.toMarkdown(pageIndex, detectHeadings?, includeImages?, includeFormFields?) -> string
Rust Signature
pub fn to_markdown(
&mut self,
page_index: usize,
options: &ConversionOptions,
) -> Result<String>
| Parameter | Type | Default | Description |
|---|---|---|---|
page_index |
int / usize / number |
– | Zero-based page index |
preserve_layout |
bool |
false |
Preserve visual layout positioning |
detect_headings |
bool |
true |
Detect headings based on font size and weight |
include_images |
bool |
true |
Include images in output |
image_output_dir |
str / None |
None |
Directory to save extracted images (Python/Rust only). Unaffected by the 200 KB inline cap. |
embed_images |
bool |
true |
Embed images as base64 data URIs (Python/Rust only). Payloads over 200 KB emit a placeholder HTML comment noting the original size (v0.3.36). |
include_form_fields |
bool |
true |
Include form field values (Python/JS) |
Returns: Markdown string for the page.
to_markdown_all(...) -> str
Convert all pages to Markdown, separated by horizontal rules (---).
Python Signature
doc.to_markdown_all(
preserve_layout: bool = False,
detect_headings: bool = True,
include_images: bool = True,
image_output_dir: str | None = None,
embed_images: bool = True,
) -> str
JavaScript Signature
doc.toMarkdownAll(detectHeadings?, includeImages?, includeFormFields?) -> string
Rust Signature
pub fn to_markdown_all(
&mut self,
options: &ConversionOptions,
) -> Result<String>
| Parameter | Type | Default | Description |
|---|---|---|---|
preserve_layout |
bool |
false |
Preserve visual layout |
detect_headings |
bool |
true |
Detect headings |
include_images |
bool |
true |
Include images |
image_output_dir |
str / None |
None |
Image output directory |
embed_images |
bool |
true |
Embed images as base64 |
Returns: Markdown string for all pages joined with --- separators.
to_markdown_with_ocr(page_index, model_path, options) -> str
Convert a page to Markdown with OCR fallback for scanned pages. When the page has little or no extractable text, OCR is used to recognize text from the rendered page image. Requires the ocr feature.
| Parameter | Type | Description |
|---|---|---|
page_index |
usize |
Zero-based page index |
model_path |
&str |
Path to the OCR model files |
options |
&ConversionOptions |
Conversion options |
Rust
let mut doc = PdfDocument::open("scanned.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown_with_ocr(0, "/path/to/models", &options)?;
println!("{}", md);
ConversionOptions
The ConversionOptions struct controls all conversion behavior.
| Field | Type | Default | Description |
|---|---|---|---|
preserve_layout |
bool |
false |
Preserve visual layout with positioning |
detect_headings |
bool |
true |
Auto-detect headings from font size clusters |
extract_tables |
bool |
false |
Extract tables (experimental) |
include_images |
bool |
true |
Include images in output |
image_output_dir |
Option<String> |
None |
Save images to this directory |
embed_images |
bool |
true |
Embed images as base64 data URIs |
reading_order_mode |
ReadingOrderMode |
Auto |
How to determine reading order |
bold_marker_behavior |
BoldMarkerBehavior |
Conservative |
Bold marker application strategy |
How It Works
The Markdown conversion pipeline operates in several stages:
-
Text Extraction – Extracts
TextSpanobjects from the page content stream, capturing text, position, font, size, weight, and color. -
Character Clustering – Groups characters into words based on inter-character gaps, then words into lines based on vertical proximity.
-
Reading Order – Determines reading order using either the Tagged PDF structure tree (preferred) or a graph-based spatial analysis of text block positions.
-
Heading Detection – When
detect_headingsis enabled, clusters font sizes across the page to identify heading levels. Larger and bolder text is mapped to#,##,###headings. -
Formatting – Applies bold (
**text**) and italic (*text*) markers based on font weight and style metadata. -
Table Detection – Identifies tabular layouts using spatial analysis of aligned text blocks and emits GFM-style Markdown tables.
-
Whitespace Cleanup – Normalizes spacing, removes redundant blank lines, and ensures consistent paragraph breaks.
Advanced Examples
Convert entire PDF to a Markdown file
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("book.pdf")
md = doc.to_markdown_all(detect_headings=True)
with open("book.md", "w", encoding="utf-8") as f:
f.write(md)
Node.js
const fs = require("node:fs");
const doc = new PdfDocument("book.pdf");
const md = doc.toMarkdownAll();
fs.writeFileSync("book.md", md);
doc.close();
Go
doc, _ := pdfoxide.Open("book.pdf")
defer doc.Close()
md, _ := doc.ToMarkdownAll()
os.WriteFile("book.md", []byte(md), 0644)
C#
using var doc = PdfDocument.Open("book.pdf");
var md = doc.ToMarkdownAll();
File.WriteAllText("book.md", md);
WASM
const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdownAll(true);
writeFileSync("book.md", md);
doc.free();
Convert with images saved to a directory
use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;
let mut doc = PdfDocument::open("report.pdf")?;
let options = ConversionOptions {
detect_headings: true,
include_images: true,
embed_images: false,
image_output_dir: Some("output/images".to_string()),
..Default::default()
};
let md = doc.to_markdown_all(&options)?;
std::fs::write("output/report.md", &md)?;
Page-by-page conversion with progress
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
pages = doc.page_count()
parts = []
for i in range(pages):
md = doc.to_markdown(i, detect_headings=True)
parts.append(md)
print(f"Converted page {i + 1}/{pages}")
full_md = "\n\n---\n\n".join(parts)
with open("report.md", "w") as f:
f.write(full_md)
Disable heading detection for flat text
doc = PdfDocument("form.pdf")
md = doc.to_markdown(0, detect_headings=False)
# All text rendered as paragraphs, no # headings
Páginas relacionadas
- Extração de texto — extração de texto cru e spans
- Conversão para HTML — converter para HTML em vez de Markdown
- Extração de imagens — extrair imagens separadamente