Skip to content

Markdown 转换

PDF Oxide 将 PDF 页面转换为清晰易读的 Markdown。转换流水线提取文本段,将其聚类成行,在带标签的 PDF 中从 /StructTreeRoot 获取标题和列表角色,检测多列间距和逆向 x 阅读顺序换行,对段落进行分组,最终输出 Markdown 语法。

v0.3.36 起,对于带标签的 PDF,转换器直接从 /StructTreeRoot 读取 StructRole(Heading(1..6) | ListItem | ListItemLabel | ListItemBody),而不再通过字体大小重新推导标题级别。角色信息通过嵌套 MCR 传播(H1 → Span → MCRLI → LBody → Span → MCR)。对于未标签文档,几何回退仍然适用:粗体 + 5% 大小提升可晋升为 H4,而 is_ordered_list_marker 能识别 1. / 12. / a) / iv. / A.,同时排除图注和年份。

多列处理: 基线相同但间距超过 > max(3 × font_size, 30 pt) 的文本段被视为跨列内容。逆向 x 阅读顺序换行(列优先的末尾→首段)会拆分段落,而不是将其拼接成无意义的字符串。

RTL: bidi 重排默认关闭——之前无条件的视觉顺序→逻辑顺序重排会破坏逻辑顺序 PDF(希伯来语 בנימין 被反转了)。阿拉伯语上下文字形周围多余的 **bold** 标记已被剔除。若输入为视觉顺序,调用方可手动调用 text::bidi::reorder_visual_to_logical(Rust)。

内联图片的 base64 载荷上限为 200 KB(v0.3.36 新增)。超过上限的图片会输出一条注明原始大小的 HTML 注释;使用 image_output_dir 可将图片写入磁盘。

快速示例

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("paper.pdf");
const md = doc.toMarkdown(0, { detectHeadings: true });
console.log(md);
doc.close();

Go

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
md, _ := doc.ToMarkdown(0)
fmt.Println(md)

C#

using PdfOxide.Core;

using var doc = PdfDocument.Open("paper.pdf");
var md = doc.ToMarkdown(0);
Console.WriteLine(md);

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0, true);
console.log(md);

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let mut doc = PdfDocument::open("paper.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown(0, &options)?;
println!("{}", md);

Java

import fyi.oxide.pdf.PdfDocument;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("paper.pdf"))) {
    String md = doc.toMarkdown(0);
    System.out.println(md);
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("paper.pdf")).use { doc ->
    val md = doc.toMarkdown(0)
    println(md)
}

Scala

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open("paper.pdf")) { doc =>
  val md = doc.toMarkdown(0)
  println(md)
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "paper.pdf")]
  (println (pdf/to-markdown doc 0)))

PHP

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');
echo $doc->toMarkdown(0);
$doc->close();

Ruby

require 'pdf_oxide'

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  puts doc.to_markdown(0)
end

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("paper.pdf");
auto md = doc.to_markdown(0);
std::cout << md << std::endl;

Swift

import PdfOxide

let doc = try Document.open("paper.pdf")
let md = try doc.toMarkdown(0)
print(md)

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('paper.pdf');
final md = doc.toMarkdown(0);
print(md);

R

library(pdfoxide)

doc <- pdf_open("paper.pdf")
md <- pdf_to_markdown(doc, 0)
cat(md)

Julia

using PdfOxide

doc = open_document("paper.pdf")
md = to_markdown(doc, 0)
println(md)

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("paper.pdf");
const md = try doc.toMarkdown(a, 0);
std.debug.print("{s}\n", .{md});

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"paper.pdf" error:&err];
NSString *md = [doc toMarkdown:0 error:&err];
NSLog(@"%@", md);

Elixir

{:ok, doc} = PdfOxide.open("paper.pdf")
{:ok, md} = PdfOxide.to_markdown(doc, 0)
IO.puts(md)

API 参考

to_markdown(page_index, ...) -> str

将单个页面转换为 Markdown。

Python Signature

doc.to_markdown(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
) -> str

JavaScript Signature

doc.toMarkdown(pageIndex, detectHeadings?, includeImages?, includeFormFields?) -> string

Rust Signature

pub fn to_markdown(
    &mut self,
    page_index: usize,
    options: &ConversionOptions,
) -> Result<String>

Java Signature

String toMarkdown(int pageIndex)

Kotlin Signature

fun toMarkdown(pageIndex: Int): String

Scala Signature

def toMarkdown(pageIndex: Int): String

Clojure Signature

(pdf/to-markdown doc page-index) ; => String

PHP Signature

public function toMarkdown(int $pageIndex): string

Ruby Signature

doc.to_markdown(page_index) # => String

C++ Signature

std::string to_markdown(int page_index) const;

Swift Signature

func toMarkdown(_ pageIndex: Int) throws -> String

Dart Signature

String toMarkdown(int pageIndex)

R Signature

pdf_to_markdown(doc, page_index)  # character

Julia Signature

to_markdown(doc, page_index)::String

Zig Signature

pub fn toMarkdown(self: *Document, allocator: std.mem.Allocator, page_index: usize) ![]u8

Objective-C Signature

- (NSString *)toMarkdown:(NSInteger)pageIndex error:(NSError **)error;

Elixir Signature

PdfOxide.to_markdown(doc, page_index) :: {:ok, String.t()} | {:error, term()}
参数 类型 默认值 说明
page_index int / usize / number 从零开始的页面索引
preserve_layout bool false 保留视觉布局定位
detect_headings bool true 根据字体大小和粗细检测标题
include_images bool true 在输出中包含图片
image_output_dir str / None None 保存提取图片的目录(仅 Python/Rust)。不受 200 KB 内联上限影响。
embed_images bool true 将图片以 base64 数据 URI 嵌入(仅 Python/Rust)。超过 200 KB 的载荷会输出注明原始大小的占位 HTML 注释(v0.3.36)。
include_form_fields bool true 包含表单字段值(Python/JS)

返回值: 该页面的 Markdown 字符串。


to_markdown_all(...) -> str

将所有页面转换为 Markdown,各页面之间以水平线(---)分隔。

Python Signature

doc.to_markdown_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
) -> str

JavaScript Signature

doc.toMarkdownAll(detectHeadings?, includeImages?, includeFormFields?) -> string

Rust Signature

pub fn to_markdown_all(
    &mut self,
    options: &ConversionOptions,
) -> Result<String>

Java Signature

String toMarkdown()  // no-arg overload = whole document

Kotlin Signature

fun toMarkdown(): String  // no-arg = whole document

Scala Signature

def toMarkdown(): String  // no-arg = whole document

Clojure Signature

(pdf/to-markdown doc) ; no page index = whole document => String

PHP Signature

public function toMarkdownAll(): string

Ruby Signature

doc.to_markdown # nil page index = whole document => String

C++ Signature

std::string to_markdown_all() const;

Swift Signature

func toMarkdownAll() throws -> String

Dart Signature

String toMarkdownAll()

R Signature

pdf_to_markdown_all(doc)  # character

Julia Signature

to_markdown_all(doc)::String

Zig Signature

pub fn toMarkdownAll(self: *Document, allocator: std.mem.Allocator) ![]u8

Objective-C Signature

- (NSString *)toMarkdownAllWithError:(NSError **)error;

Elixir Signature

PdfOxide.to_markdown_all(doc) :: {:ok, String.t()} | {:error, term()}
参数 类型 默认值 说明
preserve_layout bool false 保留视觉布局
detect_headings bool true 检测标题
include_images bool true 包含图片
image_output_dir str / None None 图片输出目录
embed_images bool true 将图片以 base64 嵌入

返回值: 所有页面以 --- 分隔符拼接而成的 Markdown 字符串。


to_markdown_with_ocr(page_index, model_path, options) -> str

将页面转换为 Markdown,并在扫描页面时使用 OCR 回退。当页面几乎没有可提取的文本时,会对渲染后的页面图像进行 OCR 识别。需要 ocr 特性。

参数 类型 说明
page_index usize 从零开始的页面索引
model_path &str OCR 模型文件的路径
options &ConversionOptions 转换选项

Rust

let mut doc = PdfDocument::open("scanned.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown_with_ocr(0, "/path/to/models", &options)?;
println!("{}", md);

ConversionOptions

ConversionOptions 结构体控制所有转换行为。

字段 类型 默认值 说明
preserve_layout bool false 保留带定位信息的视觉布局
detect_headings bool true 从字体大小聚类中自动检测标题
extract_tables bool false 提取表格(实验性)
include_images bool true 在输出中包含图片
image_output_dir Option<String> None 将图片保存到该目录
embed_images bool true 将图片以 base64 数据 URI 嵌入
reading_order_mode ReadingOrderMode Auto 阅读顺序的确定方式
bold_marker_behavior BoldMarkerBehavior Conservative 粗体标记的应用策略

工作原理

Markdown 转换流水线分多个阶段执行:

  1. 文本提取 – 从页面内容流中提取 TextSpan 对象,捕获文本、位置、字体、大小、字重和颜色。

  2. 字符聚类 – 根据字符间距将字符归组为单词,再根据垂直距离将单词归组为行。

  3. 阅读顺序 – 使用带标签 PDF 的结构树(优先)或文本块位置的图形化空间分析来确定阅读顺序。

  4. 标题检测 – 启用 detect_headings 时,对整页字体大小进行聚类以识别标题级别。更大更粗的文本会映射到 ###### 标题。

  5. 格式化 – 根据字体字重和样式元数据应用粗体(**text**)和斜体(*text*)标记。

  6. 表格检测 – 通过对对齐文本块的空间分析识别表格布局,并输出 GFM 风格的 Markdown 表格。

  7. 空白清理 – 规范化间距,删除冗余空行,确保段落分隔的一致性。


进阶示例

将整个 PDF 转换为 Markdown 文件

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")
md = doc.to_markdown_all(detect_headings=True)

with open("book.md", "w", encoding="utf-8") as f:
    f.write(md)

Node.js

const fs = require("node:fs");

const doc = new PdfDocument("book.pdf");
const md = doc.toMarkdownAll();
fs.writeFileSync("book.md", md);
doc.close();

Go

doc, _ := pdfoxide.Open("book.pdf")
defer doc.Close()
md, _ := doc.ToMarkdownAll()
os.WriteFile("book.md", []byte(md), 0644)

C#

using var doc = PdfDocument.Open("book.pdf");
var md = doc.ToMarkdownAll();
File.WriteAllText("book.md", md);

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdownAll(true);
writeFileSync("book.md", md);
doc.free();

Java

import fyi.oxide.pdf.PdfDocument;
import java.nio.file.*;

try (PdfDocument doc = PdfDocument.open(Path.of("book.pdf"))) {
    String md = doc.toMarkdown();
    Files.writeString(Path.of("book.md"), md);
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import java.nio.file.*

PdfDocument.open(Path.of("book.pdf")).use { doc ->
    Files.writeString(Path.of("book.md"), doc.toMarkdown())
}

Scala

import fyi.oxide.pdf.PdfDocument
import java.nio.file.{Files, Path}
import scala.util.Using

Using.resource(PdfDocument.open("book.pdf")) { doc =>
  Files.writeString(Path.of("book.md"), doc.toMarkdown())
}

Clojure

(require '[pdf-oxide.core :as pdf]
         '[clojure.java.io :as io])

(with-open [doc (pdf/open "book.pdf")]
  (spit "book.md" (pdf/to-markdown doc)))

PHP

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('book.pdf');
file_put_contents('book.md', $doc->toMarkdownAll());
$doc->close();

Ruby

require 'pdf_oxide'

PdfOxide::PdfDocument.open('book.pdf') do |doc|
  File.write('book.md', doc.to_markdown)
end

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <fstream>

auto doc = pdf_oxide::Document::open("book.pdf");
auto md = doc.to_markdown_all();
std::ofstream("book.md") << md;

Swift

import PdfOxide

let doc = try Document.open("book.pdf")
let md = try doc.toMarkdownAll()
try md.write(toFile: "book.md", atomically: true, encoding: .utf8)

Dart

import 'dart:io';
import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('book.pdf');
final md = doc.toMarkdownAll();
File('book.md').writeAsStringSync(md);

R

library(pdfoxide)

doc <- pdf_open("book.pdf")
md <- pdf_to_markdown_all(doc)
writeLines(md, "book.md")

Julia

using PdfOxide

doc = open_document("book.pdf")
md = to_markdown_all(doc)
write("book.md", md)

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("book.pdf");
const md = try doc.toMarkdownAll(a);
try std.fs.cwd().writeFile(.{ .sub_path = "book.md", .data = md });

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"book.pdf" error:&err];
NSString *md = [doc toMarkdownAllWithError:&err];
[md writeToFile:@"book.md" atomically:YES encoding:NSUTF8StringEncoding error:&err];

Elixir

{:ok, doc} = PdfOxide.open("book.pdf")
{:ok, md} = PdfOxide.to_markdown_all(doc)
File.write!("book.md", md)

将图片保存到目录并转换

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let mut doc = PdfDocument::open("report.pdf")?;
let options = ConversionOptions {
    detect_headings: true,
    include_images: true,
    embed_images: false,
    image_output_dir: Some("output/images".to_string()),
    ..Default::default()
};

let md = doc.to_markdown_all(&options)?;
std::fs::write("output/report.md", &md)?;

逐页转换并显示进度

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
pages = doc.page_count()

parts = []
for i in range(pages):
    md = doc.to_markdown(i, detect_headings=True)
    parts.append(md)
    print(f"Converted page {i + 1}/{pages}")

full_md = "\n\n---\n\n".join(parts)
with open("report.md", "w") as f:
    f.write(full_md)

对纯文本禁用标题检测

doc = PdfDocument("form.pdf")
md = doc.to_markdown(0, detect_headings=False)
# All text rendered as paragraphs, no # headings

相关页面