Skip to content

元数据 & XMP

PDF Oxide 从多个来源读取文档级元数据:PDF 文件头(版本号)、trailer 与 catalog 字典、XMP 元数据流(ISO 16684)以及页面标签定义。XmpExtractor 可解析 Dublin Core、XMP Core、PDF 和 XMP Rights 命名空间,以及任意自定义属性。

基本文档属性使用 version()catalog(),丰富元数据使用 XmpExtractor::extract(),页码方案使用 PageLabelExtractor

快速示例

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
major, minor = doc.version()
print(f"PDF {major}.{minor}, {doc.page_count()} pages")

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const { major, minor } = doc.getVersion();
console.log(`PDF ${major}.${minor}, ${doc.pageCount()} pages`);
doc.close();

Go

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
major, minor, _ := doc.Version()
pages, _ := doc.PageCount()
fmt.Printf("PDF %d.%d, %d pages\n", major, minor, pages)

C#

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
var (major, minor) = doc.Version;
Console.WriteLine($"PDF {major}.{minor}, {doc.PageCount} pages");

WASM

const doc = new WasmPdfDocument(bytes);
const version = doc.version();
console.log(`PDF ${version}, ${doc.pageCount()} pages`);

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;
let (major, minor) = doc.version();
println!("PDF {}.{}", major, minor);
println!("Pages: {}", doc.page_count()?);

PHP

use PdfOxide\PdfDocument;

$doc = PdfDocument::open("report.pdf");
$v = $doc->version(); // ['major' => int, 'minor' => int]
echo "PDF {$v['major']}.{$v['minor']}, {$doc->pageCount()} pages\n";
$doc->close();

Ruby

require "pdf_oxide"

PdfOxide::PdfDocument.open("report.pdf") do |doc|
  puts "PDF #{doc.pdf_version}, #{doc.page_count} pages"
end

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("report.pdf");
auto v = doc.version();
std::cout << "PDF " << static_cast<int>(v.major) << "."
          << static_cast<int>(v.minor) << ", " << doc.page_count() << " pages\n";

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let v = try doc.version()
print("PDF \(v.major).\(v.minor), \(try doc.pageCount()) pages")

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
final v = doc.version;
print('PDF ${v.major}.${v.minor}, ${doc.pageCount} pages');
doc.close();

R

library(pdfoxide)

doc <- pdf_open("report.pdf")
v <- pdf_version(doc)
cat(sprintf("PDF %d.%d, %d pages\n", v$major, v$minor, pdf_page_count(doc)))

Julia

using PdfOxide

doc = open_document("report.pdf")
v = version(doc)
println("PDF $(v.major).$(v.minor), $(page_count(doc)) pages")

Zig

const pdf_oxide = @import("pdf_oxide");

var doc = try pdf_oxide.Document.open("report.pdf");
const v = doc.version();
std.debug.print("PDF {d}.{d}, {d} pages\n", .{ v.major, v.minor, try doc.pageCount() });

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
POXVersion v = [doc version];
printf("PDF %d.%d, %ld pages\n", v.major, v.minor, (long)[doc pageCountError:&err]);

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
%{major: maj, minor: min} = PdfOxide.version(doc)
{:ok, pages} = PdfOxide.page_count(doc)
IO.puts("PDF #{maj}.#{min}, #{pages} pages")

API 参考

version() -> (u8, u8)

从文件头获取 PDF 版本号。

返回值: (major, minor) 元组,例如 PDF 1.7 返回 (1, 7),PDF 2.0 返回 (2, 0)


catalog() -> Result<Object>

获取文档 catalog 字典。catalog 是 PDF 对象层级的根节点,包含对页面树、大纲、命名空间及其他文档级结构的引用。

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let catalog = doc.catalog()?;
if let Some(dict) = catalog.as_dict() {
    for (key, _) in dict {
        println!("Catalog key: {}", key);
    }
}

trailer() -> &Object

获取文档 trailer 字典。trailer 包含交叉引用表的位置、文档 ID、加密字典引用以及 Info 字典引用。

Rust

let doc = PdfDocument::open("report.pdf")?;
let trailer = doc.trailer();
println!("Trailer: {:?}", trailer);

XmpExtractor::extract(doc) -> Result<Option<XmpMetadata>>

从文档的元数据流中提取 XMP(Extensible Metadata Platform)元数据。XMP 采用标准 XML 命名空间,提供比传统 Info 字典更丰富的元数据。

参数 类型 说明
doc &mut PdfDocument PDF 文档

返回值: 若存在 XMP 数据则返回 Some(XmpMetadata),否则返回 None

XmpMetadata 字段

Dublin Core 命名空间 (dc:)

字段 类型 说明
dc_title Option<String> 文档标题
dc_creator Vec<String> 作者/创作者列表
dc_description Option<String> 文档描述
dc_subject Vec<String> 主题关键词
dc_language Option<String> 文档语言(如 "en-US"
dc_rights Option<String> 版权声明
dc_format Option<String> MIME 格式(如 "application/pdf"

XMP Core 命名空间 (xmp:)

字段 类型 说明
xmp_creator_tool Option<String> 创建文档所用工具
xmp_create_date Option<String> 创建日期(ISO 8601)
xmp_modify_date Option<String> 最后修改日期
xmp_metadata_date Option<String> 元数据修改日期

PDF 命名空间 (pdf:)

字段 类型 说明
pdf_producer Option<String> PDF 生成应用程序
pdf_keywords Option<String> 关键词字符串
pdf_version Option<String> XMP 中的 PDF 版本(可能与文件头不同)
pdf_trapped Option<String> 陷印状态

XMP Rights 命名空间 (xmpRights:)

字段 类型 说明
xmp_rights_usage_terms Option<String> 使用条款
xmp_rights_marked Option<bool> 是否标注了权利信息
xmp_rights_web_statement Option<String> Web 版权声明 URL

其他

字段 类型 说明
custom HashMap<String, String> 自定义属性(命名空间:属性 → 值)
raw_xml Option<String> 原始 XMP XML 数据包

Rust

use pdf_oxide::extractors::xmp::XmpExtractor;

let mut doc = PdfDocument::open("report.pdf")?;
if let Some(xmp) = XmpExtractor::extract(&mut doc)? {
    if let Some(title) = &xmp.dc_title {
        println!("Title: {}", title);
    }
    for creator in &xmp.dc_creator {
        println!("Author: {}", creator);
    }
    if let Some(tool) = &xmp.xmp_creator_tool {
        println!("Created with: {}", tool);
    }
    if let Some(date) = &xmp.xmp_create_date {
        println!("Created: {}", date);
    }
    if let Some(producer) = &xmp.pdf_producer {
        println!("Producer: {}", producer);
    }
}

WASM

const doc = new WasmPdfDocument(bytes);
const xmp = doc.xmpMetadata();

if (xmp) {
  console.log(`Title: ${xmp.dc_title}`);
  console.log(`Authors: ${xmp.dc_creator}`);
  console.log(`Created with: ${xmp.xmp_creator_tool}`);
  console.log(`Created: ${xmp.xmp_create_date}`);
  console.log(`Producer: ${xmp.pdf_producer}`);
}
doc.free();

Python

doc = PdfDocument("report.pdf")
xmp = doc.xmp_metadata()

if xmp:
    print(f"Title: {xmp.get('dc_title')}")
    print(f"Authors: {xmp.get('dc_creator')}")
    print(f"Created with: {xmp.get('xmp_creator_tool')}")
    print(f"Created: {xmp.get('xmp_create_date')}")
    print(f"Producer: {xmp.get('pdf_producer')}")

<!-- Node.js: no equivalent on PdfDocumentImpl — xmp metadata not exposed in js/src/index.ts -->

Go

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
xmp, _ := doc.XmpMetadata() // returns JSON string
fmt.Println(xmp)

C#

using var doc = PdfDocument.Open("report.pdf");
var xmp = doc.GetXmpMetadata(); // returns JSON string
Console.WriteLine(xmp);

C++

auto doc = pdf_oxide::Document::open("report.pdf");
std::string xmp = doc.get_xmp_metadata(); // raw XMP XML packet
std::cout << xmp << "\n";

Swift

let doc = try Document.open("report.pdf")
let xmp = try doc.xmpMetadata() // raw XMP XML packet
print(xmp)

Dart

final doc = PdfDocument.open('report.pdf');
final xmp = doc.getXmpMetadata(); // raw XMP XML packet
print(xmp);
doc.close();

R

doc <- pdf_open("report.pdf")
xmp <- pdf_get_xmp_metadata(doc) # XMP metadata as JSON
cat(xmp, "\n")

Julia

doc = open_document("report.pdf")
xmp = get_xmp_metadata(doc) # XMP metadata string
println(xmp)

Zig

var doc = try pdf_oxide.Document.open("report.pdf");
const xmp = try doc.xmpMetadata(a); // caller owns the slice
defer a.free(xmp);
std.debug.print("{s}\n", .{xmp});

Objective-C

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSString *xmp = [doc xmpMetadataWithError:&err];
printf("%s\n", xmp.UTF8String);

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
xmp = PdfOxide.xmp_metadata(doc) # XMP metadata as an XML/JSON string
IO.puts(xmp)

Pdf 便捷方法

高层 Pdf API 提供了常用元数据查询的快捷方法。

xmp_metadata() -> Result<Option<XmpMetadata>>

获取完整的 XMP 元数据对象。

xmp_title() -> Result<Option<String>>

仅从 XMP 获取文档标题。

xmp_creators() -> Result<Vec<String>>

从 XMP 获取创作者/作者列表。

Rust

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("report.pdf")?;

if let Some(title) = pdf.xmp_title()? {
    println!("Title: {}", title);
}

let creators = pdf.xmp_creators()?;
for creator in &creators {
    println!("Author: {}", creator);
}

PageLabelExtractor::extract(doc) -> Result<Vec<PageLabelRange>>

从文档中提取页面标签定义。页面标签定义了页码的显示方式(如前言用罗马数字,正文用阿拉伯数字)。

参数 类型 说明
doc &mut PdfDocument PDF 文档

返回值: PageLabelRange 定义的向量。

PageLabelRange 字段

字段 类型 说明
start_page usize 该范围起始的页面索引
style PageLabelStyle 编号样式
prefix Option<String> 标签前缀字符串
start_number u32 该范围的起始编号

PageLabelStyle 变体

变体 说明 示例
DecimalArabic 阿拉伯数字 1, 2, 3
UppercaseRoman 大写罗马数字 I, II, III
LowercaseRoman 小写罗马数字 i, ii, iii
UppercaseLetters 大写字母 A, B, C
LowercaseLetters 小写字母 a, b, c
None 无编号(仅前缀)

Pdf 页面标签便捷方法

page_labels() -> Result<Vec<PageLabelRange>>

获取所有页面标签范围定义。

page_label(page) -> Result<String>

获取指定页面索引的显示标签。

Rust

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("book.pdf")?;

// Get all label ranges
let ranges = pdf.page_labels()?;
for range in &ranges {
    println!(
        "Pages from {}: {:?} style, prefix={:?}, start={}",
        range.start_page, range.style, range.prefix, range.start_number
    );
}

// Get label for a specific page
let label = pdf.page_label(0)?;
println!("Page 0 label: {}", label); // e.g., "i" or "Cover"

WASM

const doc = new WasmPdfDocument(bytes);
const labels = doc.pageLabels();

for (const range of labels) {
  console.log(`Pages from ${range.start_page}: style=${range.style}, prefix=${range.prefix}`);
}
doc.free();

Python

doc = PdfDocument("book.pdf")
labels = doc.page_labels()

for range in labels:
    print(f"Pages from {range['start_page']}: style={range['style']}, prefix={range['prefix']}")

<!-- Node.js: no equivalent on PdfDocumentImpl — pageLabels not exposed on class, only via properties mixin -->

Go

doc, _ := pdfoxide.Open("book.pdf")
defer doc.Close()
labels, _ := doc.PageLabels() // returns JSON string
fmt.Println(labels)

C#

using var doc = PdfDocument.Open("book.pdf");
var labels = doc.GetPageLabels(); // returns JSON string
Console.WriteLine(labels);

C++

auto doc = pdf_oxide::Document::open("book.pdf");
std::string labels = doc.get_page_labels(); // JSON string
std::cout << labels << "\n";

Swift

let doc = try Document.open("book.pdf")
let labels = try doc.pageLabels() // JSON string
print(labels)

Dart

final doc = PdfDocument.open('book.pdf');
final labels = doc.getPageLabels(); // JSON string
print(labels);
doc.close();

R

doc <- pdf_open("book.pdf")
labels <- pdf_get_page_labels(doc) # JSON string
cat(labels, "\n")

Julia

doc = open_document("book.pdf")
labels = get_page_labels(doc) # JSON string
println(labels)

Zig

var doc = try pdf_oxide.Document.open("book.pdf");
const labels = try doc.pageLabels(a); // JSON string; caller owns the slice
defer a.free(labels);
std.debug.print("{s}\n", .{labels});

Objective-C

POXDocument *doc = [POXDocument openPath:@"book.pdf" error:&err];
NSString *labels = [doc pageLabelsWithError:&err];
printf("%s\n", labels.UTF8String);

Elixir

{:ok, doc} = PdfOxide.open("book.pdf")
labels = PdfOxide.page_labels(doc) # JSON string
IO.puts(labels)

get_producer — 文档生产者

生产者是生成该 PDF 的工具(/Info.Producer)。编辑器接口将其作为读写访问器公开:通过 get_producer / Producer 读取,通过对应的 setter(保存时写入 /Info.Producer)设置。XMP 中对应的字段是上文 XmpMetadatapdf_producer

该访问器由 C ABI 函数 document_editor_get_producer 实现:

char *document_editor_get_producer(DocumentEditor *handle, int32_t *error_code);

返回调用方拥有的 C 字符串(使用 free_string 释放),未设置生产者时返回 null

Rust

use pdf_oxide::editor::DocumentEditor;

let mut editor = DocumentEditor::open("report.pdf")?;
if let Some(producer) = editor.producer()? {
    println!("Producer: {}", producer);
}

Go

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

editor, _ := pdfoxide.OpenEditor("report.pdf")
defer editor.Close()
producer, _ := editor.Producer()
fmt.Printf("Producer: %s\n", producer)

C#

using PdfOxide.Core;

using var editor = DocumentEditor.Open("report.pdf");
Console.WriteLine($"Producer: {editor.Producer}");

Swift

import PdfOxide

let editor = try DocumentEditor.open("report.pdf")
let producer = try editor.getProducer()
print("Producer: \(producer)")

PHP

use PdfOxide\DocumentEditor;

$editor = DocumentEditor::open("report.pdf");
echo "Producer: " . $editor->getProducer() . "\n";

C++

auto editor = pdf_oxide::DocumentEditor::open("report.pdf");
std::cout << "Producer: " << editor.get_producer() << "\n";

Dart

final editor = DocumentEditor.open('report.pdf');
print('Producer: ${editor.getProducer()}');

R

editor <- pdf_editor_open("report.pdf")
cat("Producer:", pdf_editor_get_producer(editor), "\n")

Julia

editor = open_editor("report.pdf")
println("Producer: ", get_producer(editor))

Zig

var editor = try pdf_oxide.Document.openEditor("report.pdf");
const producer = try editor.getProducer(a); // caller owns the slice
defer a.free(producer);
std.debug.print("Producer: {s}\n", .{producer});

Objective-C

POXDocumentEditor *editor = [POXDocumentEditor openEditor:@"report.pdf" error:&err];
NSString *producer = [editor producerError:&err];
printf("Producer: %s\n", producer.UTF8String);

Elixir

{:ok, editor} = PdfOxide.open_editor("report.pdf")
producer = PdfOxide.get_producer(editor)
IO.puts("Producer: #{producer}")

绑定覆盖情况。 get_producer 位于编辑器(DocumentEditor)上,而非只读的 PdfDocument。在 Rust (editor.producer())、Go (editor.Producer())、C# (editor.Producer 属性)、Swift (editor.getProducer()) 和 C ABI (document_editor_get_producer) 中均有暴露。对应的 setter(set_producer / SetProducer / Producer = ...)在保存时持久化修改。该访问器在 WASM 目标中不编译。


embedded_fonts — 页面使用的字体

embedded_fonts 列举页面内容流中引用的字体,从页面文本跨度中推导每个字体的名称、嵌入状态和子集状态。(子集字体通过标准的 6 字母前缀加 + 命名规则检测,如 ABCDEF+Helvetica。)由 C ABI 函数 pdf_document_get_embedded_fontspdf_oxide_font_* 访问器族实现。

Go

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()

fonts, _ := doc.Fonts(0) // []pdfoxide.Font
for _, f := range fonts {
    fmt.Printf("%s (%s) embedded=%v subset=%v\n",
        f.Name, f.Encoding, f.IsEmbedded, f.IsSubset)
}

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let fonts = try doc.embeddedFonts(0) // [Font]
for f in fonts {
    print("\(f.name) (\(f.encoding)) embedded=\(f.embedded) subset=\(f.subset)")
}

C ABI

#include "pdf_oxide.h"

int32_t err = 0;
FfiFontList *fonts = pdf_document_get_embedded_fonts(doc, /*page=*/0, &err);
int32_t n = pdf_oxide_font_count(fonts);
for (int32_t i = 0; i < n; i++) {
    char *name = pdf_oxide_font_get_name(fonts, i, &err);
    int32_t embedded = pdf_oxide_font_is_embedded(fonts, i, &err);
    int32_t subset = pdf_oxide_font_is_subset(fonts, i, &err);
    printf("%s embedded=%d subset=%d\n", name, embedded, subset);
    free_string(name);
}
pdf_oxide_font_list_free(fonts);

C++

auto doc = pdf_oxide::Document::open("report.pdf");
for (const auto& f : doc.embedded_fonts(0)) { // std::vector<Font>
    std::cout << f.name << " (" << f.encoding << ") embedded="
              << f.embedded << " subset=" << f.subset << "\n";
}

Dart

final doc = PdfDocument.open('report.pdf');
for (final f in doc.embeddedFonts(0)) { // List<Font>
  print('${f.name} (${f.encoding}) embedded=${f.embedded} subset=${f.subset}');
}
doc.close();

R

doc <- pdf_open("report.pdf")
for (f in pdf_embedded_fonts(doc, 0)) { # list of Font records
  cat(sprintf("%s (%s) embedded=%s subset=%s\n",
              f$name, f$encoding, f$embedded, f$subset))
}

Julia

doc = open_document("report.pdf")
for f in embedded_fonts(doc, 0) # Vector{Font}
    println("$(f.name) ($(f.encoding)) embedded=$(f.embedded) subset=$(f.subset)")
end

Zig

var doc = try pdf_oxide.Document.open("report.pdf");
const fonts = try doc.embeddedFonts(a, 0); // []Font
defer pdf_oxide.Document.freeFonts(a, fonts);
for (fonts) |f| {
    std.debug.print("{s} ({s}) embedded={} subset={}\n",
        .{ f.name, f.encoding, f.embedded, f.subset });
}

Objective-C

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
for (POXFont *f in [doc embeddedFonts:0 error:&err]) {
    printf("%s (%s) embedded=%d subset=%d\n",
        f.name.UTF8String, f.encoding.UTF8String, f.embedded, f.subset);
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
for f <- PdfOxide.embedded_fonts(doc, 0) do # list of %Font{}
  IO.puts("#{f.name} (#{f.encoding}) embedded=#{f.embedded} subset=#{f.subset}")
end

字体访问器字段

字段 (Go / Swift) 类型 说明
Name / name string 字体资源名称(如 "ABCDEF+Helvetica"
Type / type string 字体子类型
Encoding / encoding string 字体编码
IsEmbedded / embedded bool 字体程序是否已嵌入
IsSubset / subset bool 字体是否为子集
Size (Go) float32 字体大小(如可用)

绑定覆盖情况。 embedded_fontsGo (doc.Fonts(page))、Swift (doc.embeddedFonts(page)) 和 C ABI (pdf_document_get_embedded_fonts) 中有暴露。在 WASM 目标中不编译。


fonts_to_json — 序列化页面字体

fonts_to_json 通过单次 FFI 调用将 embedded_fonts 返回的整个字体列表序列化为 JSON 数组。Go 绑定在内部使用它来生成 []Font;Swift 将其直接暴露为 fontsToJson。C ABI 签名:

char *pdf_oxide_fonts_to_json(const FfiFontList *fonts, int32_t *error_code);

返回的 UTF-8 字符串由调用方拥有(使用 free_string 释放)。其 schema 为:

[{"name": "...", "type": "...", "encoding": "...",
  "isEmbedded": true, "isSubset": false, "size": 0}]

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let json = try doc.fontsToJson(0) // String of JSON
print(json)

C ABI

#include "pdf_oxide.h"

int32_t err = 0;
FfiFontList *fonts = pdf_document_get_embedded_fonts(doc, /*page=*/0, &err);
char *json = pdf_oxide_fonts_to_json(fonts, &err);
printf("%s\n", json);
free_string(json);
pdf_oxide_font_list_free(fonts);

C++

auto doc = pdf_oxide::Document::open("report.pdf");
std::string json = doc.fonts_to_json(0); // JSON array string
std::cout << json << "\n";

Dart

final doc = PdfDocument.open('report.pdf');
final json = doc.embeddedFontsJson(0); // JSON array string
print(json);
doc.close();

R

doc <- pdf_open("report.pdf")
json <- pdf_fonts_to_json(doc, 0) # JSON array string
cat(json, "\n")

Julia

doc = open_document("report.pdf")
json = fonts_to_json(doc, 0) # JSON array string
println(json)

Zig

var doc = try pdf_oxide.Document.open("report.pdf");
var fl = try doc.fontList(0); // owned FontList handle
defer fl.deinit();
const json = try fl.toJson(a); // JSON array string; caller owns the slice
defer a.free(json);
std.debug.print("{s}\n", .{json});

Objective-C

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSString *json = [doc embeddedFontsJson:0 error:&err]; // JSON array string
printf("%s\n", json.UTF8String);

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
json = PdfOxide.fonts_to_json(doc, 0) # JSON array string
IO.puts(json)

绑定覆盖情况。 fonts_to_jsonSwift (doc.fontsToJson(page)) 和 C ABI (pdf_oxide_fonts_to_json) 中直接暴露;Go 绑定在内部调用它将 doc.Fonts(page) 解码为类型化结构体。在 WASM 目标中不编译。


进阶示例

显示完整文档元数据

use pdf_oxide::PdfDocument;
use pdf_oxide::extractors::xmp::XmpExtractor;

let mut doc = PdfDocument::open("report.pdf")?;

// Basic info
let (major, minor) = doc.version();
println!("PDF Version: {}.{}", major, minor);
println!("Pages: {}", doc.page_count()?);

// XMP metadata
if let Some(xmp) = XmpExtractor::extract(&mut doc)? {
    println!("\nXMP Metadata:");
    println!("  Title:       {:?}", xmp.dc_title);
    println!("  Authors:     {:?}", xmp.dc_creator);
    println!("  Description: {:?}", xmp.dc_description);
    println!("  Keywords:    {:?}", xmp.pdf_keywords);
    println!("  Creator:     {:?}", xmp.xmp_creator_tool);
    println!("  Producer:    {:?}", xmp.pdf_producer);
    println!("  Created:     {:?}", xmp.xmp_create_date);
    println!("  Modified:    {:?}", xmp.xmp_modify_date);
    println!("  Language:    {:?}", xmp.dc_language);
    println!("  Rights:      {:?}", xmp.dc_rights);

    if !xmp.custom.is_empty() {
        println!("\n  Custom properties:");
        for (key, value) in &xmp.custom {
            println!("    {}: {}", key, value);
        }
    }
}

访问原始 XMP XML

use pdf_oxide::extractors::xmp::XmpExtractor;

let mut doc = PdfDocument::open("report.pdf")?;
if let Some(xmp) = XmpExtractor::extract(&mut doc)? {
    if let Some(xml) = &xmp.raw_xml {
        std::fs::write("metadata.xml", xml)?;
        println!("Raw XMP saved ({} bytes)", xml.len());
    }
}

生成页码显示字符串

use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("thesis.pdf")?;
let page_count = pdf.page_count()?;

for i in 0..page_count {
    let label = pdf.page_label(i)?;
    println!("Physical page {} -> display label '{}'", i + 1, label);
}
// Example output:
//   Physical page 1 -> display label 'i'
//   Physical page 2 -> display label 'ii'
//   Physical page 3 -> display label 'iii'
//   Physical page 4 -> display label '1'
//   Physical page 5 -> display label '2'

常见问题

get_producer 从哪里读取数据? 从文档 Info 字典的 /Info.Producer 条目读取。它位于 DocumentEditor(读写)上,对应的 setter 在保存时将更改持久化到 /Info.Producer。XMP 的 pdf:Producer 值可通过 XmpMetadatapdf_producer 单独获取。

为什么 embedded_fonts 只返回出现在文本中的字体? 字体列表从页面已渲染的文本跨度中推导,因此反映的是该页实际用于绘制字形的字体。子集检测遵循 PDF 规范的 6 字符标签加 + 约定(如 ABCDEF+Helvetica)。

fonts_to_json 返回的 JSON schema 是什么? 包含 nametypeencodingisEmbeddedisSubsetsize 字段的对象 JSON 数组,与 Go 绑定反序列化为 Font 结构体的格式相同。

元数据提取速度快吗? 非常快。PDF Oxide 的提取核心在基准测试语料库上的平均耗时约为 0.8 ms,p99 为 9 ms,通过率 100%。


相关页面