元数据 & XMP
PDF Oxide 从多个来源读取文档级元数据:PDF 文件头(版本号)、trailer 与 catalog 字典、XMP 元数据流(ISO 16684)以及页面标签定义。XmpExtractor 可解析 Dublin Core、XMP Core、PDF 和 XMP Rights 命名空间,以及任意自定义属性。
基本文档属性使用 version() 和 catalog(),丰富元数据使用 XmpExtractor::extract(),页码方案使用 PageLabelExtractor。
快速示例
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
major, minor = doc.version()
print(f"PDF {major}.{minor}, {doc.page_count()} pages")
Node.js
const { PdfDocument } = require("pdf-oxide");
const doc = new PdfDocument("report.pdf");
const { major, minor } = doc.getVersion();
console.log(`PDF ${major}.${minor}, ${doc.pageCount()} pages`);
doc.close();
Go
import pdfoxide "github.com/yfedoseev/pdf_oxide/go"
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
major, minor, _ := doc.Version()
pages, _ := doc.PageCount()
fmt.Printf("PDF %d.%d, %d pages\n", major, minor, pages)
C#
using PdfOxide.Core;
using var doc = PdfDocument.Open("report.pdf");
var (major, minor) = doc.Version;
Console.WriteLine($"PDF {major}.{minor}, {doc.PageCount} pages");
WASM
const doc = new WasmPdfDocument(bytes);
const version = doc.version();
console.log(`PDF ${version}, ${doc.pageCount()} pages`);
Rust
use pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("report.pdf")?;
let (major, minor) = doc.version();
println!("PDF {}.{}", major, minor);
println!("Pages: {}", doc.page_count()?);
PHP
use PdfOxide\PdfDocument;
$doc = PdfDocument::open("report.pdf");
$v = $doc->version(); // ['major' => int, 'minor' => int]
echo "PDF {$v['major']}.{$v['minor']}, {$doc->pageCount()} pages\n";
$doc->close();
Ruby
require "pdf_oxide"
PdfOxide::PdfDocument.open("report.pdf") do |doc|
puts "PDF #{doc.pdf_version}, #{doc.page_count} pages"
end
C++
#include <pdf_oxide/pdf_oxide.hpp>
auto doc = pdf_oxide::Document::open("report.pdf");
auto v = doc.version();
std::cout << "PDF " << static_cast<int>(v.major) << "."
<< static_cast<int>(v.minor) << ", " << doc.page_count() << " pages\n";
Swift
import PdfOxide
let doc = try Document.open("report.pdf")
let v = try doc.version()
print("PDF \(v.major).\(v.minor), \(try doc.pageCount()) pages")
Dart
import 'package:pdf_oxide/pdf_oxide.dart';
final doc = PdfDocument.open('report.pdf');
final v = doc.version;
print('PDF ${v.major}.${v.minor}, ${doc.pageCount} pages');
doc.close();
R
library(pdfoxide)
doc <- pdf_open("report.pdf")
v <- pdf_version(doc)
cat(sprintf("PDF %d.%d, %d pages\n", v$major, v$minor, pdf_page_count(doc)))
Julia
using PdfOxide
doc = open_document("report.pdf")
v = version(doc)
println("PDF $(v.major).$(v.minor), $(page_count(doc)) pages")
Zig
const pdf_oxide = @import("pdf_oxide");
var doc = try pdf_oxide.Document.open("report.pdf");
const v = doc.version();
std.debug.print("PDF {d}.{d}, {d} pages\n", .{ v.major, v.minor, try doc.pageCount() });
Objective-C
#import "POXPdfOxide.h"
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
POXVersion v = [doc version];
printf("PDF %d.%d, %ld pages\n", v.major, v.minor, (long)[doc pageCountError:&err]);
Elixir
{:ok, doc} = PdfOxide.open("report.pdf")
%{major: maj, minor: min} = PdfOxide.version(doc)
{:ok, pages} = PdfOxide.page_count(doc)
IO.puts("PDF #{maj}.#{min}, #{pages} pages")
API 参考
version() -> (u8, u8)
从文件头获取 PDF 版本号。
返回值: (major, minor) 元组,例如 PDF 1.7 返回 (1, 7),PDF 2.0 返回 (2, 0)。
catalog() -> Result<Object>
获取文档 catalog 字典。catalog 是 PDF 对象层级的根节点,包含对页面树、大纲、命名空间及其他文档级结构的引用。
Rust
let mut doc = PdfDocument::open("report.pdf")?;
let catalog = doc.catalog()?;
if let Some(dict) = catalog.as_dict() {
for (key, _) in dict {
println!("Catalog key: {}", key);
}
}
trailer() -> &Object
获取文档 trailer 字典。trailer 包含交叉引用表的位置、文档 ID、加密字典引用以及 Info 字典引用。
Rust
let doc = PdfDocument::open("report.pdf")?;
let trailer = doc.trailer();
println!("Trailer: {:?}", trailer);
XmpExtractor::extract(doc) -> Result<Option<XmpMetadata>>
从文档的元数据流中提取 XMP(Extensible Metadata Platform)元数据。XMP 采用标准 XML 命名空间,提供比传统 Info 字典更丰富的元数据。
| 参数 | 类型 | 说明 |
|---|---|---|
doc |
&mut PdfDocument |
PDF 文档 |
返回值: 若存在 XMP 数据则返回 Some(XmpMetadata),否则返回 None。
XmpMetadata 字段
Dublin Core 命名空间 (dc:)
| 字段 | 类型 | 说明 |
|---|---|---|
dc_title |
Option<String> |
文档标题 |
dc_creator |
Vec<String> |
作者/创作者列表 |
dc_description |
Option<String> |
文档描述 |
dc_subject |
Vec<String> |
主题关键词 |
dc_language |
Option<String> |
文档语言(如 "en-US") |
dc_rights |
Option<String> |
版权声明 |
dc_format |
Option<String> |
MIME 格式(如 "application/pdf") |
XMP Core 命名空间 (xmp:)
| 字段 | 类型 | 说明 |
|---|---|---|
xmp_creator_tool |
Option<String> |
创建文档所用工具 |
xmp_create_date |
Option<String> |
创建日期(ISO 8601) |
xmp_modify_date |
Option<String> |
最后修改日期 |
xmp_metadata_date |
Option<String> |
元数据修改日期 |
PDF 命名空间 (pdf:)
| 字段 | 类型 | 说明 |
|---|---|---|
pdf_producer |
Option<String> |
PDF 生成应用程序 |
pdf_keywords |
Option<String> |
关键词字符串 |
pdf_version |
Option<String> |
XMP 中的 PDF 版本(可能与文件头不同) |
pdf_trapped |
Option<String> |
陷印状态 |
XMP Rights 命名空间 (xmpRights:)
| 字段 | 类型 | 说明 |
|---|---|---|
xmp_rights_usage_terms |
Option<String> |
使用条款 |
xmp_rights_marked |
Option<bool> |
是否标注了权利信息 |
xmp_rights_web_statement |
Option<String> |
Web 版权声明 URL |
其他
| 字段 | 类型 | 说明 |
|---|---|---|
custom |
HashMap<String, String> |
自定义属性(命名空间:属性 → 值) |
raw_xml |
Option<String> |
原始 XMP XML 数据包 |
Rust
use pdf_oxide::extractors::xmp::XmpExtractor;
let mut doc = PdfDocument::open("report.pdf")?;
if let Some(xmp) = XmpExtractor::extract(&mut doc)? {
if let Some(title) = &xmp.dc_title {
println!("Title: {}", title);
}
for creator in &xmp.dc_creator {
println!("Author: {}", creator);
}
if let Some(tool) = &xmp.xmp_creator_tool {
println!("Created with: {}", tool);
}
if let Some(date) = &xmp.xmp_create_date {
println!("Created: {}", date);
}
if let Some(producer) = &xmp.pdf_producer {
println!("Producer: {}", producer);
}
}
WASM
const doc = new WasmPdfDocument(bytes);
const xmp = doc.xmpMetadata();
if (xmp) {
console.log(`Title: ${xmp.dc_title}`);
console.log(`Authors: ${xmp.dc_creator}`);
console.log(`Created with: ${xmp.xmp_creator_tool}`);
console.log(`Created: ${xmp.xmp_create_date}`);
console.log(`Producer: ${xmp.pdf_producer}`);
}
doc.free();
Python
doc = PdfDocument("report.pdf")
xmp = doc.xmp_metadata()
if xmp:
print(f"Title: {xmp.get('dc_title')}")
print(f"Authors: {xmp.get('dc_creator')}")
print(f"Created with: {xmp.get('xmp_creator_tool')}")
print(f"Created: {xmp.get('xmp_create_date')}")
print(f"Producer: {xmp.get('pdf_producer')}")
<!-- Node.js: no equivalent on PdfDocumentImpl — xmp metadata not exposed in js/src/index.ts -->
Go
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
xmp, _ := doc.XmpMetadata() // returns JSON string
fmt.Println(xmp)
C#
using var doc = PdfDocument.Open("report.pdf");
var xmp = doc.GetXmpMetadata(); // returns JSON string
Console.WriteLine(xmp);
C++
auto doc = pdf_oxide::Document::open("report.pdf");
std::string xmp = doc.get_xmp_metadata(); // raw XMP XML packet
std::cout << xmp << "\n";
Swift
let doc = try Document.open("report.pdf")
let xmp = try doc.xmpMetadata() // raw XMP XML packet
print(xmp)
Dart
final doc = PdfDocument.open('report.pdf');
final xmp = doc.getXmpMetadata(); // raw XMP XML packet
print(xmp);
doc.close();
R
doc <- pdf_open("report.pdf")
xmp <- pdf_get_xmp_metadata(doc) # XMP metadata as JSON
cat(xmp, "\n")
Julia
doc = open_document("report.pdf")
xmp = get_xmp_metadata(doc) # XMP metadata string
println(xmp)
Zig
var doc = try pdf_oxide.Document.open("report.pdf");
const xmp = try doc.xmpMetadata(a); // caller owns the slice
defer a.free(xmp);
std.debug.print("{s}\n", .{xmp});
Objective-C
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSString *xmp = [doc xmpMetadataWithError:&err];
printf("%s\n", xmp.UTF8String);
Elixir
{:ok, doc} = PdfOxide.open("report.pdf")
xmp = PdfOxide.xmp_metadata(doc) # XMP metadata as an XML/JSON string
IO.puts(xmp)
Pdf 便捷方法
高层 Pdf API 提供了常用元数据查询的快捷方法。
xmp_metadata() -> Result<Option<XmpMetadata>>
获取完整的 XMP 元数据对象。
xmp_title() -> Result<Option<String>>
仅从 XMP 获取文档标题。
xmp_creators() -> Result<Vec<String>>
从 XMP 获取创作者/作者列表。
Rust
use pdf_oxide::api::Pdf;
let mut pdf = Pdf::open("report.pdf")?;
if let Some(title) = pdf.xmp_title()? {
println!("Title: {}", title);
}
let creators = pdf.xmp_creators()?;
for creator in &creators {
println!("Author: {}", creator);
}
PageLabelExtractor::extract(doc) -> Result<Vec<PageLabelRange>>
从文档中提取页面标签定义。页面标签定义了页码的显示方式(如前言用罗马数字,正文用阿拉伯数字)。
| 参数 | 类型 | 说明 |
|---|---|---|
doc |
&mut PdfDocument |
PDF 文档 |
返回值: PageLabelRange 定义的向量。
PageLabelRange 字段
| 字段 | 类型 | 说明 |
|---|---|---|
start_page |
usize |
该范围起始的页面索引 |
style |
PageLabelStyle |
编号样式 |
prefix |
Option<String> |
标签前缀字符串 |
start_number |
u32 |
该范围的起始编号 |
PageLabelStyle 变体
| 变体 | 说明 | 示例 |
|---|---|---|
DecimalArabic |
阿拉伯数字 | 1, 2, 3 |
UppercaseRoman |
大写罗马数字 | I, II, III |
LowercaseRoman |
小写罗马数字 | i, ii, iii |
UppercaseLetters |
大写字母 | A, B, C |
LowercaseLetters |
小写字母 | a, b, c |
None |
无编号(仅前缀) | – |
Pdf 页面标签便捷方法
page_labels() -> Result<Vec<PageLabelRange>>
获取所有页面标签范围定义。
page_label(page) -> Result<String>
获取指定页面索引的显示标签。
Rust
use pdf_oxide::api::Pdf;
let mut pdf = Pdf::open("book.pdf")?;
// Get all label ranges
let ranges = pdf.page_labels()?;
for range in &ranges {
println!(
"Pages from {}: {:?} style, prefix={:?}, start={}",
range.start_page, range.style, range.prefix, range.start_number
);
}
// Get label for a specific page
let label = pdf.page_label(0)?;
println!("Page 0 label: {}", label); // e.g., "i" or "Cover"
WASM
const doc = new WasmPdfDocument(bytes);
const labels = doc.pageLabels();
for (const range of labels) {
console.log(`Pages from ${range.start_page}: style=${range.style}, prefix=${range.prefix}`);
}
doc.free();
Python
doc = PdfDocument("book.pdf")
labels = doc.page_labels()
for range in labels:
print(f"Pages from {range['start_page']}: style={range['style']}, prefix={range['prefix']}")
<!-- Node.js: no equivalent on PdfDocumentImpl — pageLabels not exposed on class, only via properties mixin -->
Go
doc, _ := pdfoxide.Open("book.pdf")
defer doc.Close()
labels, _ := doc.PageLabels() // returns JSON string
fmt.Println(labels)
C#
using var doc = PdfDocument.Open("book.pdf");
var labels = doc.GetPageLabels(); // returns JSON string
Console.WriteLine(labels);
C++
auto doc = pdf_oxide::Document::open("book.pdf");
std::string labels = doc.get_page_labels(); // JSON string
std::cout << labels << "\n";
Swift
let doc = try Document.open("book.pdf")
let labels = try doc.pageLabels() // JSON string
print(labels)
Dart
final doc = PdfDocument.open('book.pdf');
final labels = doc.getPageLabels(); // JSON string
print(labels);
doc.close();
R
doc <- pdf_open("book.pdf")
labels <- pdf_get_page_labels(doc) # JSON string
cat(labels, "\n")
Julia
doc = open_document("book.pdf")
labels = get_page_labels(doc) # JSON string
println(labels)
Zig
var doc = try pdf_oxide.Document.open("book.pdf");
const labels = try doc.pageLabels(a); // JSON string; caller owns the slice
defer a.free(labels);
std.debug.print("{s}\n", .{labels});
Objective-C
POXDocument *doc = [POXDocument openPath:@"book.pdf" error:&err];
NSString *labels = [doc pageLabelsWithError:&err];
printf("%s\n", labels.UTF8String);
Elixir
{:ok, doc} = PdfOxide.open("book.pdf")
labels = PdfOxide.page_labels(doc) # JSON string
IO.puts(labels)
get_producer — 文档生产者
生产者是生成该 PDF 的工具(/Info.Producer)。编辑器接口将其作为读写访问器公开:通过 get_producer / Producer 读取,通过对应的 setter(保存时写入 /Info.Producer)设置。XMP 中对应的字段是上文 XmpMetadata 的 pdf_producer。
该访问器由 C ABI 函数 document_editor_get_producer 实现:
char *document_editor_get_producer(DocumentEditor *handle, int32_t *error_code);
返回调用方拥有的 C 字符串(使用 free_string 释放),未设置生产者时返回 null。
Rust
use pdf_oxide::editor::DocumentEditor;
let mut editor = DocumentEditor::open("report.pdf")?;
if let Some(producer) = editor.producer()? {
println!("Producer: {}", producer);
}
Go
import pdfoxide "github.com/yfedoseev/pdf_oxide/go"
editor, _ := pdfoxide.OpenEditor("report.pdf")
defer editor.Close()
producer, _ := editor.Producer()
fmt.Printf("Producer: %s\n", producer)
C#
using PdfOxide.Core;
using var editor = DocumentEditor.Open("report.pdf");
Console.WriteLine($"Producer: {editor.Producer}");
Swift
import PdfOxide
let editor = try DocumentEditor.open("report.pdf")
let producer = try editor.getProducer()
print("Producer: \(producer)")
PHP
use PdfOxide\DocumentEditor;
$editor = DocumentEditor::open("report.pdf");
echo "Producer: " . $editor->getProducer() . "\n";
C++
auto editor = pdf_oxide::DocumentEditor::open("report.pdf");
std::cout << "Producer: " << editor.get_producer() << "\n";
Dart
final editor = DocumentEditor.open('report.pdf');
print('Producer: ${editor.getProducer()}');
R
editor <- pdf_editor_open("report.pdf")
cat("Producer:", pdf_editor_get_producer(editor), "\n")
Julia
editor = open_editor("report.pdf")
println("Producer: ", get_producer(editor))
Zig
var editor = try pdf_oxide.Document.openEditor("report.pdf");
const producer = try editor.getProducer(a); // caller owns the slice
defer a.free(producer);
std.debug.print("Producer: {s}\n", .{producer});
Objective-C
POXDocumentEditor *editor = [POXDocumentEditor openEditor:@"report.pdf" error:&err];
NSString *producer = [editor producerError:&err];
printf("Producer: %s\n", producer.UTF8String);
Elixir
{:ok, editor} = PdfOxide.open_editor("report.pdf")
producer = PdfOxide.get_producer(editor)
IO.puts("Producer: #{producer}")
绑定覆盖情况。
get_producer位于编辑器(DocumentEditor)上,而非只读的PdfDocument。在 Rust (editor.producer())、Go (editor.Producer())、C# (editor.Producer属性)、Swift (editor.getProducer()) 和 C ABI (document_editor_get_producer) 中均有暴露。对应的 setter(set_producer/SetProducer/Producer = ...)在保存时持久化修改。该访问器在 WASM 目标中不编译。
embedded_fonts — 页面使用的字体
embedded_fonts 列举页面内容流中引用的字体,从页面文本跨度中推导每个字体的名称、嵌入状态和子集状态。(子集字体通过标准的 6 字母前缀加 + 命名规则检测,如 ABCDEF+Helvetica。)由 C ABI 函数 pdf_document_get_embedded_fonts 和 pdf_oxide_font_* 访问器族实现。
Go
import pdfoxide "github.com/yfedoseev/pdf_oxide/go"
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
fonts, _ := doc.Fonts(0) // []pdfoxide.Font
for _, f := range fonts {
fmt.Printf("%s (%s) embedded=%v subset=%v\n",
f.Name, f.Encoding, f.IsEmbedded, f.IsSubset)
}
Swift
import PdfOxide
let doc = try Document.open("report.pdf")
let fonts = try doc.embeddedFonts(0) // [Font]
for f in fonts {
print("\(f.name) (\(f.encoding)) embedded=\(f.embedded) subset=\(f.subset)")
}
C ABI
#include "pdf_oxide.h"
int32_t err = 0;
FfiFontList *fonts = pdf_document_get_embedded_fonts(doc, /*page=*/0, &err);
int32_t n = pdf_oxide_font_count(fonts);
for (int32_t i = 0; i < n; i++) {
char *name = pdf_oxide_font_get_name(fonts, i, &err);
int32_t embedded = pdf_oxide_font_is_embedded(fonts, i, &err);
int32_t subset = pdf_oxide_font_is_subset(fonts, i, &err);
printf("%s embedded=%d subset=%d\n", name, embedded, subset);
free_string(name);
}
pdf_oxide_font_list_free(fonts);
C++
auto doc = pdf_oxide::Document::open("report.pdf");
for (const auto& f : doc.embedded_fonts(0)) { // std::vector<Font>
std::cout << f.name << " (" << f.encoding << ") embedded="
<< f.embedded << " subset=" << f.subset << "\n";
}
Dart
final doc = PdfDocument.open('report.pdf');
for (final f in doc.embeddedFonts(0)) { // List<Font>
print('${f.name} (${f.encoding}) embedded=${f.embedded} subset=${f.subset}');
}
doc.close();
R
doc <- pdf_open("report.pdf")
for (f in pdf_embedded_fonts(doc, 0)) { # list of Font records
cat(sprintf("%s (%s) embedded=%s subset=%s\n",
f$name, f$encoding, f$embedded, f$subset))
}
Julia
doc = open_document("report.pdf")
for f in embedded_fonts(doc, 0) # Vector{Font}
println("$(f.name) ($(f.encoding)) embedded=$(f.embedded) subset=$(f.subset)")
end
Zig
var doc = try pdf_oxide.Document.open("report.pdf");
const fonts = try doc.embeddedFonts(a, 0); // []Font
defer pdf_oxide.Document.freeFonts(a, fonts);
for (fonts) |f| {
std.debug.print("{s} ({s}) embedded={} subset={}\n",
.{ f.name, f.encoding, f.embedded, f.subset });
}
Objective-C
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
for (POXFont *f in [doc embeddedFonts:0 error:&err]) {
printf("%s (%s) embedded=%d subset=%d\n",
f.name.UTF8String, f.encoding.UTF8String, f.embedded, f.subset);
}
Elixir
{:ok, doc} = PdfOxide.open("report.pdf")
for f <- PdfOxide.embedded_fonts(doc, 0) do # list of %Font{}
IO.puts("#{f.name} (#{f.encoding}) embedded=#{f.embedded} subset=#{f.subset}")
end
字体访问器字段
| 字段 (Go / Swift) | 类型 | 说明 |
|---|---|---|
Name / name |
string |
字体资源名称(如 "ABCDEF+Helvetica") |
Type / type |
string |
字体子类型 |
Encoding / encoding |
string |
字体编码 |
IsEmbedded / embedded |
bool |
字体程序是否已嵌入 |
IsSubset / subset |
bool |
字体是否为子集 |
Size (Go) |
float32 |
字体大小(如可用) |
绑定覆盖情况。
embedded_fonts在 Go (doc.Fonts(page))、Swift (doc.embeddedFonts(page)) 和 C ABI (pdf_document_get_embedded_fonts) 中有暴露。在 WASM 目标中不编译。
fonts_to_json — 序列化页面字体
fonts_to_json 通过单次 FFI 调用将 embedded_fonts 返回的整个字体列表序列化为 JSON 数组。Go 绑定在内部使用它来生成 []Font;Swift 将其直接暴露为 fontsToJson。C ABI 签名:
char *pdf_oxide_fonts_to_json(const FfiFontList *fonts, int32_t *error_code);
返回的 UTF-8 字符串由调用方拥有(使用 free_string 释放)。其 schema 为:
[{"name": "...", "type": "...", "encoding": "...",
"isEmbedded": true, "isSubset": false, "size": 0}]
Swift
import PdfOxide
let doc = try Document.open("report.pdf")
let json = try doc.fontsToJson(0) // String of JSON
print(json)
C ABI
#include "pdf_oxide.h"
int32_t err = 0;
FfiFontList *fonts = pdf_document_get_embedded_fonts(doc, /*page=*/0, &err);
char *json = pdf_oxide_fonts_to_json(fonts, &err);
printf("%s\n", json);
free_string(json);
pdf_oxide_font_list_free(fonts);
C++
auto doc = pdf_oxide::Document::open("report.pdf");
std::string json = doc.fonts_to_json(0); // JSON array string
std::cout << json << "\n";
Dart
final doc = PdfDocument.open('report.pdf');
final json = doc.embeddedFontsJson(0); // JSON array string
print(json);
doc.close();
R
doc <- pdf_open("report.pdf")
json <- pdf_fonts_to_json(doc, 0) # JSON array string
cat(json, "\n")
Julia
doc = open_document("report.pdf")
json = fonts_to_json(doc, 0) # JSON array string
println(json)
Zig
var doc = try pdf_oxide.Document.open("report.pdf");
var fl = try doc.fontList(0); // owned FontList handle
defer fl.deinit();
const json = try fl.toJson(a); // JSON array string; caller owns the slice
defer a.free(json);
std.debug.print("{s}\n", .{json});
Objective-C
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSString *json = [doc embeddedFontsJson:0 error:&err]; // JSON array string
printf("%s\n", json.UTF8String);
Elixir
{:ok, doc} = PdfOxide.open("report.pdf")
json = PdfOxide.fonts_to_json(doc, 0) # JSON array string
IO.puts(json)
绑定覆盖情况。
fonts_to_json在 Swift (doc.fontsToJson(page)) 和 C ABI (pdf_oxide_fonts_to_json) 中直接暴露;Go 绑定在内部调用它将doc.Fonts(page)解码为类型化结构体。在 WASM 目标中不编译。
进阶示例
显示完整文档元数据
use pdf_oxide::PdfDocument;
use pdf_oxide::extractors::xmp::XmpExtractor;
let mut doc = PdfDocument::open("report.pdf")?;
// Basic info
let (major, minor) = doc.version();
println!("PDF Version: {}.{}", major, minor);
println!("Pages: {}", doc.page_count()?);
// XMP metadata
if let Some(xmp) = XmpExtractor::extract(&mut doc)? {
println!("\nXMP Metadata:");
println!(" Title: {:?}", xmp.dc_title);
println!(" Authors: {:?}", xmp.dc_creator);
println!(" Description: {:?}", xmp.dc_description);
println!(" Keywords: {:?}", xmp.pdf_keywords);
println!(" Creator: {:?}", xmp.xmp_creator_tool);
println!(" Producer: {:?}", xmp.pdf_producer);
println!(" Created: {:?}", xmp.xmp_create_date);
println!(" Modified: {:?}", xmp.xmp_modify_date);
println!(" Language: {:?}", xmp.dc_language);
println!(" Rights: {:?}", xmp.dc_rights);
if !xmp.custom.is_empty() {
println!("\n Custom properties:");
for (key, value) in &xmp.custom {
println!(" {}: {}", key, value);
}
}
}
访问原始 XMP XML
use pdf_oxide::extractors::xmp::XmpExtractor;
let mut doc = PdfDocument::open("report.pdf")?;
if let Some(xmp) = XmpExtractor::extract(&mut doc)? {
if let Some(xml) = &xmp.raw_xml {
std::fs::write("metadata.xml", xml)?;
println!("Raw XMP saved ({} bytes)", xml.len());
}
}
生成页码显示字符串
use pdf_oxide::api::Pdf;
let mut pdf = Pdf::open("thesis.pdf")?;
let page_count = pdf.page_count()?;
for i in 0..page_count {
let label = pdf.page_label(i)?;
println!("Physical page {} -> display label '{}'", i + 1, label);
}
// Example output:
// Physical page 1 -> display label 'i'
// Physical page 2 -> display label 'ii'
// Physical page 3 -> display label 'iii'
// Physical page 4 -> display label '1'
// Physical page 5 -> display label '2'
常见问题
get_producer 从哪里读取数据?
从文档 Info 字典的 /Info.Producer 条目读取。它位于 DocumentEditor(读写)上,对应的 setter 在保存时将更改持久化到 /Info.Producer。XMP 的 pdf:Producer 值可通过 XmpMetadata 的 pdf_producer 单独获取。
为什么 embedded_fonts 只返回出现在文本中的字体?
字体列表从页面已渲染的文本跨度中推导,因此反映的是该页实际用于绘制字形的字体。子集检测遵循 PDF 规范的 6 字符标签加 + 约定(如 ABCDEF+Helvetica)。
fonts_to_json 返回的 JSON schema 是什么?
包含 name、type、encoding、isEmbedded、isSubset 和 size 字段的对象 JSON 数组,与 Go 绑定反序列化为 Font 结构体的格式相同。
元数据提取速度快吗? 非常快。PDF Oxide 的提取核心在基准测试语料库上的平均耗时约为 0.8 ms,p99 为 9 ms,通过率 100%。