Metadata & XMP
PDF Oxide reads document-level metadata from multiple sources: the PDF header (version), the trailer and catalog dictionaries, XMP metadata streams (ISO 16684), and page label definitions. The XmpExtractor parses the Dublin Core, XMP Core, PDF, and XMP Rights namespaces, plus any custom properties.
Use version() and catalog() for basic document properties, XmpExtractor::extract() for rich metadata, and PageLabelExtractor for page numbering schemes.
Quick Example
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
major, minor = doc.version()
print(f"PDF {major}.{minor}, {doc.page_count()} pages")
Node.js
const { PdfDocument } = require("pdf-oxide");
const doc = new PdfDocument("report.pdf");
const { major, minor } = doc.getVersion();
console.log(`PDF ${major}.${minor}, ${doc.pageCount()} pages`);
doc.close();
Go
import pdfoxide "github.com/yfedoseev/pdf_oxide/go"
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
major, minor, _ := doc.Version()
pages, _ := doc.PageCount()
fmt.Printf("PDF %d.%d, %d pages\n", major, minor, pages)
C#
using PdfOxide.Core;
using var doc = PdfDocument.Open("report.pdf");
var (major, minor) = doc.Version;
Console.WriteLine($"PDF {major}.{minor}, {doc.PageCount} pages");
WASM
const doc = new WasmPdfDocument(bytes);
const version = doc.version();
console.log(`PDF ${version}, ${doc.pageCount()} pages`);
Rust
use pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("report.pdf")?;
let (major, minor) = doc.version();
println!("PDF {}.{}", major, minor);
println!("Pages: {}", doc.page_count()?);
API Reference
version() -> (u8, u8)
Get the PDF version from the file header.
Returns: A tuple of (major, minor), e.g., (1, 7) for PDF 1.7 or (2, 0) for PDF 2.0.
catalog() -> Result<Object>
Get the document catalog dictionary. The catalog is the root of the PDF object hierarchy and contains references to the page tree, outlines, names, and other document-level structures.
Rust
let mut doc = PdfDocument::open("report.pdf")?;
let catalog = doc.catalog()?;
if let Some(dict) = catalog.as_dict() {
for (key, _) in dict {
println!("Catalog key: {}", key);
}
}
trailer() -> &Object
Get the document trailer dictionary. The trailer contains the cross-reference table location, document ID, encryption dictionary reference, and info dictionary reference.
Rust
let doc = PdfDocument::open("report.pdf")?;
let trailer = doc.trailer();
println!("Trailer: {:?}", trailer);
XmpExtractor::extract(doc) -> Result<Option<XmpMetadata>>
Extract XMP (Extensible Metadata Platform) metadata from the document’s metadata stream. XMP provides richer metadata than the traditional Info dictionary, using standard XML namespaces.
| Parameter | Type | Description |
|---|---|---|
doc |
&mut PdfDocument |
The PDF document |
Returns: Some(XmpMetadata) if XMP data is present, None otherwise.
XmpMetadata Fields
Dublin Core namespace (dc:)
| Field | Type | Description |
|---|---|---|
dc_title |
Option<String> |
Document title |
dc_creator |
Vec<String> |
Authors/creators list |
dc_description |
Option<String> |
Document description |
dc_subject |
Vec<String> |
Subject keywords |
dc_language |
Option<String> |
Document language (e.g., "en-US") |
dc_rights |
Option<String> |
Copyright statement |
dc_format |
Option<String> |
MIME format (e.g., "application/pdf") |
XMP Core namespace (xmp:)
| Field | Type | Description |
|---|---|---|
xmp_creator_tool |
Option<String> |
Tool used to create the document |
xmp_create_date |
Option<String> |
Creation date (ISO 8601) |
xmp_modify_date |
Option<String> |
Last modification date |
xmp_metadata_date |
Option<String> |
Metadata modification date |
PDF namespace (pdf:)
| Field | Type | Description |
|---|---|---|
pdf_producer |
Option<String> |
PDF producer application |
pdf_keywords |
Option<String> |
Keywords string |
pdf_version |
Option<String> |
PDF version from XMP (may differ from header) |
pdf_trapped |
Option<String> |
Trapping status |
XMP Rights namespace (xmpRights:)
| Field | Type | Description |
|---|---|---|
xmp_rights_usage_terms |
Option<String> |
Usage terms |
xmp_rights_marked |
Option<bool> |
Whether marked with rights |
xmp_rights_web_statement |
Option<String> |
Web statement URL |
Other
| Field | Type | Description |
|---|---|---|
custom |
HashMap<String, String> |
Custom properties (namespace:property to value) |
raw_xml |
Option<String> |
The original XMP XML packet |
Rust
use pdf_oxide::extractors::xmp::XmpExtractor;
let mut doc = PdfDocument::open("report.pdf")?;
if let Some(xmp) = XmpExtractor::extract(&mut doc)? {
if let Some(title) = &xmp.dc_title {
println!("Title: {}", title);
}
for creator in &xmp.dc_creator {
println!("Author: {}", creator);
}
if let Some(tool) = &xmp.xmp_creator_tool {
println!("Created with: {}", tool);
}
if let Some(date) = &xmp.xmp_create_date {
println!("Created: {}", date);
}
if let Some(producer) = &xmp.pdf_producer {
println!("Producer: {}", producer);
}
}
WASM
const doc = new WasmPdfDocument(bytes);
const xmp = doc.xmpMetadata();
if (xmp) {
console.log(`Title: ${xmp.dc_title}`);
console.log(`Authors: ${xmp.dc_creator}`);
console.log(`Created with: ${xmp.xmp_creator_tool}`);
console.log(`Created: ${xmp.xmp_create_date}`);
console.log(`Producer: ${xmp.pdf_producer}`);
}
doc.free();
Python
doc = PdfDocument("report.pdf")
xmp = doc.xmp_metadata()
if xmp:
print(f"Title: {xmp.get('dc_title')}")
print(f"Authors: {xmp.get('dc_creator')}")
print(f"Created with: {xmp.get('xmp_creator_tool')}")
print(f"Created: {xmp.get('xmp_create_date')}")
print(f"Producer: {xmp.get('pdf_producer')}")
<!-- Node.js: no equivalent on PdfDocumentImpl — xmp metadata not exposed in js/src/index.ts -->
Go
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
xmp, _ := doc.XmpMetadata() // returns JSON string
fmt.Println(xmp)
C#
using var doc = PdfDocument.Open("report.pdf");
var xmp = doc.GetXmpMetadata(); // returns JSON string
Console.WriteLine(xmp);
Pdf Convenience Methods
The high-level Pdf API provides shortcut methods for common metadata queries.
xmp_metadata() -> Result<Option<XmpMetadata>>
Get the full XMP metadata object.
xmp_title() -> Result<Option<String>>
Get just the document title from XMP.
xmp_creators() -> Result<Vec<String>>
Get the list of creators/authors from XMP.
Rust
use pdf_oxide::api::Pdf;
let mut pdf = Pdf::open("report.pdf")?;
if let Some(title) = pdf.xmp_title()? {
println!("Title: {}", title);
}
let creators = pdf.xmp_creators()?;
for creator in &creators {
println!("Author: {}", creator);
}
PageLabelExtractor::extract(doc) -> Result<Vec<PageLabelRange>>
Extract page label definitions from the document. Page labels define how page numbers are displayed (e.g., Roman numerals for front matter, Arabic numerals for body).
| Parameter | Type | Description |
|---|---|---|
doc |
&mut PdfDocument |
The PDF document |
Returns: A vector of PageLabelRange definitions.
PageLabelRange Fields
| Field | Type | Description |
|---|---|---|
start_page |
usize |
First page index this range applies to |
style |
PageLabelStyle |
Numbering style |
prefix |
Option<String> |
Label prefix string |
start_number |
u32 |
Starting number for this range |
PageLabelStyle Variants
| Variant | Description | Example |
|---|---|---|
DecimalArabic |
Arabic numerals | 1, 2, 3 |
UppercaseRoman |
Uppercase Roman | I, II, III |
LowercaseRoman |
Lowercase Roman | i, ii, iii |
UppercaseLetters |
Uppercase letters | A, B, C |
LowercaseLetters |
Lowercase letters | a, b, c |
None |
No numbering (prefix only) | – |
Pdf Page Label Convenience Methods
page_labels() -> Result<Vec<PageLabelRange>>
Get all page label range definitions.
page_label(page) -> Result<String>
Get the display label for a specific page index.
Rust
use pdf_oxide::api::Pdf;
let mut pdf = Pdf::open("book.pdf")?;
// Get all label ranges
let ranges = pdf.page_labels()?;
for range in &ranges {
println!(
"Pages from {}: {:?} style, prefix={:?}, start={}",
range.start_page, range.style, range.prefix, range.start_number
);
}
// Get label for a specific page
let label = pdf.page_label(0)?;
println!("Page 0 label: {}", label); // e.g., "i" or "Cover"
WASM
const doc = new WasmPdfDocument(bytes);
const labels = doc.pageLabels();
for (const range of labels) {
console.log(`Pages from ${range.start_page}: style=${range.style}, prefix=${range.prefix}`);
}
doc.free();
Python
doc = PdfDocument("book.pdf")
labels = doc.page_labels()
for range in labels:
print(f"Pages from {range['start_page']}: style={range['style']}, prefix={range['prefix']}")
<!-- Node.js: no equivalent on PdfDocumentImpl — pageLabels not exposed on class, only via properties mixin -->
Go
doc, _ := pdfoxide.Open("book.pdf")
defer doc.Close()
labels, _ := doc.PageLabels() // returns JSON string
fmt.Println(labels)
C#
using var doc = PdfDocument.Open("book.pdf");
var labels = doc.GetPageLabels(); // returns JSON string
Console.WriteLine(labels);
Advanced Examples
Display complete document metadata
use pdf_oxide::PdfDocument;
use pdf_oxide::extractors::xmp::XmpExtractor;
let mut doc = PdfDocument::open("report.pdf")?;
// Basic info
let (major, minor) = doc.version();
println!("PDF Version: {}.{}", major, minor);
println!("Pages: {}", doc.page_count()?);
// XMP metadata
if let Some(xmp) = XmpExtractor::extract(&mut doc)? {
println!("\nXMP Metadata:");
println!(" Title: {:?}", xmp.dc_title);
println!(" Authors: {:?}", xmp.dc_creator);
println!(" Description: {:?}", xmp.dc_description);
println!(" Keywords: {:?}", xmp.pdf_keywords);
println!(" Creator: {:?}", xmp.xmp_creator_tool);
println!(" Producer: {:?}", xmp.pdf_producer);
println!(" Created: {:?}", xmp.xmp_create_date);
println!(" Modified: {:?}", xmp.xmp_modify_date);
println!(" Language: {:?}", xmp.dc_language);
println!(" Rights: {:?}", xmp.dc_rights);
if !xmp.custom.is_empty() {
println!("\n Custom properties:");
for (key, value) in &xmp.custom {
println!(" {}: {}", key, value);
}
}
}
Access raw XMP XML
use pdf_oxide::extractors::xmp::XmpExtractor;
let mut doc = PdfDocument::open("report.pdf")?;
if let Some(xmp) = XmpExtractor::extract(&mut doc)? {
if let Some(xml) = &xmp.raw_xml {
std::fs::write("metadata.xml", xml)?;
println!("Raw XMP saved ({} bytes)", xml.len());
}
}
Generate page number display strings
use pdf_oxide::api::Pdf;
let mut pdf = Pdf::open("thesis.pdf")?;
let page_count = pdf.page_count()?;
for i in 0..page_count {
let label = pdf.page_label(i)?;
println!("Physical page {} -> display label '{}'", i + 1, label);
}
// Example output:
// Physical page 1 -> display label 'i'
// Physical page 2 -> display label 'ii'
// Physical page 3 -> display label 'iii'
// Physical page 4 -> display label '1'
// Physical page 5 -> display label '2'
Related Pages
- Text Extraction – Extract text content from pages
- Annotation Extraction – Access bookmarks and annotations
- Form Data Extraction – Extract form field data