Image Extraction
PDF Oxide extracts images from PDF pages by parsing the content stream, resolving XObject references via Do operators, recursing into nested Form XObjects, and decoding inline images. Use extract_images() to get image objects in memory, or extract_images_to_files() to save them directly to disk as PNG or JPEG files.
Since v0.3.5, image extraction processes the full page content stream rather than only scanning the XObject dictionary. This correctly handles images placed via Do operators, nested Form XObjects with cycle detection, and inline images embedded with BI/ID/EI sequences.
Color-space support
Extracted images are decoded and delivered in their original colour space — no lossy round-tripping:
- DeviceRGB / DeviceGray / DeviceCMYK — returned as-is.
- Indexed (1, 2, 4, 8 bits per component) — palette resolved via
resolve_indexed_paletteand expanded throughexpand_indexed_to_rgb. Supports Indexed palettes built on RGB, Grayscale, and CMYK base colour spaces. Previously emittedInvalid RGB image dimensionserrors on many real-world PDFs. - CalRGB / CalGray / ICCBased — converted to RGB during decode.
Palette expansion is hardened against malicious inputs with a checked_mul overflow guard and a 256 MiB allocation cap; truncated streams are rejected cleanly instead of producing garbage pixels.
Malformed-image tolerance
Images with missing /ColorSpace entries, zero dimensions, or invalid streams are skipped with a warning — they no longer panic the page render. The same tolerance applies to malformed images nested inside Form XObjects.
Quick Example
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for img in images:
print(f"{img['width']}x{img['height']}")
Node.js
const { PdfDocument } = require("pdf-oxide");
const doc = new PdfDocument("report.pdf");
const images = doc.getEmbeddedImages(0);
for (const img of images) {
console.log(`${img.width}x${img.height}`);
}
Go
import pdfoxide "github.com/yfedoseev/pdf_oxide/go"
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
images, _ := doc.Images(0)
for _, img := range images {
fmt.Printf("%dx%d\n", img.Width, img.Height)
}
C#
using PdfOxide.Core;
using var doc = PdfDocument.Open("report.pdf");
var images = doc.ExtractImages(0);
foreach (var img in images)
{
Console.WriteLine($"{img.Width}x{img.Height}");
}
WASM
const doc = new WasmPdfDocument(bytes);
const images = doc.extractImages(0);
for (const img of images) {
console.log(`${img.width}x${img.height}`);
}
Rust
use pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
for img in &images {
println!("{}x{} {:?}", img.width(), img.height(), img.color_space());
}
API Reference
extract_images(page_index) -> Vec<PdfImage>
Extract all images from a page. Parses the page content stream to find:
- XObject images referenced via
Dooperators - Form XObjects containing nested images (recursive, with cycle detection)
- Inline images embedded with
BI/ID/EIsequences
CTM (Current Transformation Matrix) tracking provides bounding boxes for each image.
| Parameter | Type | Description |
|---|---|---|
page_index |
int / usize |
Zero-based page index |
Returns: A vector of PdfImage objects.
PdfImage Fields and Methods
| Method / Field | Type | Description |
|---|---|---|
width() |
u32 |
Image width in pixels |
height() |
u32 |
Image height in pixels |
color_space() |
&ColorSpace |
Color space (DeviceRGB, DeviceGray, DeviceCMYK, etc.) |
bits_per_component() |
u8 |
Bits per color component (typically 8) |
data() |
&ImageData |
Raw image data (JPEG bytes or raw pixels) |
bbox() |
Option<&Rect> |
Bounding box in PDF user space (if CTM was tracked) |
save_as_png(path) |
Result<()> |
Save image as PNG file |
save_as_jpeg(path) |
Result<()> |
Save image as JPEG file |
to_png_bytes() |
Result<Vec<u8>> |
Encode as PNG bytes in memory |
to_jpeg_bytes() |
Result<Vec<u8>> |
Encode as JPEG bytes in memory |
ColorSpace Variants
| Variant | Description |
|---|---|
DeviceRGB |
3-channel RGB |
DeviceGray |
Single-channel grayscale |
DeviceCMYK |
4-channel CMYK |
Indexed |
Palette-based color |
ICCBased |
ICC profile-based color |
CalGray |
Calibrated grayscale |
CalRGB |
Calibrated RGB |
Lab |
CIE Lab* color |
ImageData Variants
| Variant | Description |
|---|---|
Jpeg(Vec<u8>) |
JPEG-compressed data (DCT pass-through) |
Raw { pixels, format } |
Decoded pixel data with PixelFormat (RGB, Gray, CMYK, RGBA) |
Rust
let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
for (i, image) in images.iter().enumerate() {
println!(
"Image {}: {}x{} {:?} {}bpc",
i, image.width(), image.height(),
image.color_space(), image.bits_per_component(),
);
if let Some(bbox) = image.bbox() {
println!(" Position: ({:.1}, {:.1})", bbox.x, bbox.y);
}
image.save_as_png(&format!("output/image_{}.png", i))?;
}
extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>
Extract images from a page and save them directly to files. JPEG images are saved in their original format (zero re-encoding loss); other images are saved as PNG.
| Parameter | Type | Default | Description |
|---|---|---|---|
page_index |
usize |
– | Zero-based page index |
output_dir |
impl AsRef<Path> |
– | Directory to save images (created if absent) |
prefix |
Option<&str> |
"img" |
Filename prefix |
start_index |
Option<usize> |
1 |
Starting index for filenames |
Returns: A vector of ExtractedImageRef describing saved files.
ExtractedImageRef Fields
| Field | Type | Description |
|---|---|---|
filename |
String |
Saved filename (e.g., "img_001.png") |
format |
ImageFormat |
Png or Jpeg |
width |
u32 |
Image width in pixels |
height |
u32 |
Image height in pixels |
Rust
let mut doc = PdfDocument::open("report.pdf")?;
let refs = doc.extract_images_to_files(0, "output/images", Some("fig"), Some(1))?;
for img_ref in &refs {
println!("Saved: {} ({}x{}, {:?})", img_ref.filename, img_ref.width, img_ref.height, img_ref.format);
}
Advanced Examples
Extract all images from all pages
use pdf_oxide::PdfDocument;
use std::path::Path;
let mut doc = PdfDocument::open("book.pdf")?;
let page_count = doc.page_count()?;
let mut total = 0;
for page in 0..page_count {
let refs = doc.extract_images_to_files(
page,
"output/images",
Some(&format!("page{}", page + 1)),
Some(1),
)?;
total += refs.len();
println!("Page {}: {} images", page + 1, refs.len());
}
println!("Total: {} images extracted", total);
Get image bytes in memory (no disk I/O)
let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
for image in &images {
let png_bytes = image.to_png_bytes()?;
println!("PNG size: {} bytes", png_bytes.len());
// Use png_bytes with an HTTP response, database, etc.
}
Filter images by size
let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
// Only keep images larger than 100x100 pixels
let large_images: Vec<_> = images.iter()
.filter(|img| img.width() > 100 && img.height() > 100)
.collect();
println!("{} large images on page 1", large_images.len());
for img in &large_images {
println!(" {}x{} {:?}", img.width(), img.height(), img.color_space());
}
Distinguish JPEG pass-through from re-encoded images
use pdf_oxide::extractors::ImageData;
let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
for (i, image) in images.iter().enumerate() {
match image.data() {
ImageData::Jpeg(bytes) => {
// Original JPEG data -- save directly for zero quality loss
std::fs::write(format!("image_{}.jpg", i), bytes)?;
println!("Image {}: JPEG pass-through ({} bytes)", i, bytes.len());
}
ImageData::Raw { pixels, format } => {
// Raw pixels -- must encode to a file format
image.save_as_png(&format!("image_{}.png", i))?;
println!("Image {}: raw {:?} ({}x{})", i, format, image.width(), image.height());
}
}
}
Related Pages
- Text Extraction – Extract text alongside images
- HTML Conversion – Embed extracted images in HTML output
- Markdown Conversion – Include images in Markdown output