Skip to content

Image Extraction

PDF Oxide extracts images from PDF pages by parsing the content stream, resolving XObject references via Do operators, recursing into nested Form XObjects, and decoding inline images. Use extract_images() to get image objects in memory, or extract_images_to_files() to save them directly to disk as PNG or JPEG files.

Since v0.3.5, image extraction processes the full page content stream rather than only scanning the XObject dictionary. This correctly handles images placed via Do operators, nested Form XObjects with cycle detection, and inline images embedded with BI/ID/EI sequences.

Color-space support

Extracted images are decoded and delivered in their original colour space — no lossy round-tripping:

  • DeviceRGB / DeviceGray / DeviceCMYK — returned as-is.
  • Indexed (1, 2, 4, 8 bits per component) — palette resolved via resolve_indexed_palette and expanded through expand_indexed_to_rgb. Supports Indexed palettes built on RGB, Grayscale, and CMYK base colour spaces. Previously emitted Invalid RGB image dimensions errors on many real-world PDFs.
  • CalRGB / CalGray / ICCBased — converted to RGB during decode.

Palette expansion is hardened against malicious inputs with a checked_mul overflow guard and a 256 MiB allocation cap; truncated streams are rejected cleanly instead of producing garbage pixels.

Malformed-image tolerance

Images with missing /ColorSpace entries, zero dimensions, or invalid streams are skipped with a warning — they no longer panic the page render. The same tolerance applies to malformed images nested inside Form XObjects.

Quick Example

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for img in images:
    print(f"{img['width']}x{img['height']}")

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const images = doc.getEmbeddedImages(0);
for (const img of images) {
    console.log(`${img.width}x${img.height}`);
}

Go

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
images, _ := doc.Images(0)
for _, img := range images {
    fmt.Printf("%dx%d\n", img.Width, img.Height)
}

C#

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
var images = doc.ExtractImages(0);
foreach (var img in images)
{
    Console.WriteLine($"{img.Width}x{img.Height}");
}

WASM

const doc = new WasmPdfDocument(bytes);
const images = doc.extractImages(0);
for (const img of images) {
    console.log(`${img.width}x${img.height}`);
}

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
for img in &images {
    println!("{}x{} {:?}", img.width(), img.height(), img.color_space());
}

API Reference

extract_images(page_index) -> Vec<PdfImage>

Extract all images from a page. Parses the page content stream to find:

  1. XObject images referenced via Do operators
  2. Form XObjects containing nested images (recursive, with cycle detection)
  3. Inline images embedded with BI/ID/EI sequences

CTM (Current Transformation Matrix) tracking provides bounding boxes for each image.

Parameter Type Description
page_index int / usize Zero-based page index

Returns: A vector of PdfImage objects.

PdfImage Fields and Methods

Method / Field Type Description
width() u32 Image width in pixels
height() u32 Image height in pixels
color_space() &ColorSpace Color space (DeviceRGB, DeviceGray, DeviceCMYK, etc.)
bits_per_component() u8 Bits per color component (typically 8)
data() &ImageData Raw image data (JPEG bytes or raw pixels)
bbox() Option<&Rect> Bounding box in PDF user space (if CTM was tracked)
save_as_png(path) Result<()> Save image as PNG file
save_as_jpeg(path) Result<()> Save image as JPEG file
to_png_bytes() Result<Vec<u8>> Encode as PNG bytes in memory
to_jpeg_bytes() Result<Vec<u8>> Encode as JPEG bytes in memory

ColorSpace Variants

Variant Description
DeviceRGB 3-channel RGB
DeviceGray Single-channel grayscale
DeviceCMYK 4-channel CMYK
Indexed Palette-based color
ICCBased ICC profile-based color
CalGray Calibrated grayscale
CalRGB Calibrated RGB
Lab CIE Lab* color

ImageData Variants

Variant Description
Jpeg(Vec<u8>) JPEG-compressed data (DCT pass-through)
Raw { pixels, format } Decoded pixel data with PixelFormat (RGB, Gray, CMYK, RGBA)

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for (i, image) in images.iter().enumerate() {
    println!(
        "Image {}: {}x{} {:?} {}bpc",
        i, image.width(), image.height(),
        image.color_space(), image.bits_per_component(),
    );

    if let Some(bbox) = image.bbox() {
        println!("  Position: ({:.1}, {:.1})", bbox.x, bbox.y);
    }

    image.save_as_png(&format!("output/image_{}.png", i))?;
}

extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>

Extract images from a page and save them directly to files. JPEG images are saved in their original format (zero re-encoding loss); other images are saved as PNG.

Parameter Type Default Description
page_index usize Zero-based page index
output_dir impl AsRef<Path> Directory to save images (created if absent)
prefix Option<&str> "img" Filename prefix
start_index Option<usize> 1 Starting index for filenames

Returns: A vector of ExtractedImageRef describing saved files.

ExtractedImageRef Fields

Field Type Description
filename String Saved filename (e.g., "img_001.png")
format ImageFormat Png or Jpeg
width u32 Image width in pixels
height u32 Image height in pixels

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let refs = doc.extract_images_to_files(0, "output/images", Some("fig"), Some(1))?;

for img_ref in &refs {
    println!("Saved: {} ({}x{}, {:?})", img_ref.filename, img_ref.width, img_ref.height, img_ref.format);
}

Advanced Examples

Extract all images from all pages

use pdf_oxide::PdfDocument;
use std::path::Path;

let mut doc = PdfDocument::open("book.pdf")?;
let page_count = doc.page_count()?;
let mut total = 0;

for page in 0..page_count {
    let refs = doc.extract_images_to_files(
        page,
        "output/images",
        Some(&format!("page{}", page + 1)),
        Some(1),
    )?;
    total += refs.len();
    println!("Page {}: {} images", page + 1, refs.len());
}
println!("Total: {} images extracted", total);

Get image bytes in memory (no disk I/O)

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for image in &images {
    let png_bytes = image.to_png_bytes()?;
    println!("PNG size: {} bytes", png_bytes.len());

    // Use png_bytes with an HTTP response, database, etc.
}

Filter images by size

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

// Only keep images larger than 100x100 pixels
let large_images: Vec<_> = images.iter()
    .filter(|img| img.width() > 100 && img.height() > 100)
    .collect();

println!("{} large images on page 1", large_images.len());
for img in &large_images {
    println!("  {}x{} {:?}", img.width(), img.height(), img.color_space());
}

Distinguish JPEG pass-through from re-encoded images

use pdf_oxide::extractors::ImageData;

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for (i, image) in images.iter().enumerate() {
    match image.data() {
        ImageData::Jpeg(bytes) => {
            // Original JPEG data -- save directly for zero quality loss
            std::fs::write(format!("image_{}.jpg", i), bytes)?;
            println!("Image {}: JPEG pass-through ({} bytes)", i, bytes.len());
        }
        ImageData::Raw { pixels, format } => {
            // Raw pixels -- must encode to a file format
            image.save_as_png(&format!("image_{}.png", i))?;
            println!("Image {}: raw {:?} ({}x{})", i, format, image.width(), image.height());
        }
    }
}