Image Extraction
PDF Oxide extracts images from PDF pages by parsing the content stream, resolving XObject references via Do operators, recursing into nested Form XObjects, and decoding inline images. Use extract_images() to get image objects in memory, or extract_images_to_files() to save them directly to disk as PNG or JPEG files.
Since v0.3.5, image extraction processes the full page content stream rather than only scanning the XObject dictionary. This correctly handles images placed via Do operators, nested Form XObjects with cycle detection, and inline images embedded with BI/ID/EI sequences.
Color-space support
Extracted images are decoded and delivered in their original colour space — no lossy round-tripping:
- DeviceRGB / DeviceGray / DeviceCMYK — returned as-is.
- Indexed (1, 2, 4, 8 bits per component) — palette resolved via
resolve_indexed_paletteand expanded throughexpand_indexed_to_rgb. Supports Indexed palettes built on RGB, Grayscale, and CMYK base colour spaces. Previously emittedInvalid RGB image dimensionserrors on many real-world PDFs. - CalRGB / CalGray / ICCBased — converted to RGB during decode.
Palette expansion is hardened against malicious inputs with a checked_mul overflow guard and a 256 MiB allocation cap; truncated streams are rejected cleanly instead of producing garbage pixels.
Malformed-image tolerance
Images with missing /ColorSpace entries, zero dimensions, or invalid streams are skipped with a warning — they no longer panic the page render. The same tolerance applies to malformed images nested inside Form XObjects.
Quick Example
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for img in images:
print(f"{img['width']}x{img['height']}")
Node.js
const { PdfDocument } = require("pdf-oxide");
const doc = new PdfDocument("report.pdf");
const images = doc.getEmbeddedImages(0);
for (const img of images) {
console.log(`${img.width}x${img.height}`);
}
Go
import pdfoxide "github.com/yfedoseev/pdf_oxide/go"
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
images, _ := doc.Images(0)
for _, img := range images {
fmt.Printf("%dx%d\n", img.Width, img.Height)
}
C#
using PdfOxide.Core;
using var doc = PdfDocument.Open("report.pdf");
var images = doc.ExtractImages(0);
foreach (var img in images)
{
Console.WriteLine($"{img.Width}x{img.Height}");
}
WASM
const doc = new WasmPdfDocument(bytes);
const images = doc.extractImages(0);
for (const img of images) {
console.log(`${img.width}x${img.height}`);
}
Rust
use pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
for img in &images {
println!("{}x{} {:?}", img.width(), img.height(), img.color_space());
}
Java
import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.image.ExtractedImage;
import java.nio.file.Path;
import java.util.List;
try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
List<ExtractedImage> images = doc.page(0).images();
for (ExtractedImage img : images) {
System.out.println(img.width() + "x" + img.height());
}
}
Kotlin
import fyi.oxide.pdf.PdfDocument
PdfDocument.open(java.nio.file.Path.of("report.pdf")).use { doc ->
for (img in doc.page(0).images()) {
println("${img.width()}x${img.height()}")
}
}
Scala
import fyi.oxide.pdf.{PdfDocument, imagesSeq}
import scala.util.Using
Using.resource(PdfDocument.open("report.pdf")) { doc =>
for (img <- doc.page(0).imagesSeq) {
println(s"${img.width}x${img.height}")
}
}
Clojure
(require '[pdf-oxide.core :as pdf])
(with-open [doc (pdf/open "report.pdf")]
(doseq [img (pdf/images (pdf/page doc 0))]
(println (str (.width img) "x" (.height img)))))
C++
#include <pdf_oxide/pdf_oxide.hpp>
auto doc = pdf_oxide::Document::open("report.pdf");
for (const auto& img : doc.embedded_images(0)) {
std::printf("%dx%d\n", img.width, img.height);
}
Swift
import PdfOxide
let doc = try Document.open("report.pdf")
for img in try doc.embeddedImages(0) {
print("\(img.width)x\(img.height)")
}
Dart
import 'package:pdf_oxide/pdf_oxide.dart';
final doc = PdfDocument.open('report.pdf');
for (final img in doc.embeddedImages(0)) {
print('${img.width}x${img.height}');
}
R
library(pdfoxide)
doc <- pdf_open("report.pdf")
for (img in pdf_embedded_images(doc, 0)) {
cat(sprintf("%dx%d\n", img$width, img$height))
}
Julia
using PdfOxide
doc = open_document("report.pdf")
for img in embedded_images(doc, 0)
println("$(img.width)x$(img.height)")
end
Zig
const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;
var doc = try pdf_oxide.Document.open("report.pdf");
const images = try doc.embeddedImages(a, 0);
for (images) |img| {
std.debug.print("{d}x{d}\n", .{ img.width, img.height });
}
Objective-C
#import "POXPdfOxide.h"
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
for (POXImage *img in [doc embeddedImages:0 error:&err]) {
NSLog(@"%ldx%ld", (long)img.width, (long)img.height);
}
Elixir
{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, images} = PdfOxide.embedded_images(doc, 0)
for img <- images do
IO.puts("#{img.width}x#{img.height}")
end
API Reference
extract_images(page_index) -> Vec<PdfImage>
Extract all images from a page. Parses the page content stream to find:
- XObject images referenced via
Dooperators - Form XObjects containing nested images (recursive, with cycle detection)
- Inline images embedded with
BI/ID/EIsequences
CTM (Current Transformation Matrix) tracking provides bounding boxes for each image.
| Parameter | Type | Description |
|---|---|---|
page_index |
int / usize |
Zero-based page index |
Returns: A vector of PdfImage objects.
PdfImage Fields and Methods
| Method / Field | Type | Description |
|---|---|---|
width() |
u32 |
Image width in pixels |
height() |
u32 |
Image height in pixels |
color_space() |
&ColorSpace |
Color space (DeviceRGB, DeviceGray, DeviceCMYK, etc.) |
bits_per_component() |
u8 |
Bits per color component (typically 8) |
data() |
&ImageData |
Raw image data (JPEG bytes or raw pixels) |
bbox() |
Option<&Rect> |
Bounding box in PDF user space (if CTM was tracked) |
save_as_png(path) |
Result<()> |
Save image as PNG file |
save_as_jpeg(path) |
Result<()> |
Save image as JPEG file |
to_png_bytes() |
Result<Vec<u8>> |
Encode as PNG bytes in memory |
to_jpeg_bytes() |
Result<Vec<u8>> |
Encode as JPEG bytes in memory |
ColorSpace Variants
| Variant | Description |
|---|---|
DeviceRGB |
3-channel RGB |
DeviceGray |
Single-channel grayscale |
DeviceCMYK |
4-channel CMYK |
Indexed |
Palette-based color |
ICCBased |
ICC profile-based color |
CalGray |
Calibrated grayscale |
CalRGB |
Calibrated RGB |
Lab |
CIE Lab* color |
ImageData Variants
| Variant | Description |
|---|---|
Jpeg(Vec<u8>) |
JPEG-compressed data (DCT pass-through) |
Raw { pixels, format } |
Decoded pixel data with PixelFormat (RGB, Gray, CMYK, RGBA) |
Rust
let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
for (i, image) in images.iter().enumerate() {
println!(
"Image {}: {}x{} {:?} {}bpc",
i, image.width(), image.height(),
image.color_space(), image.bits_per_component(),
);
if let Some(bbox) = image.bbox() {
println!(" Position: ({:.1}, {:.1})", bbox.x, bbox.y);
}
image.save_as_png(&format!("output/image_{}.png", i))?;
}
extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>
Extract images from a page and save them directly to files. JPEG images are saved in their original format (zero re-encoding loss); other images are saved as PNG.
| Parameter | Type | Default | Description |
|---|---|---|---|
page_index |
usize |
– | Zero-based page index |
output_dir |
impl AsRef<Path> |
– | Directory to save images (created if absent) |
prefix |
Option<&str> |
"img" |
Filename prefix |
start_index |
Option<usize> |
1 |
Starting index for filenames |
Returns: A vector of ExtractedImageRef describing saved files.
ExtractedImageRef Fields
| Field | Type | Description |
|---|---|---|
filename |
String |
Saved filename (e.g., "img_001.png") |
format |
ImageFormat |
Png or Jpeg |
width |
u32 |
Image width in pixels |
height |
u32 |
Image height in pixels |
Rust
let mut doc = PdfDocument::open("report.pdf")?;
let refs = doc.extract_images_to_files(0, "output/images", Some("fig"), Some(1))?;
for img_ref in &refs {
println!("Saved: {} ({}x{}, {:?})", img_ref.filename, img_ref.width, img_ref.height, img_ref.format);
}
Advanced Examples
Extract all images from all pages
use pdf_oxide::PdfDocument;
use std::path::Path;
let mut doc = PdfDocument::open("book.pdf")?;
let page_count = doc.page_count()?;
let mut total = 0;
for page in 0..page_count {
let refs = doc.extract_images_to_files(
page,
"output/images",
Some(&format!("page{}", page + 1)),
Some(1),
)?;
total += refs.len();
println!("Page {}: {} images", page + 1, refs.len());
}
println!("Total: {} images extracted", total);
Get image bytes in memory (no disk I/O)
let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
for image in &images {
let png_bytes = image.to_png_bytes()?;
println!("PNG size: {} bytes", png_bytes.len());
// Use png_bytes with an HTTP response, database, etc.
}
Filter images by size
let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
// Only keep images larger than 100x100 pixels
let large_images: Vec<_> = images.iter()
.filter(|img| img.width() > 100 && img.height() > 100)
.collect();
println!("{} large images on page 1", large_images.len());
for img in &large_images {
println!(" {}x{} {:?}", img.width(), img.height(), img.color_space());
}
Distinguish JPEG pass-through from re-encoded images
use pdf_oxide::extractors::ImageData;
let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
for (i, image) in images.iter().enumerate() {
match image.data() {
ImageData::Jpeg(bytes) => {
// Original JPEG data -- save directly for zero quality loss
std::fs::write(format!("image_{}.jpg", i), bytes)?;
println!("Image {}: JPEG pass-through ({} bytes)", i, bytes.len());
}
ImageData::Raw { pixels, format } => {
// Raw pixels -- must encode to a file format
image.save_as_png(&format!("image_{}.png", i))?;
println!("Image {}: raw {:?} ({}x{})", i, format, image.width(), image.height());
}
}
}
The embedded-images accessor (embedded_images)
extract_images() is the rich, in-memory Rust API. The cross-language bindings expose a leaner embedded-images accessor built on the same content-stream walk, returning each image’s pixel dimensions, format, color space, bits-per-component, and raw decoded bytes. It is backed by the C ABI function pdf_document_get_embedded_images plus the pdf_oxide_image_* accessor family.
How do I list embedded images with the bindings?
Go
import (
"fmt"
pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
images, _ := doc.Images(0) // []pdfoxide.Image
for _, img := range images {
fmt.Printf("%dx%d %s/%s %dbpc, %d bytes\n",
img.Width, img.Height, img.Format, img.Colorspace,
img.BitsPerComponent, len(img.Data))
}
Swift
import PdfOxide
let doc = try Document.open("report.pdf")
let images = try doc.embeddedImages(0) // [Image]
for img in images {
print("\(img.width)x\(img.height) \(img.format)/\(img.colorspace) "
+ "\(img.bitsPerComponent)bpc, \(img.data.count) bytes")
}
C ABI
#include "pdf_oxide.h"
int32_t err = 0;
FfiImageList *images = pdf_document_get_embedded_images(doc, /*page=*/0, &err);
int32_t n = pdf_oxide_image_count(images);
for (int32_t i = 0; i < n; i++) {
int32_t w = pdf_oxide_image_get_width(images, i, &err);
int32_t h = pdf_oxide_image_get_height(images, i, &err);
char *fmt = pdf_oxide_image_get_format(images, i, &err);
char *cs = pdf_oxide_image_get_colorspace(images, i, &err);
printf("%dx%d %s/%s\n", w, h, fmt, cs);
free_string(fmt);
free_string(cs);
}
pdf_oxide_image_list_free(images);
Java
import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.image.ExtractedImage;
import java.nio.file.Path;
try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
for (ExtractedImage img : doc.page(0).images()) {
System.out.printf("%dx%d %s, %d bytes%n",
img.width(), img.height(), img.format(), img.bytes().length);
}
}
Kotlin
import fyi.oxide.pdf.PdfDocument
PdfDocument.open(java.nio.file.Path.of("report.pdf")).use { doc ->
for (img in doc.page(0).images()) {
println("${img.width()}x${img.height()} ${img.format()}, ${img.bytes().size} bytes")
}
}
Scala
import fyi.oxide.pdf.{PdfDocument, imagesSeq}
import scala.util.Using
Using.resource(PdfDocument.open("report.pdf")) { doc =>
for (img <- doc.page(0).imagesSeq) {
println(s"${img.width}x${img.height} ${img.format}, ${img.bytes.length} bytes")
}
}
Clojure
(require '[pdf-oxide.core :as pdf])
(with-open [doc (pdf/open "report.pdf")]
(doseq [img (pdf/images (pdf/page doc 0))]
(println (format "%dx%d %s, %d bytes"
(.width img) (.height img) (.format img) (count (.bytes img))))))
C++
#include <pdf_oxide/pdf_oxide.hpp>
auto doc = pdf_oxide::Document::open("report.pdf");
for (const auto& img : doc.embedded_images(0)) {
std::printf("%dx%d %s/%s %dbpc, %zu bytes\n",
img.width, img.height, img.format.c_str(), img.colorspace.c_str(),
img.bits_per_component, img.data.size());
}
Dart
import 'package:pdf_oxide/pdf_oxide.dart';
final doc = PdfDocument.open('report.pdf');
for (final img in doc.embeddedImages(0)) {
print('${img.width}x${img.height} ${img.format}/${img.colorspace} '
'${img.bitsPerComponent}bpc, ${img.data.length} bytes');
}
R
library(pdfoxide)
doc <- pdf_open("report.pdf")
for (img in pdf_embedded_images(doc, 0)) {
cat(sprintf("%dx%d %s/%s %dbpc, %d bytes\n",
img$width, img$height, img$format, img$colorspace,
img$bits_per_component, length(img$data)))
}
Julia
using PdfOxide
doc = open_document("report.pdf")
for img in embedded_images(doc, 0)
println("$(img.width)x$(img.height) $(img.format)/$(img.colorspace) " *
"$(img.bitsPerComponent)bpc, $(length(img.data)) bytes")
end
Zig
const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;
var doc = try pdf_oxide.Document.open("report.pdf");
const images = try doc.embeddedImages(a, 0);
for (images) |img| {
std.debug.print("{d}x{d} {s}/{s} {d}bpc, {d} bytes\n", .{
img.width, img.height, img.format, img.colorspace,
img.bits_per_component, img.data.len,
});
}
Objective-C
#import "POXPdfOxide.h"
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
for (POXImage *img in [doc embeddedImages:0 error:&err]) {
NSLog(@"%ldx%ld %@/%@ %ldbpc, %lu bytes",
(long)img.width, (long)img.height, img.format, img.colorspace,
(long)img.bitsPerComponent, (unsigned long)img.data.length);
}
Elixir
{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, images} = PdfOxide.embedded_images(doc, 0)
for img <- images do
IO.puts("#{img.width}x#{img.height} #{img.format}/#{img.colorspace} " <>
"#{img.bits_per_component}bpc, #{byte_size(img.data)} bytes")
end
Image accessor fields
| Field (Go / Swift) | Type | Description |
|---|---|---|
Width / width |
int |
Image width in pixels |
Height / height |
int |
Image height in pixels |
Format / format |
string |
Source format string (e.g. "jpeg", "raw") |
Colorspace / colorspace |
string |
Color space name (e.g. "DeviceRGB") |
BitsPerComponent / bitsPerComponent |
int |
Bits per color component |
Data / data |
[]byte / [UInt8] |
Raw decoded image bytes |
Binding coverage. The embedded-images accessor is exposed in Go (
doc.Images(page)), Swift (doc.embeddedImages(page)), and the C ABI (pdf_document_get_embedded_images). In Rust, use the richerextract_images()shown above. The accessor is compiled out of the WASM target.
The page-elements accessor (page_elements)
page_elements returns every laid-out element (text spans, with their type, text, and bounding box) on a page as a single list. The bindings marshal the whole list in one FFI call via pdf_oxide_elements_to_json, so it is the cheapest way to walk a page’s layout without re-running text extraction per region. It is backed by the C ABI function pdf_page_get_elements and the pdf_oxide_element_* accessor family.
How do I walk a page’s layout elements?
Go
import (
"fmt"
pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
elements, _ := doc.PageElements(0) // []pdfoxide.Element
for _, el := range elements {
fmt.Printf("[%s] %q at (%.1f, %.1f) %.1fx%.1f\n",
el.Type, el.Text, el.X, el.Y, el.Width, el.Height)
}
Swift
import PdfOxide
let doc = try Document.open("report.pdf")
let elements = try doc.pageElements(0) // ElementList
for el in try elements.all() {
print("[\(el.type)] \(el.text) at "
+ "(\(el.rect.x), \(el.rect.y)) \(el.rect.width)x\(el.rect.height)")
}
// Serialize the whole list to JSON in one call:
let json = try elements.toJson()
C ABI
#include "pdf_oxide.h"
int32_t err = 0;
FfiElementList *els = pdf_page_get_elements(doc, /*page=*/0, &err);
// One-shot JSON serialization (caller frees with free_string):
char *json = pdf_oxide_elements_to_json(els, &err);
printf("%s\n", json);
free_string(json);
pdf_oxide_elements_free(els);
Dart
import 'package:pdf_oxide/pdf_oxide.dart';
final doc = PdfDocument.open('report.pdf');
final elements = doc.pageElements(0); // ElementList
for (final el in elements.toList()) {
print('[${el.type}] ${el.text} at '
'(${el.rect.x}, ${el.rect.y}) ${el.rect.width}x${el.rect.height}');
}
// Serialize the whole list to JSON in one call:
final json = elements.toJson();
Objective-C
#import "POXPdfOxide.h"
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
POXElementList *els = [doc pageElements:0 error:&err];
for (int32_t i = 0; i < [els count]; i++) {
NSString *type = [els typeAtIndex:i error:&err];
NSString *text = [els textAtIndex:i error:&err];
POXBbox rect = [els rectAtIndex:i error:&err];
NSLog(@"[%@] %@ at (%.1f, %.1f) %.1fx%.1f",
type, text, rect.x, rect.y, rect.width, rect.height);
}
// One-shot JSON serialization:
NSString *json = [els toJsonWithError:&err];
Elixir
{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, els} = PdfOxide.page_elements(doc, 0)
for i <- 0..(PdfOxide.element_count(els) - 1) do
{:ok, type} = PdfOxide.element_type(els, i)
{:ok, text} = PdfOxide.element_text(els, i)
{:ok, rect} = PdfOxide.element_rect(els, i)
IO.puts("[#{type}] #{text} at (#{rect.x}, #{rect.y}) #{rect.width}x#{rect.height}")
end
# Serialize the whole list to JSON in one call:
{:ok, json} = PdfOxide.elements_to_json(els)
Element fields
| Field (Go / Swift) | Type | Description |
|---|---|---|
Type / type |
string |
Element type (e.g. "text") |
Text / text |
string |
Element text content |
X, Y / rect.x, rect.y |
float |
Bounding-box origin in PDF user space |
Width, Height / rect.width, rect.height |
float |
Bounding-box size |
Binding coverage.
page_elementsis exposed in Go (doc.PageElements(page)), Swift (doc.pageElements(page)→ElementList), and the C ABI (pdf_page_get_elements+pdf_oxide_elements_to_json). It is compiled out of the WASM target.
FAQ
What is the difference between extract_images() and the embedded-images accessor?
extract_images() (Rust) returns rich PdfImage objects with save_as_png, to_jpeg_bytes, CTM bounding boxes, and typed ColorSpace/ImageData enums. The embedded-images accessor (doc.Images / doc.embeddedImages / pdf_document_get_embedded_images) returns a flat list of dimensions, format, color space, and raw bytes — the cross-language path to the same content-stream walk.
Is image extraction fast? Yes. PDF Oxide’s extraction core runs at roughly 0.8 ms mean / 9 ms p99 with a 100% pass rate on the benchmark corpus, decoding images in their original color space with no lossy round-tripping.
Does the embedded-images accessor re-encode JPEGs?
No. JPEG-backed images are returned with their original DCT bytes (format == "jpeg"); only raw pixel data is decoded. The richer extract_images() API exposes the same distinction via ImageData::Jpeg vs ImageData::Raw.
Why is data empty for some images?
Malformed images (missing /ColorSpace, zero dimensions, truncated streams) are skipped with a warning rather than panicking the page, so their byte buffer may come back empty.
Related Pages
- Text Extraction – Extract text alongside images
- HTML Conversion – Embed extracted images in HTML output
- Markdown Conversion – Include images in Markdown output
- Metadata & XMP – Read embedded fonts and document producer