What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

图片提取

PDF Oxide 通过解析内容流、经由 Do 运算符解析 XObject 引用、递归处理嵌套 Form XObject 以及解码内联图片，从 PDF 页面中提取图片。使用 extract_images() 在内存中获取图片对象，或使用 extract_images_to_files() 将图片直接保存为 PNG 或 JPEG 文件。

自 v0.3.5 起，图片提取会处理完整的页面内容流，而不仅仅是扫描 XObject 字典。这样可以正确处理通过 Do 运算符放置的图片、带有循环检测的嵌套 Form XObject，以及以 BI/ID/EI 序列嵌入的内联图片。

色彩空间支持

提取的图片以原始色彩空间解码，不进行任何有损转换：

DeviceRGB / DeviceGray / DeviceCMYK — 原样返回。
Indexed（1、2、4、8 位/分量）— 通过 resolve_indexed_palette 解析调色板，并通过 expand_indexed_to_rgb 展开。支持基于 RGB、灰度和 CMYK 基色空间构建的索引调色板。此前在许多真实 PDF 上会触发 Invalid RGB image dimensions 错误。
CalRGB / CalGray / ICCBased — 解码时转换为 RGB。

调色板展开通过 checked_mul 溢出保护和 256 MiB 分配上限来抵御恶意输入；截断的流将被干净地拒绝，而不是产生损坏像素。

容错处理畸形图片

缺少 /ColorSpace 条目、零尺寸或无效流的图片会以警告方式跳过，而不会导致页面渲染崩溃。同样的容错机制也适用于嵌套在 Form XObject 内部的畸形图片。

快速示例

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for img in images:
    print(f"{img['width']}x{img['height']}")

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const images = doc.getEmbeddedImages(0);
for (const img of images) {
    console.log(`${img.width}x${img.height}`);
}

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
images, _ := doc.Images(0)
for _, img := range images {
    fmt.Printf("%dx%d\n", img.Width, img.Height)
}

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
var images = doc.ExtractImages(0);
foreach (var img in images)
{
    Console.WriteLine($"{img.Width}x{img.Height}");
}

WASM

const doc = new WasmPdfDocument(bytes);
const images = doc.extractImages(0);
for (const img of images) {
    console.log(`${img.width}x${img.height}`);
}

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
for img in &images {
    println!("{}x{} {:?}", img.width(), img.height(), img.color_space());
}

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.image.ExtractedImage;
import java.nio.file.Path;
import java.util.List;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    List<ExtractedImage> images = doc.page(0).images();
    for (ExtractedImage img : images) {
        System.out.println(img.width() + "x" + img.height());
    }
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("report.pdf")).use { doc ->
    for (img in doc.page(0).images()) {
        println("${img.width()}x${img.height()}")
    }
}

Scala

import fyi.oxide.pdf.{PdfDocument, imagesSeq}
import scala.util.Using

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  for (img <- doc.page(0).imagesSeq) {
    println(s"${img.width}x${img.height}")
  }
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "report.pdf")]
  (doseq [img (pdf/images (pdf/page doc 0))]
    (println (str (.width img) "x" (.height img)))))

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("report.pdf");
for (const auto& img : doc.embedded_images(0)) {
    std::printf("%dx%d\n", img.width, img.height);
}

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
for img in try doc.embeddedImages(0) {
    print("\(img.width)x\(img.height)")
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
for (final img in doc.embeddedImages(0)) {
    print('${img.width}x${img.height}');
}

library(pdfoxide)

doc <- pdf_open("report.pdf")
for (img in pdf_embedded_images(doc, 0)) {
    cat(sprintf("%dx%d\n", img$width, img$height))
}

Julia

using PdfOxide

doc = open_document("report.pdf")
for img in embedded_images(doc, 0)
    println("$(img.width)x$(img.height)")
end

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("report.pdf");
const images = try doc.embeddedImages(a, 0);
for (images) |img| {
    std.debug.print("{d}x{d}\n", .{ img.width, img.height });
}

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
for (POXImage *img in [doc embeddedImages:0 error:&err]) {
    NSLog(@"%ldx%ld", (long)img.width, (long)img.height);
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, images} = PdfOxide.embedded_images(doc, 0)
for img <- images do
  IO.puts("#{img.width}x#{img.height}")
end

API 参考

`extract_images(page_index) -> Vec<PdfImage>`

从页面提取所有图片。解析页面内容流以查找：

XObject 图片 — 通过 Do 运算符引用
Form XObject — 包含嵌套图片（递归处理，带循环检测）
内联图片 — 以 BI/ID/EI 序列嵌入

CTM（当前变换矩阵）追踪为每张图片提供边界框。

参数	类型	说明
`page_index`	`int` / `usize`	从零开始的页面索引

返回值： PdfImage 对象的向量。

PdfImage 字段和方法

方法 / 字段	类型	说明
`width()`	`u32`	图片宽度（像素）
`height()`	`u32`	图片高度（像素）
`color_space()`	`&ColorSpace`	色彩空间（DeviceRGB、DeviceGray、DeviceCMYK 等）
`bits_per_component()`	`u8`	每个色彩分量的位数（通常为 8）
`data()`	`&ImageData`	原始图片数据（JPEG 字节或原始像素）
`bbox()`	`Option<&Rect>`	PDF 用户空间中的边界框（如果追踪了 CTM）
`save_as_png(path)`	`Result<()>`	将图片保存为 PNG 文件
`save_as_jpeg(path)`	`Result<()>`	将图片保存为 JPEG 文件
`to_png_bytes()`	`Result<Vec<u8>>`	在内存中编码为 PNG 字节
`to_jpeg_bytes()`	`Result<Vec<u8>>`	在内存中编码为 JPEG 字节

ColorSpace 变体

变体	说明
`DeviceRGB`	3 通道 RGB
`DeviceGray`	单通道灰度
`DeviceCMYK`	4 通道 CMYK
`Indexed`	调色板色彩
`ICCBased`	基于 ICC 配置文件的色彩
`CalGray`	校准灰度
`CalRGB`	校准 RGB
`Lab`	CIE Lab* 色彩

ImageData 变体

变体	说明
`Jpeg(Vec<u8>)`	JPEG 压缩数据（DCT 直通）
`Raw { pixels, format }`	带 `PixelFormat`（RGB、Gray、CMYK、RGBA）的解码像素数据

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for (i, image) in images.iter().enumerate() {
    println!(
        "Image {}: {}x{} {:?} {}bpc",
        i, image.width(), image.height(),
        image.color_space(), image.bits_per_component(),
    );

    if let Some(bbox) = image.bbox() {
        println!("  Position: ({:.1}, {:.1})", bbox.x, bbox.y);
    }

    image.save_as_png(&format!("output/image_{}.png", i))?;
}

`extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>`

从页面提取图片并直接保存到文件。JPEG 图片以原始格式保存（零重编码损失）；其他图片保存为 PNG。

参数	类型	默认值	说明
`page_index`	`usize`	–	从零开始的页面索引
`output_dir`	`impl AsRef<Path>`	–	图片保存目录（不存在时自动创建）
`prefix`	`Option<&str>`	`"img"`	文件名前缀
`start_index`	`Option<usize>`	`1`	文件名起始索引

返回值： 描述已保存文件的 ExtractedImageRef 向量。

ExtractedImageRef 字段

字段	类型	说明
`filename`	`String`	已保存的文件名（如 `"img_001.png"`）
`format`	`ImageFormat`	`Png` 或 `Jpeg`
`width`	`u32`	图片宽度（像素）
`height`	`u32`	图片高度（像素）

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let refs = doc.extract_images_to_files(0, "output/images", Some("fig"), Some(1))?;

for img_ref in &refs {
    println!("Saved: {} ({}x{}, {:?})", img_ref.filename, img_ref.width, img_ref.height, img_ref.format);
}

进阶示例

提取所有页面的图片

use pdf_oxide::PdfDocument;
use std::path::Path;

let mut doc = PdfDocument::open("book.pdf")?;
let page_count = doc.page_count()?;
let mut total = 0;

for page in 0..page_count {
    let refs = doc.extract_images_to_files(
        page,
        "output/images",
        Some(&format!("page{}", page + 1)),
        Some(1),
    )?;
    total += refs.len();
    println!("Page {}: {} images", page + 1, refs.len());
}
println!("Total: {} images extracted", total);

在内存中获取图片字节（无磁盘 I/O）

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for image in &images {
    let png_bytes = image.to_png_bytes()?;
    println!("PNG size: {} bytes", png_bytes.len());

    // Use png_bytes with an HTTP response, database, etc.
}

按尺寸过滤图片

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

// Only keep images larger than 100x100 pixels
let large_images: Vec<_> = images.iter()
    .filter(|img| img.width() > 100 && img.height() > 100)
    .collect();

println!("{} large images on page 1", large_images.len());
for img in &large_images {
    println!("  {}x{} {:?}", img.width(), img.height(), img.color_space());
}

区分 JPEG 直通与重编码图片

use pdf_oxide::extractors::ImageData;

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for (i, image) in images.iter().enumerate() {
    match image.data() {
        ImageData::Jpeg(bytes) => {
            // Original JPEG data -- save directly for zero quality loss
            std::fs::write(format!("image_{}.jpg", i), bytes)?;
            println!("Image {}: JPEG pass-through ({} bytes)", i, bytes.len());
        }
        ImageData::Raw { pixels, format } => {
            // Raw pixels -- must encode to a file format
            image.save_as_png(&format!("image_{}.png", i))?;
            println!("Image {}: raw {:?} ({}x{})", i, format, image.width(), image.height());
        }
    }
}

嵌入图片访问器（`embedded_images`）

extract_images() 是功能丰富的内存式 Rust API。跨语言绑定提供了一个更轻量的嵌入图片访问器，基于相同的内容流遍历，返回每张图片的像素尺寸、格式、色彩空间、位/分量以及原始解码字节。底层由 C ABI 函数 pdf_document_get_embedded_images 和 pdf_oxide_image_* 访问器族实现。

如何通过绑定列出嵌入图片

import (
    "fmt"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()

images, _ := doc.Images(0) // []pdfoxide.Image
for _, img := range images {
    fmt.Printf("%dx%d %s/%s %dbpc, %d bytes\n",
        img.Width, img.Height, img.Format, img.Colorspace,
        img.BitsPerComponent, len(img.Data))
}

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let images = try doc.embeddedImages(0) // [Image]
for img in images {
    print("\(img.width)x\(img.height) \(img.format)/\(img.colorspace) "
        + "\(img.bitsPerComponent)bpc, \(img.data.count) bytes")
}

C ABI

#include "pdf_oxide.h"

int32_t err = 0;
FfiImageList *images = pdf_document_get_embedded_images(doc, /*page=*/0, &err);
int32_t n = pdf_oxide_image_count(images);
for (int32_t i = 0; i < n; i++) {
    int32_t w = pdf_oxide_image_get_width(images, i, &err);
    int32_t h = pdf_oxide_image_get_height(images, i, &err);
    char *fmt = pdf_oxide_image_get_format(images, i, &err);
    char *cs  = pdf_oxide_image_get_colorspace(images, i, &err);
    printf("%dx%d %s/%s\n", w, h, fmt, cs);
    free_string(fmt);
    free_string(cs);
}
pdf_oxide_image_list_free(images);

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.image.ExtractedImage;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    for (ExtractedImage img : doc.page(0).images()) {
        System.out.printf("%dx%d %s, %d bytes%n",
            img.width(), img.height(), img.format(), img.bytes().length);
    }
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("report.pdf")).use { doc ->
    for (img in doc.page(0).images()) {
        println("${img.width()}x${img.height()} ${img.format()}, ${img.bytes().size} bytes")
    }
}

Scala

import fyi.oxide.pdf.{PdfDocument, imagesSeq}
import scala.util.Using

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  for (img <- doc.page(0).imagesSeq) {
    println(s"${img.width}x${img.height} ${img.format}, ${img.bytes.length} bytes")
  }
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "report.pdf")]
  (doseq [img (pdf/images (pdf/page doc 0))]
    (println (format "%dx%d %s, %d bytes"
                     (.width img) (.height img) (.format img) (count (.bytes img))))))

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("report.pdf");
for (const auto& img : doc.embedded_images(0)) {
    std::printf("%dx%d %s/%s %dbpc, %zu bytes\n",
        img.width, img.height, img.format.c_str(), img.colorspace.c_str(),
        img.bits_per_component, img.data.size());
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
for (final img in doc.embeddedImages(0)) {
    print('${img.width}x${img.height} ${img.format}/${img.colorspace} '
        '${img.bitsPerComponent}bpc, ${img.data.length} bytes');
}

library(pdfoxide)

doc <- pdf_open("report.pdf")
for (img in pdf_embedded_images(doc, 0)) {
    cat(sprintf("%dx%d %s/%s %dbpc, %d bytes\n",
        img$width, img$height, img$format, img$colorspace,
        img$bits_per_component, length(img$data)))
}

Julia

using PdfOxide

doc = open_document("report.pdf")
for img in embedded_images(doc, 0)
    println("$(img.width)x$(img.height) $(img.format)/$(img.colorspace) " *
            "$(img.bitsPerComponent)bpc, $(length(img.data)) bytes")
end

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("report.pdf");
const images = try doc.embeddedImages(a, 0);
for (images) |img| {
    std.debug.print("{d}x{d} {s}/{s} {d}bpc, {d} bytes\n", .{
        img.width, img.height, img.format, img.colorspace,
        img.bits_per_component, img.data.len,
    });
}

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
for (POXImage *img in [doc embeddedImages:0 error:&err]) {
    NSLog(@"%ldx%ld %@/%@ %ldbpc, %lu bytes",
        (long)img.width, (long)img.height, img.format, img.colorspace,
        (long)img.bitsPerComponent, (unsigned long)img.data.length);
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, images} = PdfOxide.embedded_images(doc, 0)
for img <- images do
  IO.puts("#{img.width}x#{img.height} #{img.format}/#{img.colorspace} " <>
          "#{img.bits_per_component}bpc, #{byte_size(img.data)} bytes")
end

图片访问器字段

字段（Go / Swift）	类型	说明
`Width` / `width`	`int`	图片宽度（像素）
`Height` / `height`	`int`	图片高度（像素）
`Format` / `format`	`string`	来源格式字符串（如 `"jpeg"`、`"raw"`）
`Colorspace` / `colorspace`	`string`	色彩空间名称（如 `"DeviceRGB"`）
`BitsPerComponent` / `bitsPerComponent`	`int`	每个色彩分量的位数
`Data` / `data`	`[]byte` / `[UInt8]`	原始解码图片字节

绑定覆盖范围。 嵌入图片访问器在 Go（doc.Images(page)）、Swift（doc.embeddedImages(page)）和 C ABI（pdf_document_get_embedded_images）中可用。在 Rust 中请使用上述功能更丰富的 extract_images()。该访问器在 WASM 目标中不编译。

页面元素访问器（`page_elements`）

page_elements 将页面上所有布局元素（文本段落，含类型、文本和边界框）作为单一列表返回。绑定通过 pdf_oxide_elements_to_json 在一次 FFI 调用中编组整个列表，因此它是无需逐区域重复运行文本提取即可遍历页面布局的最高效方式。底层由 C ABI 函数 pdf_page_get_elements 和 pdf_oxide_element_* 访问器族实现。

如何遍历页面布局元素

import (
    "fmt"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()

elements, _ := doc.PageElements(0) // []pdfoxide.Element
for _, el := range elements {
    fmt.Printf("[%s] %q at (%.1f, %.1f) %.1fx%.1f\n",
        el.Type, el.Text, el.X, el.Y, el.Width, el.Height)
}

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let elements = try doc.pageElements(0) // ElementList
for el in try elements.all() {
    print("[\(el.type)] \(el.text) at "
        + "(\(el.rect.x), \(el.rect.y)) \(el.rect.width)x\(el.rect.height)")
}

// Serialize the whole list to JSON in one call:
let json = try elements.toJson()

C ABI

#include "pdf_oxide.h"

int32_t err = 0;
FfiElementList *els = pdf_page_get_elements(doc, /*page=*/0, &err);

// One-shot JSON serialization (caller frees with free_string):
char *json = pdf_oxide_elements_to_json(els, &err);
printf("%s\n", json);
free_string(json);

pdf_oxide_elements_free(els);

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
final elements = doc.pageElements(0); // ElementList
for (final el in elements.toList()) {
    print('[${el.type}] ${el.text} at '
        '(${el.rect.x}, ${el.rect.y}) ${el.rect.width}x${el.rect.height}');
}

// Serialize the whole list to JSON in one call:
final json = elements.toJson();

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
POXElementList *els = [doc pageElements:0 error:&err];
for (int32_t i = 0; i < [els count]; i++) {
    NSString *type = [els typeAtIndex:i error:&err];
    NSString *text = [els textAtIndex:i error:&err];
    POXBbox rect = [els rectAtIndex:i error:&err];
    NSLog(@"[%@] %@ at (%.1f, %.1f) %.1fx%.1f",
        type, text, rect.x, rect.y, rect.width, rect.height);
}

// One-shot JSON serialization:
NSString *json = [els toJsonWithError:&err];

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, els} = PdfOxide.page_elements(doc, 0)
for i <- 0..(PdfOxide.element_count(els) - 1) do
  {:ok, type} = PdfOxide.element_type(els, i)
  {:ok, text} = PdfOxide.element_text(els, i)
  {:ok, rect} = PdfOxide.element_rect(els, i)
  IO.puts("[#{type}] #{text} at (#{rect.x}, #{rect.y}) #{rect.width}x#{rect.height}")
end

# Serialize the whole list to JSON in one call:
{:ok, json} = PdfOxide.elements_to_json(els)

元素字段

字段（Go / Swift）	类型	说明
`Type` / `type`	`string`	元素类型（如 `"text"`）
`Text` / `text`	`string`	元素文本内容
`X`, `Y` / `rect.x`, `rect.y`	`float`	PDF 用户空间中边界框的原点
`Width`, `Height` / `rect.width`, `rect.height`	`float`	边界框尺寸

绑定覆盖范围。 page_elements 在 Go（doc.PageElements(page)）、Swift（doc.pageElements(page) → ElementList）和 C ABI（pdf_page_get_elements + pdf_oxide_elements_to_json）中可用。在 WASM 目标中不编译。

常见问题

extract_images() 与嵌入图片访问器有何区别？ extract_images()（Rust）返回具有 save_as_png、to_jpeg_bytes、CTM 边界框以及类型化 ColorSpace/ImageData 枚举的丰富 PdfImage 对象。嵌入图片访问器（doc.Images / doc.embeddedImages / pdf_document_get_embedded_images）返回相同内容流遍历的跨语言路径，以扁平列表形式提供尺寸、格式、色彩空间和原始字节。

图片提取速度快吗？ 是的。PDF Oxide 的提取核心在基准测试语料库上以约 0.8 ms 均值 / 9 ms p99 运行，通过率 100%，以原始色彩空间解码图片，无任何有损转换。

嵌入图片访问器会对 JPEG 重新编码吗？ 不会。基于 JPEG 的图片以原始 DCT 字节（format == "jpeg"）返回；只有原始像素数据才会被解码。更丰富的 extract_images() API 通过 ImageData::Jpeg 与 ImageData::Raw 暴露同样的区分。

为什么某些图片的 data 为空？ 畸形图片（缺少 /ColorSpace、零尺寸、截断流）会以警告方式跳过而非导致页面崩溃，因此其字节缓冲区可能为空返回。

图片提取

色彩空间支持

容错处理畸形图片

快速示例

API 参考

extract_images(page_index) -> Vec<PdfImage>

PdfImage 字段和方法

ColorSpace 变体

ImageData 变体

extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>

ExtractedImageRef 字段

进阶示例

提取所有页面的图片

在内存中获取图片字节（无磁盘 I/O）

按尺寸过滤图片

区分 JPEG 直通与重编码图片

嵌入图片访问器（embedded_images）

如何通过绑定列出嵌入图片

图片访问器字段

页面元素访问器（page_elements）

如何遍历页面布局元素

元素字段

常见问题

相关页面

`extract_images(page_index) -> Vec<PdfImage>`

`extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>`

嵌入图片访问器（`embedded_images`）

页面元素访问器（`page_elements`）