What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

画像抽出

PDF Oxideは、PDFページからコンテンツストリームを解析し、Do演算子を介したXObjectの参照を解決し、ネストされたForm XObjectへの再帰処理、およびインライン画像のデコードによって画像を抽出します。extract_images()でメモリ内の画像オブジェクトを取得するか、extract_images_to_files()でPNGまたはJPEGとして直接ディスクへ保存できます。

v0.3.5以降、画像抽出はXObjectディクショナリのスキャンだけでなく、ページ全体のコンテンツストリームを処理します。これにより、Do演算子で配置された画像、サイクル検出付きのネストされたForm XObject、およびBI/ID/EIシーケンスで埋め込まれたインライン画像を正しく処理します。

カラースペースのサポート

抽出された画像はオリジナルのカラースペースのままデコードされ、非可逆の変換は行いません。

DeviceRGB / DeviceGray / DeviceCMYK — そのまま返されます。
Indexed（1、2、4、8ビット/コンポーネント）— resolve_indexed_paletteでパレットを解決し、expand_indexed_to_rgbで展開します。RGB、グレースケール、CMYKを基本カラースペースとするインデックスパレットをサポート。以前は多くの実際のPDFでInvalid RGB image dimensionsエラーが発生していました。
CalRGB / CalGray / ICCBased — デコード時にRGBへ変換されます。

パレット展開はchecked_mulオーバーフローガードと256 MiBのアロケーション上限によって悪意のある入力に対して堅牢化されており、不完全なストリームは不正なピクセルを生成する代わりにクリーンに拒否されます。

不正な画像への耐性

/ColorSpaceエントリの欠落、ゼロサイズ、または無効なストリームを持つ画像は警告とともにスキップされ、ページレンダリングがパニックを起こすことはありません。同じ耐性がForm XObject内にネストされた不正な画像にも適用されます。

クイックサンプル

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for img in images:
    print(f"{img['width']}x{img['height']}")

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const images = doc.getEmbeddedImages(0);
for (const img of images) {
    console.log(`${img.width}x${img.height}`);
}

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
images, _ := doc.Images(0)
for _, img := range images {
    fmt.Printf("%dx%d\n", img.Width, img.Height)
}

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
var images = doc.ExtractImages(0);
foreach (var img in images)
{
    Console.WriteLine($"{img.Width}x{img.Height}");
}

WASM

const doc = new WasmPdfDocument(bytes);
const images = doc.extractImages(0);
for (const img of images) {
    console.log(`${img.width}x${img.height}`);
}

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
for img in &images {
    println!("{}x{} {:?}", img.width(), img.height(), img.color_space());
}

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.image.ExtractedImage;
import java.nio.file.Path;
import java.util.List;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    List<ExtractedImage> images = doc.page(0).images();
    for (ExtractedImage img : images) {
        System.out.println(img.width() + "x" + img.height());
    }
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("report.pdf")).use { doc ->
    for (img in doc.page(0).images()) {
        println("${img.width()}x${img.height()}")
    }
}

Scala

import fyi.oxide.pdf.{PdfDocument, imagesSeq}
import scala.util.Using

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  for (img <- doc.page(0).imagesSeq) {
    println(s"${img.width}x${img.height}")
  }
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "report.pdf")]
  (doseq [img (pdf/images (pdf/page doc 0))]
    (println (str (.width img) "x" (.height img)))))

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("report.pdf");
for (const auto& img : doc.embedded_images(0)) {
    std::printf("%dx%d\n", img.width, img.height);
}

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
for img in try doc.embeddedImages(0) {
    print("\(img.width)x\(img.height)")
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
for (final img in doc.embeddedImages(0)) {
    print('${img.width}x${img.height}');
}

library(pdfoxide)

doc <- pdf_open("report.pdf")
for (img in pdf_embedded_images(doc, 0)) {
    cat(sprintf("%dx%d\n", img$width, img$height))
}

Julia

using PdfOxide

doc = open_document("report.pdf")
for img in embedded_images(doc, 0)
    println("$(img.width)x$(img.height)")
end

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("report.pdf");
const images = try doc.embeddedImages(a, 0);
for (images) |img| {
    std.debug.print("{d}x{d}\n", .{ img.width, img.height });
}

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
for (POXImage *img in [doc embeddedImages:0 error:&err]) {
    NSLog(@"%ldx%ld", (long)img.width, (long)img.height);
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, images} = PdfOxide.embedded_images(doc, 0)
for img <- images do
  IO.puts("#{img.width}x#{img.height}")
end

APIリファレンス

`extract_images(page_index) -> Vec<PdfImage>`

ページからすべての画像を抽出します。ページコンテンツストリームを解析して次の要素を検索します。

XObject画像 — Do演算子で参照されるもの
Form XObject — ネストされた画像を含むもの（サイクル検出付き再帰処理）
インライン画像 — BI/ID/EIシーケンスで埋め込まれたもの

CTM（Current Transformation Matrix）追跡により、各画像のバウンディングボックスを提供します。

パラメータ	型	説明
`page_index`	`int` / `usize`	ゼロベースのページインデックス

戻り値: PdfImageオブジェクトのベクター。

PdfImageのフィールドとメソッド

メソッド / フィールド	型	説明
`width()`	`u32`	画像の幅（ピクセル）
`height()`	`u32`	画像の高さ（ピクセル）
`color_space()`	`&ColorSpace`	カラースペース（DeviceRGB、DeviceGray、DeviceCMYKなど）
`bits_per_component()`	`u8`	カラーコンポーネントあたりのビット数（通常8）
`data()`	`&ImageData`	生の画像データ（JPEGバイトまたは生ピクセル）
`bbox()`	`Option<&Rect>`	PDFユーザー空間におけるバウンディングボックス（CTMが追跡されている場合）
`save_as_png(path)`	`Result<()>`	画像をPNGファイルとして保存
`save_as_jpeg(path)`	`Result<()>`	画像をJPEGファイルとして保存
`to_png_bytes()`	`Result<Vec<u8>>`	メモリ内でPNGバイトにエンコード
`to_jpeg_bytes()`	`Result<Vec<u8>>`	メモリ内でJPEGバイトにエンコード

ColorSpaceバリアント

バリアント	説明
`DeviceRGB`	3チャンネルRGB
`DeviceGray`	シングルチャンネルグレースケール
`DeviceCMYK`	4チャンネルCMYK
`Indexed`	パレットベースのカラー
`ICCBased`	ICCプロファイルベースのカラー
`CalGray`	較正グレースケール
`CalRGB`	較正RGB
`Lab`	CIE Lab*カラー

ImageDataバリアント

バリアント	説明
`Jpeg(Vec<u8>)`	JPEG圧縮データ（DCTパススルー）
`Raw { pixels, format }`	`PixelFormat`（RGB、Gray、CMYK、RGBA）付きのデコード済みピクセルデータ

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for (i, image) in images.iter().enumerate() {
    println!(
        "Image {}: {}x{} {:?} {}bpc",
        i, image.width(), image.height(),
        image.color_space(), image.bits_per_component(),
    );

    if let Some(bbox) = image.bbox() {
        println!("  Position: ({:.1}, {:.1})", bbox.x, bbox.y);
    }

    image.save_as_png(&format!("output/image_{}.png", i))?;
}

`extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>`

ページから画像を抽出してファイルに直接保存します。JPEG画像はオリジナルフォーマットのまま保存されます（再エンコードによる品質損失ゼロ）。その他の画像はPNGとして保存されます。

パラメータ	型	デフォルト	説明
`page_index`	`usize`	–	ゼロベースのページインデックス
`output_dir`	`impl AsRef<Path>`	–	画像の保存先ディレクトリ（存在しない場合は作成）
`prefix`	`Option<&str>`	`"img"`	ファイル名プレフィックス
`start_index`	`Option<usize>`	`1`	ファイル名の開始インデックス

戻り値: 保存されたファイルを説明するExtractedImageRefのベクター。

ExtractedImageRefのフィールド

フィールド	型	説明
`filename`	`String`	保存されたファイル名（例：`"img_001.png"`）
`format`	`ImageFormat`	`Png`または`Jpeg`
`width`	`u32`	画像の幅（ピクセル）
`height`	`u32`	画像の高さ（ピクセル）

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let refs = doc.extract_images_to_files(0, "output/images", Some("fig"), Some(1))?;

for img_ref in &refs {
    println!("Saved: {} ({}x{}, {:?})", img_ref.filename, img_ref.width, img_ref.height, img_ref.format);
}

応用サンプル

全ページから画像を抽出する

use pdf_oxide::PdfDocument;
use std::path::Path;

let mut doc = PdfDocument::open("book.pdf")?;
let page_count = doc.page_count()?;
let mut total = 0;

for page in 0..page_count {
    let refs = doc.extract_images_to_files(
        page,
        "output/images",
        Some(&format!("page{}", page + 1)),
        Some(1),
    )?;
    total += refs.len();
    println!("Page {}: {} images", page + 1, refs.len());
}
println!("Total: {} images extracted", total);

メモリ内で画像バイトを取得する（ディスクI/Oなし）

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for image in &images {
    let png_bytes = image.to_png_bytes()?;
    println!("PNG size: {} bytes", png_bytes.len());

    // Use png_bytes with an HTTP response, database, etc.
}

サイズで画像をフィルタリングする

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

// Only keep images larger than 100x100 pixels
let large_images: Vec<_> = images.iter()
    .filter(|img| img.width() > 100 && img.height() > 100)
    .collect();

println!("{} large images on page 1", large_images.len());
for img in &large_images {
    println!("  {}x{} {:?}", img.width(), img.height(), img.color_space());
}

JPEGパススルーと再エンコード画像を区別する

use pdf_oxide::extractors::ImageData;

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for (i, image) in images.iter().enumerate() {
    match image.data() {
        ImageData::Jpeg(bytes) => {
            // Original JPEG data -- save directly for zero quality loss
            std::fs::write(format!("image_{}.jpg", i), bytes)?;
            println!("Image {}: JPEG pass-through ({} bytes)", i, bytes.len());
        }
        ImageData::Raw { pixels, format } => {
            // Raw pixels -- must encode to a file format
            image.save_as_png(&format!("image_{}.png", i))?;
            println!("Image {}: raw {:?} ({}x{})", i, format, image.width(), image.height());
        }
    }
}

埋め込み画像アクセサー（`embedded_images`）

extract_images()はメモリ内でリッチなデータを返すRust APIです。クロス言語バインディングでは、同じコンテンツストリームウォークをベースにした、より軽量な埋め込み画像アクセサーが提供されており、各画像のピクセル寸法、フォーマット、カラースペース、ビット/コンポーネント、および生デコードバイトを返します。これはC ABI関数pdf_document_get_embedded_imagesおよびpdf_oxide_image_*アクセサーファミリーで実装されています。

バインディングで埋め込み画像を一覧表示するには

import (
    "fmt"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()

images, _ := doc.Images(0) // []pdfoxide.Image
for _, img := range images {
    fmt.Printf("%dx%d %s/%s %dbpc, %d bytes\n",
        img.Width, img.Height, img.Format, img.Colorspace,
        img.BitsPerComponent, len(img.Data))
}

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let images = try doc.embeddedImages(0) // [Image]
for img in images {
    print("\(img.width)x\(img.height) \(img.format)/\(img.colorspace) "
        + "\(img.bitsPerComponent)bpc, \(img.data.count) bytes")
}

C ABI

#include "pdf_oxide.h"

int32_t err = 0;
FfiImageList *images = pdf_document_get_embedded_images(doc, /*page=*/0, &err);
int32_t n = pdf_oxide_image_count(images);
for (int32_t i = 0; i < n; i++) {
    int32_t w = pdf_oxide_image_get_width(images, i, &err);
    int32_t h = pdf_oxide_image_get_height(images, i, &err);
    char *fmt = pdf_oxide_image_get_format(images, i, &err);
    char *cs  = pdf_oxide_image_get_colorspace(images, i, &err);
    printf("%dx%d %s/%s\n", w, h, fmt, cs);
    free_string(fmt);
    free_string(cs);
}
pdf_oxide_image_list_free(images);

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.image.ExtractedImage;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    for (ExtractedImage img : doc.page(0).images()) {
        System.out.printf("%dx%d %s, %d bytes%n",
            img.width(), img.height(), img.format(), img.bytes().length);
    }
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("report.pdf")).use { doc ->
    for (img in doc.page(0).images()) {
        println("${img.width()}x${img.height()} ${img.format()}, ${img.bytes().size} bytes")
    }
}

Scala

import fyi.oxide.pdf.{PdfDocument, imagesSeq}
import scala.util.Using

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  for (img <- doc.page(0).imagesSeq) {
    println(s"${img.width}x${img.height} ${img.format}, ${img.bytes.length} bytes")
  }
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "report.pdf")]
  (doseq [img (pdf/images (pdf/page doc 0))]
    (println (format "%dx%d %s, %d bytes"
                     (.width img) (.height img) (.format img) (count (.bytes img))))))

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("report.pdf");
for (const auto& img : doc.embedded_images(0)) {
    std::printf("%dx%d %s/%s %dbpc, %zu bytes\n",
        img.width, img.height, img.format.c_str(), img.colorspace.c_str(),
        img.bits_per_component, img.data.size());
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
for (final img in doc.embeddedImages(0)) {
    print('${img.width}x${img.height} ${img.format}/${img.colorspace} '
        '${img.bitsPerComponent}bpc, ${img.data.length} bytes');
}

library(pdfoxide)

doc <- pdf_open("report.pdf")
for (img in pdf_embedded_images(doc, 0)) {
    cat(sprintf("%dx%d %s/%s %dbpc, %d bytes\n",
        img$width, img$height, img$format, img$colorspace,
        img$bits_per_component, length(img$data)))
}

Julia

using PdfOxide

doc = open_document("report.pdf")
for img in embedded_images(doc, 0)
    println("$(img.width)x$(img.height) $(img.format)/$(img.colorspace) " *
            "$(img.bitsPerComponent)bpc, $(length(img.data)) bytes")
end

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("report.pdf");
const images = try doc.embeddedImages(a, 0);
for (images) |img| {
    std.debug.print("{d}x{d} {s}/{s} {d}bpc, {d} bytes\n", .{
        img.width, img.height, img.format, img.colorspace,
        img.bits_per_component, img.data.len,
    });
}

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
for (POXImage *img in [doc embeddedImages:0 error:&err]) {
    NSLog(@"%ldx%ld %@/%@ %ldbpc, %lu bytes",
        (long)img.width, (long)img.height, img.format, img.colorspace,
        (long)img.bitsPerComponent, (unsigned long)img.data.length);
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, images} = PdfOxide.embedded_images(doc, 0)
for img <- images do
  IO.puts("#{img.width}x#{img.height} #{img.format}/#{img.colorspace} " <>
          "#{img.bits_per_component}bpc, #{byte_size(img.data)} bytes")
end

画像アクセサーのフィールド

フィールド（Go / Swift）	型	説明
`Width` / `width`	`int`	画像の幅（ピクセル）
`Height` / `height`	`int`	画像の高さ（ピクセル）
`Format` / `format`	`string`	ソースフォーマット文字列（例：`"jpeg"`、`"raw"`）
`Colorspace` / `colorspace`	`string`	カラースペース名（例：`"DeviceRGB"`）
`BitsPerComponent` / `bitsPerComponent`	`int`	カラーコンポーネントあたりのビット数
`Data` / `data`	`[]byte` / `[UInt8]`	生デコード済み画像バイト

バインディングのカバレッジ。 埋め込み画像アクセサーはGo（doc.Images(page)）、Swift（doc.embeddedImages(page)）、およびC ABI（pdf_document_get_embedded_images）で公開されています。Rustでは上記のよりリッチなextract_images()を使用してください。このアクセサーはWASMターゲットではコンパイルされません。

ページ要素アクセサー（`page_elements`）

page_elementsはページ上のすべてのレイアウト要素（テキストスパン、その型、テキスト、バウンディングボックス）を単一のリストとして返します。バインディングはpdf_oxide_elements_to_jsonを介して1回のFFI呼び出しでリスト全体をマーシャリングするため、リージョンごとにテキスト抽出を再実行せずにページのレイアウトを走査する最も効率的な方法です。C ABI関数pdf_page_get_elementsおよびpdf_oxide_element_*アクセサーファミリーで実装されています。

ページのレイアウト要素を走査するには

import (
    "fmt"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()

elements, _ := doc.PageElements(0) // []pdfoxide.Element
for _, el := range elements {
    fmt.Printf("[%s] %q at (%.1f, %.1f) %.1fx%.1f\n",
        el.Type, el.Text, el.X, el.Y, el.Width, el.Height)
}

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let elements = try doc.pageElements(0) // ElementList
for el in try elements.all() {
    print("[\(el.type)] \(el.text) at "
        + "(\(el.rect.x), \(el.rect.y)) \(el.rect.width)x\(el.rect.height)")
}

// Serialize the whole list to JSON in one call:
let json = try elements.toJson()

C ABI

#include "pdf_oxide.h"

int32_t err = 0;
FfiElementList *els = pdf_page_get_elements(doc, /*page=*/0, &err);

// One-shot JSON serialization (caller frees with free_string):
char *json = pdf_oxide_elements_to_json(els, &err);
printf("%s\n", json);
free_string(json);

pdf_oxide_elements_free(els);

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
final elements = doc.pageElements(0); // ElementList
for (final el in elements.toList()) {
    print('[${el.type}] ${el.text} at '
        '(${el.rect.x}, ${el.rect.y}) ${el.rect.width}x${el.rect.height}');
}

// Serialize the whole list to JSON in one call:
final json = elements.toJson();

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
POXElementList *els = [doc pageElements:0 error:&err];
for (int32_t i = 0; i < [els count]; i++) {
    NSString *type = [els typeAtIndex:i error:&err];
    NSString *text = [els textAtIndex:i error:&err];
    POXBbox rect = [els rectAtIndex:i error:&err];
    NSLog(@"[%@] %@ at (%.1f, %.1f) %.1fx%.1f",
        type, text, rect.x, rect.y, rect.width, rect.height);
}

// One-shot JSON serialization:
NSString *json = [els toJsonWithError:&err];

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, els} = PdfOxide.page_elements(doc, 0)
for i <- 0..(PdfOxide.element_count(els) - 1) do
  {:ok, type} = PdfOxide.element_type(els, i)
  {:ok, text} = PdfOxide.element_text(els, i)
  {:ok, rect} = PdfOxide.element_rect(els, i)
  IO.puts("[#{type}] #{text} at (#{rect.x}, #{rect.y}) #{rect.width}x#{rect.height}")
end

# Serialize the whole list to JSON in one call:
{:ok, json} = PdfOxide.elements_to_json(els)

要素フィールド

フィールド（Go / Swift）	型	説明
`Type` / `type`	`string`	要素の型（例：`"text"`）
`Text` / `text`	`string`	要素のテキスト内容
`X`, `Y` / `rect.x`, `rect.y`	`float`	PDFユーザー空間におけるバウンディングボックスの原点
`Width`, `Height` / `rect.width`, `rect.height`	`float`	バウンディングボックスのサイズ

バインディングのカバレッジ。 page_elementsはGo（doc.PageElements(page)）、Swift（doc.pageElements(page) → ElementList）、およびC ABI（pdf_page_get_elements + pdf_oxide_elements_to_json）で公開されています。WASMターゲットではコンパイルされません。

よくある質問

extract_images()と埋め込み画像アクセサーの違いは何ですか？ extract_images()（Rust）はsave_as_png、to_jpeg_bytes、CTMバウンディングボックス、型付きのColorSpace/ImageData列挙型を持つリッチなPdfImageオブジェクトを返します。埋め込み画像アクセサー（doc.Images / doc.embeddedImages / pdf_document_get_embedded_images）は、同じコンテンツストリームウォークへのクロス言語パスとして、寸法、フォーマット、カラースペース、生バイトのフラットリストを返します。

画像抽出は高速ですか？ はい。PDF Oxideの抽出コアはベンチマークコーパスで平均0.8ms / p99 9msで実行され、100%のパスレートを達成し、非可逆の変換なしにオリジナルのカラースペースで画像をデコードします。

埋め込み画像アクセサーはJPEGを再エンコードしますか？ しません。JPEGバックの画像はオリジナルのDCTバイト（format == "jpeg"）として返され、生ピクセルデータのみがデコードされます。よりリッチなextract_images() APIではImageData::Jpeg対ImageData::Rawとして同じ区別が公開されています。

一部の画像でdataが空なのはなぜですか？ 不正な画像（/ColorSpaceの欠落、ゼロサイズ、不完全なストリーム）はパニックを起こす代わりに警告とともにスキップされるため、バイトバッファが空で返されることがあります。

画像抽出

カラースペースのサポート

不正な画像への耐性

クイックサンプル

APIリファレンス

extract_images(page_index) -> Vec<PdfImage>

PdfImageのフィールドとメソッド

ColorSpaceバリアント

ImageDataバリアント

extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>

ExtractedImageRefのフィールド

応用サンプル

全ページから画像を抽出する

メモリ内で画像バイトを取得する（ディスクI/Oなし）

サイズで画像をフィルタリングする

JPEGパススルーと再エンコード画像を区別する

埋め込み画像アクセサー（embedded_images）

バインディングで埋め込み画像を一覧表示するには

画像アクセサーのフィールド

ページ要素アクセサー（page_elements）

ページのレイアウト要素を走査するには

要素フィールド

よくある質問

関連ページ

`extract_images(page_index) -> Vec<PdfImage>`

`extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>`

埋め込み画像アクセサー（`embedded_images`）

ページ要素アクセサー（`page_elements`）