Skip to content

Registro de cambios

Todos los cambios importantes de PDF Oxide están documentados aquí.


v0.3.38 – 2026-04-22

DocumentBuilder lands in every binding; AES-256 on the write path; signature verification; multi-target WASM; Go purego backend

Write-side API parity across all bindings (#384)

  • DocumentBuilder + FluentPageBuilder + EmbeddedFont now ship in Python, Node/TypeScript, C#, Go, and WASM alongside Rust. Multi-page construction with full CJK / Cyrillic / Greek support through embedded fonts. Closes #382 cross-language.
  • 15 annotation methods on every binding: link_url / link_page / link_named, highlight, underline, strikeout, squiggly, sticky note, stamp (14 standard + custom), free text, watermark (custom / DRAFT / CONFIDENTIAL).
  • 5 AcroForm widget types on every binding: text_field, checkbox, combo_box, radio_group, push_button.
  • Graphics primitives on every binding: rect, filled_rect, line.
  • HTML+CSS pipelinePdf.from_html_css(...) and from_html_css_with_fonts(...) for multi-font cascades in every binding.

AES-256 encryption on the write path (#386)

  • save_encrypted(path, user_pw, owner_pw) / to_bytes_encrypted(user_pw, owner_pw) on DocumentBuilder in every binding.
  • save_with_encryption in Rust for custom algorithm + permissions.

Real font subsetting (#385 / FONT-3b)

  • CJK faces are now embedded as a subset rather than the full face. A 5-character PDF built from a ~17 MB CJK font typically ships under 100 KB. Content streams, /W widths, and the ToUnicode CMap are re-keyed onto the subset GID space; extract_text round-trips unchanged.
  • Internal writer API change: EmbeddedFont::encode_string / encode_shaped_run return Vec<u16> and build_embedded_font_objects returns a GlyphRemapper that callers pass to ContentStreamBuilder::build_with_remappers. No change to high-level APIs.

Digital signature verification (#208, verification half)

  • Signature.verify() and Signature.verify_detached(pdf_bytes) (and binding-native equivalents) in every binding. RFC 5652 §5.4 signer-attributes + §11.2 messageDigest checks.
  • RSA-PKCS#1 v1.5 over SHA-1 / SHA-256 / SHA-384 / SHA-512 returns Valid / Invalid. RSA-PSS and ECDSA surface as Unknown / UnsupportedFeatureException; callers can still read the certificate and run their own check.
  • Certificate — DER inspection (subject, issuer, serial, validity, is_valid) via x509-parserevery binding.
  • Signature — enumerate + inspect + .get_certificate()every binding.
  • Timestamp — RFC 3161 TSTInfo parsing (time, serial, policy, TSA name, hash algorithm, message imprint) — every binding.
  • TsaClient — RFC 3161 HTTP POST with nonce and HTTP Basic auth behind a tsa-client Cargo feature — every binding except WASM. Intentionally not wired on WASM (ureq is wasm-incompatible).
  • DocumentEditor::set_producer / set_creation_date metadata writers.
  • render_page_region and render_page_fit — clipped and fitted rendering surface.
  • Bicubic image filtering (pdf.js #19978 parity) — scanned / bilevel pages with Multiply-blended overlays no longer collapse their grayscale range on downscale.

Signing itself (as opposed to verification) is not covered; #208 remains open for that half.

Multi-target WASM packaging (#392)

  • pdf-oxide-wasm now ships three builds side-by-side with package.json conditional exports: nodejs/, bundler/ (Vite / webpack / Rollup / esbuild / Bun), and web/ (browsers / Deno / Cloudflare Workers).
  • Fixes the ReferenceError: Can't find variable: __dirname thrown under browser bundlers.
  • Subpath imports (pdf-oxide-wasm/web, /nodejs, /bundler) available for manual routing.

Go binding — purego backend + cache-dir install

  • Second backend via ebitengine/purego dlopen’s libpdf_oxide.{so,dylib,dll} at runtime. CGO_ENABLED=0 builds now work. Backend selection is automatic — //go:build cgo → full CGo API, //go:build !cgo → purego.
  • Purego surface: PdfDocument open (path / bytes / password), page count, version, text / Markdown / HTML / plain-text extraction, fonts, annotations, page elements, search, page dimensions, logging, plus PdfCreator.FromMarkdown for test fixtures.
  • CGo-only (compile-time error under !cgo): DocumentEditor, DocumentBuilder, barcodes, signatures, TSA, rendering, OCR, form mutation.
  • Installer: new -shared flag fetches the cdylib instead of the staticlib and prints CGO_ENABLED=0 + PDF_OXIDE_LIB_PATH=… to export.
  • Install dir moved to os.UserCacheDir()~/.cache/pdf_oxide (Linux), ~/Library/Caches/pdf_oxide (macOS), %LocalAppData%\pdf_oxide (Windows). Matches Go’s own GOCACHE convention.
  • Release assets now include pdf_oxide-go-ffi-shared-<platform>.tar.gz for every Tier-1 platform alongside the existing staticlib archives.

Bug fixes

  • #395RenderPage no longer raises SignatureException when a page contains unparseable signature-field metadata but no interactive signature widget. Reported by @gevorgter.

Thanks


v0.3.37 – 2026-04-20

HTML + CSS → PDF (#248) — first credible pure-Rust pipeline

New API — Pdf::from_html_css

let font = std::fs::read("DejaVuSans.ttf")?;
let pdf = Pdf::from_html_css(
    "<h1>Hello</h1><p>World</p>",
    "h1 { color: blue; font-size: 24pt }",
    font,
)?;
pdf.save("out.pdf")?;

Pass HTML + CSS + font bytes, get a paginated PDF back. Pure Rust, MIT/Apache only (no MPL transitive deps), extract_text round-trips byte-equal so produced PDFs participate in the existing test infrastructure.

What shipped

  • Font subsystem — TTF/OTF embedding with Type 0 / CIDFontType2 / Identity-H / ToUnicode emission; Latin, Cyrillic, Greek, Hebrew, Arabic round-trip via extract_text. System-font discovery via fontdb, text shaping via rustybuzz.
  • Hand-rolled CSS engine (~6,500 LoC, zero MPL deps) – tokenizer, parser, L3+L4 selectors (:is/:where/:not/:has), matcher, cascade, calc() / min() / max() / clamp(), var() with cycle detection, typed property values, at-rules (@media print, @page with :first/:left/:right/:blank, @font-face, @import, @supports), counters, pseudo-element content.
  • HTML – HTML5 tokenizer, flat arena DOM, stylesheet extraction (<style>, <link rel="stylesheet">, inline style=""), resource extraction (<img> + srcset, <picture>/<source>, <a href>).
  • Layout – Taffy-backed block / flex / grid, UAX #14 line breaking, margin collapsing, multi-column, tables (auto + fixed).
  • Paint – text + borders, RTL via rustybuzz, <a href>/Link annotation, <img> data-URI → /XObject, ::before / ::after, page-break-{before,after}: always, opacity, transform: translate*(), <ul> / <ol> list markers, embedded fonts via DocumentBuilder::register_embedded_font (#382).

Multi-font cascade

  • Pdf::from_html_css_with_fonts(html, css, Vec<(family, bytes)>) — CSS font-family on any element resolves against registered families (case-insensitive, with/without quotes, multi-word unquoted).

Bug fixes in corner-case pass

  • Base-14 bold text now renders bold (resource-dict key mismatch against Tf /Helvetica-Bold).
  • TTC system fonts (Helvetica.ttc, msgothic.ttc) now resolve via fontdb Source::SharedFile.
  • Unquoted multi-word font-family tokenises correctly.
  • Memory leak in Pdf::from_html_css factories closed (four Box::leak sites replaced with scoped locals).
  • PNG alpha / soft-mask (SMask) now renders.
  • Shaped text round-trips via extract_text (encode_shaped_run maps glyph clusters back to source codepoints).
  • PdfWriter::finish embeds fonts in registration order (was HashMap-random).
  • Embedded-font name collisions isolated via monotonic EFn resource names.
  • fontdb Mutex no longer held across fs::read of font bytes.

Out of scope

CSS filters, 3D transforms, animations, SVG-in-HTML (every viable Rust SVG crate is MPL), MathML, hyphens: auto, shape-outside, JavaScript, full-matrix transform (scale/rotate), gradients, box-shadow.

Licence audit

cargo deny check licenses passes with zero MPL transitive dependencies. The Mozilla CSS stack (cssparser, selectors, html5ever, lightningcss, stylo) is all MPL-2.0; v0.3.37 hand-rolls the equivalents to keep pdf_oxide entirely under MIT/Apache.

Thanks

  • @jmriebold – #248 (“CSS support”) is the root of this release’s entire HTML+CSS→PDF pipeline.

v0.3.36 – 2026-04-19

Markdown structural extraction — Tagged-PDF heading/list emission, multi-column reading order, safer RTL handling

Markdown structural extraction (#377)

to_markdown() now wires /StructTreeRoot directly into the markdown pipeline instead of re-deriving heading levels from font-size heuristics and list markers from glyph detection:

  • Heading and list emission from /StructTreeRoot. New StructRole (Heading(1..6), ListItem, ListItemLabel, ListItemBody) attached to every span. Word-tagged documents recover their full heading hierarchy; lists emit - item with paragraph breaks at every role transition.
  • Role propagated through nested MCRs. H1 → Span → MCR and LI → LBody → Span → MCR patterns now carry the right semantic role via InheritedContext { heading_level, list_role }.
  • Per-/StructTreeRoot block boundary forces paragraph break. OrderedContent.block_id increments on every entry into /P, /H1..6, /LI, /Lbl, /LBody, /Sect, /Div, /Art, /TR, /TH, /TD, /Note, /Reference, /BibEntry, /Code; tight-gap layouts no longer merge.
  • Same-baseline gate against form-heading over-fragmentation — same-baseline spans re-join into one heading.
  • Multi-column gutter detection — same-baseline spans separated by > max(3 × font_size, 30 pt) are treated as cross-column.
  • Backward-x reading-order wrap detection — column-major reading order (last span of col 1 at x=976 → first span of col 2 at x=192 same baseline) now breaks paragraphs instead of joining.
  • Geometric heading + list-prefix detection for untagged docs. Bold + 5 % size bump promotes to H4. New is_ordered_list_marker recognises 1. / 12. / a) / iv. / A. while rejecting figure captions and years.

RTL text — safe-by-default

  • Spurious **bold** markers around Arabic contextual glyphs are now stripped (shape transitions flipped the font-weight detector).
  • Bidi reorder is OFF by default. An earlier draft ran unicode-bidi’s visual→logical reorder on every RTL line, which broke previously-correct logical-order PDFs (Hebrew name בנימין was being reversed). Reorder helper remains at text::bidi::reorder_visual_to_logical for callers whose input is visual-order.

Markdown output

  • Inline-image base64 data URIs capped at 200 KB. PDFs with high-resolution diagrams previously inflated markdown output 10–20× (a 1.9 MB paper produced 11.3 MB of markdown). Images over the cap emit an HTML-comment placeholder with the original size. File-based image output (image_output_dir) is unaffected.

Empirical impact

Validated against v0.3.35 on a 369-PDF regression spanning academic, government, forms, newspapers, technical, theses, IRS, pdfium, pdfjs, safedocs, and slow-corpus subsets:

  • 0 catastrophic regressions.
  • Token Jaccard vs pdfium and pdftotext: median 1.000, ≥0.95 on 95/106 fixtures.
  • Token Jaccard vs pymupdf4llm: median 0.978, ≥0.95 on 65/106 fixtures.
  • ~2× more headings emitted than pymupdf4llm across the corpus.

Thanks

  • @Goldziher (kreuzberg) – filed #377 with a 727-document benchmark methodology plus 9 reproducer PDFs. The framing (“TF1 within ±3 % so text content is fine, structure is the issue”) made the whole investigation tractable.

v0.3.35 – 2026-04-19

Narrow-glyph doublet preservation in text extraction

Text extraction correctness

  • Adjacent narrow-glyph doublets no longer collapsed at small font sizes (#378, PR #379). TextExtractor::deduplicate_overlapping_chars and deduplicate_overlapping_spans used a hardcoded 2 pt absolute threshold; for narrow glyphs (l, r, I, i) in compact fonts at small sizes the per-glyph advance width drops to ≤ 2 pt (Helvetica l ≈ 2.5 pt at 9 pt), so legitimate adjacent doublets one full advance apart fell inside the dedup window and one of the two glyphs was silently dropped. Visible corruption included controller → controler, billed → biled, warranty → warrnty, following → folowing, VIII → VII. Threshold now scales with each glyph’s own advance_width as min(advance_width * 0.30, 2.0). Tunables hoisted to TextExtractor::DEDUP_OVERLAP_RATIO / DEDUP_OVERLAP_CAP_PT associated constants.

Thanks

  • @Hugues-DTANKOUO – reported #378 with precise root-cause analysis and authored PR #379 with the advance-scaled threshold and a parametrised regression matrix (4 narrow glyphs × 3 body-text sizes).

v0.3.34 – 2026-04-17

Idiomatic page API across all bindings; structured table extraction

New Features

  • Page API (#371) – Python, Node.js, C#, and Go now expose a PdfPage object. Iterate with for page in doc, for (const p of doc), foreach (var p in doc.Pages), or doc.Pages(); index with doc[i], doc.page(i), doc[i], or doc.Page(i). Each page exposes lazy text, markdown(), html(), words, lines, tables, images, paths, annotations, search(), and more.
  • Structured table extraction (#289)extract_tables() (Python), ExtractTables() (C#/Go), and extractTables() (Node.js) now return rows and cells with text plus bounding boxes, not just Markdown. Available on both PdfDocument and the new PdfPage.
  • Node.js parityextractWords, extractTextLines, extractTables, extractPaths, getEmbeddedImages, ocrExtractText wired into the TypeScript layer (previously native-only).
  • ExtractedTableTable – Rust core rename; the redundant Extracted prefix is dropped. FFI-facing types updated.

Text Extraction Quality

  • XY-cut column detection on mixed-layout pages (#319)is_multi_column_page guard tightened to require at least 15 spans per column; column-ordered spans are no longer re-sorted with the row-aware sort in extract_text.

Thanks

  • @SeanPedersen for proposing the page-first API (#371). @pdenapo for requesting structured table extraction (#289).

v0.3.33 – 2026-04-16

Text extraction, image correctness, and memory safety fixes

Bug Fixes

  • ToUnicode CMap miss (#363) – Subset Type0 fonts now emit U+FFFD when a CID is missing from the ToUnicode CMap, instead of falling through to Identity-H ciphertext (e.g. %B+$%8A//$2*%01*1%6APP).
  • Intra-word TJ kerning no longer splits words (#365) – 0.10–0.20 em letter-pair kerning inside single words ([(diffe) -150 (rent)]) no longer triggers space insertion.
  • Cyrillic UTF-8 mojibake recovered (#317) – Fonts with Latin-only encoding and raw UTF-8 byte sequences now decode correctly.
  • FlateDecode partial-recovery rejects garbage output (#364) – MS Reporting Services PDFs whose content streams fail mid-decompress no longer return 128 bytes of pseudo-random data.
  • Indexed + ICCBased palette (#373) – Unresolved ICC stream references inside the Indexed base array no longer default /N to 3 instead of CMYK’s 4, fixing diagonal-stripe artefacts. Reported by @Charltsing.
  • Lab-base Indexed palettes → sRGB (#337) – CIE L*a*b* palette bytes now converted Lab→XYZ→sRGB instead of reinterpreted as raw RGB.

Memory and Performance

  • All internal caches bounded (PRs #369, #354) – Object cache (64 MB), font caches (256–512 entries), XObject span/image caches (1024 entries), and global CMap cache (1024 entries) now use FIFO eviction.
  • Path extraction OOM on chart-heavy PDFs fixed (#369) – CTM-aware XObject dedup added so the same XObject at the same position is deduplicated but the same XObject at different positions processes separately.
  • Mutex poison resilienceMutexExt::lock_or_recover() replaces 72 .lock().unwrap() call sites.

Dependencies

  • RustCrypto cipher 0.5 ecosystem (PRs #352, #295, #291): aes 0.8→0.9, cbc 0.1→0.2, sha2/sha1/md-5 0.10→0.11.

Test Suite

  • 13 dead/stale ignored tests removed; 3 previously-ignored tests fixed. Regression tests added for every bug fix above. Suite now 6,300 passed, 0 failed, 228 ignored.

Thanks

  • @Charltsing for the Indexed + CMYK image extraction bug report (#373).
  • @ddxtanx for profiling the unbounded memory growth during multi-page extraction (#354).
  • @andrewjradcliffe for PR #369: bounded FIFO caches, CTM-aware XObject dedup, MutexExt poison-recovery trait, Python binding hardening.

v0.3.32 – 2026-04-15

Release pipeline fix for Windows-x64 Go FFI tarball

Release Pipeline

  • Fix x86_64-pc-windows-gnu native-lib build failing the v0.3.31 releasescripts/shrink-staticlib.sh ran objcopy --strip-debug on every archive member, but the MinGW cross-compile toolchain emits split-debug .dwo members containing only DWARF sections; after stripping the member had no sections left and objcopy aborted the whole archive. Fix: drop .dwo archive members via ar d before invoking objcopy. No functional change to Rust, Python, Node, WASM, or C# artifacts – this release exists solely to unblock the Windows-x64 Go install path.

v0.3.31 – 2026-04-13

Bug fixes, Go build changes, release infra improvements

Bug Fixes

  • Xref recovery – Fixed recovery for mis-flagged free page objects and off-by-few-bytes xref offset entries.

Breaking Changes

  • Go native libs – Native libraries are no longer committed to go/lib/. Consumers must run go run github.com/yfedoseev/pdf_oxide/go/cmd/install@latest once per machine.

Release Infra

  • Shrunk Rust staticlibs 63% (71 MB to 26 MB), stripped npm .node addon, dropped sourcemaps from npm, fixed crate sdist leak, tightened NuGet snupkg packaging.

v0.3.27 – 2026-04-12

Go staticlib, Node.js native bindings, C# NativeAOT, OCR FFI, major bug fixes

New Features

  • Go staticlib migration – Switched from cdylib to staticlib for self-contained Go binaries.
  • Node.js native bindings – Prebuilt platform subpackages via napi-rs style distribution.
  • C# LibraryImport – Migrated 881 P/Invoke declarations from DllImport to LibraryImport for NativeAOT compatibility.
  • OCR FFI bridge – OCR support now available in Go, C#, and Node.js bindings.
  • Regression harness – 60-PDF curated corpus for automated quality testing.

Bug Fixes

  • Indexed color space images, AES-256 (V=5, R=6) encryption, reading order for single-column and tabular content, Arabic text extraction, word separation, font width fallback, object cache invalidation, rendering improvements.

v0.3.24 – 2026-04-09

Official bindings for JavaScript/TypeScript, Go, and C#

New Features

  • JavaScript/TypeScript bindings – Published on npm with full API coverage.
  • Go bindings – Native Go package with complete API surface.
  • C# bindings – .NET package published on NuGet.
  • C FFI layer – 270+ extern "C" functions with shared pdf_oxide.h header.
  • Global log level control – Configurable across all bindings.

v0.3.23 – 2026-04-09

Critical stability fixes

Bug Fixes

  • Fixed SIGABRT on pages with degenerate CTM from rotated dvips PDFs.
  • Fixed images/XObjects being stripped on save.
  • Fixed garbled rendering on systems without common fonts.
  • Fixed form field page index always returning 0.

v0.3.22 – 2026-04-08

Thread-safe documents, async Python, free-threaded Python, word/line segmentation tuning

New Features

  • Thread-safe PdfDocumentSend + Sync via Mutex (replaced RefCell).
  • Async Python APIAsyncPdfDocument, AsyncPdf, AsyncOfficeConverter.
  • Free-threaded Python – Support for cp314t (no-GIL builds).
  • Segmentation thresholdsword_gap_threshold, line_gap_threshold, profile for tuning word/line detection.

Bug Fixes

  • CLI split/merge blank pages, rendering skip for malformed images, structure tree cycle SIGSEGV, table strategy gating.

Performance

  • Cached structure tree and decompressed content streams, O(1) MCID lookup, O(log n) page tree traversal, lazy page tree population.

v0.3.21 – 2026-04-04

Multi-arch Python wheels, log level fix

Bug Fixes

  • Log level now fully respected in Python (macros forwarded to log crate).

New Features

  • Multi-arch Python wheels – Linux aarch64, musl x86_64/aarch64, Windows ARM64; lowered glibc requirement to 2_28.

v0.3.20 – 2026-04-04

Major table extraction rewrite, text quality improvements, silent logging by default

New Features

  • Table extraction engine rewrite – Intersection pipeline, text-edge detection, extended grid, column-aware text detection, dotted/dashed line reconstitution, hybrid row detection.
  • Text extraction quality – Adjacent value spacing, split decimal merging, bold span consolidation, HTML heading hierarchy, label-value pairing, columnar group merging.
  • Silent logging – Logging now silent by default across all bindings; Python logs flow through logging module via pyo3-log.

Bug Fixes

  • Encrypted PDF clear error message, ObjStm/XRef stream decryption, stream parser trailing newline handling.

v0.3.19 – 2026-04-02

Single-call page extraction, column-aware reading order, per-character bounding boxes

New Features

  • extract_page_text() – Single-call DTO for streamlined page extraction.
  • Column-aware reading order – XY-Cut spatial partitioning for multi-column documents.
  • Per-character bounding boxes – Derived from font metrics for precise character positioning.
  • is_monospace flag – Available on TextSpan and TextChar.
  • Pdf::from_bytes() – New constructor across all bindings.
  • Path operationsextract_paths() in Python bindings.

Bug Fixes

  • UTF-8 panic on multi-byte debug log, markdown spacing, Form XObject /Matrix, rotated text matrix, prescan CTM loss, deduplication, Tm-scale text drop, markdown word merging, CLI merge blank docs.

Breaking Changes

  • WASM – JSON field names now use camelCase.

v0.3.18 – 2026-04-01

Rendering engine overhaul, new Python and WASM APIs, batteries-included Python

New Features

  • Rendering engine overhaul – Correct character spacing, embedded font support, standard font metrics, fill-and-stroke, clip path, gradient shading, alpha transparency, stencil image masks, page rotation, separation color spaces.
  • New Python APIsvalidate_pdf_a, validate_pdf_ua, validate_pdf_x, extract_pages, delete_page, move_page, flatten_to_images, password constructor, merge.
  • New WASM APIsvalidatePdfA, deletePage, extractPages, save, password constructor, merge.
  • Batteries-included Python – Rendering, parallel, signatures, and office conversion enabled by default.

Bug Fixes

  • Degenerate CTM abort, FlateDecode flate-bomb protection (256 MB cap), clipping stack sync.

v0.3.17 – 2026-03-08

Table detection refinement, tagged PDF optimization

Improvements

  • Refined table detection – Requires 2+ columns, reducing false positives.
  • Optimized tagged PDF extraction pipeline.

Bug Fixes

  • Fixed RefCell already borrowed panic on recursive Form XObject processing.

v0.3.16 – 2026-03-08

Smart hybrid table extraction, Python type stubs, pathlib support

New Features

  • Smart hybrid table extraction – Union-Find clustering, visual line analysis, visual spans/headers.
  • Professional ASCII tables – Multiline wrapping for terminal output.
  • Python type stubs – Auto-generated via mypy stubgen.
  • Python PdfDocument – Accepts pathlib.Path and supports context manager.

Bug Fixes

  • Segfault in nested Form XObject, Python coordinate scaling, ASCII table UTF-8 panic.

v0.3.15 – 2026-03-06

Header/footer management, page templates, scoped extraction

New Features

  • Header/footer management API – Add, remove, and edit PDF artifacts.
  • Page templates – Dynamic placeholders for page numbering, dates, etc.
  • Scoped extraction – Respects erase_regions for filtered output.
  • PdfDocument.from_bytes() – New Python constructor.

Bug Fixes

  • Multi-column reading order (XY-Cut), font identity collisions, Lines table strategy false positives.

v0.3.14 – 2026-03-03

High-level rendering, word/line extraction, geometric primitives, hybrid tables

New Features

  • High-level rendering APIPdf::render_page in Rust, Python, and WASM.
  • Word and line extractionextract_words, extract_text_lines across all bindings.
  • Geometric primitive extractionextract_rects, extract_lines.
  • Hybrid table detection – Vector line hints improve table boundary detection.
  • API harmonization – Fluent .within(page, rect) pattern.
  • CLI commandsrender and paths commands with --area filtering.

Bug Fixes

  • OCR feature gating discovery, XObject span cache poisoning, V=4 crypt filters, encrypted CIDToGIDMap.

v0.3.13 – 2026-03-02

CJK text extraction fixes

Bug Fixes

  • Multi-byte decoding in extract_chars for CJK/Type0 fonts, improved character positioning accuracy, character spacing scaling.

v0.3.12 – 2026-03-01

Text extraction quality, markdown conversion, performance

Improvements

  • Text extraction quality – CID font width calculation, font-change word boundary detection, non-standard CID mapping fallback, RTL text directionality.
  • Markdown conversion – XY-Cut recursive spatial partitioning, heading detection, list reconstruction.

Performance

  • Zero-copy page tree traversal, structure tree caching, BT operator early-out, larger I/O buffer, removed xref reconstruction threshold.

v0.3.10 – 2026-02-26

Parallel extraction, WASM/JavaScript support, batch processing, text quality improvements

New Features

  • WASM/JavaScript support – WebAssembly bindings via wasm-bindgen. Full text extraction, PDF creation, editing, form fields, and search available in the browser and Node.js. Published as pdf-oxide-wasm on npm.

  • Parallel page extraction – New parallel feature flag with rayon-based multi-threaded extraction. ParallelExtractor distributes pages across worker threads. Global font cache ensures fonts are parsed only once.

  • Batch processing API – New BatchProcessor for multi-PDF workflows with progress callbacks and error collection. Supports both sequential and parallel processing.

  • OCR hybrid detection – New PageType enum (NativeText, ScannedPage, HybridPage) with multi-heuristic detection for intelligent OCR fallback.

  • Full WASM/Python API parity – 10 new method groups across WASM and Python bindings: form field get/set, image bytes extraction, PDF-from-images, form flattening, PDF merging, file embedding, page labels, XMP metadata.

Bug Fixes

  • Circular XObject segfault – Fixed segfault from circular Form XObject references during image extraction
  • XRef /Prev chain overflow – XRef /Prev chain parsing rewritten from recursive to iterative with cycle detection
  • Broken ligature textrepair_ligatures() post-processor fixes corrupted text from LaTeX PDFs
  • Text extraction quality – Annotation text extraction, leader dot normalization, Priority 3 CMap support
  • Table extraction – Merged cells, multi-line cell content, font-based header detection
  • Form field persistence – Incremental save now correctly persists form field value changes

Performance

  • Image-only page skippage_cannot_have_text() pre-check skips decompression for pages with no fonts
  • SmallVec operator operands – Stack-allocated operands eliminate per-operator heap allocation
  • Cross-document font cache – Process-level LRU font cache shared across all PdfDocument instances

v0.3.9 – 2026-02-24

20+ micro-optimizations – 40% faster text extraction

Performance

  • O(n^2) string concat fix – Pre-allocated Vec<&str> joined at end replaces quadratic String::push_str() accumulation
  • Image-only content stream parser – New fast path for extract_images() that skips text and graphics operators (3-5x faster)
  • Fingerprint-based font cache – Font identity by hashing encoding+widths+flags instead of full struct comparison
  • Streaming parser – Content stream operators streamed instead of collected into Vec
  • Fast inline parser for BT/ET – Direct byte matching for common text operators
  • Byte-to-char lookup table – 256-entry lookup replaces HashMap in hot path
  • Width lookup table – Fixed-size array replaces HashMap for glyph widths
  • Shrink Operator enum – 112 to 40 bytes via boxing large variants (64% smaller)
  • zlib-rs backend – 15-25% faster stream decompression via zlib-ng port

Bug Fixes

  • Font encoding with embedded programs – Correct base encoding resolution per PDF spec
  • Supplementary Unicode (U+10000+) – Fixed truncation of supplementary code points
  • StandardEncoding ligature mapping – Correct fi, fl, ff, ffi, ffl mapping via Adobe Glyph List
  • Kangxi Radical normalization – Full U+2F00-U+2FD5 mapping table
  • RTL text character order – Arabic/Hebrew extracted in logical reading order
  • Multi-column text separation – Improved column detection via gap analysis

Features

  • extract_all_text() – New convenience method for all-page text extraction
  • source_role for StructElem – Preserves original PDF role name before role mapping

v0.3.8 – 2026-02-20

Text-only parser – graphics-heavy pages 10-30x faster

Performance

  • Text-only content stream parser – New parse_content_stream_text_only() fast path skips graphics operators outside BT/ET blocks using byte-level scanning instead of full nom parsing
  • Byte-level graphics scanner – Raw index arithmetic replaces nom-based operand loop, processing at near-memcpy speed
  • Skip color operators – 12 color operators added to byte-level skip list
  • Defer q/cm/Q emission – Graphics state ops deferred until text is confirmed, eliminating ~75% of backtrack overhead
  • Arc-wrap FontInfo cache – Avoids cloning full FontInfo structs on cache hits
  • O(n) page map construction – Single-pass traversal replaces recursive descent
  • XObject name-to-ref cache – Eliminates O(n^2) dictionary cloning on XObject-heavy pages

v0.3.7 – 2026-02-19

Text extraction quality: 95.7% to 99.6% clean rate

Verified – 3,829-PDF Corpus

Metric v0.3.6 v0.3.7 Change
Clean rate 95.7% 99.6% 3,812 of 3,829 PDFs
Dirty PDFs 165 17 -90%

Added – Parser & Decoders

  • BrotliDecode stream filter (PDF 2.0) – New decoder for Brotli-compressed streams
  • Xref trailer selection – Correct trailer selection when multiple trailers exist
  • Headerless PDF recovery – Search for first object marker when %PDF- header is missing

Added – Font Encoding

  • CFF font encoding parser – Parse CFF/OpenType font programs for character encoding
  • Type1 font encoding parser – Parse embedded Type 1 font programs for glyph mappings
  • 80K+ CID-to-Unicode mappings – Expanded Adobe-CNS1, Adobe-GB1, Adobe-Japan1, Adobe-Korea1
  • Shift-JIS/RKSJ decoding – Japanese Shift-JIS encoded CMap stream support
  • Identity-H cmap propagation – Propagate TrueType cmap tables from CIDFont descendants

Fixed – Text Extraction Pipeline

  • Tf buffer flush – Flush pending text on font switch to prevent text loss
  • Adaptive space threshold – Replace fixed 0.25em threshold with bbox-based spacing
  • Span deduplication – Deduplicate overlapping spans rendered for bold/shadow effects
  • Character deduplication – Remove duplicate characters within 2pt on the same line
  • BT operator check removal – Fix incorrect validation that skipped valid text blocks
  • ByteMode decoding – Proper 1-byte, 2-byte, and variable-width character code decoding
  • Annotation text extraction – Extract text from Widget, FreeText, and appearance streams

v0.3.6 – 2026-02-16

10x faster – two O(n) bottlenecks eliminated

Performance

  • Bulk page tree cache – On first page access, the entire page tree is walked once and all pages are cached. Previously get_page() traversed from root for every uncached page, resulting in O(n) per page and O(n^2) total for sequential access. Now O(1) per page after a single O(n) walk. A 10,000-page veraPDF test file went from 55,667ms to 332ms (168x faster).

  • Scan-for-object offset cache – When objects are missing from the xref table, scan_for_object() previously read the entire PDF file for each missing object. Tagged PDFs with hundreds of structure tree elements not in xref triggered hundreds of full file reads. Now the file is scanned once and all object offsets are cached. A 10-page tagged PDF went from ~10s to 68ms (146x faster). A 154-page academic PDF with 571 fonts went from ~18s to 405ms (44x faster).

  • Single-pass text extractionextract_spans() no longer runs two passes (classify document type, then extract). The classification pass was eliminated entirely; adaptive font-aware thresholds now produce equal or better results in a single pass.

  • Content stream Vec pre-allocationparse_content_stream() pre-allocates operator Vec capacity based on stream size, reducing reallocations for large content streams.

Verified – 3,830-PDF Corpus (v0.3.5 to v0.3.6)

Metric v0.3.5 v0.3.6 Change
Pass rate 99.8% 99.8% 3,823 of 3,830 valid PDFs
Slow (>5s) 2 0 Eliminated
Mean 23.3ms 2.1ms -91%
p50 0.6ms 0.6ms
p90 3.0ms 2.6ms -13%
p99 33.2ms 18.0ms -46%
Max 68,722ms 625ms -99%
Sum (all PDFs) 89.1s 8.0s -91%

Text output verified byte-identical on 11 PDFs (862 KB of extracted text). 4 PDFs showed improved extraction quality from adaptive spacing.


v0.3.5 – 2026-02-15

Performance, 3,830-PDF stability, and error recovery

Performance

  • Font caching across pages – Document-level font cache keyed by ObjectRef avoids re-parsing shared fonts on every page
  • Page object cachingget_page() caches resolved page objects, eliminating repeated page tree traversal for multi-page extraction
  • Structure tree caching – Structure tree result cached after first access, avoiding redundant parsing on every extract_text() call
  • BT operator early-out – Text extraction skips the full pipeline for image-only pages that contain no BT (Begin Text) operators
  • Larger I/O buffer for big files – BufReader capacity increased from 8 KB to 256 KB for files over 100 MB
  • Xref reconstruction threshold removed – Eliminated the heuristic that triggered full-file reconstruction on valid portfolio PDFs with few objects

Verified – 3,830-PDF Corpus

  • 100% pass rate on 3,830 PDFs across veraPDF (2,907), Mozilla pdf.js (897), SafeDocs (26)
  • Zero timeouts, zero panics
  • p50 = 0.6ms, p90 = 3.0ms, p99 = 33ms

Added – Encryption

  • Owner password authentication – Algorithm 7 for R<=4, Algorithm 12 for R>=5
  • R>=5 user password verification with SASLprep – Full AES-256 password verification using SHA-256
  • Public password authentication APIPdf::authenticate(password) and PdfDocument::authenticate(password)

Added – PDF/A Compliance Validation

  • XMP metadata validation – Checks for pdfaid:part and pdfaid:conformance entries
  • Color space validation – Scans page content streams for device-dependent color operators without output intent
  • AFRelationship validation – PDF/A-3 embedded file spec validation

Added – PDF/X Compliance Validation

  • XMP PDF/X identification – Validates pdfxid:GTS_PDFXVersion
  • Page box relationship validation – TrimBox within BleedBox within MediaBox
  • ExtGState transparency detection – SMask, CA/ca, BM checks
  • Device-dependent color detection – Flags unsupported color spaces
  • ICC profile validation – Validates ICCBased profile streams

Added – Rendering

  • Spec-correct clipping – Clip state scoped to q/Q save/restore
  • Glyph advance width calculation – Per PDF spec section 9.4.4
  • Form XObject rendering – Parses /Matrix transform, uses form’s /Resources

Fixed – Error Recovery (28+ real-world PDFs)

  • Missing objects resolve to Null per PDF spec section 7.3.10
  • Lenient header version parsing for unusual version strings
  • Non-standard encryption algorithm matching (V=1, R=3 combinations)
  • Non-dictionary Resources treated as empty instead of erroring
  • Null nodes in page tree gracefully skipped
  • Corrupt content streams return empty content instead of errors
  • Enhanced page tree scanning with /Resources+/Parent heuristic

Fixed – DoS Protection

  • Page count validated against PDF spec Annex C.2 limit (8,388,607)

Fixed – Image Extraction

  • Content stream image extraction via Do operators
  • Nested Form XObject images with cycle detection
  • Inline images (BI…ID…EI sequences)
  • CTM transformations for image positioning
  • ColorSpace indirect reference resolution

Fixed – Parser Robustness

  • Multi-line object headers (1 0\nobj format used by Google-generated PDFs)
  • Extended header search from 1024 to 8192 bytes
  • Lenient version parsing for malformed headers

Fixed – Page Access Robustness

  • Pages without /Contents return empty content
  • Cyclic page tree detection prevents stack overflow
  • Null stream references handled gracefully
  • Pages without /Type entry found by /MediaBox or /Contents keys

Fixed – Encryption Robustness

  • AES decryption with undersized keys returns error instead of panic
  • Xref stream parsing hardened against malformed entries
  • Indirect /Encrypt references resolved before parsing

Fixed – Content Stream Processing

  • Dictionary-as-Stream fallback for bare dictionaries
  • Abbreviated filter names (AHx, A85, LZW, Fl, RL, CCF, DCT)
  • Content stream operator limit (default 1,000,000)

Fixed – Code Quality

  • Structure tree indirect object references resolved at parse time
  • Lexer R/RG token disambiguation
  • Stream whitespace trimming no longer strips NUL bytes or spaces from binary data

Tests

  • 8 previously ignored tests un-ignored and fixed

Removed

  • Empty PdfImage stub (extraction uses ImageInfo)
  • Commented-out DocumentType::detect() test block

v0.3.4 – 2026-02-12

Parsing robustness, character extraction, and XObject paths

Breaking Changes

  • parse_header() signature changed from (u8, u8) to (u8, u8, u64) to include byte offset

Fixed – PDF Parsing Robustness (Issue #41)

  • PDFs with binary prefixes or BOM headers now open successfully
  • Header search scans first 1024 bytes for %PDF- marker
  • Supports UTF-8 BOM, email headers, and other leading binary data
  • Lenient mode handles real-world malformed PDFs; strict mode for compliance testing

Added – Character-Level Text Extraction (Issue #39)

  • extract_chars() returns Vec<TextChar> with per-character positioning
  • Includes transformation matrix, rotation angle, advance width
  • Sorted in reading order with overlapping character deduplication
  • 30-50% faster than span extraction for character-only use cases
  • Exposed in both Rust and Python APIs

Added – XObject Path Extraction (Issue #40)

  • extract_paths() recursively processes Form XObjects via Do operator
  • Coordinate transformations via /Matrix properly applied
  • Graphics state properly isolated (save/restore)
  • Duplicate XObject detection prevents infinite loops
  • Nested XObjects supported

Changed

  • Upgraded nom parser library from 7.1 to 8.0

v0.3.3 – 2026-02-11

CJK support, structure tree enhancements, and compliance foundations

Includes all changes from v0.2.5 and v0.2.6 as a consolidated release.

Highlights

  • TagSuspect/MarkInfo support – Parse MarkInfo dictionary from document catalog
  • Word Break /WB structure element for CJK text
  • Predefined CMap support for Adobe-GB1 (Simplified Chinese), Adobe-Japan1 (Japanese), Adobe-CNS1 (Traditional Chinese), Adobe-Korea1 (Korean)
  • Abbreviation expansion /E support
  • Type 0 /W array parsing for CIDFont glyph widths
  • Soft hyphen (U+00AD) handling fix
  • Enhanced artifact filtering with subtype support
  • Image embedding in HTML and Markdown output (base64 data URIs)
  • Image file export with embed_images=false and image_output_dir
  • PdfImage::to_base64_data_uri() and to_png_bytes() methods

v0.3.2 – 2026-02-01

Editing, encryption, and document security

Added – PDF Editing

  • DocumentEditor for modifying existing PDFs
  • Full annotation support (text markup, shapes, stamps, ink, file attachments, redactions)
  • Interactive form field creation (text, checkbox, radio, dropdown, list, button)
  • Form flattening
  • Link annotations (URLs, internal page navigation)
  • Outline/bookmark builder
  • PDF layers (Optional Content Groups)

Added – Encryption

  • Encryption on write (AES-256, AES-128, RC4-128, RC4-40)
  • Permission controls (print, copy, modify, annotate)
  • EncryptionConfig builder with EncryptionAlgorithm and Permissions
  • Digital signature foundation

v0.3.1 – 2026-01-14

Form fields, multimedia, creation tools, and search

Added – PDF Creation

  • Pdf::from_markdown(), Pdf::from_html(), Pdf::from_text(), Pdf::from_image()
  • PdfBuilder fluent pattern for metadata and layout configuration
  • DocumentBuilder for programmatic PDF generation
  • Table rendering with TableRenderer
  • Graphics API: colors, gradients, patterns, blend modes, transparency
  • Page templates with headers, footers, page numbering, watermarks
  • Barcode generation (QR, Code128, EAN-13, UPC-A, Code39, ITF)
  • Text search with regex, case-sensitive/insensitive, whole word, page ranges
  • SearchOptions and SearchResult types
  • Position tracking with page/coordinates

Added – Form Field Coverage (95%)

  • Hierarchical field creation (parent/child structures with dotted names)
  • Field property modification (readonly, required, rect, tooltip, max length, alignment, default value)
  • FDF/XFDF export for form data exchange

Added – Multimedia Annotations

  • MovieAnnotation, SoundAnnotation, ScreenAnnotation, RichMediaAnnotation
  • ThreeDAnnotation with U3D and PRC format support

Added – XFA Form Support

  • XfaExtractor, XfaParser, XfaConverter (XFA to AcroForm conversion)

Changed – Python Bindings

  • True Python 3.8-3.14 support via abi3-py38
  • Modern tooling: uv, pdm, ruff integration

v0.3.0 – 2026-01-10

Extraction foundation – unified API and core capabilities

Added – Unified Pdf API

  • Pdf::open() for reading existing PDFs
  • DOM-like page navigation with pdf.page(0)
  • PdfDocument low-level handle for advanced use cases

Added – Text Extraction

  • extract_text() – full-page plain text
  • extract_spans() – styled text runs with font metadata
  • Structure tree-based reading order for tagged PDFs
  • Intelligent line-break and space detection for untagged PDFs

Added – Image Extraction

  • extract_images() – extract all images from a page
  • Format detection (JPEG, PNG, TIFF, JBIG2, CCITT)
  • Color space handling (DeviceRGB, DeviceCMYK, DeviceGray, ICCBased)

Added – Metadata Extraction

  • Document info dictionary (title, author, subject, keywords)
  • XMP metadata read/write
  • Page info (dimensions, rotation, media/crop/trim boxes)

Added – Form Extraction

  • extract_form_fields() for AcroForm field enumeration
  • Text, button, choice, and signature field types

Added – Conversion

  • to_markdown() – page-level Markdown conversion
  • to_html() – page-level HTML conversion
  • to_plain_text() – configurable plain text output

Added – Compliance

  • PDF/A validation (ISO 19005, levels 1a through 3b)
  • PDF/X validation (ISO 15930, levels X-1a through X-6p)
  • PDF/UA validation (ISO 14289, levels UA-1 and UA-2)

Added – Rendering (requires rendering feature)

  • Render pages to PNG/JPEG via tiny-skia
  • Configurable DPI and scale

Added – Python Bindings

  • PdfDocument class with full extraction API
  • Pdf class with creation and high-level API
  • PyO3-based, published to PyPI as pdf_oxide

v0.2.4 – 2026-01-09

  • CTM transformation fix for text positioning
  • Structure tree /Alt and /Pg parsing
  • FormulaRenderer for formula images

v0.2.3 – 2026-01-07

  • BT/ET matrix reset per PDF spec
  • Geometric spacing detection in Markdown converter
  • apply_intelligent_text_processing() for ligatures and hyphenation

v0.2.2 – 2025-12-15

  • Keyword optimization for discoverability

v0.2.1 – 2025-12-15

  • Encrypted stream decoding improvements

v0.1.4 – 2025-12-12

  • Encrypted stream decoding fixes

v0.1.0 – 2025-11-06

  • Initial release
  • PDF text extraction with spec-compliant Unicode mapping
  • Intelligent reading order detection
  • Python bindings via PyO3
  • Encrypted PDF support
  • Form field extraction
  • Image extraction