Registro de cambios
Todos los cambios importantes de PDF Oxide están documentados aquí.
v0.3.38 – 2026-04-22
DocumentBuilderlands in every binding; AES-256 on the write path; signature verification; multi-target WASM; Go purego backend
Write-side API parity across all bindings (#384)
DocumentBuilder+FluentPageBuilder+EmbeddedFontnow ship in Python, Node/TypeScript, C#, Go, and WASM alongside Rust. Multi-page construction with full CJK / Cyrillic / Greek support through embedded fonts. Closes #382 cross-language.- 15 annotation methods on every binding:
link_url/link_page/link_named,highlight,underline,strikeout,squiggly, sticky note, stamp (14 standard + custom), free text,watermark(custom / DRAFT / CONFIDENTIAL). - 5 AcroForm widget types on every binding:
text_field,checkbox,combo_box,radio_group,push_button. - Graphics primitives on every binding:
rect,filled_rect,line. - HTML+CSS pipeline —
Pdf.from_html_css(...)andfrom_html_css_with_fonts(...)for multi-font cascades in every binding.
AES-256 encryption on the write path (#386)
save_encrypted(path, user_pw, owner_pw)/to_bytes_encrypted(user_pw, owner_pw)onDocumentBuilderin every binding.save_with_encryptionin Rust for custom algorithm + permissions.
Real font subsetting (#385 / FONT-3b)
- CJK faces are now embedded as a subset rather than the full face. A 5-character PDF built from a ~17 MB CJK font typically ships under 100 KB. Content streams,
/Wwidths, and theToUnicodeCMap are re-keyed onto the subset GID space;extract_textround-trips unchanged. - Internal writer API change:
EmbeddedFont::encode_string/encode_shaped_runreturnVec<u16>andbuild_embedded_font_objectsreturns aGlyphRemapperthat callers pass toContentStreamBuilder::build_with_remappers. No change to high-level APIs.
Digital signature verification (#208, verification half)
Signature.verify()andSignature.verify_detached(pdf_bytes)(and binding-native equivalents) in every binding. RFC 5652 §5.4 signer-attributes + §11.2messageDigestchecks.- RSA-PKCS#1 v1.5 over SHA-1 / SHA-256 / SHA-384 / SHA-512 returns
Valid/Invalid. RSA-PSS and ECDSA surface asUnknown/UnsupportedFeatureException; callers can still read the certificate and run their own check. Certificate— DER inspection (subject, issuer, serial, validity,is_valid) viax509-parser— every binding.Signature— enumerate + inspect +.get_certificate()— every binding.Timestamp— RFC 3161TSTInfoparsing (time, serial, policy, TSA name, hash algorithm, message imprint) — every binding.TsaClient— RFC 3161 HTTP POST with nonce and HTTP Basic auth behind atsa-clientCargo feature — every binding except WASM. Intentionally not wired on WASM (ureq is wasm-incompatible).DocumentEditor::set_producer/set_creation_datemetadata writers.render_page_regionandrender_page_fit— clipped and fitted rendering surface.- Bicubic image filtering (pdf.js #19978 parity) — scanned / bilevel pages with Multiply-blended overlays no longer collapse their grayscale range on downscale.
Signing itself (as opposed to verification) is not covered; #208 remains open for that half.
Multi-target WASM packaging (#392)
pdf-oxide-wasmnow ships three builds side-by-side withpackage.jsonconditional exports:nodejs/,bundler/(Vite / webpack / Rollup / esbuild / Bun), andweb/(browsers / Deno / Cloudflare Workers).- Fixes the
ReferenceError: Can't find variable: __dirnamethrown under browser bundlers. - Subpath imports (
pdf-oxide-wasm/web,/nodejs,/bundler) available for manual routing.
Go binding — purego backend + cache-dir install
- Second backend via ebitengine/purego
dlopen’slibpdf_oxide.{so,dylib,dll}at runtime.CGO_ENABLED=0builds now work. Backend selection is automatic —//go:build cgo→ full CGo API,//go:build !cgo→ purego. - Purego surface:
PdfDocumentopen (path / bytes / password), page count, version, text / Markdown / HTML / plain-text extraction, fonts, annotations, page elements, search, page dimensions, logging, plusPdfCreator.FromMarkdownfor test fixtures. - CGo-only (compile-time error under
!cgo):DocumentEditor,DocumentBuilder, barcodes, signatures, TSA, rendering, OCR, form mutation. - Installer: new
-sharedflag fetches the cdylib instead of the staticlib and printsCGO_ENABLED=0+PDF_OXIDE_LIB_PATH=…to export. - Install dir moved to
os.UserCacheDir()—~/.cache/pdf_oxide(Linux),~/Library/Caches/pdf_oxide(macOS),%LocalAppData%\pdf_oxide(Windows). Matches Go’s ownGOCACHEconvention. - Release assets now include
pdf_oxide-go-ffi-shared-<platform>.tar.gzfor every Tier-1 platform alongside the existing staticlib archives.
Bug fixes
- #395 –
RenderPageno longer raisesSignatureExceptionwhen a page contains unparseable signature-field metadata but no interactive signature widget. Reported by @gevorgter.
Thanks
- @sparkyandrew – #382 (CJK via
DocumentBuilder), #385 (subsetter). - @arthurlassagne – #392 (browser build breakage).
- @gevorgter – #395 (
RenderPagesignature exception).
v0.3.37 – 2026-04-20
HTML + CSS → PDF (#248) — first credible pure-Rust pipeline
New API — Pdf::from_html_css
let font = std::fs::read("DejaVuSans.ttf")?;
let pdf = Pdf::from_html_css(
"<h1>Hello</h1><p>World</p>",
"h1 { color: blue; font-size: 24pt }",
font,
)?;
pdf.save("out.pdf")?;
Pass HTML + CSS + font bytes, get a paginated PDF back. Pure Rust, MIT/Apache only (no MPL transitive deps), extract_text round-trips byte-equal so produced PDFs participate in the existing test infrastructure.
What shipped
- Font subsystem — TTF/OTF embedding with
Type 0/CIDFontType2/ Identity-H /ToUnicodeemission; Latin, Cyrillic, Greek, Hebrew, Arabic round-trip viaextract_text. System-font discovery viafontdb, text shaping viarustybuzz. - Hand-rolled CSS engine (~6,500 LoC, zero MPL deps) – tokenizer, parser, L3+L4 selectors (
:is/:where/:not/:has), matcher, cascade,calc()/min()/max()/clamp(),var()with cycle detection, typed property values, at-rules (@media print,@pagewith:first/:left/:right/:blank,@font-face,@import,@supports), counters, pseudo-element content. - HTML – HTML5 tokenizer, flat arena DOM, stylesheet extraction (
<style>,<link rel="stylesheet">, inlinestyle=""), resource extraction (<img>+ srcset,<picture>/<source>,<a href>). - Layout – Taffy-backed block / flex / grid, UAX #14 line breaking, margin collapsing, multi-column, tables (auto + fixed).
- Paint – text + borders, RTL via rustybuzz,
<a href>→/Linkannotation,<img>data-URI →/XObject,::before/::after,page-break-{before,after}: always,opacity,transform: translate*(),<ul>/<ol>list markers, embedded fonts viaDocumentBuilder::register_embedded_font(#382).
Multi-font cascade
Pdf::from_html_css_with_fonts(html, css, Vec<(family, bytes)>)— CSSfont-familyon any element resolves against registered families (case-insensitive, with/without quotes, multi-word unquoted).
Bug fixes in corner-case pass
- Base-14 bold text now renders bold (resource-dict key mismatch against
Tf /Helvetica-Bold). - TTC system fonts (Helvetica.ttc, msgothic.ttc) now resolve via
fontdbSource::SharedFile. - Unquoted multi-word
font-familytokenises correctly. - Memory leak in
Pdf::from_html_cssfactories closed (fourBox::leaksites replaced with scoped locals). - PNG alpha / soft-mask (
SMask) now renders. - Shaped text round-trips via
extract_text(encode_shaped_runmaps glyph clusters back to source codepoints). PdfWriter::finishembeds fonts in registration order (was HashMap-random).- Embedded-font name collisions isolated via monotonic
EFnresource names. fontdbMutex no longer held acrossfs::readof font bytes.
Out of scope
CSS filters, 3D transforms, animations, SVG-in-HTML (every viable Rust SVG crate is MPL), MathML, hyphens: auto, shape-outside, JavaScript, full-matrix transform (scale/rotate), gradients, box-shadow.
Licence audit
cargo deny check licenses passes with zero MPL transitive dependencies. The Mozilla CSS stack (cssparser, selectors, html5ever, lightningcss, stylo) is all MPL-2.0; v0.3.37 hand-rolls the equivalents to keep pdf_oxide entirely under MIT/Apache.
Thanks
- @jmriebold – #248 (“CSS support”) is the root of this release’s entire HTML+CSS→PDF pipeline.
v0.3.36 – 2026-04-19
Markdown structural extraction — Tagged-PDF heading/list emission, multi-column reading order, safer RTL handling
Markdown structural extraction (#377)
to_markdown() now wires /StructTreeRoot directly into the markdown pipeline instead of re-deriving heading levels from font-size heuristics and list markers from glyph detection:
- Heading and list emission from
/StructTreeRoot. NewStructRole(Heading(1..6),ListItem,ListItemLabel,ListItemBody) attached to every span. Word-tagged documents recover their full heading hierarchy; lists emit- itemwith paragraph breaks at every role transition. - Role propagated through nested MCRs.
H1 → Span → MCRandLI → LBody → Span → MCRpatterns now carry the right semantic role viaInheritedContext { heading_level, list_role }. - Per-
/StructTreeRootblock boundary forces paragraph break.OrderedContent.block_idincrements on every entry into/P,/H1..6,/LI,/Lbl,/LBody,/Sect,/Div,/Art,/TR,/TH,/TD,/Note,/Reference,/BibEntry,/Code; tight-gap layouts no longer merge. - Same-baseline gate against form-heading over-fragmentation — same-baseline spans re-join into one heading.
- Multi-column gutter detection — same-baseline spans separated by
> max(3 × font_size, 30 pt)are treated as cross-column. - Backward-x reading-order wrap detection — column-major reading order (last span of col 1 at x=976 → first span of col 2 at x=192 same baseline) now breaks paragraphs instead of joining.
- Geometric heading + list-prefix detection for untagged docs. Bold + 5 % size bump promotes to H4. New
is_ordered_list_markerrecognises1./12./a)/iv./A.while rejecting figure captions and years.
RTL text — safe-by-default
- Spurious
**bold**markers around Arabic contextual glyphs are now stripped (shape transitions flipped the font-weight detector). - Bidi reorder is OFF by default. An earlier draft ran
unicode-bidi’s visual→logical reorder on every RTL line, which broke previously-correct logical-order PDFs (Hebrew nameבנימיןwas being reversed). Reorder helper remains attext::bidi::reorder_visual_to_logicalfor callers whose input is visual-order.
Markdown output
- Inline-image base64 data URIs capped at 200 KB. PDFs with high-resolution diagrams previously inflated markdown output 10–20× (a 1.9 MB paper produced 11.3 MB of markdown). Images over the cap emit an HTML-comment placeholder with the original size. File-based image output (
image_output_dir) is unaffected.
Empirical impact
Validated against v0.3.35 on a 369-PDF regression spanning academic, government, forms, newspapers, technical, theses, IRS, pdfium, pdfjs, safedocs, and slow-corpus subsets:
- 0 catastrophic regressions.
- Token Jaccard vs pdfium and pdftotext: median 1.000, ≥0.95 on 95/106 fixtures.
- Token Jaccard vs pymupdf4llm: median 0.978, ≥0.95 on 65/106 fixtures.
- ~2× more headings emitted than pymupdf4llm across the corpus.
Thanks
- @Goldziher (kreuzberg) – filed #377 with a 727-document benchmark methodology plus 9 reproducer PDFs. The framing (“TF1 within ±3 % so text content is fine, structure is the issue”) made the whole investigation tractable.
v0.3.35 – 2026-04-19
Narrow-glyph doublet preservation in text extraction
Text extraction correctness
- Adjacent narrow-glyph doublets no longer collapsed at small font sizes (#378, PR #379).
TextExtractor::deduplicate_overlapping_charsanddeduplicate_overlapping_spansused a hardcoded 2 pt absolute threshold; for narrow glyphs (l,r,I,i) in compact fonts at small sizes the per-glyph advance width drops to ≤ 2 pt (Helvetical≈ 2.5 pt at 9 pt), so legitimate adjacent doublets one full advance apart fell inside the dedup window and one of the two glyphs was silently dropped. Visible corruption includedcontroller → controler,billed → biled,warranty → warrnty,following → folowing,VIII → VII. Threshold now scales with each glyph’s ownadvance_widthasmin(advance_width * 0.30, 2.0). Tunables hoisted toTextExtractor::DEDUP_OVERLAP_RATIO/DEDUP_OVERLAP_CAP_PTassociated constants.
Thanks
- @Hugues-DTANKOUO – reported #378 with precise root-cause analysis and authored PR #379 with the advance-scaled threshold and a parametrised regression matrix (4 narrow glyphs × 3 body-text sizes).
v0.3.34 – 2026-04-17
Idiomatic page API across all bindings; structured table extraction
New Features
- Page API (#371) – Python, Node.js, C#, and Go now expose a
PdfPageobject. Iterate withfor page in doc,for (const p of doc),foreach (var p in doc.Pages), ordoc.Pages(); index withdoc[i],doc.page(i),doc[i], ordoc.Page(i). Each page exposes lazytext,markdown(),html(),words,lines,tables,images,paths,annotations,search(), and more. - Structured table extraction (#289) –
extract_tables()(Python),ExtractTables()(C#/Go), andextractTables()(Node.js) now return rows and cells with text plus bounding boxes, not just Markdown. Available on bothPdfDocumentand the newPdfPage. - Node.js parity –
extractWords,extractTextLines,extractTables,extractPaths,getEmbeddedImages,ocrExtractTextwired into the TypeScript layer (previously native-only). ExtractedTable→Table– Rust core rename; the redundantExtractedprefix is dropped. FFI-facing types updated.
Text Extraction Quality
- XY-cut column detection on mixed-layout pages (#319) –
is_multi_column_pageguard tightened to require at least 15 spans per column; column-ordered spans are no longer re-sorted with the row-aware sort inextract_text.
Thanks
- @SeanPedersen for proposing the page-first API (#371). @pdenapo for requesting structured table extraction (#289).
v0.3.33 – 2026-04-16
Text extraction, image correctness, and memory safety fixes
Bug Fixes
- ToUnicode CMap miss (#363) – Subset Type0 fonts now emit U+FFFD when a CID is missing from the ToUnicode CMap, instead of falling through to Identity-H ciphertext (e.g.
%B+$%8A//$2*%01*1%6APP). - Intra-word TJ kerning no longer splits words (#365) – 0.10–0.20 em letter-pair kerning inside single words (
[(diffe) -150 (rent)]) no longer triggers space insertion. - Cyrillic UTF-8 mojibake recovered (#317) – Fonts with Latin-only encoding and raw UTF-8 byte sequences now decode correctly.
- FlateDecode partial-recovery rejects garbage output (#364) – MS Reporting Services PDFs whose content streams fail mid-decompress no longer return 128 bytes of pseudo-random data.
- Indexed + ICCBased palette (#373) – Unresolved ICC stream references inside the Indexed base array no longer default
/Nto 3 instead of CMYK’s 4, fixing diagonal-stripe artefacts. Reported by @Charltsing. - Lab-base Indexed palettes → sRGB (#337) – CIE L*a*b* palette bytes now converted Lab→XYZ→sRGB instead of reinterpreted as raw RGB.
Memory and Performance
- All internal caches bounded (PRs #369, #354) – Object cache (64 MB), font caches (256–512 entries), XObject span/image caches (1024 entries), and global CMap cache (1024 entries) now use FIFO eviction.
- Path extraction OOM on chart-heavy PDFs fixed (#369) – CTM-aware XObject dedup added so the same XObject at the same position is deduplicated but the same XObject at different positions processes separately.
- Mutex poison resilience –
MutexExt::lock_or_recover()replaces 72.lock().unwrap()call sites.
Dependencies
- RustCrypto cipher 0.5 ecosystem (PRs #352, #295, #291):
aes0.8→0.9,cbc0.1→0.2,sha2/sha1/md-50.10→0.11.
Test Suite
- 13 dead/stale ignored tests removed; 3 previously-ignored tests fixed. Regression tests added for every bug fix above. Suite now 6,300 passed, 0 failed, 228 ignored.
Thanks
- @Charltsing for the Indexed + CMYK image extraction bug report (#373).
- @ddxtanx for profiling the unbounded memory growth during multi-page extraction (#354).
- @andrewjradcliffe for PR #369: bounded FIFO caches, CTM-aware XObject dedup,
MutexExtpoison-recovery trait, Python binding hardening.
v0.3.32 – 2026-04-15
Release pipeline fix for Windows-x64 Go FFI tarball
Release Pipeline
- Fix
x86_64-pc-windows-gnunative-lib build failing the v0.3.31 release –scripts/shrink-staticlib.shranobjcopy --strip-debugon every archive member, but the MinGW cross-compile toolchain emits split-debug.dwomembers containing only DWARF sections; after stripping the member had no sections left and objcopy aborted the whole archive. Fix: drop.dwoarchive members viaar dbefore invokingobjcopy. No functional change to Rust, Python, Node, WASM, or C# artifacts – this release exists solely to unblock the Windows-x64 Go install path.
v0.3.31 – 2026-04-13
Bug fixes, Go build changes, release infra improvements
Bug Fixes
- Xref recovery – Fixed recovery for mis-flagged free page objects and off-by-few-bytes xref offset entries.
Breaking Changes
- Go native libs – Native libraries are no longer committed to
go/lib/. Consumers must rungo run github.com/yfedoseev/pdf_oxide/go/cmd/install@latestonce per machine.
Release Infra
- Shrunk Rust staticlibs 63% (71 MB to 26 MB), stripped npm
.nodeaddon, dropped sourcemaps from npm, fixed crate sdist leak, tightened NuGet snupkg packaging.
v0.3.27 – 2026-04-12
Go staticlib, Node.js native bindings, C# NativeAOT, OCR FFI, major bug fixes
New Features
- Go staticlib migration – Switched from cdylib to staticlib for self-contained Go binaries.
- Node.js native bindings – Prebuilt platform subpackages via napi-rs style distribution.
- C# LibraryImport – Migrated 881 P/Invoke declarations from DllImport to LibraryImport for NativeAOT compatibility.
- OCR FFI bridge – OCR support now available in Go, C#, and Node.js bindings.
- Regression harness – 60-PDF curated corpus for automated quality testing.
Bug Fixes
- Indexed color space images, AES-256 (V=5, R=6) encryption, reading order for single-column and tabular content, Arabic text extraction, word separation, font width fallback, object cache invalidation, rendering improvements.
v0.3.24 – 2026-04-09
Official bindings for JavaScript/TypeScript, Go, and C#
New Features
- JavaScript/TypeScript bindings – Published on npm with full API coverage.
- Go bindings – Native Go package with complete API surface.
- C# bindings – .NET package published on NuGet.
- C FFI layer – 270+
extern "C"functions with sharedpdf_oxide.hheader. - Global log level control – Configurable across all bindings.
v0.3.23 – 2026-04-09
Critical stability fixes
Bug Fixes
- Fixed SIGABRT on pages with degenerate CTM from rotated dvips PDFs.
- Fixed images/XObjects being stripped on save.
- Fixed garbled rendering on systems without common fonts.
- Fixed form field page index always returning 0.
v0.3.22 – 2026-04-08
Thread-safe documents, async Python, free-threaded Python, word/line segmentation tuning
New Features
- Thread-safe PdfDocument –
Send + Syncvia Mutex (replaced RefCell). - Async Python API –
AsyncPdfDocument,AsyncPdf,AsyncOfficeConverter. - Free-threaded Python – Support for
cp314t(no-GIL builds). - Segmentation thresholds –
word_gap_threshold,line_gap_threshold,profilefor tuning word/line detection.
Bug Fixes
- CLI split/merge blank pages, rendering skip for malformed images, structure tree cycle SIGSEGV, table strategy gating.
Performance
- Cached structure tree and decompressed content streams, O(1) MCID lookup, O(log n) page tree traversal, lazy page tree population.
v0.3.21 – 2026-04-04
Multi-arch Python wheels, log level fix
Bug Fixes
- Log level now fully respected in Python (macros forwarded to log crate).
New Features
- Multi-arch Python wheels – Linux aarch64, musl x86_64/aarch64, Windows ARM64; lowered glibc requirement to 2_28.
v0.3.20 – 2026-04-04
Major table extraction rewrite, text quality improvements, silent logging by default
New Features
- Table extraction engine rewrite – Intersection pipeline, text-edge detection, extended grid, column-aware text detection, dotted/dashed line reconstitution, hybrid row detection.
- Text extraction quality – Adjacent value spacing, split decimal merging, bold span consolidation, HTML heading hierarchy, label-value pairing, columnar group merging.
- Silent logging – Logging now silent by default across all bindings; Python logs flow through
loggingmodule via pyo3-log.
Bug Fixes
- Encrypted PDF clear error message, ObjStm/XRef stream decryption, stream parser trailing newline handling.
v0.3.19 – 2026-04-02
Single-call page extraction, column-aware reading order, per-character bounding boxes
New Features
extract_page_text()– Single-call DTO for streamlined page extraction.- Column-aware reading order – XY-Cut spatial partitioning for multi-column documents.
- Per-character bounding boxes – Derived from font metrics for precise character positioning.
is_monospaceflag – Available onTextSpanandTextChar.Pdf::from_bytes()– New constructor across all bindings.- Path operations –
extract_paths()in Python bindings.
Bug Fixes
- UTF-8 panic on multi-byte debug log, markdown spacing, Form XObject
/Matrix, rotated text matrix, prescan CTM loss, deduplication, Tm-scale text drop, markdown word merging, CLI merge blank docs.
Breaking Changes
- WASM – JSON field names now use camelCase.
v0.3.18 – 2026-04-01
Rendering engine overhaul, new Python and WASM APIs, batteries-included Python
New Features
- Rendering engine overhaul – Correct character spacing, embedded font support, standard font metrics, fill-and-stroke, clip path, gradient shading, alpha transparency, stencil image masks, page rotation, separation color spaces.
- New Python APIs –
validate_pdf_a,validate_pdf_ua,validate_pdf_x,extract_pages,delete_page,move_page,flatten_to_images, password constructor,merge. - New WASM APIs –
validatePdfA,deletePage,extractPages,save, password constructor,merge. - Batteries-included Python – Rendering, parallel, signatures, and office conversion enabled by default.
Bug Fixes
- Degenerate CTM abort, FlateDecode flate-bomb protection (256 MB cap), clipping stack sync.
v0.3.17 – 2026-03-08
Table detection refinement, tagged PDF optimization
Improvements
- Refined table detection – Requires 2+ columns, reducing false positives.
- Optimized tagged PDF extraction pipeline.
Bug Fixes
- Fixed
RefCell already borrowedpanic on recursive Form XObject processing.
v0.3.16 – 2026-03-08
Smart hybrid table extraction, Python type stubs, pathlib support
New Features
- Smart hybrid table extraction – Union-Find clustering, visual line analysis, visual spans/headers.
- Professional ASCII tables – Multiline wrapping for terminal output.
- Python type stubs – Auto-generated via mypy stubgen.
- Python PdfDocument – Accepts
pathlib.Pathand supports context manager.
Bug Fixes
- Segfault in nested Form XObject, Python coordinate scaling, ASCII table UTF-8 panic.
v0.3.15 – 2026-03-06
Header/footer management, page templates, scoped extraction
New Features
- Header/footer management API – Add, remove, and edit PDF artifacts.
- Page templates – Dynamic placeholders for page numbering, dates, etc.
- Scoped extraction – Respects
erase_regionsfor filtered output. PdfDocument.from_bytes()– New Python constructor.
Bug Fixes
- Multi-column reading order (XY-Cut), font identity collisions, Lines table strategy false positives.
v0.3.14 – 2026-03-03
High-level rendering, word/line extraction, geometric primitives, hybrid tables
New Features
- High-level rendering API –
Pdf::render_pagein Rust, Python, and WASM. - Word and line extraction –
extract_words,extract_text_linesacross all bindings. - Geometric primitive extraction –
extract_rects,extract_lines. - Hybrid table detection – Vector line hints improve table boundary detection.
- API harmonization – Fluent
.within(page, rect)pattern. - CLI commands –
renderandpathscommands with--areafiltering.
Bug Fixes
- OCR feature gating discovery, XObject span cache poisoning, V=4 crypt filters, encrypted CIDToGIDMap.
v0.3.13 – 2026-03-02
CJK text extraction fixes
Bug Fixes
- Multi-byte decoding in
extract_charsfor CJK/Type0 fonts, improved character positioning accuracy, character spacing scaling.
v0.3.12 – 2026-03-01
Text extraction quality, markdown conversion, performance
Improvements
- Text extraction quality – CID font width calculation, font-change word boundary detection, non-standard CID mapping fallback, RTL text directionality.
- Markdown conversion – XY-Cut recursive spatial partitioning, heading detection, list reconstruction.
Performance
- Zero-copy page tree traversal, structure tree caching, BT operator early-out, larger I/O buffer, removed xref reconstruction threshold.
v0.3.10 – 2026-02-26
Parallel extraction, WASM/JavaScript support, batch processing, text quality improvements
New Features
-
WASM/JavaScript support – WebAssembly bindings via wasm-bindgen. Full text extraction, PDF creation, editing, form fields, and search available in the browser and Node.js. Published as
pdf-oxide-wasmon npm. -
Parallel page extraction – New
parallelfeature flag with rayon-based multi-threaded extraction.ParallelExtractordistributes pages across worker threads. Global font cache ensures fonts are parsed only once. -
Batch processing API – New
BatchProcessorfor multi-PDF workflows with progress callbacks and error collection. Supports both sequential and parallel processing. -
OCR hybrid detection – New
PageTypeenum (NativeText,ScannedPage,HybridPage) with multi-heuristic detection for intelligent OCR fallback. -
Full WASM/Python API parity – 10 new method groups across WASM and Python bindings: form field get/set, image bytes extraction, PDF-from-images, form flattening, PDF merging, file embedding, page labels, XMP metadata.
Bug Fixes
- Circular XObject segfault – Fixed segfault from circular Form XObject references during image extraction
- XRef /Prev chain overflow – XRef
/Prevchain parsing rewritten from recursive to iterative with cycle detection - Broken ligature text –
repair_ligatures()post-processor fixes corrupted text from LaTeX PDFs - Text extraction quality – Annotation text extraction, leader dot normalization, Priority 3 CMap support
- Table extraction – Merged cells, multi-line cell content, font-based header detection
- Form field persistence – Incremental save now correctly persists form field value changes
Performance
- Image-only page skip –
page_cannot_have_text()pre-check skips decompression for pages with no fonts - SmallVec operator operands – Stack-allocated operands eliminate per-operator heap allocation
- Cross-document font cache – Process-level LRU font cache shared across all PdfDocument instances
v0.3.9 – 2026-02-24
20+ micro-optimizations – 40% faster text extraction
Performance
- O(n^2) string concat fix – Pre-allocated
Vec<&str>joined at end replaces quadraticString::push_str()accumulation - Image-only content stream parser – New fast path for
extract_images()that skips text and graphics operators (3-5x faster) - Fingerprint-based font cache – Font identity by hashing encoding+widths+flags instead of full struct comparison
- Streaming parser – Content stream operators streamed instead of collected into Vec
- Fast inline parser for BT/ET – Direct byte matching for common text operators
- Byte-to-char lookup table – 256-entry lookup replaces HashMap in hot path
- Width lookup table – Fixed-size array replaces HashMap for glyph widths
- Shrink Operator enum – 112 to 40 bytes via boxing large variants (64% smaller)
- zlib-rs backend – 15-25% faster stream decompression via zlib-ng port
Bug Fixes
- Font encoding with embedded programs – Correct base encoding resolution per PDF spec
- Supplementary Unicode (U+10000+) – Fixed truncation of supplementary code points
- StandardEncoding ligature mapping – Correct fi, fl, ff, ffi, ffl mapping via Adobe Glyph List
- Kangxi Radical normalization – Full U+2F00-U+2FD5 mapping table
- RTL text character order – Arabic/Hebrew extracted in logical reading order
- Multi-column text separation – Improved column detection via gap analysis
Features
extract_all_text()– New convenience method for all-page text extractionsource_rolefor StructElem – Preserves original PDF role name before role mapping
v0.3.8 – 2026-02-20
Text-only parser – graphics-heavy pages 10-30x faster
Performance
- Text-only content stream parser – New
parse_content_stream_text_only()fast path skips graphics operators outside BT/ET blocks using byte-level scanning instead of full nom parsing - Byte-level graphics scanner – Raw index arithmetic replaces nom-based operand loop, processing at near-memcpy speed
- Skip color operators – 12 color operators added to byte-level skip list
- Defer q/cm/Q emission – Graphics state ops deferred until text is confirmed, eliminating ~75% of backtrack overhead
- Arc-wrap FontInfo cache – Avoids cloning full FontInfo structs on cache hits
- O(n) page map construction – Single-pass traversal replaces recursive descent
- XObject name-to-ref cache – Eliminates O(n^2) dictionary cloning on XObject-heavy pages
v0.3.7 – 2026-02-19
Text extraction quality: 95.7% to 99.6% clean rate
Verified – 3,829-PDF Corpus
| Metric | v0.3.6 | v0.3.7 | Change |
|---|---|---|---|
| Clean rate | 95.7% | 99.6% | 3,812 of 3,829 PDFs |
| Dirty PDFs | 165 | 17 | -90% |
Added – Parser & Decoders
- BrotliDecode stream filter (PDF 2.0) – New decoder for Brotli-compressed streams
- Xref trailer selection – Correct trailer selection when multiple trailers exist
- Headerless PDF recovery – Search for first object marker when
%PDF-header is missing
Added – Font Encoding
- CFF font encoding parser – Parse CFF/OpenType font programs for character encoding
- Type1 font encoding parser – Parse embedded Type 1 font programs for glyph mappings
- 80K+ CID-to-Unicode mappings – Expanded Adobe-CNS1, Adobe-GB1, Adobe-Japan1, Adobe-Korea1
- Shift-JIS/RKSJ decoding – Japanese Shift-JIS encoded CMap stream support
- Identity-H cmap propagation – Propagate TrueType cmap tables from CIDFont descendants
Fixed – Text Extraction Pipeline
- Tf buffer flush – Flush pending text on font switch to prevent text loss
- Adaptive space threshold – Replace fixed 0.25em threshold with bbox-based spacing
- Span deduplication – Deduplicate overlapping spans rendered for bold/shadow effects
- Character deduplication – Remove duplicate characters within 2pt on the same line
- BT operator check removal – Fix incorrect validation that skipped valid text blocks
- ByteMode decoding – Proper 1-byte, 2-byte, and variable-width character code decoding
- Annotation text extraction – Extract text from Widget, FreeText, and appearance streams
v0.3.6 – 2026-02-16
10x faster – two O(n) bottlenecks eliminated
Performance
-
Bulk page tree cache – On first page access, the entire page tree is walked once and all pages are cached. Previously
get_page()traversed from root for every uncached page, resulting in O(n) per page and O(n^2) total for sequential access. Now O(1) per page after a single O(n) walk. A 10,000-page veraPDF test file went from 55,667ms to 332ms (168x faster). -
Scan-for-object offset cache – When objects are missing from the xref table,
scan_for_object()previously read the entire PDF file for each missing object. Tagged PDFs with hundreds of structure tree elements not in xref triggered hundreds of full file reads. Now the file is scanned once and all object offsets are cached. A 10-page tagged PDF went from ~10s to 68ms (146x faster). A 154-page academic PDF with 571 fonts went from ~18s to 405ms (44x faster). -
Single-pass text extraction –
extract_spans()no longer runs two passes (classify document type, then extract). The classification pass was eliminated entirely; adaptive font-aware thresholds now produce equal or better results in a single pass. -
Content stream Vec pre-allocation –
parse_content_stream()pre-allocates operator Vec capacity based on stream size, reducing reallocations for large content streams.
Verified – 3,830-PDF Corpus (v0.3.5 to v0.3.6)
| Metric | v0.3.5 | v0.3.6 | Change |
|---|---|---|---|
| Pass rate | 99.8% | 99.8% | 3,823 of 3,830 valid PDFs |
| Slow (>5s) | 2 | 0 | Eliminated |
| Mean | 23.3ms | 2.1ms | -91% |
| p50 | 0.6ms | 0.6ms | – |
| p90 | 3.0ms | 2.6ms | -13% |
| p99 | 33.2ms | 18.0ms | -46% |
| Max | 68,722ms | 625ms | -99% |
| Sum (all PDFs) | 89.1s | 8.0s | -91% |
Text output verified byte-identical on 11 PDFs (862 KB of extracted text). 4 PDFs showed improved extraction quality from adaptive spacing.
v0.3.5 – 2026-02-15
Performance, 3,830-PDF stability, and error recovery
Performance
- Font caching across pages – Document-level font cache keyed by ObjectRef avoids re-parsing shared fonts on every page
- Page object caching –
get_page()caches resolved page objects, eliminating repeated page tree traversal for multi-page extraction - Structure tree caching – Structure tree result cached after first access, avoiding redundant parsing on every
extract_text()call - BT operator early-out – Text extraction skips the full pipeline for image-only pages that contain no BT (Begin Text) operators
- Larger I/O buffer for big files – BufReader capacity increased from 8 KB to 256 KB for files over 100 MB
- Xref reconstruction threshold removed – Eliminated the heuristic that triggered full-file reconstruction on valid portfolio PDFs with few objects
Verified – 3,830-PDF Corpus
- 100% pass rate on 3,830 PDFs across veraPDF (2,907), Mozilla pdf.js (897), SafeDocs (26)
- Zero timeouts, zero panics
- p50 = 0.6ms, p90 = 3.0ms, p99 = 33ms
Added – Encryption
- Owner password authentication – Algorithm 7 for R<=4, Algorithm 12 for R>=5
- R>=5 user password verification with SASLprep – Full AES-256 password verification using SHA-256
- Public password authentication API –
Pdf::authenticate(password)andPdfDocument::authenticate(password)
Added – PDF/A Compliance Validation
- XMP metadata validation – Checks for
pdfaid:partandpdfaid:conformanceentries - Color space validation – Scans page content streams for device-dependent color operators without output intent
- AFRelationship validation – PDF/A-3 embedded file spec validation
Added – PDF/X Compliance Validation
- XMP PDF/X identification – Validates
pdfxid:GTS_PDFXVersion - Page box relationship validation – TrimBox within BleedBox within MediaBox
- ExtGState transparency detection – SMask, CA/ca, BM checks
- Device-dependent color detection – Flags unsupported color spaces
- ICC profile validation – Validates ICCBased profile streams
Added – Rendering
- Spec-correct clipping – Clip state scoped to q/Q save/restore
- Glyph advance width calculation – Per PDF spec section 9.4.4
- Form XObject rendering – Parses /Matrix transform, uses form’s /Resources
Fixed – Error Recovery (28+ real-world PDFs)
- Missing objects resolve to Null per PDF spec section 7.3.10
- Lenient header version parsing for unusual version strings
- Non-standard encryption algorithm matching (V=1, R=3 combinations)
- Non-dictionary Resources treated as empty instead of erroring
- Null nodes in page tree gracefully skipped
- Corrupt content streams return empty content instead of errors
- Enhanced page tree scanning with /Resources+/Parent heuristic
Fixed – DoS Protection
- Page count validated against PDF spec Annex C.2 limit (8,388,607)
Fixed – Image Extraction
- Content stream image extraction via Do operators
- Nested Form XObject images with cycle detection
- Inline images (BI…ID…EI sequences)
- CTM transformations for image positioning
- ColorSpace indirect reference resolution
Fixed – Parser Robustness
- Multi-line object headers (
1 0\nobjformat used by Google-generated PDFs) - Extended header search from 1024 to 8192 bytes
- Lenient version parsing for malformed headers
Fixed – Page Access Robustness
- Pages without /Contents return empty content
- Cyclic page tree detection prevents stack overflow
- Null stream references handled gracefully
- Pages without /Type entry found by /MediaBox or /Contents keys
Fixed – Encryption Robustness
- AES decryption with undersized keys returns error instead of panic
- Xref stream parsing hardened against malformed entries
- Indirect /Encrypt references resolved before parsing
Fixed – Content Stream Processing
- Dictionary-as-Stream fallback for bare dictionaries
- Abbreviated filter names (AHx, A85, LZW, Fl, RL, CCF, DCT)
- Content stream operator limit (default 1,000,000)
Fixed – Code Quality
- Structure tree indirect object references resolved at parse time
- Lexer R/RG token disambiguation
- Stream whitespace trimming no longer strips NUL bytes or spaces from binary data
Tests
- 8 previously ignored tests un-ignored and fixed
Removed
- Empty
PdfImagestub (extraction usesImageInfo) - Commented-out
DocumentType::detect()test block
v0.3.4 – 2026-02-12
Parsing robustness, character extraction, and XObject paths
Breaking Changes
parse_header()signature changed from(u8, u8)to(u8, u8, u64)to include byte offset
Fixed – PDF Parsing Robustness (Issue #41)
- PDFs with binary prefixes or BOM headers now open successfully
- Header search scans first 1024 bytes for
%PDF-marker - Supports UTF-8 BOM, email headers, and other leading binary data
- Lenient mode handles real-world malformed PDFs; strict mode for compliance testing
Added – Character-Level Text Extraction (Issue #39)
extract_chars()returnsVec<TextChar>with per-character positioning- Includes transformation matrix, rotation angle, advance width
- Sorted in reading order with overlapping character deduplication
- 30-50% faster than span extraction for character-only use cases
- Exposed in both Rust and Python APIs
Added – XObject Path Extraction (Issue #40)
extract_paths()recursively processes Form XObjects via Do operator- Coordinate transformations via /Matrix properly applied
- Graphics state properly isolated (save/restore)
- Duplicate XObject detection prevents infinite loops
- Nested XObjects supported
Changed
- Upgraded nom parser library from 7.1 to 8.0
v0.3.3 – 2026-02-11
CJK support, structure tree enhancements, and compliance foundations
Includes all changes from v0.2.5 and v0.2.6 as a consolidated release.
Highlights
- TagSuspect/MarkInfo support – Parse MarkInfo dictionary from document catalog
- Word Break /WB structure element for CJK text
- Predefined CMap support for Adobe-GB1 (Simplified Chinese), Adobe-Japan1 (Japanese), Adobe-CNS1 (Traditional Chinese), Adobe-Korea1 (Korean)
- Abbreviation expansion /E support
- Type 0 /W array parsing for CIDFont glyph widths
- Soft hyphen (U+00AD) handling fix
- Enhanced artifact filtering with subtype support
- Image embedding in HTML and Markdown output (base64 data URIs)
- Image file export with
embed_images=falseandimage_output_dir PdfImage::to_base64_data_uri()andto_png_bytes()methods
v0.3.2 – 2026-02-01
Editing, encryption, and document security
Added – PDF Editing
DocumentEditorfor modifying existing PDFs- Full annotation support (text markup, shapes, stamps, ink, file attachments, redactions)
- Interactive form field creation (text, checkbox, radio, dropdown, list, button)
- Form flattening
- Link annotations (URLs, internal page navigation)
- Outline/bookmark builder
- PDF layers (Optional Content Groups)
Added – Encryption
- Encryption on write (AES-256, AES-128, RC4-128, RC4-40)
- Permission controls (print, copy, modify, annotate)
EncryptionConfigbuilder withEncryptionAlgorithmandPermissions- Digital signature foundation
v0.3.1 – 2026-01-14
Form fields, multimedia, creation tools, and search
Added – PDF Creation
Pdf::from_markdown(),Pdf::from_html(),Pdf::from_text(),Pdf::from_image()PdfBuilderfluent pattern for metadata and layout configurationDocumentBuilderfor programmatic PDF generation- Table rendering with
TableRenderer - Graphics API: colors, gradients, patterns, blend modes, transparency
- Page templates with headers, footers, page numbering, watermarks
- Barcode generation (QR, Code128, EAN-13, UPC-A, Code39, ITF)
Added – Search
- Text search with regex, case-sensitive/insensitive, whole word, page ranges
SearchOptionsandSearchResulttypes- Position tracking with page/coordinates
Added – Form Field Coverage (95%)
- Hierarchical field creation (parent/child structures with dotted names)
- Field property modification (readonly, required, rect, tooltip, max length, alignment, default value)
- FDF/XFDF export for form data exchange
Added – Multimedia Annotations
- MovieAnnotation, SoundAnnotation, ScreenAnnotation, RichMediaAnnotation
- ThreeDAnnotation with U3D and PRC format support
Added – XFA Form Support
- XfaExtractor, XfaParser, XfaConverter (XFA to AcroForm conversion)
Changed – Python Bindings
- True Python 3.8-3.14 support via abi3-py38
- Modern tooling: uv, pdm, ruff integration
v0.3.0 – 2026-01-10
Extraction foundation – unified API and core capabilities
Added – Unified Pdf API
Pdf::open()for reading existing PDFs- DOM-like page navigation with
pdf.page(0) PdfDocumentlow-level handle for advanced use cases
Added – Text Extraction
extract_text()– full-page plain textextract_spans()– styled text runs with font metadata- Structure tree-based reading order for tagged PDFs
- Intelligent line-break and space detection for untagged PDFs
Added – Image Extraction
extract_images()– extract all images from a page- Format detection (JPEG, PNG, TIFF, JBIG2, CCITT)
- Color space handling (DeviceRGB, DeviceCMYK, DeviceGray, ICCBased)
Added – Metadata Extraction
- Document info dictionary (title, author, subject, keywords)
- XMP metadata read/write
- Page info (dimensions, rotation, media/crop/trim boxes)
Added – Form Extraction
extract_form_fields()for AcroForm field enumeration- Text, button, choice, and signature field types
Added – Conversion
to_markdown()– page-level Markdown conversionto_html()– page-level HTML conversionto_plain_text()– configurable plain text output
Added – Compliance
- PDF/A validation (ISO 19005, levels 1a through 3b)
- PDF/X validation (ISO 15930, levels X-1a through X-6p)
- PDF/UA validation (ISO 14289, levels UA-1 and UA-2)
Added – Rendering (requires rendering feature)
- Render pages to PNG/JPEG via tiny-skia
- Configurable DPI and scale
Added – Python Bindings
PdfDocumentclass with full extraction APIPdfclass with creation and high-level API- PyO3-based, published to PyPI as
pdf_oxide
v0.2.4 – 2026-01-09
- CTM transformation fix for text positioning
- Structure tree
/Altand/Pgparsing - FormulaRenderer for formula images
v0.2.3 – 2026-01-07
- BT/ET matrix reset per PDF spec
- Geometric spacing detection in Markdown converter
apply_intelligent_text_processing()for ligatures and hyphenation
v0.2.2 – 2025-12-15
- Keyword optimization for discoverability
v0.2.1 – 2025-12-15
- Encrypted stream decoding improvements
v0.1.4 – 2025-12-12
- Encrypted stream decoding fixes
v0.1.0 – 2025-11-06
- Initial release
- PDF text extraction with spec-compliant Unicode mapping
- Intelligent reading order detection
- Python bindings via PyO3
- Encrypted PDF support
- Form field extraction
- Image extraction