What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Referencia de la API de JavaScript

PDF Oxide ofrece bindings de WebAssembly para JavaScript y TypeScript. El paquete npm pdf-oxide-wasm funciona en Node.js, navegadores, bundlers, Deno y Cloudflare Workers.

npm install pdf-oxide-wasm

Empaquetado multi-target (v0.3.38)

pdf-oxide-wasm ahora distribuye tres builds en paralelo mediante exports condicionales en package.json. Elige la subruta que corresponda a tu runtime — la importación de nivel superior con enrutado automático también se resuelve correctamente a través del campo exports en la mayoría de los entornos.

Subruta	Target
`pdf-oxide-wasm/nodejs`	Node.js (CommonJS + ESM)
`pdf-oxide-wasm/bundler`	Vite, webpack, Rollup, esbuild, Bun
`pdf-oxide-wasm/web`	Navegadores, Deno, Cloudflare Workers

// Node.js
import { WasmPdfDocument } from "pdf-oxide-wasm/nodejs";

// Vite / webpack / Rollup
import init, { WasmPdfDocument } from "pdf-oxide-wasm/bundler";
await init();

// Browsers / Deno / Workers
import init, { WasmPdfDocument } from "pdf-oxide-wasm/web";
await init();

Esto corrige el error ReferenceError: Can't find variable: __dirname que lanzaban los bundlers de navegador antes de la v0.3.38.

Para la API de Rust, consulta la Referencia de la API de Rust. Para la API de Python, consulta la Referencia de la API de Python. Para detalles de tipos, consulta Tipos y enums.

Algunos métodos están condicionados por features de compilación de Rust (rendering, signatures, barcodes, ocr-tract). El paquete por defecto pdf-oxide-wasm habilita el conjunto común; el OCR se distribuye en el build separado wasm-ocr. Consulta Disponibilidad de funciones.

Funciones del módulo

Funciones libres exportadas en el nivel superior del paquete.

import {
  setLogLevel, disableLogging,
  generateBarcodeSvg, generateQrSvg,
  planSplitByBookmarks, splitByBookmarks,
  setCryptoPolicy, cryptoPolicy, cryptoInventory, cryptoCbom,
  modelManifest, prefetchAvailable,
  signPdfBytes, signPdfBytesPades, hasDocumentTimestamp,
} from "pdf-oxide-wasm";

Logging

setLogLevel(level)   // Set log verbosity: "off" | "error" | "warn" | "info" | "debug" | "trace"
disableLogging()     // Silence all log output

Códigos de barras

generateBarcodeSvg(barcodeType, data) -> string  // 1D barcode as SVG; type 0–7 (Code128, Code39, Ean13, Ean8, UpcA, Itf, Code93, Codabar)
generateQrSvg(data, errorCorrection, size) -> string  // QR code as SVG; errorCorrection 0=Low 1=Medium 2=Quartile 3=High

División por marcadores

planSplitByBookmarks(srcBytes, titlePrefix, ignoreCase, level, includeFrontMatter) -> Array  // Plan a split without producing PDFs; returns segment descriptors
splitByBookmarks(srcBytes, titlePrefix, ignoreCase, level, includeFrontMatter) -> Array       // Split at bookmark boundaries; returns [segment, bytes] pairs (level 0=all depths, 1=top-level)

Gobernanza criptográfica

setCryptoPolicy(spec)   // Install the process-wide crypto policy ("compat" | "strict" | "fips-strict"[;…]); fail-closed
cryptoPolicy() -> string  // The active crypto policy as its canonical grammar string
cryptoInventory() -> string[]  // Algorithm tokens exercised so far this process
cryptoCbom() -> string  // CycloneDX 1.6 Cryptographic Bill of Materials (JSON string)

Aprovisionamiento de modelos OCR

modelManifest() -> string   // JSON manifest of OCR detector/recognizer cache filenames and source URLs (host-side fetch)
prefetchAvailable() -> boolean  // Whether this build can download OCR models to a local cache (always false in WASM)

Firma (funciones libres)

signPdfBytes(pdfData, cert, reason?, location?) -> Uint8Array  // Sign raw PDF bytes with a WasmCertificate; returns the signed PDF
signPdfBytesPades(pdfData, cert, level, timestampToken?, revocation?, reason?, location?) -> Uint8Array  // Sign at a PAdES baseline level (BB/BT/BLt); pass a pre-fetched RFC 3161 token for BT/BLt
hasDocumentTimestamp(pdfData) -> boolean  // Whether the PDF carries a document-scoped /DocTimeStamp (PAdES-B-LTA)

WasmPdfDocument

La clase principal para abrir, extraer, editar y guardar PDFs.

import { WasmPdfDocument } from "pdf-oxide-wasm";

Constructor

`new WasmPdfDocument(data, password?)`

Carga un documento PDF a partir de bytes en bruto.

Parámetro	Tipo	Descripción
`data`	`Uint8Array`	El contenido del archivo PDF
`password`	`string \| undefined`	Contraseña opcional para PDFs cifrados

Lanza: Error si el PDF no es válido o no se puede parsear.

const bytes = new Uint8Array(readFileSync("document.pdf"));
const doc = new WasmPdfDocument(bytes);

Constructores estáticos

WasmPdfDocument.openFromDocxBytes(data) -> WasmPdfDocument  // Convert DOCX bytes to a PDF document
WasmPdfDocument.openFromPptxBytes(data) -> WasmPdfDocument  // Convert PPTX bytes to a PDF document
WasmPdfDocument.openFromXlsxBytes(data) -> WasmPdfDocument  // Convert XLSX bytes to a PDF document

Solo lectura, núcleo

`pageCount() -> number`

Obtiene el número de páginas del documento.

`version() -> Uint8Array`

Obtiene la versión del PDF como [major, minor].

const [major, minor] = doc.version();
console.log(`PDF ${major}.${minor}`);

`authenticate(password) -> boolean`

Descifra un PDF cifrado. Devuelve true si la autenticación tuvo éxito.

Parámetro	Tipo	Descripción
`password`	`string`	La cadena de contraseña

`hasStructureTree() -> boolean`

Comprueba si el documento es un PDF etiquetado (Tagged PDF) con árbol de estructura.

Inspección de firmas

signatureCount() -> number          // Number of digital signatures in the document
signatures() -> WasmSignature[]     // Parsed signatures (signer, reason, time, verify())
dss() -> Dss | null                 // Document Security Store (certs/CRLs/OCSP), or null

Extracción de texto

`extractText(pageIndex, region?) -> string`

Extrae texto plano de una sola página. Pasa una región opcional [x, y, w, h] para limitar la extracción.

Parámetro	Tipo	Descripción
`pageIndex`	`number`	Número de página de base cero
`region`	`number[] \| undefined`	Recorte opcional `[x, y, width, height]`

const text = doc.extractText(0);

`extractAllText() -> string`

Extrae texto plano de todas las páginas, separadas por caracteres de salto de página (form feed).

`extractStructured(pageIndex) -> string`

Extrae una representación JSON estructurada de la página (bloques, líneas, estilos).

`extractChars(pageIndex, region?) -> Array`

Extrae caracteres individuales con posicionamiento preciso y metadatos de fuente.

Parámetro	Tipo	Descripción
`pageIndex`	`number`	Número de página de base cero
`region`	`number[] \| undefined`	Recorte opcional `[x, y, width, height]`

Devuelve: un array de objetos con los campos:

Campo	Tipo	Descripción
`char`	`string`	El carácter
`bbox`	`{x, y, width, height}`	Caja delimitadora
`fontName`	`string`	Nombre de la fuente
`fontSize`	`number`	Tamaño de fuente en puntos
`fontWeight`	`string`	Grosor (Normal, Bold, etc.)
`isItalic`	`boolean`	Indicador de cursiva
`color`	`{r, g, b}`	Color RGB (0.0–1.0)

const chars = doc.extractChars(0);
for (const c of chars) {
  console.log(`'${c.char}' at (${c.bbox.x}, ${c.bbox.y})`);
}

`extractPageText(pageIndex, readingOrder?) -> object`

Obtiene spans, caracteres y dimensiones de la página en una sola pasada de extracción. Más eficiente que llamar a extractSpans() + extractChars() por separado. Pasa "column_aware" para PDFs de varias columnas.

Parámetro	Tipo	Descripción
`pageIndex`	`number`	Número de página de base cero
`readingOrder`	`string \| undefined`	`"column_aware"` o `"top_to_bottom"` (por defecto)

Devuelve: un objeto con los campos:

Campo	Tipo	Descripción
`spans`	`Array`	Array de objetos span
`chars`	`Array`	Array de objetos carácter
`pageWidth`	`number`	Ancho de página en puntos PDF
`pageHeight`	`number`	Alto de página en puntos PDF
`text`	`string`	Contenido de texto completo

const result = doc.extractPageText(0);
console.log(`Page: ${result.pageWidth}x${result.pageHeight} pt`);
for (const span of result.spans) {
  console.log(`'${span.text}' font=${span.fontName} size=${span.fontSize}`);
}

`extractSpans(pageIndex, region?, readingOrder?) -> Array`

Extrae spans de texto con estilo y metadatos de fuente. Pasa "column_aware" como readingOrder para PDFs de varias columnas.

Parámetro	Tipo	Descripción
`pageIndex`	`number`	Número de página de base cero
`region`	`number[] \| undefined`	Recorte opcional `[x, y, width, height]`
`readingOrder`	`string \| undefined`	`"column_aware"` o `"top_to_bottom"` (por defecto)

Devuelve: un array de objetos con los campos:

Campo	Tipo	Descripción
`text`	`string`	El contenido de texto
`bbox`	`{x, y, width, height}`	Caja delimitadora
`fontName`	`string`	Nombre de la fuente
`fontSize`	`number`	Tamaño de fuente en puntos
`fontWeight`	`string`	Grosor (Normal, Bold, etc.)
`isItalic`	`boolean`	Indicador de cursiva
`isMonospace`	`boolean`	Si la fuente es de ancho fijo
`charWidths`	`number[]`	Anchos de avance por glifo
`color`	`{r, g, b}`	Color RGB (0.0–1.0)

const spans = doc.extractSpans(0);
for (const span of spans) {
  console.log(`"${span.text}" size=${span.fontSize}`);
}

Palabras, líneas, tablas

extractWords(pageIndex, region?) -> Array       // Word-level boxes with text + font metadata
extractTextLines(pageIndex, region?) -> Array   // Line-level boxes, each with its words
extractTables(pageIndex, region?) -> Array      // Detected tables with rows/cells (text + bboxes)

Artefactos de encabezado / pie de página

Detecta y elimina o borra encabezados, pies de página y artefactos de mobiliario de página recurrentes.

removeHeaders(threshold) -> number     // Remove detected headers across the document; returns count removed
removeFooters(threshold) -> number     // Remove detected footers; returns count removed
removeArtifacts(threshold) -> number   // Remove detected page artifacts; returns count removed
eraseHeader(pageIndex)                 // Queue an erase of the header region on a page
editHeader(pageIndex)                  // Mark the header region for editing on a page
eraseFooter(pageIndex)                 // Queue an erase of the footer region on a page
editFooter(pageIndex)                  // Mark the footer region for editing on a page
eraseArtifacts(pageIndex)              // Queue an erase of detected artifacts on a page

Extracción por región

`within(pageIndex, region) -> WasmPdfPageRegion`

Limita la extracción posterior a una región rectangular de una página. region es [x, y, width, height]. Consulta WasmPdfPageRegion.

const region = doc.within(0, [50, 600, 400, 150]);
const text = region.extractText();

Conversión de formato

`toMarkdown(pageIndex, detectHeadings?, includeImages?, includeFormFields?) -> string`

Convierte una sola página a Markdown.

Parámetro	Tipo	Por defecto	Descripción
`pageIndex`	`number`	–	Número de página de base cero
`detectHeadings`	`boolean`	`true`	Detectar encabezados a partir del tamaño de fuente
`includeImages`	`boolean`	`true`	Incluir imágenes
`includeFormFields`	`boolean`	`true`	Incluir valores de campos de formulario

`toMarkdownAll(detectHeadings?, includeImages?, includeFormFields?) -> string`

Convierte todas las páginas a Markdown.

`toHtml(pageIndex, preserveLayout?, detectHeadings?, includeFormFields?) -> string`

Convierte una sola página a HTML.

Parámetro	Tipo	Por defecto	Descripción
`pageIndex`	`number`	–	Número de página de base cero
`preserveLayout`	`boolean`	`false`	Preservar la maquetación visual
`detectHeadings`	`boolean`	`true`	Detectar encabezados
`includeFormFields`	`boolean`	`true`	Incluir valores de campos de formulario

`toHtmlAll(preserveLayout?, detectHeadings?, includeFormFields?) -> string`

Convierte todas las páginas a HTML.

`toPlainText(pageIndex) -> string`

Convierte una sola página a texto plano.

`toPlainTextAll() -> string`

Convierte todas las páginas a texto plano.

Ida y vuelta con Office

toDocxBytes() -> Uint8Array   // Export the document as a DOCX file
toPptxBytes() -> Uint8Array   // Export the document as a PPTX file
toXlsxBytes() -> Uint8Array   // Export the document as an XLSX file

Búsqueda

`search(pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array`

Busca texto en todas las páginas.

Parámetro	Tipo	Por defecto	Descripción
`pattern`	`string`	–	Patrón de búsqueda (cadena o regex)
`caseInsensitive`	`boolean`	`false`	Búsqueda sin distinción de mayúsculas
`literal`	`boolean`	`false`	Tratar el patrón como cadena literal
`wholeWord`	`boolean`	`false`	Coincidir solo palabras completas
`maxResults`	`number`	`0`	Máximo de resultados (0 = sin límite)

Devuelve: un array de objetos con los campos:

Campo	Tipo	Descripción
`page`	`number`	Número de página
`text`	`string`	Texto coincidente
`bbox`	`object`	Caja delimitadora
`startIndex`	`number`	Índice de inicio en el texto de la página
`endIndex`	`number`	Índice de fin en el texto de la página

`searchPage(pageIndex, pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array`

Busca texto dentro de una sola página.

Información de imágenes

`extractImages(pageIndex) -> Array`

Obtiene los metadatos de imagen de una página.

Campo	Tipo	Descripción
`width`	`number`	Ancho de la imagen en píxeles
`height`	`number`	Alto de la imagen en píxeles
`colorSpace`	`string`	Espacio de color (p. ej. `DeviceRGB`)
`bitsPerComponent`	`number`	Bits por canal de color
`bbox`	`object`	Posición en la página

`extractImageBytes(pageIndex) -> Array`

Extrae los bytes de imagen en bruto de una página. Devuelve un array de objetos:

Campo	Tipo	Descripción
`width`	`number`	Ancho de la imagen en píxeles
`height`	`number`	Alto de la imagen en píxeles
`data`	`Uint8Array`	Bytes de imagen en bruto
`format`	`string`	Formato de imagen

`pageImages(pageIndex) -> Array`

Obtiene los nombres y límites de las imágenes para operaciones de posicionamiento.

Campo	Tipo	Descripción
`name`	`string`	Nombre del XObject
`bounds`	`number[]`	`[x, y, width, height]`
`matrix`	`number[]`	Matriz de transformación `[a, b, c, d, e, f]`

Contenido vectorial

extractPaths(pageIndex, region?) -> Array   // Vector paths (lines, curves, shapes) on a page
extractRects(pageIndex, region?) -> Array   // Axis-aligned rectangles detected from path segments
extractLines(pageIndex, region?) -> Array   // Straight line segments detected from path data

Estructura del documento

`getOutline() -> Array | null`

Obtiene los marcadores / tabla de contenidos del documento. Devuelve null si no existe esquema.

`getAnnotations(pageIndex) -> Array`

Obtiene los metadatos de anotación (tipo, rect, contenido, etc.) de una página.

`pageLabels() -> Array`

Obtiene los rangos de etiquetas de página. Devuelve un array de objetos:

Campo	Tipo	Descripción
`startPage`	`number`	Primera página de este rango
`style`	`string`	Estilo de numeración
`prefix`	`string`	Prefijo de la etiqueta
`startValue`	`number`	Número inicial

`xmpMetadata() -> object | null`

Obtiene los metadatos XMP. Devuelve null si no están presentes. Los campos del objeto incluyen:

Campo	Tipo	Descripción
`dcTitle`	`string \| null`	Título del documento
`dcCreator`	`string[] \| null`	Lista de creadores
`dcDescription`	`string \| null`	Descripción
`xmpCreatorTool`	`string \| null`	Herramienta creadora
`xmpCreateDate`	`string \| null`	Fecha de creación
`xmpModifyDate`	`string \| null`	Fecha de modificación
`pdfProducer`	`string \| null`	Productor del PDF

Campos de formulario

`getFormFields() -> Array`

Obtiene todos los campos de formulario con nombre, tipo, valor y flags.

Campo	Tipo	Descripción
`name`	`string`	Nombre del campo
`fieldType`	`string`	Tipo de campo (text, checkbox, etc.)
`value`	`string`	Valor actual
`flags`	`number`	Flags del campo

const fields = doc.getFormFields();
for (const f of fields) {
  console.log(`${f.name} (${f.fieldType}) = ${f.value}`);
}

`hasXfa() -> boolean`

Comprueba si el documento contiene formularios XFA.

`getFormFieldValue(name) -> any`

Obtiene el valor de un campo de formulario por nombre. Devuelve un string, boolean o null según el tipo de campo.

`setFormFieldValue(name, value) -> void`

Asigna el valor de un campo de formulario por nombre.

Parámetro	Tipo	Descripción
`name`	`string`	Nombre del campo
`value`	`string \| boolean`	Nuevo valor del campo

`exportFormData(format?) -> Uint8Array`

Exporta los datos del formulario como FDF (por defecto) o XFDF.

Parámetro	Tipo	Por defecto	Descripción
`format`	`string`	`"fdf"`	Formato de exportación: `"fdf"` o `"xfdf"`

Aplanado de formularios

flattenForms()                    // Flatten all form fields into page content
flattenFormsOnPage(pageIndex)     // Flatten forms on a specific page
flattenWarnings() -> string[]     // Warnings produced by the last flatten operation

Edición

Metadatos

Método	Parámetros	Descripción
`setTitle(title)`	`string`	Establecer el título del documento
`setAuthor(author)`	`string`	Establecer el autor del documento
`setSubject(subject)`	`string`	Establecer el asunto del documento
`setKeywords(keywords)`	`string`	Establecer las palabras clave del documento

Rotación de página

Método	Parámetros	Descripción
`pageRotation(pageIndex)`	`number`	Obtener la rotación actual (0, 90, 180, 270)
`setPageRotation(pageIndex, degrees)`	`number, number`	Establecer una rotación absoluta
`rotatePage(pageIndex, degrees)`	`number, number`	Sumar a la rotación actual
`rotateAllPages(degrees)`	`number`	Rotar todas las páginas

Dimensiones de página

Método	Parámetros	Descripción
`pageMediaBox(pageIndex)`	`number`	Obtener el MediaBox `[llx, lly, urx, ury]`
`setPageMediaBox(pageIndex, llx, lly, urx, ury)`	`number, ...`	Establecer el MediaBox
`pageCropBox(pageIndex)`	`number`	Obtener el CropBox (puede ser null)
`setPageCropBox(pageIndex, llx, lly, urx, ury)`	`number, ...`	Establecer el CropBox
`cropMargins(left, right, top, bottom)`	`number, ...`	Recortar todos los márgenes de página

Operaciones de página

deletePage(index)                 // Delete a page by index
movePage(fromIndex, toIndex)      // Move a page to a new position
extractPages(pages) -> Uint8Array // Build a new PDF from the given page indices

Borrado / Whiteout

Método	Parámetros	Descripción
`eraseRegion(pageIndex, llx, lly, urx, ury)`	`number, ...`	Borrar una región
`eraseRegions(pageIndex, rects)`	`number, Float32Array`	Borrar varias regiones
`clearEraseRegions(pageIndex)`	`number`	Limpiar los borrados pendientes

Anotaciones y redacción

Método	Parámetros	Descripción
`flattenPageAnnotations(pageIndex)`	`number`	Aplanar las anotaciones de la página
`flattenAllAnnotations()`	–	Aplanar todas las anotaciones
`applyPageRedactions(pageIndex)`	`number`	Aplicar las redacciones de la página
`applyAllRedactions()`	–	Aplicar todas las redacciones
`addRedaction(page, x0, y0, x1, y1, fill?)`	`number, ...`	Encolar una caja de redacción (relleno opcional `[r,g,b]`)
`redactionCount(page)`	`number`	Contar las redacciones encoladas de una página
`applyRedactionsDestructive(scrubMetadata?)`	`boolean`	Eliminar contenido de forma destructiva; devuelve un informe de redacción
`sanitizeDocument(scrubMetadata?, removeJavascript?, removeEmbeddedFiles?)`	`boolean, ...`	Eliminar metadatos, scripts y archivos incrustados; devuelve un informe

Combinar e incrustar

`mergeFrom(data) -> number`

Combina páginas de otro PDF. Devuelve el número de páginas combinadas.

Parámetro	Tipo	Descripción
`data`	`Uint8Array`	Los bytes del archivo PDF de origen

`embedFile(name, data) -> void`

Adjunta un archivo al PDF.

Parámetro	Tipo	Descripción
`name`	`string`	Nombre de archivo del adjunto
`data`	`Uint8Array`	Contenido del archivo

Manipulación de imágenes

Método	Parámetros	Descripción
`repositionImage(pageIndex, name, x, y)`	`number, string, number, number`	Mover imagen
`resizeImage(pageIndex, name, w, h)`	`number, string, number, number`	Redimensionar imagen
`setImageBounds(pageIndex, name, x, y, w, h)`	`number, string, ...`	Establecer los límites de la imagen

Clasificación y autoextracción

classifyDocument() -> string                 // Classify the whole document (e.g. born-digital vs scanned)
classifyPage(pageIndex) -> string            // Classify a single page
extractTextAuto(pageIndex) -> string         // Auto-pick native vs OCR extraction for a page
extractPageAuto(pageIndex, optionsJson?) -> string  // Auto-extraction returning a structured JSON page

Validación

validatePdfA(level) -> object        // Validate against a PDF/A conformance level (e.g. "2b")
convertToPdfA(level) -> object       // Convert toward a PDF/A level; returns a report
validatePdfUa(level?) -> object      // Validate against PDF/UA accessibility
validatePdfX(level?) -> object       // Validate against a PDF/X print level

Renderizado

Requiere la feature rendering.

Método	Parámetros	Devuelve	Descripción
`renderPage(pageIndex, dpi?)`	`number, number`	`Uint8Array`	Renderiza una página a bytes PNG (150 dpi por defecto)
`flattenToImages(dpi?)`	`number`	`Uint8Array`	Aplana todas las páginas a un PDF basado en imágenes

OCR

Requiere el build wasm-ocr. Consulta WasmOcrEngine.

`extractTextOcr(pageIndex, engine) -> string`

Ejecuta el pipeline de OCR in-WASM sobre una página usando un WasmOcrEngine construido en el host. Devuelve el texto reconocido en orden de lectura.

const text = doc.extractTextOcr(0, engine);

Guardar

`save() -> Uint8Array`

Guarda el PDF editado como bytes. saveToBytes() está disponible como alias.

`saveWithOptions(compress?, garbageCollect?, linearize?) -> Uint8Array`

Guarda con opciones de serialización explícitas.

Parámetro	Tipo	Por defecto	Descripción
`compress`	`boolean`	`true`	Comprimir los flujos de objetos
`garbageCollect`	`boolean`	`true`	Descartar objetos sin referencias
`linearize`	`boolean`	`false`	Producir un PDF linealizado (“fast web view”)

`saveEncryptedToBytes(password, ownerPassword?, allowPrint?, allowCopy?, allowModify?, allowAnnotate?) -> Uint8Array`

Guarda con cifrado AES-256.

Parámetro	Tipo	Por defecto	Descripción
`password`	`string`	–	Contraseña de usuario
`ownerPassword`	`string`	contraseña de usuario	Contraseña de propietario
`allowPrint`	`boolean`	`true`	Permitir la impresión
`allowCopy`	`boolean`	`true`	Permitir la copia
`allowModify`	`boolean`	`true`	Permitir la modificación
`allowAnnotate`	`boolean`	`true`	Permitir anotaciones

`free()`

Libera la memoria de WASM. Llama siempre a este método cuando termines con el documento.

WasmPdfPageRegion

Un manejador de región devuelto por WasmPdfDocument.within(pageIndex, region). Los métodos de extracción quedan limitados al rectángulo.

extractText() -> string       // Plain text within the region
extractChars() -> Array       // Characters within the region
extractWords() -> Array       // Words within the region
extractTextLines() -> Array   // Text lines within the region
extractTables() -> Array      // Tables within the region
extractImages() -> Array      // Images within the region
extractPaths() -> Array       // Vector paths within the region
extractRects() -> Array       // Rectangles within the region
extractLines() -> Array       // Line segments within the region
extractTextOcr(engine?) -> string  // OCR text within the region (wasm-ocr build)

WasmPdf

Clase fábrica para crear nuevos PDFs.

import { WasmPdf } from "pdf-oxide-wasm";

Métodos estáticos

WasmPdf.fromMarkdown(content, title?, author?) -> WasmPdf  // Create a PDF from Markdown text
WasmPdf.fromHtml(content, title?, author?) -> WasmPdf      // Create a PDF from HTML
WasmPdf.fromText(content, title?, author?) -> WasmPdf      // Create a PDF from plain text
WasmPdf.fromBytes(data) -> WasmPdf                         // Open an existing PDF from bytes for modification
WasmPdf.fromImageBytes(data) -> WasmPdf                    // Single-page PDF from one image (JPEG/PNG)
WasmPdf.fromMultipleImageBytes(imagesArray) -> WasmPdf     // Multi-page PDF, one page per image
WasmPdf.merge(pdfs) -> WasmPdf                             // Merge an array of PDF byte buffers into one
WasmPdf.fromHtmlCss(html, css, fontBytes) -> WasmPdf       // HTML + CSS with a single embedded font
WasmPdf.fromHtmlCssWithFonts(html, css, fonts) -> WasmPdf  // HTML + CSS with multiple [name, bytes] fonts

Parámetro	Tipo	Descripción
`content`	`string`	Contenido de origen (Markdown / HTML / texto)
`title`	`string \| undefined`	Título del documento
`author`	`string \| undefined`	Autor del documento
`data`	`Uint8Array`	Bytes de un archivo PDF o de imagen
`imagesArray`	`Uint8Array[]`	Array de bytes de archivos de imagen
`pdfs`	`Uint8Array[]`	Array de bytes de archivos PDF a combinar

Métodos de instancia

`toBytes() -> Uint8Array`

Obtiene el PDF como bytes.

`size -> number`

Tamaño del PDF en bytes (getter de solo lectura).

const pdf = WasmPdf.fromMarkdown("# Hello World\n\nThis is a PDF.");
console.log(`PDF size: ${pdf.size} bytes`);
writeFileSync("output.pdf", pdf.toBytes());

WasmDocumentBuilder

Constructor fluido de bajo nivel para maquetación de páginas, para componer PDFs página a página. Combínalo con WasmFluentPageBuilder.

import { WasmDocumentBuilder } from "pdf-oxide-wasm";
const builder = new WasmDocumentBuilder();

Configuración del documento

new WasmDocumentBuilder()          // Create an empty builder
title(title)                       // Set document title
author(author)                     // Set document author
subject(subject)                   // Set document subject
keywords(keywords)                 // Set document keywords
creator(creator)                   // Set the creator tool name
onOpen(script)                     // Set a document-level open JavaScript action
taggedPdfUa1()                     // Enable Tagged PDF / PDF/UA-1 output
language(lang)                     // Set the document language (e.g. "en-US")
roleMap(custom, standard)          // Map a custom structure tag to a standard role
registerEmbeddedFont(name, font)   // Register a WasmEmbeddedFont under a name

Creación de páginas y salida

a4Page() -> WasmFluentPageBuilder         // Start a new A4 page
letterPage() -> WasmFluentPageBuilder     // Start a new US Letter page
page(width, height) -> WasmFluentPageBuilder  // Start a custom-size page (points)
commitPage(page)                          // Commit a completed page builder
build() -> Uint8Array                     // Finish and return the PDF bytes
toBytesEncrypted(userPassword, ownerPassword?) -> Uint8Array  // Finish with AES-256 encryption

WasmFluentPageBuilder

Constructor por página devuelto por a4Page() / letterPage() / page(). Encola operaciones y luego confírmalas con done(builder) (o builder.commitPage(page)).

Texto y flujo

font(name, size)                 // Set the current font and size
at(x, y)                         // Move the cursor to an absolute position
text(text)                       // Draw text at the cursor
heading(level, text)             // Draw a heading (level 1–6)
paragraph(text)                  // Draw a wrapped paragraph
space(points)                    // Advance the cursor vertically
horizontalRule()                 // Draw a horizontal rule
newline()                        // Advance to the next line
columns(columnCount, gapPt, text)  // Lay text out in N balanced columns
footnote(refMark, noteText)      // Add a footnote marker + bottom-of-page note

Tramos en línea (inline runs)

inline(text)                     // Append an inline text run
inlineBold(text)                 // Append a bold inline run
inlineItalic(text)               // Append an italic inline run
inlineColor(r, g, b, text)       // Append a colored inline run (RGB 0.0–1.0)

Acciones de enlace y formulario

linkUrl(url)                     // Wrap the last element in a URL link
linkPage(page)                   // Link to another page index
linkNamed(destination)           // Link to a named destination
linkJavascript(script)           // Attach a JavaScript link action
onOpen(script)                   // Page open action
onClose(script)                  // Page close action
fieldKeystroke(script)           // Keystroke JavaScript for the last field
fieldFormat(script)              // Format JavaScript for the last field
fieldValidate(script)            // Validate JavaScript for the last field
fieldCalculate(script)           // Calculate JavaScript for the last field

Anotaciones de marcado

highlight(r, g, b)               // Highlight the last text run (RGB 0.0–1.0)
underline(r, g, b)               // Underline the last text run
strikeout(r, g, b)               // Strike out the last text run
squiggly(r, g, b)                // Squiggly-underline the last text run
stickyNote(text)                 // Add a sticky note at the cursor
stickyNoteAt(x, y, text)         // Add a sticky note at an absolute position
stamp(name)                      // Add a rubber-stamp annotation (e.g. "Approved")
freeText(x, y, w, h, text)       // Add a free-text annotation box
watermark(text)                  // Add a text watermark
watermarkConfidential()          // Add a "CONFIDENTIAL" watermark
watermarkDraft()                 // Add a "DRAFT" watermark

Widgets de AcroForm

textField(name, x, y, w, h, defaultValue?)            // Add a text field
checkbox(name, x, y, w, h, checked)                   // Add a checkbox
comboBox(name, x, y, w, h, options, selected?)        // Add a dropdown combo box
radioGroup(name, values, xs, ys, ws, hs, selected?)   // Add a radio-button group (parallel arrays)
pushButton(name, x, y, w, h, caption)                 // Add a clickable push button
signatureField(name, x, y, w, h)                      // Add an unsigned signature placeholder

Códigos de barras e imágenes

barcode1d(barcodeType, data, x, y, w, h)   // Draw a 1D barcode (type 0–7)
barcodeQr(data, x, y, size)                // Draw a QR code
imageWithAlt(bytes, x, y, w, h, altText)   // Embed an image with accessibility alt text
imageArtifact(bytes, x, y, w, h)           // Embed a decorative image as an /Artifact

Primitivas gráficas

rect(x, y, w, h)                                  // Stroked 1pt rectangle outline
filledRect(x, y, w, h, r, g, b)                   // Filled rectangle (RGB 0.0–1.0)
line(x1, y1, x2, y2)                              // 1pt black line
strokeRect(x, y, w, h, width, r, g, b)            // Stroked rectangle, explicit width + color
strokeRectDashed(x, y, w, h, width, r, g, b, dash, phase)  // Dashed rectangle border
strokeLine(x1, y1, x2, y2, width, r, g, b)        // Line with explicit width + color
strokeLineDashed(x1, y1, x2, y2, width, r, g, b, dash, phase)  // Dashed line
textInRect(x, y, w, h, text, align)               // Lay text inside a rectangle (align 0/1/2)

Ayudantes de maquetación y cierre

measure(text) -> number                  // Rendered width of text in the current font (points)
remainingSpace() -> number               // Vertical space left on the page (points)
newPageSameSize()                        // Start a new page with the same dimensions
table(spec)                              // Draw a buffered table from a spec object
streamingTable(spec) -> WasmStreamingTable  // Open a streaming table for large datasets
done(builder)                            // Commit this page's queued ops to the document builder

Un objeto spec de table(spec) usa { columns: [{ header, width, align }], rows: [[...]], hasHeader }. Un spec de streamingTable(spec) añade { repeatHeader, mode, sampleRows, minColWidthPt, maxColWidthPt, maxRowspan, batchSize }.

WasmStreamingTable

Manejador de tabla con streaming de filas devuelto por WasmFluentPageBuilder.streamingTable(spec). Inserta filas de forma incremental y luego llama a finish().

columnCount() -> number       // Number of columns
pendingRowCount() -> number   // Rows in the current un-flushed batch
batchCount() -> number        // Number of completed batches
pushRow(cells)                // Push one row (array of cell strings)
pushRowSpan(cells)            // Push a row whose cells may carry rowspans
flush()                       // Flush the current batch
finish()                      // Finalize the table and replay it into the page

WasmEmbeddedFont

Una fuente registrada para incrustación mediante WasmDocumentBuilder.registerEmbeddedFont.

WasmEmbeddedFont.fromBytes(data, name?) -> WasmEmbeddedFont  // Load a TTF/OTF font from bytes
font.name -> string                                          // The font's resolved name (getter)

Plantillas de página

Mobiliario reutilizable de encabezado/pie de página aplicado a varias páginas.

WasmArtifactStyle

new WasmArtifactStyle()        // Default style
font(name, size) -> this       // Set font family and size
bold() -> this                 // Make the text bold
color(r, g, b) -> this         // Set the text color (RGB 0.0–1.0)

WasmArtifact

new WasmArtifact()                       // Empty artifact
WasmArtifact.left(text) -> WasmArtifact   // Left-aligned artifact text
WasmArtifact.center(text) -> WasmArtifact // Center-aligned artifact text
WasmArtifact.right(text) -> WasmArtifact  // Right-aligned artifact text
withStyle(style) -> this                  // Apply a WasmArtifactStyle
withOffset(offset) -> this                // Set the vertical offset from the edge

WasmHeader / WasmFooter

new WasmHeader()                  // Empty header (WasmFooter is identical)
WasmHeader.left(text) -> WasmHeader     // Left-aligned header text
WasmHeader.center(text) -> WasmHeader   // Center-aligned header text
WasmHeader.right(text) -> WasmHeader    // Right-aligned header text

WasmPageTemplate

new WasmPageTemplate()         // Empty template
header(header) -> this         // Set the page header artifact
footer(footer) -> this         // Set the page footer artifact
skipFirstPage() -> this        // Omit header/footer on the first page

Firmas digitales

Requiere la feature signatures.

WasmCertificate

WasmCertificate.load(data) -> WasmCertificate                  // Load a DER certificate + key bundle
WasmCertificate.loadPem(certPem, keyPem) -> WasmCertificate    // Load from PEM cert + key strings
WasmCertificate.loadPkcs12(data, password) -> WasmCertificate  // Load from a PKCS#12 (.p12/.pfx) blob
cert.subject -> string         // Subject distinguished name (getter)
cert.issuer -> string          // Issuer distinguished name (getter)
cert.serial -> string          // Serial number (getter)
cert.validity -> bigint[]      // [notBefore, notAfter] as unix seconds (getter)
cert.isValid -> boolean        // Whether the certificate is currently valid (getter)

WasmSignature

Devuelto por WasmPdfDocument.signatures().

sig.signerName -> string | null          // Signer common name (getter)
sig.reason -> string | null              // Signing reason (getter)
sig.location -> string | null            // Signing location (getter)
sig.contactInfo -> string | null         // Signer contact info (getter)
sig.signingTime -> bigint | null         // Signing time as unix seconds (getter)
sig.coversWholeDocument -> boolean       // Whether the signature covers the entire file (getter)
sig.padesLevel -> PadesLevel             // PAdES baseline level of the signature (getter)
sig.verify() -> boolean                  // Verify the signature cryptographically
sig.verifyDetached(pdfData) -> boolean   // Verify including a messageDigest check against the bytes

WasmTimestamp

WasmTimestamp.parse(data) -> WasmTimestamp  // Parse a DER TimeStampToken / TSTInfo
ts.time -> bigint              // Timestamp time as unix seconds (getter)
ts.serial -> string            // Serial number (getter)
ts.policyOid -> string         // TSA policy OID (getter)
ts.tsaName -> string           // TSA name (getter)
ts.hashAlgorithm -> number     // Imprint hash algorithm id (getter)
ts.messageImprint -> Uint8Array  // The message imprint digest (getter)
ts.verify() -> boolean         // Verify the timestamp token

WasmRevocationMaterial

Material de validación offline PAdES-B-LT para signPdfBytesPades.

new WasmRevocationMaterial()   // Empty material set
addCert(der)                   // Add a DER X.509 certificate
addCrl(der)                    // Add a DER CRL
addOcsp(der)                   // Add a DER OCSP response

Dss

Un Document Security Store parseado devuelto por WasmPdfDocument.dss().

dss.certCount -> number        // Number of DER certificates (getter)
getCert(i) -> Uint8Array | undefined   // i-th DER certificate
dss.crlCount -> number         // Number of DER CRLs (getter)
getCrl(i) -> Uint8Array | undefined    // i-th DER CRL
dss.ocspCount -> number        // Number of DER OCSP responses (getter)
getOcsp(i) -> Uint8Array | undefined   // i-th DER OCSP response
dss.vri -> string[]            // Per-signature VRI keys (uppercase-hex SHA-1 of /Contents) (getter)

OCR

El OCR se ejecuta enteramente in-WASM mediante el backend en Rust puro tract del build separado wasm-ocr. Los modelos se entregan desde el host: descarga los archivos ONNX del detector/reconocedor y el diccionario (consulta modelManifest()), y luego pasa los bytes al constructor.

WasmOcrEngine

new WasmOcrEngine(detModel, recModel, dict, config?)  // Build from host-supplied model bytes
engine.ocrImage(imageBytes) -> string                 // OCR a raw image (PNG/JPEG/TIFF); returns JSON {text, confidence, spans}

Parámetro	Tipo	Descripción
`detModel`	`Uint8Array`	Bytes ONNX del detector DBNet
`recModel`	`Uint8Array`	Bytes ONNX del reconocedor SVTR
`dict`	`string`	Diccionario de caracteres del reconocedor, un carácter por línea
`config`	`WasmOcrConfig \| undefined`	Reservado (se usan valores por defecto ajustados)

WasmOcrConfig

new WasmOcrConfig()   // OCR configuration object (reserved for future tuning)

Enums

Align

Discriminante de alineación de texto/celda usado por textInRect y los specs de columna de tabla.

Align.Left   // 0
Align.Center // 1
Align.Right  // 2

PadesLevel

Nivel de línea base PAdES, usado por signPdfBytesPades y WasmSignature.padesLevel.

PadesLevel.BB    // 0 — signed attrs incl. ESS signing-certificate-v2
PadesLevel.BT    // 1 — B-B + RFC 3161 signature-time-stamp
PadesLevel.BLt   // 2 — B-T + Document Security Store (DSS/VRI)
PadesLevel.BLta  // 3 — B-LT + document-scoped /DocTimeStamp

Disponibilidad de funciones

Algunas funciones están condicionadas por features de compilación de Rust. El paquete por defecto pdf-oxide-wasm habilita el conjunto común; el OCR se distribuye en el build separado wasm-ocr.

Función	WASM	Notas
Extracción de texto	Sí	Soporte completo
Extracción estructurada	Sí	Chars, spans, palabras, líneas, tablas
Creación de PDF	Sí	Markdown, HTML, texto, imágenes, DocumentBuilder
Edición de PDF	Sí	Metadatos, rotación, dimensiones, borrado, páginas
Campos de formulario	Sí	Leer, escribir, exportar, aplanar, construir
Búsqueda	Sí	Soporte completo de regex
Cifrado	Sí	Lectura y escritura AES-256
Anotaciones	Sí	Leer, aplanar, redactar, sanear
Combinar / dividir PDFs	Sí	Combinar páginas y dividir por marcadores
Archivos incrustados	Sí	Adjuntar archivos a PDFs
Etiquetas de página / XMP	Sí	Leer etiquetas de página y metadatos XMP
Ida y vuelta con Office	Sí	Importación y exportación DOCX/PPTX/XLSX
Validación	Sí	PDF/A, PDF/UA, PDF/X
Códigos de barras	Sí (`barcodes`)	1D + QR como SVG o imágenes de página
Renderizado	Sí (`rendering`)	Página → PNG, aplanar a imágenes
Firmas digitales	Sí (`signatures`)	Firmar, PAdES B-LT, verificar, marcas de tiempo
OCR	Build `wasm-ocr`	OCR tract in-WASM; modelos descargados desde el host

Manejo de errores

Todos los métodos que pueden fallar lanzan objetos Error de JavaScript:

try {
  const doc = new WasmPdfDocument(new Uint8Array([0, 1, 2]));
} catch (e) {
  console.error(`Failed to open: ${e.message}`);
}

TypeScript

El paquete incluye definiciones de tipos completas:

import { WasmPdfDocument, WasmPdf } from "pdf-oxide-wasm";

const doc: WasmPdfDocument = new WasmPdfDocument(bytes);
const text: string = doc.extractText(0);
const pdf: WasmPdf = WasmPdf.fromMarkdown("# Hello");

Other Language Bindings

PDF Oxide ofrece bindings nativos para todos los ecosistemas principales: Rust, Python, Node.js, C#, Golang, Java, PHP, Ruby, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C y Elixir.

Próximos pasos

Tipos y enums — todos los tipos y enums compartidos
Referencia de la API Page — iteración de página consistente entre bindings
Primeros pasos con WASM — tutorial