Node.js Streams API
The pdf-oxide native binding ships readable streams for search results, pages, and tables — idiomatic for Node.js pipelines and memory-efficient for large documents.
All streams implement the standard Node.js Readable interface in object mode, support backpressure, integrate with pipe(), and work with for await async iteration.
Streams are native to the Node.js binding. For the WASM build, iterate synchronously.
SearchStream
Emits one SearchResult at a time as the underlying SearchManager produces matches.
const { PdfDocument, SearchManager, SearchStream } = require("pdf-oxide");
const doc = new PdfDocument("large.pdf");
const manager = new SearchManager(doc);
const stream = new SearchStream(manager, "invoice");
stream.on("data", (r) => {
console.log(`page ${r.pageIndex + 1}: ${r.text}`);
});
stream.on("end", () => {
console.log("search complete");
doc.close();
});
stream.on("error", (err) => {
console.error(err);
doc.close();
});
Case-sensitive search
const stream = new SearchStream(manager, "Invoice", { caseSensitive: true });
Async iteration
for await (const result of stream) {
if (result.pageIndex > 50) break;
console.log(result.text);
}
pipe() compatibility
const { Writable } = require("stream");
const sink = new Writable({
objectMode: true,
write(result, _enc, cb) {
console.log(`${result.pageIndex}:${result.text}`);
cb();
},
});
stream.pipe(sink);
PageIteratorStream
Emits one page’s extracted text at a time. Useful for line-oriented output or when feeding an LLM with a rate-limited queue.
const { PageIteratorStream } = require("pdf-oxide");
const stream = new PageIteratorStream(doc, { format: "markdown" });
for await (const { pageIndex, content } of stream) {
await indexPage(pageIndex, content);
}
format accepts "text" (default), "markdown", "html", "plain".
TableStream
Emits one table at a time as it’s detected.
const { TableStream } = require("pdf-oxide");
const stream = new TableStream(doc);
stream.on("data", (table) => {
console.log(`${table.rows.length}x${table.rows[0].length} on page ${table.pageIndex}`);
});
Backpressure
All streams implement standard Node.js backpressure. If your consumer is slow, the stream pauses extraction until .read() resumes:
stream.on("data", async (result) => {
stream.pause();
await slowIndex(result);
stream.resume();
});
Or use for await, which handles pausing automatically.
Error handling
Errors during extraction are emitted as standard error events:
stream.on("error", (err) => {
if (err.code === "PDF_INVALID_PAGE") {
console.warn("skipping invalid page", err.pageIndex);
} else {
throw err;
}
});
Memory efficiency
Streams keep only one result in flight. On a 10,000-page PDF producing 50,000 matches, a SearchStream uses constant memory — the entire result set is never materialised.
Cleanup
Closing the parent PdfDocument ends all attached streams. Streams also clean up their manager reference on end / error.
const doc = new PdfDocument("big.pdf");
const stream = new SearchStream(new SearchManager(doc), "TODO");
stream.on("end", () => doc.close());
stream.on("error", () => doc.close());
For Node.js 22+, the using keyword releases the document when the scope exits:
{
using doc = new PdfDocument("big.pdf");
const stream = new SearchStream(new SearchManager(doc), "TODO");
for await (const r of stream) console.log(r);
} // doc.close() called automatically
Related
- Node.js Getting Started — install, quick start
- Node.js API Reference
- Search — non-streaming search options