Skip to content

Node.js Streams API

The pdf-oxide native binding ships readable streams for search results, pages, and tables — idiomatic for Node.js pipelines and memory-efficient for large documents.

All streams implement the standard Node.js Readable interface in object mode, support backpressure, integrate with pipe(), and work with for await async iteration.

Streams are native to the Node.js binding. For the WASM build, iterate synchronously.

SearchStream

Emits one SearchResult at a time as the underlying SearchManager produces matches.

const { PdfDocument, SearchManager, SearchStream } = require("pdf-oxide");

const doc = new PdfDocument("large.pdf");
const manager = new SearchManager(doc);
const stream = new SearchStream(manager, "invoice");

stream.on("data", (r) => {
  console.log(`page ${r.pageIndex + 1}: ${r.text}`);
});

stream.on("end", () => {
  console.log("search complete");
  doc.close();
});

stream.on("error", (err) => {
  console.error(err);
  doc.close();
});
const stream = new SearchStream(manager, "Invoice", { caseSensitive: true });

Async iteration

for await (const result of stream) {
  if (result.pageIndex > 50) break;
  console.log(result.text);
}

pipe() compatibility

const { Writable } = require("stream");

const sink = new Writable({
  objectMode: true,
  write(result, _enc, cb) {
    console.log(`${result.pageIndex}:${result.text}`);
    cb();
  },
});

stream.pipe(sink);

PageIteratorStream

Emits one page’s extracted text at a time. Useful for line-oriented output or when feeding an LLM with a rate-limited queue.

const { PageIteratorStream } = require("pdf-oxide");

const stream = new PageIteratorStream(doc, { format: "markdown" });

for await (const { pageIndex, content } of stream) {
  await indexPage(pageIndex, content);
}

format accepts "text" (default), "markdown", "html", "plain".

TableStream

Emits one table at a time as it’s detected.

const { TableStream } = require("pdf-oxide");

const stream = new TableStream(doc);

stream.on("data", (table) => {
  console.log(`${table.rows.length}x${table.rows[0].length} on page ${table.pageIndex}`);
});

Backpressure

All streams implement standard Node.js backpressure. If your consumer is slow, the stream pauses extraction until .read() resumes:

stream.on("data", async (result) => {
  stream.pause();
  await slowIndex(result);
  stream.resume();
});

Or use for await, which handles pausing automatically.

Error handling

Errors during extraction are emitted as standard error events:

stream.on("error", (err) => {
  if (err.code === "PDF_INVALID_PAGE") {
    console.warn("skipping invalid page", err.pageIndex);
  } else {
    throw err;
  }
});

Memory efficiency

Streams keep only one result in flight. On a 10,000-page PDF producing 50,000 matches, a SearchStream uses constant memory — the entire result set is never materialised.

Cleanup

Closing the parent PdfDocument ends all attached streams. Streams also clean up their manager reference on end / error.

const doc = new PdfDocument("big.pdf");
const stream = new SearchStream(new SearchManager(doc), "TODO");

stream.on("end", () => doc.close());
stream.on("error", () => doc.close());

For Node.js 22+, the using keyword releases the document when the scope exits:

{
  using doc = new PdfDocument("big.pdf");
  const stream = new SearchStream(new SearchManager(doc), "TODO");
  for await (const r of stream) console.log(r);
} // doc.close() called automatically