Skip to content

Getting Started with PDF Oxide (Java)

PDF Oxide is the fastest Java PDF library for text extraction — 0.8ms mean, 100% pass rate on 3,830 real-world PDFs. The same Rust core ships to Python, Go, JS, and C#; the Java binding is a thin JNI layer with a JDK 11 LTS floor and free Kotlin interop from the same JAR.

Installation

The JAR embeds native libraries for Linux (x86_64/aarch64), macOS (x86_64/aarch64), and Windows (x86_64). No compiler or extra setup — the right library is extracted on first call.

Maven

<dependency>
  <groupId>fyi.oxide</groupId>
  <artifactId>pdf-oxide</artifactId>
  <version>0.3.69</version>
</dependency>

Gradle

// Kotlin DSL
implementation("fyi.oxide:pdf-oxide:0.3.69")
// Groovy
implementation 'fyi.oxide:pdf-oxide:0.3.69'

Quick Start

Open a PDF and extract text. PdfDocument is AutoCloseable, so use try-with-resources to free the native handle deterministically.

import fyi.oxide.pdf.PdfDocument;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    System.out.println("Pages: " + doc.pageCount());
    System.out.println(doc.extractText(0)); // zero-based page index
}

You can open from a path string, a Path, raw byte[], or an InputStream:

import fyi.oxide.pdf.PdfDocument;

byte[] pdfBytes = downloadFromS3();
try (PdfDocument doc = PdfDocument.open(pdfBytes)) {
    String text = doc.extractText(0);
}

Text Extraction

Loop over every page by its zero-based index:

import fyi.oxide.pdf.PdfDocument;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("book.pdf"))) {
    for (int i = 0; i < doc.pageCount(); i++) {
        System.out.println("--- Page " + (i + 1) + " ---");
        System.out.println(doc.extractText(i));
    }
}

Word-Level Extraction

A PdfPage exposes structured geometry. words() returns a list of TextWord, each with its text, bounding box, and OCR confidence.

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.PdfPage;
import fyi.oxide.pdf.text.TextWord;
import fyi.oxide.pdf.geometry.BBox;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("paper.pdf"))) {
    PdfPage page = doc.page(0);
    for (TextWord word : page.words()) {
        BBox b = word.bbox();
        System.out.printf("'%s' at (%.1f, %.1f) conf=%.2f%n",
            word.text(), b.x0(), b.y0(), word.confidence());
    }
}

PdfPage also offers lines(), chars(), tables(), images(), annotations(), plus width(), height(), and text(BBox region) to extract from a sub-region.

Markdown Conversion

Convert a single page or the whole document to Markdown via the MarkdownConverter helper (or the doc.toMarkdown(...) convenience methods).

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.MarkdownConverter;
import java.nio.file.Files;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    String md = MarkdownConverter.toMarkdown(doc); // whole document
    Files.writeString(Path.of("report.md"), md);

    String pageMd = doc.toMarkdown(0); // single page
    String pageHtml = doc.toHtml(0);   // or HTML
}

search() scans the whole document and returns a list of SearchMatch, each with its page index, bounding box, and matched text.

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.search.SearchMatch;
import fyi.oxide.pdf.geometry.BBox;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("manual.pdf"))) {
    for (SearchMatch m : doc.search("configuration")) {
        BBox b = m.bbox();
        System.out.printf("Page %d: '%s' at (%.0f, %.0f)%n",
            m.pageIndex(), m.text(), b.x0(), b.y0());
    }
}

Creating PDFs

The Pdf type builds PDFs from Markdown, HTML, or images. It is AutoCloseable and has no Cleaner backstop, so always close it explicitly or in try-with-resources.

import fyi.oxide.pdf.Pdf;
import java.nio.file.Path;

try (Pdf pdf = Pdf.fromMarkdown("# Hello\n\nThis is a PDF.")) {
    pdf.saveTo(Path.of("out.pdf"));
}

try (Pdf pdf = Pdf.fromHtml("<h1>Invoice</h1><p>Amount: $42</p>")) {
    byte[] bytes = pdf.save(); // serialize to memory instead of disk
}

Password-Protected PDFs

Pass a password to open(), or call authenticate() after catching a PdfEncryptedException.

import fyi.oxide.pdf.PdfDocument;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("confidential.pdf"), "secret")) {
    System.out.println(doc.extractText(0));
}

Error Handling

PdfException extends RuntimeException (unchecked), with typed subclasses and a kind() enum for switch dispatch.

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.exception.PdfEncryptedException;
import fyi.oxide.pdf.exception.PdfException;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("document.pdf"))) {
    String text = doc.extractText(0);
} catch (PdfEncryptedException e) {
    System.err.println("Password required");
} catch (PdfException e) {
    switch (e.kind()) {
        case PARSE -> System.err.println("Malformed PDF");
        case IO    -> System.err.println("I/O error");
        default    -> System.err.println("PDF error: " + e.getMessage());
    }
}

Kotlin

The same JAR works directly from Kotlin — record accessors become properties.

import fyi.oxide.pdf.PdfDocument
import java.nio.file.Path

PdfDocument.open(Path.of("report.pdf")).use { doc ->
    println("Pages: ${doc.pageCount()}")
    println(doc.extractText(0))
}

Next Steps