OpenDataLoader LogoOpenDataLoader

Quick Start with Node.js

Install @opendataloader/pdf and convert PDF files to Markdown or JSON using TypeScript or JavaScript. Requires Java 11+ and Node.js 20+.

The TypeScript package mirrors the Python API and exposes both a programmatic helper and a CLI (npx @opendataloader/pdf).

Requirements

  • Node.js 20 or later
  • Java 11+ available on the system PATH

Verify Java once before installing:

java -version

If java is not found, install a JDK:

OSInstall Command
macOSbrew install --cask temurin or download from Adoptium
Ubuntu/Debiansudo apt install openjdk-17-jdk
WindowsDownload installer from Adoptium (adds to PATH automatically)

Windows PATH tip: If java -version fails after installing, close and reopen your terminal. If it still fails, add C:\Program Files\Eclipse Adoptium\jdk-<version>\bin to your system PATH manually.

Install

npm install @opendataloader/pdf

Convert from TypeScript

import { convert } from "@opendataloader/pdf";

async function main() {
  await convert(["path/to/document.pdf", "path/to/folder"], {
    outputDir: "path/to/output",
    format: "json,html,pdf,markdown",
  });
}

main().catch((error) => {
  console.error("Error processing PDF:", error);
});

convert() options

OptionTypeDefaultDescription
outputDirstring-Directory where output files are written. Default: input file directory
passwordstring-Password for encrypted PDF files
formatstring | string[]-Output formats (comma-separated). Values: json, text, html, pdf, markdown, tagged-pdf. Default: json. For HTML inside Markdown use --markdown-with-html. For image extraction control use --image-output.
quietbooleanfalseSuppress console logging output
contentSafetyOffstring | string[]-Disable content safety filters. Values: all, hidden-text, off-page, tiny, hidden-ocg
sanitizebooleanfalseEnable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders
keepLineBreaksbooleanfalsePreserve original line breaks in extracted text
replaceInvalidCharsstring" "Replacement character for invalid/unrecognized characters. Default: space
useStructTreebooleanfalseUse PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality
tableMethodstring"default"Table detection method. Values: default (border-based), cluster (border + cluster). Default: default
readingOrderstring"xycut"Reading order algorithm. Values: off, xycut. Default: xycut
markdownPageSeparatorstring-Separator between pages in Markdown output. Use %page-number% for page numbers. Default: none
markdownWithHtmlbooleanfalseAllow HTML tags inside Markdown output for complex structures such as multi-row-span tables. Implies --format markdown.
textPageSeparatorstring-Separator between pages in text output. Use %page-number% for page numbers. Default: none
htmlPageSeparatorstring-Separator between pages in HTML output. Use %page-number% for page numbers. Default: none
imageOutputstring"external"Image output mode. Values: off (no images), embedded (Base64 data URIs), external (file references). Default: external
imageFormatstring"png"Output format for extracted images. Values: png, jpeg. Default: png
imageDirstring-Directory for extracted images (applies only with --image-output external)
pagesstring-Pages to extract (e.g., "1,3,5-7"). Default: all pages
includeHeaderFooterbooleanfalseInclude page headers and footers in output
detectStrikethroughbooleanfalseDetect strikethrough text and wrap with ~~ in Markdown output or <del></del> tag in HTML output (experimental)
hybridstring"off"Hybrid backend (requires a running server). Quick start: pip install "opendataloader-pdf[hybrid]" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast, hancom-ai
hybridModestring"auto"Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend)
hybridUrlstring-Hybrid backend server URL (overrides default)
hybridTimeoutstring"0"Hybrid backend request timeout in milliseconds (0 = no timeout). Default: 0
hybridFallbackbooleanfalseOpt in to Java fallback on hybrid backend error (default: disabled)
hybridHancomAiRegionlistStrategystring"table-first"DLA label 7 (regionlist) handling. Requires --hybrid=hancom-ai. Values: table-first (default; check TSR overlap), list-only (skip TSR, always treat as list)
hybridHancomAiOcrStrategystring"auto"OCR strategy. Requires --hybrid=hancom-ai. Values: off (stream-only), auto (default; stream first, OCR fallback), force (OCR-only)
hybridHancomAiImageCachestring"memory"Page image cache backing. Requires --hybrid=hancom-ai. Values: memory (default), disk
toStdoutbooleanfalseWrite output to stdout instead of file (single format only)
threadsstring"1"Number of worker threads for per-page processing. Default: 1 (sequential, stable). Values >1 (experimental) run pages in parallel for faster throughput; output may vary slightly on some PDFs. Capped at the number of available CPU cores. Applies to the native Java pipeline only; ignored in --hybrid mode

CLI usage

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
npx @opendataloader/pdf file1.pdf file2.pdf folder/ \
  -o output/ \
  -f json,html,pdf,markdown

For CLI options, see the CLI Options Reference.

Next steps

  • Need schema details for downstream parsing? See the JSON schema.

On this page