Quick Start with Node.js
Install @opendataloader/pdf and convert PDF files to Markdown or JSON using TypeScript or JavaScript. Requires Java 11+ and Node.js 20+.
The TypeScript package mirrors the Python API and exposes both a programmatic helper and a CLI (npx @opendataloader/pdf).
Requirements
- Node.js 20 or later
- Java 11+ available on the system
PATH
Verify Java once before installing:
java -versionIf java is not found, install a JDK:
| OS | Install Command |
|---|---|
| macOS | brew install --cask temurin or download from Adoptium |
| Ubuntu/Debian | sudo apt install openjdk-17-jdk |
| Windows | Download installer from Adoptium (adds to PATH automatically) |
Windows PATH tip: If
java -versionfails after installing, close and reopen your terminal. If it still fails, addC:\Program Files\Eclipse Adoptium\jdk-<version>\binto your system PATH manually.
Install
npm install @opendataloader/pdfConvert from TypeScript
import { convert } from "@opendataloader/pdf";
async function main() {
await convert(["path/to/document.pdf", "path/to/folder"], {
outputDir: "path/to/output",
format: "json,html,pdf,markdown",
});
}
main().catch((error) => {
console.error("Error processing PDF:", error);
});convert() options
| Option | Type | Default | Description |
|---|---|---|---|
outputDir | string | - | Directory where output files are written. Default: input file directory |
password | string | - | Password for encrypted PDF files |
format | string | string[] | - | Output formats (comma-separated). Values: json, text, html, pdf, markdown, tagged-pdf. Default: json. For HTML inside Markdown use --markdown-with-html. For image extraction control use --image-output. |
quiet | boolean | false | Suppress console logging output |
contentSafetyOff | string | string[] | - | Disable content safety filters. Values: all, hidden-text, off-page, tiny, hidden-ocg |
sanitize | boolean | false | Enable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders |
keepLineBreaks | boolean | false | Preserve original line breaks in extracted text |
replaceInvalidChars | string | " " | Replacement character for invalid/unrecognized characters. Default: space |
useStructTree | boolean | false | Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality |
tableMethod | string | "default" | Table detection method. Values: default (border-based), cluster (border + cluster). Default: default |
readingOrder | string | "xycut" | Reading order algorithm. Values: off, xycut. Default: xycut |
markdownPageSeparator | string | - | Separator between pages in Markdown output. Use %page-number% for page numbers. Default: none |
markdownWithHtml | boolean | false | Allow HTML tags inside Markdown output for complex structures such as multi-row-span tables. Implies --format markdown. |
textPageSeparator | string | - | Separator between pages in text output. Use %page-number% for page numbers. Default: none |
htmlPageSeparator | string | - | Separator between pages in HTML output. Use %page-number% for page numbers. Default: none |
imageOutput | string | "external" | Image output mode. Values: off (no images), embedded (Base64 data URIs), external (file references). Default: external |
imageFormat | string | "png" | Output format for extracted images. Values: png, jpeg. Default: png |
imageDir | string | - | Directory for extracted images (applies only with --image-output external) |
pages | string | - | Pages to extract (e.g., "1,3,5-7"). Default: all pages |
includeHeaderFooter | boolean | false | Include page headers and footers in output |
detectStrikethrough | boolean | false | Detect strikethrough text and wrap with ~~ in Markdown output or <del></del> tag in HTML output (experimental) |
hybrid | string | "off" | Hybrid backend (requires a running server). Quick start: pip install "opendataloader-pdf[hybrid]" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast, hancom-ai |
hybridMode | string | "auto" | Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend) |
hybridUrl | string | - | Hybrid backend server URL (overrides default) |
hybridTimeout | string | "0" | Hybrid backend request timeout in milliseconds (0 = no timeout). Default: 0 |
hybridFallback | boolean | false | Opt in to Java fallback on hybrid backend error (default: disabled) |
hybridHancomAiRegionlistStrategy | string | "table-first" | DLA label 7 (regionlist) handling. Requires --hybrid=hancom-ai. Values: table-first (default; check TSR overlap), list-only (skip TSR, always treat as list) |
hybridHancomAiOcrStrategy | string | "auto" | OCR strategy. Requires --hybrid=hancom-ai. Values: off (stream-only), auto (default; stream first, OCR fallback), force (OCR-only) |
hybridHancomAiImageCache | string | "memory" | Page image cache backing. Requires --hybrid=hancom-ai. Values: memory (default), disk |
toStdout | boolean | false | Write output to stdout instead of file (single format only) |
threads | string | "1" | Number of worker threads for per-page processing. Default: 1 (sequential, stable). Values >1 (experimental) run pages in parallel for faster throughput; output may vary slightly on some PDFs. Capped at the number of available CPU cores. Applies to the native Java pipeline only; ignored in --hybrid mode |
CLI usage
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
npx @opendataloader/pdf file1.pdf file2.pdf folder/ \
-o output/ \
-f json,html,pdf,markdownFor CLI options, see the CLI Options Reference.
Next steps
- Need schema details for downstream parsing? See the JSON schema.