OpenDataLoader LogoOpenDataLoader

CLI Options Reference

Complete reference for all CLI options

CLI Options Reference

This page documents all available CLI options for opendataloader-pdf.

Options

OptionShortTypeDefaultDescription
--output-dir-ostring-Directory where output files are written. Default: input file directory
--password-pstring-Password for encrypted PDF files
--format-fstring-Output formats (comma-separated). Values: json, text, html, pdf, markdown, tagged-pdf. Default: json. For HTML inside Markdown use --markdown-with-html. For image extraction control use --image-output.
--quiet-qbooleanfalseSuppress console logging output
--content-safety-off-string-Disable content safety filters. Values: all, hidden-text, off-page, tiny, hidden-ocg
--sanitize-booleanfalseEnable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders
--keep-line-breaks-booleanfalsePreserve original line breaks in extracted text
--replace-invalid-chars-string" "Replacement character for invalid/unrecognized characters. Default: space
--use-struct-tree-booleanfalseUse PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality
--table-method-string"default"Table detection method. Values: default (border-based), cluster (border + cluster). Default: default
--reading-order-string"xycut"Reading order algorithm. Values: off, xycut. Default: xycut
--markdown-page-separator-string-Separator between pages in Markdown output. Use %page-number% for page numbers. Default: none
--markdown-with-html-booleanfalseAllow HTML tags inside Markdown output for complex structures such as multi-row-span tables. Implies --format markdown.
--text-page-separator-string-Separator between pages in text output. Use %page-number% for page numbers. Default: none
--html-page-separator-string-Separator between pages in HTML output. Use %page-number% for page numbers. Default: none
--image-output-string"external"Image output mode. Values: off (no images), embedded (Base64 data URIs), external (file references). Default: external
--image-format-string"png"Output format for extracted images. Values: png, jpeg. Default: png
--image-dir-string-Directory for extracted images (applies only with --image-output external)
--pages-string-Pages to extract (e.g., "1,3,5-7"). Default: all pages
--include-header-footer-booleanfalseInclude page headers and footers in output
--detect-strikethrough-booleanfalseDetect strikethrough text and wrap with ~~ in Markdown output or <del></del> tag in HTML output (experimental)
--hybrid-string"off"Hybrid backend (requires a running server). Quick start: pip install "opendataloader-pdf[hybrid]" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast, hancom-ai
--hybrid-mode-string"auto"Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend)
--hybrid-url-string-Hybrid backend server URL (overrides default)
--hybrid-timeout-string"0"Hybrid backend request timeout in milliseconds (0 = no timeout). Default: 0
--hybrid-fallback-booleanfalseOpt in to Java fallback on hybrid backend error (default: disabled)
--hybrid-hancom-ai-regionlist-strategy-string"table-first"DLA label 7 (regionlist) handling. Requires --hybrid=hancom-ai. Values: table-first (default; check TSR overlap), list-only (skip TSR, always treat as list)
--hybrid-hancom-ai-ocr-strategy-string"auto"OCR strategy. Requires --hybrid=hancom-ai. Values: off (stream-only), auto (default; stream first, OCR fallback), force (OCR-only)
--hybrid-hancom-ai-image-cache-string"memory"Page image cache backing. Requires --hybrid=hancom-ai. Values: memory (default), disk
--to-stdout-booleanfalseWrite output to stdout instead of file (single format only)
--threads-string"1"Number of worker threads for per-page processing. Default: 1 (sequential, stable). Values >1 (experimental) run pages in parallel for faster throughput; output may vary slightly on some PDFs. Capped at the number of available CPU cores. Applies to the native Java pipeline only; ignored in --hybrid mode

Examples

Basic conversion

opendataloader-pdf document.pdf -o ./output -f json,markdown

Convert entire folder

opendataloader-pdf ./pdf-folder -o ./output -f json

Save images as external files

opendataloader-pdf document.pdf -f markdown --image-output external

Disable reading order sorting

opendataloader-pdf document.pdf -f json --reading-order off

Add page separators in output

opendataloader-pdf document.pdf -f markdown --markdown-page-separator "--- Page %page-number% ---"

Encrypted PDF

opendataloader-pdf encrypted.pdf -p mypassword -o ./output

On this page