CLI Options Reference

This page documents all available CLI options for opendataloader-pdf.

Options

Option	Short	Type	Default	Description
`--output-dir`	`-o`	`string`	-	Directory where output files are written. Default: input file directory
`--password`	`-p`	`string`	-	Password for encrypted PDF files
`--format`	`-f`	`string`	-	Output formats (comma-separated). Values: json, text, html, pdf, markdown, tagged-pdf. Default: json. For HTML inside Markdown use --markdown-with-html. For image extraction control use --image-output.
`--quiet`	`-q`	`boolean`	`false`	Suppress console logging output
`--content-safety-off`	-	`string`	-	Disable content safety filters. Values: all, hidden-text, off-page, tiny, hidden-ocg
`--sanitize`	-	`boolean`	`false`	Enable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders
`--keep-line-breaks`	-	`boolean`	`false`	Preserve original line breaks in extracted text
`--replace-invalid-chars`	-	`string`	`" "`	Replacement character for invalid/unrecognized characters. Default: space
`--use-struct-tree`	-	`boolean`	`false`	Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality. Takes precedence over --hybrid: when both are set on a tagged PDF, the structure tree is used and the hybrid backend is not called
`--table-method`	-	`string`	`"default"`	Table detection method. Values: default (border-based), cluster (border + cluster). Default: default
`--reading-order`	-	`string`	`"xycut"`	Reading order algorithm. Values: off, xycut. Default: xycut
`--markdown-page-separator`	-	`string`	-	Separator between pages in Markdown output. Use %page-number% for page numbers. Default: none
`--markdown-with-html`	-	`boolean`	`false`	Allow HTML tags inside Markdown output for complex structures such as multi-row-span tables. Implies --format markdown.
`--text-page-separator`	-	`string`	-	Separator between pages in text output. Use %page-number% for page numbers. Default: none
`--html-page-separator`	-	`string`	-	Separator between pages in HTML output. Use %page-number% for page numbers. Default: none
`--image-output`	-	`string`	`"external"`	Image output mode. Values: off (no images), embedded (Base64 data URIs), external (file references). Default: external
`--image-format`	-	`string`	`"png"`	Output format for extracted images. Values: png, jpeg. Default: png
`--image-dir`	-	`string`	-	Directory for extracted images (applies only with --image-output external)
`--pages`	-	`string`	-	Pages to extract (e.g., "1,3,5-7"). Default: all pages
`--include-header-footer`	-	`boolean`	`false`	Include page headers and footers in output
`--detect-strikethrough`	-	`boolean`	`false`	Detect strikethrough text and wrap with ~~ in Markdown output or <del></del> tag in HTML output (experimental)
`--hybrid`	-	`string`	`"off"`	Hybrid backend (requires a running server). Quick start: pip install "opendataloader-pdf[hybrid]" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast, hancom-ai. Ignored when --use-struct-tree is set on a tagged PDF (structure tree takes precedence)
`--hybrid-mode`	-	`string`	`"auto"`	Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend)
`--hybrid-url`	-	`string`	-	Hybrid backend server URL (overrides default)
`--hybrid-timeout`	-	`string`	`"0"`	Hybrid backend request timeout in milliseconds (0 = no timeout). Default: 0
`--hybrid-fallback`	-	`boolean`	`false`	Opt in to Java fallback on hybrid backend error (default: disabled)
`--hybrid-hancom-ai-regionlist-strategy`	-	`string`	`"table-first"`	DLA label 7 (regionlist) handling. Requires --hybrid=hancom-ai. Values: table-first (default; check TSR overlap), list-only (skip TSR, always treat as list)
`--hybrid-hancom-ai-ocr-strategy`	-	`string`	`"auto"`	OCR strategy. Requires --hybrid=hancom-ai. Values: off (stream-only), auto (default; stream first, OCR fallback), force (OCR-only)
`--hybrid-hancom-ai-image-cache`	-	`string`	`"memory"`	Page image cache backing. Requires --hybrid=hancom-ai. Values: memory (default), disk
`--to-stdout`	-	`boolean`	`false`	Write output to stdout instead of file (single format only)
`--threads`	-	`string`	`"1"`	Number of worker threads for per-page processing. Default: 1 (sequential, stable). Values >1 (experimental) run pages in parallel for faster throughput; output may vary slightly on some PDFs. Capped at the number of available CPU cores. Applies to the native Java pipeline only; ignored in --hybrid mode

Examples

Basic conversion

opendataloader-pdf document.pdf -o ./output -f json,markdown

Convert entire folder

opendataloader-pdf ./pdf-folder -o ./output -f json

Save images as external files

opendataloader-pdf document.pdf -f markdown --image-output external

Disable reading order sorting

opendataloader-pdf document.pdf -f json --reading-order off

Add page separators in output

opendataloader-pdf document.pdf -f markdown --markdown-page-separator "--- Page %page-number% ---"

Encrypted PDF

opendataloader-pdf encrypted.pdf -p mypassword -o ./output

CLI Options Reference

On this page