CLI Options Reference
Complete reference for all CLI options
CLI Options Reference
This page documents all available CLI options for opendataloader-pdf.
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--output-dir | -o | string | - | Directory where output files are written. Default: input file directory |
--password | -p | string | - | Password for encrypted PDF files |
--format | -f | string | - | Output formats (comma-separated). Values: json, text, html, pdf, markdown, tagged-pdf. Default: json. For HTML inside Markdown use --markdown-with-html. For image extraction control use --image-output. |
--quiet | -q | boolean | false | Suppress console logging output |
--content-safety-off | - | string | - | Disable content safety filters. Values: all, hidden-text, off-page, tiny, hidden-ocg |
--sanitize | - | boolean | false | Enable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders |
--keep-line-breaks | - | boolean | false | Preserve original line breaks in extracted text |
--replace-invalid-chars | - | string | " " | Replacement character for invalid/unrecognized characters. Default: space |
--use-struct-tree | - | boolean | false | Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality |
--table-method | - | string | "default" | Table detection method. Values: default (border-based), cluster (border + cluster). Default: default |
--reading-order | - | string | "xycut" | Reading order algorithm. Values: off, xycut. Default: xycut |
--markdown-page-separator | - | string | - | Separator between pages in Markdown output. Use %page-number% for page numbers. Default: none |
--markdown-with-html | - | boolean | false | Allow HTML tags inside Markdown output for complex structures such as multi-row-span tables. Implies --format markdown. |
--text-page-separator | - | string | - | Separator between pages in text output. Use %page-number% for page numbers. Default: none |
--html-page-separator | - | string | - | Separator between pages in HTML output. Use %page-number% for page numbers. Default: none |
--image-output | - | string | "external" | Image output mode. Values: off (no images), embedded (Base64 data URIs), external (file references). Default: external |
--image-format | - | string | "png" | Output format for extracted images. Values: png, jpeg. Default: png |
--image-dir | - | string | - | Directory for extracted images (applies only with --image-output external) |
--pages | - | string | - | Pages to extract (e.g., "1,3,5-7"). Default: all pages |
--include-header-footer | - | boolean | false | Include page headers and footers in output |
--detect-strikethrough | - | boolean | false | Detect strikethrough text and wrap with ~~ in Markdown output or <del></del> tag in HTML output (experimental) |
--hybrid | - | string | "off" | Hybrid backend (requires a running server). Quick start: pip install "opendataloader-pdf[hybrid]" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast, hancom-ai |
--hybrid-mode | - | string | "auto" | Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend) |
--hybrid-url | - | string | - | Hybrid backend server URL (overrides default) |
--hybrid-timeout | - | string | "0" | Hybrid backend request timeout in milliseconds (0 = no timeout). Default: 0 |
--hybrid-fallback | - | boolean | false | Opt in to Java fallback on hybrid backend error (default: disabled) |
--hybrid-hancom-ai-regionlist-strategy | - | string | "table-first" | DLA label 7 (regionlist) handling. Requires --hybrid=hancom-ai. Values: table-first (default; check TSR overlap), list-only (skip TSR, always treat as list) |
--hybrid-hancom-ai-ocr-strategy | - | string | "auto" | OCR strategy. Requires --hybrid=hancom-ai. Values: off (stream-only), auto (default; stream first, OCR fallback), force (OCR-only) |
--hybrid-hancom-ai-image-cache | - | string | "memory" | Page image cache backing. Requires --hybrid=hancom-ai. Values: memory (default), disk |
--to-stdout | - | boolean | false | Write output to stdout instead of file (single format only) |
--threads | - | string | "1" | Number of worker threads for per-page processing. Default: 1 (sequential, stable). Values >1 (experimental) run pages in parallel for faster throughput; output may vary slightly on some PDFs. Capped at the number of available CPU cores. Applies to the native Java pipeline only; ignored in --hybrid mode |
Examples
Basic conversion
opendataloader-pdf document.pdf -o ./output -f json,markdownConvert entire folder
opendataloader-pdf ./pdf-folder -o ./output -f jsonSave images as external files
opendataloader-pdf document.pdf -f markdown --image-output externalDisable reading order sorting
opendataloader-pdf document.pdf -f json --reading-order offAdd page separators in output
opendataloader-pdf document.pdf -f markdown --markdown-page-separator "--- Page %page-number% ---"Encrypted PDF
opendataloader-pdf encrypted.pdf -p mypassword -o ./output