Hybrid mode combines the speed of local Java processing with the accuracy of AI backends. Instead of sending every page to an AI service, OpenDataLoader intelligently routes only complex pages (tables, OCR) to the backend while processing simple text pages locally.
Results: Table accuracy jumps from 0.489 → 0.928 (+90%) with acceptable speed trade-off.
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slowopendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/
import opendataloader_pdf# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slowopendataloader_pdf.convert( input_path=["file1.pdf", "file2.pdf", "folder/"], output_dir="output/", hybrid="docling-fast" # Routes complex pages to AI backend)
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slowopendataloader_pdf.convert( input_path=["file1.pdf", "file2.pdf", "folder/"], output_dir="output/", hybrid="docling-fast", hybrid_url="http://localhost:5002", # Custom backend URL hybrid_timeout="60000", # 60 second timeout hybrid_fallback=True # Opt in to Java fallback on error)
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slowopendataloader-pdf \ --hybrid docling-fast \ --hybrid-url http://localhost:5002 \ --hybrid-timeout 60000 \ --hybrid-fallback \ file1.pdf file2.pdf folder/
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slowopendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/
For non-English documents, specify the OCR language. The default engine is EasyOCR, which uses ISO 639-1 codes:
Note for Arabic and other RTL scripts: With the default EasyOCR engine, character recognition uses EasyOCR's ar model. The current reading order algorithm processes text based on coordinates and does not perform RTL shaping or visual reordering, so text strings may appear in visual order rather than logical order. This limitation applies to all right-to-left scripts.
tesseract binary on PATH + tessdata for each language
CLI bridge. Honors --psm
tesserocr
tesserocr Python package + tesseract tessdata
Tesseract via Python bindings. Honors --psm
rapidocr
rapidocr and onnxruntime Python packages (pip install rapidocr onnxruntime)
ONNX-based engine
ocrmac
ocrmac Python package; macOS only
Apple Vision framework
auto
—
Delegates engine selection to docling
Each engine has its own license, language coverage, and accuracy characteristics; refer to the engine's own documentation. This server does not validate engine accuracy.
Prerequisite check: The server validates the selected engine at startup. If the binary or Python package is missing, it exits with code 2 and a message naming what to install — for example, "OCR engine 'tesseract' selected but the 'tesseract' binary was not found on PATH".
When the input PDFs already contain reliable embedded text, OCR can re-extract text from images such as charts, diagrams, or screenshots, producing duplicate fragments. Use --no-ocr to skip OCR entirely:
opendataloader-pdf-hybrid --port 5002 --no-ocr
--no-ocr and --force-ocr are mutually exclusive. When --no-ocr is combined with --ocr-engine, --ocr-lang, or --psm, the server logs a warning naming the inert flags.
import opendataloader_pdf# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slowopendataloader_pdf.convert( input_path=["file1.pdf", "file2.pdf", "folder/"], output_dir="output/", hybrid="docling-fast")
Start the backend server with --force-ocr before running the Python conversion.
Note: Standard digital PDFs do not need --force-ocr. Use it only for scanned or image-based PDFs where text cannot be selected.
Timeout: OCR is CPU-intensive. By default there is no timeout, but you can set one explicitly:
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slowopendataloader-pdf --hybrid docling-fast --hybrid-timeout 120000 file1.pdf file2.pdf folder/
Generate AI-powered natural language descriptions for images and charts in your PDFs. This makes visual content searchable in RAG pipelines and produces alt text for accessibility.
Important: Picture description requires --hybrid-mode full on the client side. Without it, the enrichment runs on the backend but the descriptions are not included in the output.
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slowopendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
import opendataloader_pdf# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slowopendataloader_pdf.convert( input_path=["file1.pdf", "file2.pdf", "folder/"], output_dir="output/", hybrid="docling-fast", hybrid_mode="full" # Required for picture description)
Start the backend server with --enrich-picture-description before running.
The description appears in the JSON output under "description" and as an italic caption in Markdown:
{ "type": "picture", "page number": 1, "bounding box": [72.0, 400.0, 540.0, 650.0], "description": "A bar chart showing waste generation by region from 2016 to 2030..."}
*A bar chart showing waste generation by region from 2016 to 2030...*
You can customize the prompt for specific document types:
opendataloader-pdf-hybrid --enrich-picture-description \ --picture-description-prompt "Describe this scientific figure in detail, including axis labels and data trends."
Note: Picture description uses SmolVLM (256M), a lightweight vision model. Results are suitable for general context but may not capture precise data values from complex charts. The model is English-centric — prompts asking for non-English output (e.g., "Describe the image in Korean.") will not produce coherent translations and are not recommended.
The --ocr-lang code system varies by engine. If omitted, each engine uses its own default languages.
Engine
Code system
Example
easyocr
ISO 639-1
ko,en
tesseract / tesserocr
ISO 639-2
kor,eng
rapidocr
Plain English names
english,chinese
ocrmac
BCP-47
en-US
Engine availability check: At startup, the server probes whether the selected engine's binary or Python package is installed. If not, it exits with code 2 and a message naming the missing prerequisite — for example, Tesseract requires the tesseract binary on PATH.
Inert flag warning: When --no-ocr is combined with OCR-related flags (--ocr-engine, --ocr-lang, --psm), the server logs a single warning naming the inert flags rather than silently dropping them.