OpenDataLoader LogoOpenDataLoader

Hybrid Mode

Route complex PDF pages to AI backends for OCR, formula extraction, and chart description while keeping simple pages fast and local.

Overview

Hybrid mode combines the speed of local Java processing with the accuracy of AI backends. Instead of sending every page to an AI service, OpenDataLoader intelligently routes only complex pages (tables, OCR) to the backend while processing simple text pages locally.

Results: Table accuracy jumps from 0.489 → 0.928 (+90%) with acceptable speed trade-off.

MetricJava-onlyHybridImprovement
Table accuracy (TEDS)0.4890.928+90%
Heading accuracy (MHS)0.7390.821+11%
Reading order (NID)0.9020.934+4%
Speed0.015s/doc0.463s/doc31x slower

Installation

pip install -U "opendataloader-pdf[hybrid]"

This installs the hybrid dependencies including docling and the backend server.

System Requirements

ResourceRequirement
RAM~2–4 GB for the backend server (docling models are loaded into memory)
Disk~1–2 GB for model downloads (cached after first run)
GPUOptional — CPU-only works fine; GPU accelerates OCR and table detection
PortDefault 5002 (configurable with --port). Ensure it is not blocked by a firewall

Quick Start

CLI

Start the backend server (first terminal)

opendataloader-pdf-hybrid --port 5002

Process PDFs with hybrid mode (second terminal)

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/

Python

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    hybrid="docling-fast"           # Routes complex pages to AI backend
)

How It Works

PDF Input


┌─────────────────────────────────────┐
│         Triage Processor            │
│   Analyzes each page complexity     │
└─────────────────────────────────────┘
    │                    │
    ▼                    ▼
┌─────────────┐    ┌─────────────────┐
│  JAVA Path  │    │  BACKEND Path   │
│  (0.015s)   │    │  (AI processing)│
│  Simple     │    │  Complex tables │
│  text pages │    │  OCR pages      │
└─────────────┘    └─────────────────┘
    │                    │
    └────────┬───────────┘

┌─────────────────────────────────────┐
│         Result Merger               │
│    Combines results by page order   │
└─────────────────────────────────────┘

Triage Strategy

The triage processor uses a conservative strategy: it routes uncertain pages to the backend to minimize missed tables (false negatives). This means:

  • Simple text pages → Fast Java path
  • Pages with tables → Backend path
  • Uncertain pages → Backend path (better safe than sorry)

Configuration Options

OptionTypeDefaultDescription
hybridstring"off"Backend name: off, docling-fast
hybrid_urlstringautoBackend server URL
hybrid_timeoutstr"0"Request timeout in milliseconds (0 = no timeout)
hybrid_fallbackboolfalseFallback to Java on backend error

Python Options

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    hybrid="docling-fast",
    hybrid_url="http://localhost:5002",  # Custom backend URL
    hybrid_timeout="60000",               # 60 second timeout
    hybrid_fallback=True                  # Opt in to Java fallback on error
)

CLI Options

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf \
    --hybrid docling-fast \
    --hybrid-url http://localhost:5002 \
    --hybrid-timeout 60000 \
    --hybrid-fallback \
    file1.pdf file2.pdf folder/

Supported Backends

BackendStatusDescription
offDefaultJava-only, no external calls
docling-fastAvailableDocling-serve backend (local)
hancomPlannedHancom Document AI
azurePlannedAzure Document Intelligence
googlePlannedGoogle Document AI

Privacy & Security

Hybrid mode is designed with privacy in mind:

  • Local-first: Simple pages never leave your machine
  • On-premise backend: Run docling-serve locally
  • Fallback: If backend is unavailable, processing continues with Java-only
  • No cloud dependency: Default configuration requires no external services

When to Use Hybrid Mode

Use CaseRecommendation
High-volume simple documentsJava-only (faster)
Documents with complex tablesHybrid mode
OCR-heavy scanned documentsHybrid mode
Maximum speed priorityJava-only
Maximum accuracy priorityHybrid mode
Air-gapped environmentsHybrid with local backend (pre-install dependencies while online)

Scanned PDFs (OCR)

For image-based or scanned PDFs that contain no selectable text, enable OCR on the hybrid backend with --force-ocr.

CLI

Terminal 1: Start backend with OCR enabled

opendataloader-pdf-hybrid --port 5002 --force-ocr

Terminal 2: Process scanned PDF

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/

For non-English documents, specify the OCR language. The default engine is EasyOCR, which uses ISO 639-1 codes:

opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"

Multiple languages can be combined with commas. For the full list of supported codes, see the EasyOCR documentation.

For Arabic documents:

opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ar,en"

Note for Arabic and other RTL scripts: With the default EasyOCR engine, character recognition uses EasyOCR's ar model. The current reading order algorithm processes text based on coordinates and does not perform RTL shaping or visual reordering, so text strings may appear in visual order rather than logical order. This limitation applies to all right-to-left scripts.

Choosing an OCR Engine

If the default EasyOCR does not support your language (for example, Malayalam: ({'ml'}, 'is not supported')), switch engines:

# Malayalam — use Tesseract with the matching tessdata
opendataloader-pdf-hybrid --port 5002 --force-ocr \
    --ocr-engine tesseract --ocr-lang "mal"

Each engine uses its own language code system — see the OCR Language Codes by Engine table in Server Options.

EnginePrerequisiteNotes
easyocrInstalled by opendataloader-pdf[hybrid]Default. Pure Python
tesseracttesseract binary on PATH + tessdata for each languageCLI bridge. Honors --psm
tesserocrtesserocr Python package + tesseract tessdataTesseract via Python bindings. Honors --psm
rapidocrrapidocr and onnxruntime Python packages (pip install rapidocr onnxruntime)ONNX-based engine
ocrmacocrmac Python package; macOS onlyApple Vision framework
autoDelegates engine selection to docling

Each engine has its own license, language coverage, and accuracy characteristics; refer to the engine's own documentation. This server does not validate engine accuracy.

Prerequisite check: The server validates the selected engine at startup. If the binary or Python package is missing, it exits with code 2 and a message naming what to install — for example, "OCR engine 'tesseract' selected but the 'tesseract' binary was not found on PATH".

Disabling OCR

When the input PDFs already contain reliable embedded text, OCR can re-extract text from images such as charts, diagrams, or screenshots, producing duplicate fragments. Use --no-ocr to skip OCR entirely:

opendataloader-pdf-hybrid --port 5002 --no-ocr

--no-ocr and --force-ocr are mutually exclusive. When --no-ocr is combined with --ocr-engine, --ocr-lang, or --psm, the server logs a warning naming the inert flags.

Python

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    hybrid="docling-fast"
)

Start the backend server with --force-ocr before running the Python conversion.

Note: Standard digital PDFs do not need --force-ocr. Use it only for scanned or image-based PDFs where text cannot be selected.

Timeout: OCR is CPU-intensive. By default there is no timeout, but you can set one explicitly:

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast --hybrid-timeout 120000 file1.pdf file2.pdf folder/

Chart and Image Description

Generate AI-powered natural language descriptions for images and charts in your PDFs. This makes visual content searchable in RAG pipelines and produces alt text for accessibility.

Important: Picture description requires --hybrid-mode full on the client side. Without it, the enrichment runs on the backend but the descriptions are not included in the output.

CLI

Terminal 1: Start backend with picture description enabled

opendataloader-pdf-hybrid --port 5002 --enrich-picture-description

Terminal 2: Process with full backend mode

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/

Python

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    hybrid="docling-fast",
    hybrid_mode="full"              # Required for picture description
)

Start the backend server with --enrich-picture-description before running.

Output

The description appears in the JSON output under "description" and as an italic caption in Markdown:

{
  "type": "picture",
  "page number": 1,
  "bounding box": [72.0, 400.0, 540.0, 650.0],
  "description": "A bar chart showing waste generation by region from 2016 to 2030..."
}
![image 1](document_images/imageFile1.png)

*A bar chart showing waste generation by region from 2016 to 2030...*

You can customize the prompt for specific document types:

opendataloader-pdf-hybrid --enrich-picture-description \
  --picture-description-prompt "Describe this scientific figure in detail, including axis labels and data trends."

Note: Picture description uses SmolVLM (256M), a lightweight vision model. Results are suitable for general context but may not capture precise data values from complex charts. The model is English-centric — prompts asking for non-English output (e.g., "Describe the image in Korean.") will not produce coherent translations and are not recommended.

Server Options

OptionDescription
--port PORTServer port (default: 5002)
--host HOSTBind address (default: 0.0.0.0)
--force-ocrForce full-page OCR on all pages. Mutually exclusive with --no-ocr
--no-ocrDisable OCR entirely. Use when input PDFs already have reliable embedded text. Mutually exclusive with --force-ocr
--ocr-engine ENGINEOCR engine: auto, easyocr, ocrmac, rapidocr, tesseract, tesserocr (default: easyocr). The exact list is derived from the installed docling version
--ocr-lang LANGOCR languages, comma-separated. Code system depends on --ocr-engine (see below)
--psm INTTesseract Page Segmentation Mode. Applied only when --ocr-engine is tesseract or tesserocr; ignored otherwise. See tesseract --help-extra
--enrich-formulaEnable formula enrichment (LaTeX extraction)
--no-enrich-formulaDisable formula enrichment
--enrich-picture-descriptionEnable picture description (alt text generation)
--no-enrich-picture-descriptionDisable picture description
--picture-description-prompt TEXTCustom prompt for picture description
--device DEVICEAccelerator device: auto, cpu, cuda, mps (Apple Silicon), xpu (Intel GPU). Default: auto
--max-file-size MBMaximum upload file size in MB. 0 means no limit (default: 0)
--log-level LEVELLog level: debug, info, warning, error

OCR Language Codes by Engine

The --ocr-lang code system varies by engine. If omitted, each engine uses its own default languages.

EngineCode systemExample
easyocrISO 639-1ko,en
tesseract / tesserocrISO 639-2kor,eng
rapidocrPlain English namesenglish,chinese
ocrmacBCP-47en-US

Engine availability check: At startup, the server probes whether the selected engine's binary or Python package is installed. If not, it exits with code 2 and a message naming the missing prerequisite — for example, Tesseract requires the tesseract binary on PATH.

Inert flag warning: When --no-ocr is combined with OCR-related flags (--ocr-engine, --ocr-lang, --psm), the server logs a single warning naming the inert flags rather than silently dropping them.

Troubleshooting

Backend Connection Failed

Error: Could not connect to hybrid backend at http://localhost:5002

Solution: Start the backend server first:

opendataloader-pdf-hybrid

Slow Processing

If hybrid mode is slower than expected:

  1. Check if the backend server is healthy
  2. Consider increasing hybrid_timeout for large documents
  3. Ensure the backend has sufficient resources (RAM, CPU)

Fallback Activated

Warning: Hybrid backend unavailable, falling back to Java processing

This is expected behavior when hybrid_fallback=true. The document will still be processed, but without AI-enhanced table extraction.

Learn More

On this page