Hybrid Mode

Route complex PDF pages to AI backends for OCR, formula extraction, and chart description while keeping simple pages fast and local.

Overview

Hybrid mode combines the speed of local Java processing with the accuracy of AI backends. Instead of sending every page to an AI service, OpenDataLoader intelligently routes only complex pages (tables, OCR) to the backend while processing simple text pages locally.

Results: Table accuracy jumps from 0.489 → 0.928 (+90%) with acceptable speed trade-off.

Metric	Java-only	Hybrid	Improvement
Table accuracy (TEDS)	0.489	0.928	+90%
Heading accuracy (MHS)	0.739	0.821	+11%
Reading order (NID)	0.902	0.934	+4%
Speed	0.015s/doc	0.463s/doc	31x slower

Installation

pip install -U "opendataloader-pdf[hybrid]"

This installs the hybrid dependencies including docling and the backend server.

System Requirements

Resource	Requirement
RAM	~2–4 GB for the backend server (docling models are loaded into memory)
Disk	~1–2 GB for model downloads (cached after first run)
GPU	Optional — CPU-only works fine; GPU accelerates OCR and table detection
Port	Default `5002` (configurable with `--port`). Ensure it is not blocked by a firewall

Quick Start

CLI

Start the backend server (first terminal)

opendataloader-pdf-hybrid --port 5002

Process PDFs with hybrid mode (second terminal)

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/

Python

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    hybrid="docling-fast"           # Routes complex pages to AI backend
)

How It Works

PDF Input
    │
    ▼
┌─────────────────────────────────────┐
│         Triage Processor            │
│   Analyzes each page complexity     │
└─────────────────────────────────────┘
    │                    │
    ▼                    ▼
┌─────────────┐    ┌─────────────────┐
│  JAVA Path  │    │  BACKEND Path   │
│  (0.015s)   │    │  (AI processing)│
│  Simple     │    │  Complex tables │
│  text pages │    │  OCR pages      │
└─────────────┘    └─────────────────┘
    │                    │
    └────────┬───────────┘
             ▼
┌─────────────────────────────────────┐
│         Result Merger               │
│    Combines results by page order   │
└─────────────────────────────────────┘

Triage Strategy

The triage processor uses a conservative strategy: it routes uncertain pages to the backend to minimize missed tables (false negatives). This means:

Simple text pages → Fast Java path
Pages with tables → Backend path
Uncertain pages → Backend path (better safe than sorry)

Configuration Options

Option	Type	Default	Description
`hybrid`	string	`"off"`	Backend name: `off`, `docling-fast`
`hybrid_url`	string	auto	Backend server URL
`hybrid_timeout`	str	`"0"`	Request timeout in milliseconds (0 = no timeout)
`hybrid_fallback`	bool	false	Fallback to Java on backend error

Python Options

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    hybrid="docling-fast",
    hybrid_url="http://localhost:5002",  # Custom backend URL
    hybrid_timeout="60000",               # 60 second timeout
    hybrid_fallback=True                  # Opt in to Java fallback on error
)

CLI Options

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf \
    --hybrid docling-fast \
    --hybrid-url http://localhost:5002 \
    --hybrid-timeout 60000 \
    --hybrid-fallback \
    file1.pdf file2.pdf folder/

Supported Backends

Backend	Status	Description
`off`	Default	Java-only, no external calls
`docling-fast`	Available	Docling-serve backend (local)
`hancom`	Planned	Hancom Document AI
`azure`	Planned	Azure Document Intelligence
`google`	Planned	Google Document AI

Privacy & Security

Hybrid mode is designed with privacy in mind:

Local-first: Simple pages never leave your machine
On-premise backend: Run docling-serve locally
Fallback: If backend is unavailable, processing continues with Java-only
No cloud dependency: Default configuration requires no external services

When to Use Hybrid Mode

Use Case	Recommendation
High-volume simple documents	Java-only (faster)
Documents with complex tables	Hybrid mode
OCR-heavy scanned documents	Hybrid mode
Maximum speed priority	Java-only
Maximum accuracy priority	Hybrid mode
Air-gapped environments	Hybrid with local backend (pre-install dependencies while online)

Scanned PDFs (OCR)

For image-based or scanned PDFs that contain no selectable text, enable OCR on the hybrid backend with --force-ocr.

CLI

Terminal 1: Start backend with OCR enabled

opendataloader-pdf-hybrid --port 5002 --force-ocr

Terminal 2: Process scanned PDF

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/

For non-English documents, specify the OCR language. The default engine is EasyOCR, which uses ISO 639-1 codes:

opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"

Multiple languages can be combined with commas. For the full list of supported codes, see the EasyOCR documentation.

For Arabic documents:

opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ar,en"

Note for Arabic and other RTL scripts: With the default EasyOCR engine, character recognition uses EasyOCR's ar model. The current reading order algorithm processes text based on coordinates and does not perform RTL shaping or visual reordering, so text strings may appear in visual order rather than logical order. This limitation applies to all right-to-left scripts.

Choosing an OCR Engine

If the default EasyOCR does not support your language (for example, Malayalam: ({'ml'}, 'is not supported')), switch engines:

# Malayalam — use Tesseract with the matching tessdata
opendataloader-pdf-hybrid --port 5002 --force-ocr \
    --ocr-engine tesseract --ocr-lang "mal"

Each engine uses its own language code system — see the OCR Language Codes by Engine table in Server Options.

Engine	Prerequisite	Notes
`easyocr`	Installed by `opendataloader-pdf[hybrid]`	Default. Pure Python
`tesseract`	`tesseract` binary on `PATH` + tessdata for each language	CLI bridge. Honors `--psm`
`tesserocr`	`tesserocr` Python package + tesseract tessdata	Tesseract via Python bindings. Honors `--psm`
`rapidocr`	`rapidocr` and `onnxruntime` Python packages (`pip install rapidocr onnxruntime`)	ONNX-based engine
`ocrmac`	`ocrmac` Python package; macOS only	Apple Vision framework
`auto`	—	Delegates engine selection to `docling`

Each engine has its own license, language coverage, and accuracy characteristics; refer to the engine's own documentation. This server does not validate engine accuracy.

Prerequisite check: The server validates the selected engine at startup. If the binary or Python package is missing, it exits with code 2 and a message naming what to install — for example, "OCR engine 'tesseract' selected but the 'tesseract' binary was not found on PATH".

Disabling OCR

When the input PDFs already contain reliable embedded text, OCR can re-extract text from images such as charts, diagrams, or screenshots, producing duplicate fragments. Use --no-ocr to skip OCR entirely:

opendataloader-pdf-hybrid --port 5002 --no-ocr

--no-ocr and --force-ocr are mutually exclusive. When --no-ocr is combined with --ocr-engine, --ocr-lang, or --psm, the server logs a warning naming the inert flags.

Python

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    hybrid="docling-fast"
)

Start the backend server with --force-ocr before running the Python conversion.

Note: Standard digital PDFs do not need --force-ocr. Use it only for scanned or image-based PDFs where text cannot be selected.

Timeout: OCR is CPU-intensive. By default there is no timeout, but you can set one explicitly:
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast --hybrid-timeout 120000 file1.pdf file2.pdf folder/

Chart and Image Description

Generate AI-powered natural language descriptions for images and charts in your PDFs. This makes visual content searchable in RAG pipelines and produces alt text for accessibility.

Important: Picture description requires --hybrid-mode full on the client side. Without it, the enrichment runs on the backend but the descriptions are not included in the output.

CLI

Terminal 1: Start backend with picture description enabled

opendataloader-pdf-hybrid --port 5002 --enrich-picture-description

Terminal 2: Process with full backend mode

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/

Python

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    hybrid="docling-fast",
    hybrid_mode="full"              # Required for picture description
)

Start the backend server with --enrich-picture-description before running.

Output

The description appears in the JSON output under "description" and as an italic caption in Markdown:

{
  "type": "picture",
  "page number": 1,
  "bounding box": [72.0, 400.0, 540.0, 650.0],
  "description": "A bar chart showing waste generation by region from 2016 to 2030..."
}

![image 1](document_images/imageFile1.png)

*A bar chart showing waste generation by region from 2016 to 2030...*

You can customize the prompt for specific document types:

opendataloader-pdf-hybrid --enrich-picture-description \
  --picture-description-prompt "Describe this scientific figure in detail, including axis labels and data trends."

Note: Picture description uses SmolVLM (256M), a lightweight vision model. Results are suitable for general context but may not capture precise data values from complex charts. The model is English-centric — prompts asking for non-English output (e.g., "Describe the image in Korean.") will not produce coherent translations and are not recommended.

Server Options

Option	Description
`--port PORT`	Server port (default: 5002)
`--host HOST`	Bind address (default: 0.0.0.0)
`--force-ocr`	Force full-page OCR on all pages. Mutually exclusive with `--no-ocr`
`--no-ocr`	Disable OCR entirely. Use when input PDFs already have reliable embedded text. Mutually exclusive with `--force-ocr`
`--ocr-engine ENGINE`	OCR engine: `auto`, `easyocr`, `ocrmac`, `rapidocr`, `tesseract`, `tesserocr` (default: `easyocr`). The exact list is derived from the installed `docling` version
`--ocr-lang LANG`	OCR languages, comma-separated. Code system depends on `--ocr-engine` (see below)
`--psm INT`	Tesseract Page Segmentation Mode. Applied only when `--ocr-engine` is `tesseract` or `tesserocr`; ignored otherwise. See `tesseract --help-extra`
`--enrich-formula`	Enable formula enrichment (LaTeX extraction)
`--no-enrich-formula`	Disable formula enrichment
`--enrich-picture-description`	Enable picture description (alt text generation)
`--no-enrich-picture-description`	Disable picture description
`--picture-description-prompt TEXT`	Custom prompt for picture description
`--device DEVICE`	Accelerator device: `auto`, `cpu`, `cuda`, `mps` (Apple Silicon), `xpu` (Intel GPU). Default: `auto`
`--max-file-size MB`	Maximum upload file size in MB. `0` means no limit (default: `0`)
`--log-level LEVEL`	Log level: `debug`, `info`, `warning`, `error`

OCR Language Codes by Engine

The --ocr-lang code system varies by engine. If omitted, each engine uses its own default languages.

Engine	Code system	Example
`easyocr`	ISO 639-1	`ko,en`
`tesseract` / `tesserocr`	ISO 639-2	`kor,eng`
`rapidocr`	Plain English names	`english,chinese`
`ocrmac`	BCP-47	`en-US`

Engine availability check: At startup, the server probes whether the selected engine's binary or Python package is installed. If not, it exits with code 2 and a message naming the missing prerequisite — for example, Tesseract requires the tesseract binary on PATH.

Inert flag warning: When --no-ocr is combined with OCR-related flags (--ocr-engine, --ocr-lang, --psm), the server logs a single warning naming the inert flags rather than silently dropping them.

Troubleshooting

Backend Connection Failed

Error: Could not connect to hybrid backend at http://localhost:5002

Solution: Start the backend server first:

opendataloader-pdf-hybrid

Slow Processing

If hybrid mode is slower than expected:

Check if the backend server is healthy
Consider increasing hybrid_timeout for large documents
Ensure the backend has sufficient resources (RAM, CPU)

Fallback Activated

Warning: Hybrid backend unavailable, falling back to Java processing

This is expected behavior when hybrid_fallback=true. The document will still be processed, but without AI-enhanced table extraction.

Hybrid Had No Effect

Warning: Both --use-struct-tree and --hybrid were set on a tagged PDF. The structure tree takes precedence, so the hybrid backend was NOT called. A well-tagged PDF already carries reading order and structure; drop --use-struct-tree if you want the hybrid backend instead.

--use-struct-tree takes precedence over --hybrid on tagged PDFs. If hybrid mode gave no accuracy improvement, check whether --use-struct-tree is set — drop it to route complex pages to the backend, or keep it to rely on the PDF's own tags. On PDFs with no structure tree, --use-struct-tree is ignored and hybrid runs normally.

Learn More

CLI Options Reference — Full list of CLI options
Benchmark Results — Detailed accuracy comparisons
RAG Integration — Using hybrid mode in RAG pipelines

Hybrid Mode

On this page