OpenDataLoader LogoOpenDataLoader

Frequently Asked Questions

Common questions about OpenDataLoader PDF for RAG, LLM, and document processing

General

What is OpenDataLoader PDF?

OpenDataLoader PDF is an open-source tool that converts PDF documents into structured formats (JSON, Markdown, HTML) optimized for AI applications like RAG (Retrieval-Augmented Generation), LLM processing, and vector search. It runs entirely on your local machine without requiring GPU or cloud services.

What is the best PDF parser for RAG?

For RAG pipelines, you need a PDF parser that:

  • Preserves correct reading order (especially for multi-column layouts)
  • Provides bounding boxes for citations
  • Outputs structured data (headings, paragraphs, tables)
  • Filters noise (headers, footers, hidden text)

OpenDataLoader PDF is designed specifically for these requirements. It uses the XY-Cut++ algorithm for reading order, provides coordinates for every element, and includes built-in AI safety filters.

How does OpenDataLoader compare to other PDF parsers?

OpenDataLoader PDF is the only open-source PDF parser that combines:

  • Rule-based extraction (no GPU needed)
  • Bounding boxes for every element
  • XY-Cut++ reading order algorithm
  • Built-in AI safety filters
  • Native Tagged PDF support

Most alternatives require GPU, lack coordinates, or ignore PDF structure tags.

What makes OpenDataLoader unique?

OpenDataLoader takes a different approach from many PDF parsers:

  • Rule-based extraction — Deterministic output without GPU requirements
  • Bounding boxes for all elements — Essential for citation systems
  • XY-Cut++ reading order — Handles multi-column layouts correctly
  • Built-in AI safety filters — Protects against prompt injection
  • Native Tagged PDF support — Leverages accessibility metadata

This means: consistent output (same input = same output), no GPU required, faster processing, and no model hallucinations.

Installation & Setup

What are the system requirements?

  • Java 11 or higher (must be installed and in PATH)
  • Python 3.10+ (for Python package)
  • Node.js 20+ (for Node.js package)
  • No GPU required
  • Works on Linux, macOS, and Windows

Why does OpenDataLoader require Java?

The core PDF parsing engine is written in Java for performance and reliability. The Python and Node.js packages automatically manage the Java runtime — you just need Java installed on your system.

How do I install OpenDataLoader PDF?

Python:

pip install opendataloader-pdf

Node.js:

npm install @opendataloader/pdf

Usage

How do I extract tables from PDF for LLM?

OpenDataLoader detects tables using both border analysis and text clustering, preserving row/column structure in the output:

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="json"                       # JSON preserves table structure
)

Tables are exported as structured data with rows, columns, and cell content preserved.

How do I handle multi-column PDFs?

Reading order is enabled by default using the XY-Cut++ algorithm. No configuration needed:

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
)

This ensures text is extracted in the order humans would read it, not left-to-right across columns.

How do I get bounding boxes for citations?

Use JSON output format. Every element includes a bounding box field:

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="json"
)

Output:

{
  "type": "paragraph",
  "page number": 1,
  "bounding box": [72.0, 650.5, 540.0, 700.2],
  "content": "This is the paragraph text..."
}

Coordinates are [left, bottom, right, top] in PDF points (72 points = 1 inch).

What output formats are available?

FormatUse Case
jsonStructured data with bounding boxes, semantic types
markdownClean text for LLM context, RAG chunks
htmlWeb display with styling
pdfAnnotated PDF showing detected structures
textPlain text extraction

You can combine formats: format="json,markdown"

Does OpenDataLoader work with LangChain?

Yes! OpenDataLoader PDF has an official LangChain integration:

pip install -U langchain-opendataloader-pdf
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
    file_path=["file1.pdf", "file2.pdf", "folder/"],
    format="text"
)
documents = loader.load()

See the LangChain documentation for more details.

Privacy & Security

Can I use this without sending data to the cloud?

Yes. OpenDataLoader PDF runs 100% locally on your machine. No API calls, no data transmission — your documents never leave your environment. This makes it ideal for:

  • Legal documents
  • Medical records
  • Financial reports
  • Any sensitive data

What is AI Safety filtering?

PDFs can contain hidden text designed for prompt injection attacks — invisible instructions that manipulate LLMs. OpenDataLoader automatically filters:

  • Hidden text (transparent, zero-size fonts)
  • Off-page content
  • Suspicious invisible layers

This is enabled by default. Learn more in our AI Safety documentation.

Is my data safe?

Yes. OpenDataLoader:

  • Runs entirely on your machine
  • Makes no network requests
  • Stores no data externally
  • Is open-source (you can audit the code)

Performance

How fast is OpenDataLoader?

Local mode processes 60+ pages per second on CPU (0.015s/page). Hybrid mode processes 2+ pages per second (0.463s/page) with significantly higher accuracy for complex documents. No GPU required. Benchmarked on Apple M4. Full benchmark details. With multi-process batch processing, throughput exceeds 100 pages per second on 8+ core machines.

Can I process multiple PDFs at once?

Yes. Pass a list of files, a directory, or both:

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["report.pdf", "contract.pdf", "invoice.pdf"],
    output_dir="output/",
    format="json,markdown"
)

# Or a folder (recursively finds all PDFs)
opendataloader_pdf.convert(
    input_path="documents/",
    output_dir="output/",
    format="json,markdown"
)

CLI equivalent:

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf report.pdf contract.pdf ./invoices/ -o ./output -f json,markdown

Performance tip: Always pass all files in a single call. Each separate CLI invocation starts a new Java process (~1-2s overhead), so batching is significantly faster for large document collections.

Does it work with scanned PDFs?

Yes, via hybrid mode with OCR. Install the hybrid extra, then start the backend with --force-ocr:

Terminal 1: Start backend with OCR enabled

pip install -U "opendataloader-pdf[hybrid]"
opendataloader-pdf-hybrid --port 5002 --force-ocr

Terminal 2: Process scanned PDF

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/

Or use in Python:

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    hybrid="docling-fast"
)

For non-English scanned documents, specify the OCR language:

opendataloader-pdf-hybrid --port 5002 --ocr-lang "ko,en"

See Hybrid Mode → Scanned PDFs (OCR) for details.

Does it work with images and charts?

Two levels of support:

  1. Image extraction (all modes): Embedded images are extracted with bounding boxes. Enable with image_output="external" (the default):
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    image_output="external"             # Saves images as files; bounding boxes in JSON
)
  1. AI chart descriptions (hybrid only): Generate natural language descriptions of charts and figures, useful for RAG pipelines where visual content needs to be searchable:
# Start backend with picture description enabled
opendataloader-pdf-hybrid --port 5002 --enrich-picture-description

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/

The description appears in the JSON output under "description" and as a caption in Markdown. See Hybrid Mode → Chart and Image Description for details.

Tagged PDF

What is Tagged PDF?

Tagged PDF is a document structure that includes semantic information (headings, paragraphs, lists, tables). When a PDF has proper tags, OpenDataLoader can extract the exact layout the author intended — no guessing required.

Why does Tagged PDF matter?

The European Accessibility Act (EAA) took effect on June 28, 2025, requiring accessible digital documents across the EU. This means more PDFs are now properly tagged.

How do I use Tagged PDF features?

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    use_struct_tree=True                # Use native PDF structure tags
)

Most PDF parsers ignore structure tags entirely. OpenDataLoader is one of the few that fully supports them.

Troubleshooting

Text from different columns is mixed together

Reading order is enabled by default (XY-Cut++). If still seeing issues, try --use-struct-tree for tagged PDFs.

Tables are not detected correctly

For complex tables, enable hybrid mode which routes table-heavy pages to an AI backend for 90% better accuracy:

pip install -U "opendataloader-pdf[hybrid]"

Terminal 1: Start the backend server

opendataloader-pdf-hybrid --port 5002

Terminal 2: Process PDFs with hybrid mode

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/

Or use in Python:

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    hybrid="docling-fast"               # Routes complex pages to AI backend
)

This improves table accuracy from 0.489 to 0.928. See Hybrid Mode for details.

Headers and footers appear in my output

These should be filtered by default. If they're appearing, they may be part of the main content flow rather than repeated elements.

Java is not found

Ensure Java 11+ is installed and in your PATH:

java -version

If not installed, download from Adoptium or use your package manager.

Contributing

How can I contribute?

We welcome contributions! See our Contributing Guide for details on:

  • Reporting bugs
  • Suggesting features
  • Submitting pull requests

Where can I get help?

On this page