Frequently Asked Questions
Common questions about OpenDataLoader PDF for RAG, LLM, and document processing
General
What is OpenDataLoader PDF?
OpenDataLoader PDF is an open-source tool that converts PDF documents into structured formats (JSON, Markdown, HTML) optimized for AI applications like RAG (Retrieval-Augmented Generation), LLM processing, and vector search. It runs entirely on your local machine without requiring GPU or cloud services.
What is the best PDF parser for RAG?
For RAG pipelines, you need a PDF parser that:
- Preserves correct reading order (especially for multi-column layouts)
- Provides bounding boxes for citations
- Outputs structured data (headings, paragraphs, tables)
- Filters noise (headers, footers, hidden text)
OpenDataLoader PDF is designed specifically for these requirements. It uses the XY-Cut++ algorithm for reading order, provides coordinates for every element, and includes built-in AI safety filters.
How does OpenDataLoader compare to other PDF parsers?
OpenDataLoader PDF is the only open-source PDF parser that combines:
- Rule-based extraction (no GPU needed)
- Bounding boxes for every element
- XY-Cut++ reading order algorithm
- Built-in AI safety filters
- Native Tagged PDF support
Most alternatives require GPU, lack coordinates, or ignore PDF structure tags.
What makes OpenDataLoader unique?
OpenDataLoader takes a different approach from many PDF parsers:
- Rule-based extraction — Deterministic output without GPU requirements
- Bounding boxes for all elements — Essential for citation systems
- XY-Cut++ reading order — Handles multi-column layouts correctly
- Built-in AI safety filters — Protects against prompt injection
- Native Tagged PDF support — Leverages accessibility metadata
This means: consistent output (same input = same output), no GPU required, faster processing, and no model hallucinations.
Installation & Setup
What are the system requirements?
- Java 11 or higher (must be installed and in PATH)
- Python 3.10+ (for Python package)
- Node.js 20+ (for Node.js package)
- No GPU required
- Works on Linux, macOS, and Windows
Why does OpenDataLoader require Java?
The core PDF parsing engine is written in Java for performance and reliability. The Python and Node.js packages automatically manage the Java runtime — you just need Java installed on your system.
How do I install OpenDataLoader PDF?
Python:
pip install opendataloader-pdfNode.js:
npm install @opendataloader/pdfUsage
How do I extract tables from PDF for LLM?
OpenDataLoader detects tables using both border analysis and text clustering, preserving row/column structure in the output:
import opendataloader_pdf
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["file1.pdf", "file2.pdf", "folder/"],
output_dir="output/",
format="json" # JSON preserves table structure
)Tables are exported as structured data with rows, columns, and cell content preserved.
How do I handle multi-column PDFs?
Reading order is enabled by default using the XY-Cut++ algorithm. No configuration needed:
import opendataloader_pdf
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["file1.pdf", "file2.pdf", "folder/"],
output_dir="output/",
)This ensures text is extracted in the order humans would read it, not left-to-right across columns.
How do I get bounding boxes for citations?
Use JSON output format. Every element includes a bounding box field:
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["file1.pdf", "file2.pdf", "folder/"],
output_dir="output/",
format="json"
)Output:
{
"type": "paragraph",
"page number": 1,
"bounding box": [72.0, 650.5, 540.0, 700.2],
"content": "This is the paragraph text..."
}Coordinates are [left, bottom, right, top] in PDF points (72 points = 1 inch).
What output formats are available?
| Format | Use Case |
|---|---|
json | Structured data with bounding boxes, semantic types |
markdown | Clean text for LLM context, RAG chunks |
html | Web display with styling |
pdf | Annotated PDF showing detected structures |
text | Plain text extraction |
You can combine formats: format="json,markdown"
Does OpenDataLoader work with LangChain?
Yes! OpenDataLoader PDF has an official LangChain integration:
pip install -U langchain-opendataloader-pdffrom langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
loader = OpenDataLoaderPDFLoader(
file_path=["file1.pdf", "file2.pdf", "folder/"],
format="text"
)
documents = loader.load()See the LangChain documentation for more details.
Privacy & Security
Can I use this without sending data to the cloud?
Yes. OpenDataLoader PDF runs 100% locally on your machine. No API calls, no data transmission — your documents never leave your environment. This makes it ideal for:
- Legal documents
- Medical records
- Financial reports
- Any sensitive data
What is AI Safety filtering?
PDFs can contain hidden text designed for prompt injection attacks — invisible instructions that manipulate LLMs. OpenDataLoader automatically filters:
- Hidden text (transparent, zero-size fonts)
- Off-page content
- Suspicious invisible layers
This is enabled by default. Learn more in our AI Safety documentation.
Is my data safe?
Yes. OpenDataLoader:
- Runs entirely on your machine
- Makes no network requests
- Stores no data externally
- Is open-source (you can audit the code)
Performance
How fast is OpenDataLoader?
Local mode processes 60+ pages per second on CPU (0.015s/page). Hybrid mode processes 2+ pages per second (0.463s/page) with significantly higher accuracy for complex documents. No GPU required. Benchmarked on Apple M4. Full benchmark details. With multi-process batch processing, throughput exceeds 100 pages per second on 8+ core machines.
Can I process multiple PDFs at once?
Yes. Pass a list of files, a directory, or both:
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["report.pdf", "contract.pdf", "invoice.pdf"],
output_dir="output/",
format="json,markdown"
)
# Or a folder (recursively finds all PDFs)
opendataloader_pdf.convert(
input_path="documents/",
output_dir="output/",
format="json,markdown"
)CLI equivalent:
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf report.pdf contract.pdf ./invoices/ -o ./output -f json,markdownPerformance tip: Always pass all files in a single call. Each separate CLI invocation starts a new Java process (~1-2s overhead), so batching is significantly faster for large document collections.
Does it work with scanned PDFs?
Yes, via hybrid mode with OCR. Install the hybrid extra, then start the backend with --force-ocr:
Terminal 1: Start backend with OCR enabled
pip install -U "opendataloader-pdf[hybrid]"
opendataloader-pdf-hybrid --port 5002 --force-ocrTerminal 2: Process scanned PDF
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/Or use in Python:
import opendataloader_pdf
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["file1.pdf", "file2.pdf", "folder/"],
output_dir="output/",
hybrid="docling-fast"
)For non-English scanned documents, specify the OCR language:
opendataloader-pdf-hybrid --port 5002 --ocr-lang "ko,en"See Hybrid Mode → Scanned PDFs (OCR) for details.
Does it work with images and charts?
Two levels of support:
- Image extraction (all modes): Embedded images are extracted with bounding boxes. Enable with
image_output="external"(the default):
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["file1.pdf", "file2.pdf", "folder/"],
output_dir="output/",
image_output="external" # Saves images as files; bounding boxes in JSON
)- AI chart descriptions (hybrid only): Generate natural language descriptions of charts and figures, useful for RAG pipelines where visual content needs to be searchable:
# Start backend with picture description enabled
opendataloader-pdf-hybrid --port 5002 --enrich-picture-description
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/The description appears in the JSON output under "description" and as a caption in Markdown. See Hybrid Mode → Chart and Image Description for details.
Tagged PDF
What is Tagged PDF?
Tagged PDF is a document structure that includes semantic information (headings, paragraphs, lists, tables). When a PDF has proper tags, OpenDataLoader can extract the exact layout the author intended — no guessing required.
Why does Tagged PDF matter?
The European Accessibility Act (EAA) took effect on June 28, 2025, requiring accessible digital documents across the EU. This means more PDFs are now properly tagged.
How do I use Tagged PDF features?
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["file1.pdf", "file2.pdf", "folder/"],
output_dir="output/",
use_struct_tree=True # Use native PDF structure tags
)Most PDF parsers ignore structure tags entirely. OpenDataLoader is one of the few that fully supports them.
Troubleshooting
Text from different columns is mixed together
Reading order is enabled by default (XY-Cut++). If still seeing issues, try --use-struct-tree for tagged PDFs.
Tables are not detected correctly
For complex tables, enable hybrid mode which routes table-heavy pages to an AI backend for 90% better accuracy:
pip install -U "opendataloader-pdf[hybrid]"Terminal 1: Start the backend server
opendataloader-pdf-hybrid --port 5002Terminal 2: Process PDFs with hybrid mode
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/Or use in Python:
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["file1.pdf", "file2.pdf", "folder/"],
output_dir="output/",
hybrid="docling-fast" # Routes complex pages to AI backend
)This improves table accuracy from 0.489 to 0.928. See Hybrid Mode for details.
Headers and footers appear in my output
These should be filtered by default. If they're appearing, they may be part of the main content flow rather than repeated elements.
Java is not found
Ensure Java 11+ is installed and in your PATH:
java -versionIf not installed, download from Adoptium or use your package manager.
Contributing
How can I contribute?
We welcome contributions! See our Contributing Guide for details on:
- Reporting bugs
- Suggesting features
- Submitting pull requests
Where can I get help?
- GitHub Discussions — Q&A and general conversations
- GitHub Issues — Bug reports and feature requests
What's New in v2.0
OpenDataLoader PDF v2.0 release highlights: PDF to Markdown for RAG at 100+ pages/sec with no GPU, top benchmark performance, four free AI Add-ons, Apache 2.0 license, LangChain integration
Quick Start with Python
Install opendataloader-pdf and extract text, tables, and headings from PDF files using Python. Requires Java 11+ and Python 3.10+.