RAG Integration Guide
How to use OpenDataLoader PDF in Retrieval-Augmented Generation pipelines
Why PDF Parsing Matters for RAG
RAG (Retrieval-Augmented Generation) systems retrieve relevant context from documents to ground LLM responses. The quality of your PDF parsing directly impacts:
- Retrieval accuracy: Poorly parsed text → wrong chunks retrieved
- Answer quality: Jumbled text → confused LLM responses
- Citation accuracy: No coordinates → can't point to source location
OpenDataLoader is designed specifically for RAG pipelines, providing structured output with bounding boxes for every element.
Basic RAG Workflow
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐
│ PDF │ → │ OpenDataLoader │ → │ Markdown/ │
│ Files │ │ PDF │ │ JSON │
└─────────────┘ └──────────────────┘ └─────────────┘
↓
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐
│ LLM │ ← │ Vector Store │ ← │ Chunking │
│ Response │ │ (Retrieval) │ │ & Embed │
└─────────────┘ └──────────────────┘ └─────────────┘Working Examples
Complete, runnable examples are available in the repository:
git clone https://github.com/opendataloader-project/opendataloader-pdf
cd opendataloader-pdf/examples/python/rag
# Basic chunking (no external dependencies)
pip install opendataloader-pdf
python basic_chunking.py
# LangChain integration
pip install -r requirements.txt
python langchain_example.pySee examples/python/rag for details.
Quick Start
Step 1: Convert PDFs
import opendataloader_pdf
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["file1.pdf", "file2.pdf", "folder/"],
output_dir="output/",
format="json,markdown",
quiet=True,
)Step 2: Load and Chunk
import json
with open("output/document.json", encoding="utf-8") as f:
doc = json.load(f)
# Chunk by semantic elements
chunks = []
for element in doc["kids"]:
if element["type"] in ("paragraph", "heading", "list"):
chunks.append({
"text": element.get("content", ""),
"metadata": {
"type": element["type"],
"page": element.get("page number"),
"bbox": element.get("bounding box"),
"source": doc.get("file name"),
}
})Step 3: Embed and Store
Each chunk is ready for your embedding model and vector store:
for chunk in chunks:
text = chunk["text"] # Text to embed
metadata = chunk["metadata"] # Page, bbox, source for citations
# Your embedding step:
# embedding = your_model.embed(text)
# vector_store.add(embedding, metadata=metadata)Using Bounding Boxes for Citations
OpenDataLoader provides bounding boxes for every element, enabling precise source citations:
import json
with open("output/document.json", encoding="utf-8") as f:
doc = json.load(f)
# Extract elements with locations
for element in doc["kids"]:
content = element.get("content", "")
bbox = element.get("bounding box") # [left, bottom, right, top]
page = element.get("page number")
element_type = element.get("type")
# Store with your chunks for citation
chunk_metadata = {
"page": page,
"bbox": bbox,
"type": element_type
}Citation Format Example
When your RAG system retrieves a chunk, you can generate precise citations:
def format_citation(metadata):
source = metadata.get("source", "unknown")
page = metadata.get("page")
bbox = metadata.get("bbox")
citation = f"Source: {source}"
if page:
citation += f", Page {page}"
if bbox:
citation += f", Position ({bbox[0]:.0f}, {bbox[1]:.0f})"
return citation
# Output: "Source: document.pdf, Page 3, Position (72, 450)"Chunking Strategies
By Semantic Elements
Create one chunk per paragraph, heading, or list element:
def chunk_by_element(doc):
"""Best for: Fine-grained retrieval, precise citations."""
chunks = []
for element in doc["kids"]:
if element["type"] in ("paragraph", "heading", "list"):
chunks.append({
"text": element.get("content", ""),
"metadata": {
"type": element["type"],
"page": element.get("page number"),
"bbox": element.get("bounding box"),
"source": doc.get("file name"),
}
})
return chunksBy Headings (Sections)
Group content under headings into coherent sections:
def chunk_by_section(doc):
"""Best for: Context-rich retrieval, topic-based search."""
chunks = []
current_heading = None
current_content = []
current_start_page = None
for element in doc["kids"]:
if element["type"] == "heading":
if current_content:
chunks.append({
"text": "\n".join(current_content),
"metadata": {
"heading": current_heading,
"page": current_start_page,
"source": doc.get("file name"),
}
})
current_heading = element.get("content", "")
current_content = [current_heading]
current_start_page = element.get("page number")
elif element["type"] in ("paragraph", "list"):
content = element.get("content", "")
if content:
current_content.append(content)
# Save the last section
if current_content:
chunks.append({
"text": "\n".join(current_content),
"metadata": {"heading": current_heading, "page": current_start_page}
})
return chunksMerged Chunks (Minimum Size)
Combine small paragraphs to avoid overly fragmented chunks:
def chunk_with_min_size(doc, min_chars=200):
"""Best for: Balanced chunk sizes, reducing noise."""
chunks = []
buffer_text = ""
buffer_pages = []
for element in doc["kids"]:
if element["type"] in ("paragraph", "heading", "list"):
buffer_text += element.get("content", "") + "\n"
page = element.get("page number")
if page and page not in buffer_pages:
buffer_pages.append(page)
if len(buffer_text) >= min_chars:
chunks.append({
"text": buffer_text.strip(),
"metadata": {"pages": buffer_pages.copy()}
})
buffer_text = ""
buffer_pages = []
if buffer_text.strip():
chunks.append({"text": buffer_text.strip(), "metadata": {"pages": buffer_pages}})
return chunksTables as Separate Chunks
Tables often contain dense information. Chunk them separately:
for element in doc["kids"]:
if element["type"] == "table":
chunks.append({
"type": "table",
"content": element, # Keep full structure
"page": element.get("page number")
})Handling Different Document Types
Academic Papers (Multi-Column)
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["paper1.pdf", "paper2.pdf", "papers/"],
output_dir="output/",
format="json,markdown",
)Financial Reports (Tables Heavy)
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["report1.pdf", "report2.pdf", "reports/"],
output_dir="output/",
format="json", # JSON preserves table structure
)Legal Documents (Long Text)
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["contract1.pdf", "contract2.pdf", "contracts/"],
output_dir="output/",
format="markdown",
)Filtering Noise
OpenDataLoader automatically filters content that would pollute your RAG context:
- Headers/footers: Repeated page elements removed
- Hidden text: Transparent or off-page content filtered
- Watermarks: Background elements excluded
This is enabled by default. To disable (not recommended for RAG):
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["file1.pdf", "file2.pdf", "folder/"],
output_dir="output/",
content_safety_off="all" # Disable all filters
)Performance Tips
Batch Processing
Process multiple files in a single call to avoid repeated Java startup overhead:
import opendataloader_pdf
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["report1.pdf", "report2.pdf", "report3.pdf"],
output_dir="output/",
format="json,markdown",
quiet=True,
)
# Or process an entire folder (recursive)
opendataloader_pdf.convert(
input_path="documents/",
output_dir="output/",
format="json,markdown",
quiet=True,
)CLI equivalent:
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf report1.pdf report2.pdf report3.pdf folder/ --format json,markdown --output-dir output/Why batch matters: Each CLI invocation starts a new Java process (~1-2s overhead). Passing all files in one command processes them in a single JVM, which is significantly faster for large document collections.
Output Format Selection
| Format | Use Case | Size |
|---|---|---|
markdown | Text for chunking/embedding | Smallest |
json | Structured data with metadata | Medium |
json,markdown | Both (recommended for RAG) | Larger |
Common Issues and Solutions
Issue: Text from different columns mixed together
Solution: Reading order is enabled by default (XY-Cut++). If still seeing issues, the PDF may have irregular layout that requires --use-struct-tree for tagged PDFs.
Issue: Headers/footers appearing in chunks
Solution: These are filtered by default. If still appearing, check if they're part of the main content flow.
Issue: Tables losing structure
Solution: Use JSON output for tables, which preserves row/column structure.
Issue: Too many small chunks
Solution: Use the merged chunking strategy with a minimum size threshold:
chunks = chunk_with_min_size(doc, min_chars=500)Framework Integrations
LangChain
OpenDataLoader PDF has an official LangChain integration. Install it separately:
pip install -U langchain-opendataloader-pdffrom langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
# Load documents
loader = OpenDataLoaderPDFLoader(
file_path=["document.pdf", "folder/"],
format="text",
quiet=True,
)
documents = loader.load()
# Use with any LangChain pipeline
for doc in documents:
print(doc.metadata)
print(doc.page_content[:100])See examples/python/rag/langchain_example.py for a complete working example.
Configuration options:
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path | List[str] | Required | PDF files or directories |
format | str | None | Output format (json, html, markdown, text) |
quiet | bool | False | Suppress CLI logging |
content_safety_off | List[str] | None | Disable specific safety filters |
Resources:
Best Practices Summary
- Always enable reading order for multi-column documents
- Use JSON output when you need bounding boxes for citations
- Use Markdown output for simple text chunking
- Keep AI safety filters on to avoid prompt injection
- Chunk by semantic elements (headings, paragraphs) rather than fixed sizes
- Store bounding boxes with chunks for precise citations