RAG Integration Guide

Why PDF Parsing Matters for RAG

RAG (Retrieval-Augmented Generation) systems retrieve relevant context from documents to ground LLM responses. The quality of your PDF parsing directly impacts:

Retrieval accuracy: Poorly parsed text → wrong chunks retrieved
Answer quality: Jumbled text → confused LLM responses
Citation accuracy: No coordinates → can't point to source location

OpenDataLoader is designed specifically for RAG pipelines, providing structured output with bounding boxes for every element.

Basic RAG Workflow

┌─────────────┐    ┌──────────────────┐    ┌─────────────┐
│   PDF       │ →  │  OpenDataLoader  │ →  │  Markdown/  │
│   Files     │    │  PDF             │    │  JSON       │
└─────────────┘    └──────────────────┘    └─────────────┘
                                                  ↓
┌─────────────┐    ┌──────────────────┐    ┌─────────────┐
│   LLM       │ ←  │  Vector Store    │ ←  │  Chunking   │
│   Response  │    │  (Retrieval)     │    │  & Embed    │
└─────────────┘    └──────────────────┘    └─────────────┘

Working Examples

Complete, runnable examples are available in the repository:

git clone https://github.com/opendataloader-project/opendataloader-pdf
cd opendataloader-pdf/examples/python/rag

# Basic chunking (no external dependencies)
pip install opendataloader-pdf
python basic_chunking.py

# LangChain integration
pip install -r requirements.txt
python langchain_example.py

See examples/python/rag for details.

Quick Start

Step 1: Convert PDFs

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="json,markdown",
    quiet=True,
)

Step 2: Load and Chunk

import json

with open("output/document.json", encoding="utf-8") as f:
    doc = json.load(f)

# Chunk by semantic elements
chunks = []
for element in doc["kids"]:
    if element["type"] in ("paragraph", "heading", "list"):
        chunks.append({
            "text": element.get("content", ""),
            "metadata": {
                "type": element["type"],
                "page": element.get("page number"),
                "bbox": element.get("bounding box"),
                "source": doc.get("file name"),
            }
        })

Step 3: Embed and Store

Each chunk is ready for your embedding model and vector store:

for chunk in chunks:
    text = chunk["text"]           # Text to embed
    metadata = chunk["metadata"]   # Page, bbox, source for citations

    # Your embedding step:
    # embedding = your_model.embed(text)
    # vector_store.add(embedding, metadata=metadata)

Using Bounding Boxes for Citations

OpenDataLoader provides bounding boxes for every element, enabling precise source citations:

import json

with open("output/document.json", encoding="utf-8") as f:
    doc = json.load(f)

# Extract elements with locations
for element in doc["kids"]:
    content = element.get("content", "")
    bbox = element.get("bounding box")  # [left, bottom, right, top]
    page = element.get("page number")
    element_type = element.get("type")

    # Store with your chunks for citation
    chunk_metadata = {
        "page": page,
        "bbox": bbox,
        "type": element_type
    }

Citation Format Example

When your RAG system retrieves a chunk, you can generate precise citations:

def format_citation(metadata):
    source = metadata.get("source", "unknown")
    page = metadata.get("page")
    bbox = metadata.get("bbox")

    citation = f"Source: {source}"
    if page:
        citation += f", Page {page}"
    if bbox:
        citation += f", Position ({bbox[0]:.0f}, {bbox[1]:.0f})"
    return citation

# Output: "Source: document.pdf, Page 3, Position (72, 450)"

Chunking Strategies

By Semantic Elements

Create one chunk per paragraph, heading, or list element:

def chunk_by_element(doc):
    """Best for: Fine-grained retrieval, precise citations."""
    chunks = []
    for element in doc["kids"]:
        if element["type"] in ("paragraph", "heading", "list"):
            chunks.append({
                "text": element.get("content", ""),
                "metadata": {
                    "type": element["type"],
                    "page": element.get("page number"),
                    "bbox": element.get("bounding box"),
                    "source": doc.get("file name"),
                }
            })
    return chunks

By Headings (Sections)

Group content under headings into coherent sections:

def chunk_by_section(doc):
    """Best for: Context-rich retrieval, topic-based search."""
    chunks = []
    current_heading = None
    current_content = []
    current_start_page = None

    for element in doc["kids"]:
        if element["type"] == "heading":
            if current_content:
                chunks.append({
                    "text": "\n".join(current_content),
                    "metadata": {
                        "heading": current_heading,
                        "page": current_start_page,
                        "source": doc.get("file name"),
                    }
                })
            current_heading = element.get("content", "")
            current_content = [current_heading]
            current_start_page = element.get("page number")
        elif element["type"] in ("paragraph", "list"):
            content = element.get("content", "")
            if content:
                current_content.append(content)

    # Save the last section
    if current_content:
        chunks.append({
            "text": "\n".join(current_content),
            "metadata": {"heading": current_heading, "page": current_start_page}
        })

    return chunks

Merged Chunks (Minimum Size)

Combine small paragraphs to avoid overly fragmented chunks:

def chunk_with_min_size(doc, min_chars=200):
    """Best for: Balanced chunk sizes, reducing noise."""
    chunks = []
    buffer_text = ""
    buffer_pages = []

    for element in doc["kids"]:
        if element["type"] in ("paragraph", "heading", "list"):
            buffer_text += element.get("content", "") + "\n"
            page = element.get("page number")
            if page and page not in buffer_pages:
                buffer_pages.append(page)

            if len(buffer_text) >= min_chars:
                chunks.append({
                    "text": buffer_text.strip(),
                    "metadata": {"pages": buffer_pages.copy()}
                })
                buffer_text = ""
                buffer_pages = []

    if buffer_text.strip():
        chunks.append({"text": buffer_text.strip(), "metadata": {"pages": buffer_pages}})

    return chunks

Tables as Separate Chunks

Tables often contain dense information. Chunk them separately:

for element in doc["kids"]:
    if element["type"] == "table":
        chunks.append({
            "type": "table",
            "content": element,  # Keep full structure
            "page": element.get("page number")
        })

Handling Different Document Types

Academic Papers (Multi-Column)

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["paper1.pdf", "paper2.pdf", "papers/"],
    output_dir="output/",
    format="json,markdown",
)

Financial Reports (Tables Heavy)

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["report1.pdf", "report2.pdf", "reports/"],
    output_dir="output/",
    format="json",                      # JSON preserves table structure
)

Legal Documents (Long Text)

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["contract1.pdf", "contract2.pdf", "contracts/"],
    output_dir="output/",
    format="markdown",
)

Filtering Noise

OpenDataLoader automatically filters content that would pollute your RAG context:

Headers/footers: Repeated page elements removed
Hidden text: Transparent or off-page content filtered
Watermarks: Background elements excluded

This is enabled by default. To disable (not recommended for RAG):

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    content_safety_off="all"            # Disable all filters
)

Performance Tips

Batch Processing

Process multiple files in a single call to avoid repeated Java startup overhead:

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["report1.pdf", "report2.pdf", "report3.pdf"],
    output_dir="output/",
    format="json,markdown",
    quiet=True,
)

# Or process an entire folder (recursive)
opendataloader_pdf.convert(
    input_path="documents/",
    output_dir="output/",
    format="json,markdown",
    quiet=True,
)

CLI equivalent:

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf report1.pdf report2.pdf report3.pdf folder/ --format json,markdown --output-dir output/

Why batch matters: Each CLI invocation starts a new Java process (~1-2s overhead). Passing all files in one command processes them in a single JVM, which is significantly faster for large document collections.

Output Format Selection

Format	Use Case	Size
`markdown`	Text for chunking/embedding	Smallest
`json`	Structured data with metadata	Medium
`json,markdown`	Both (recommended for RAG)	Larger

Common Issues and Solutions

Issue: Text from different columns mixed together

Solution: Reading order is enabled by default (XY-Cut++). If still seeing issues, the PDF may have irregular layout that requires --use-struct-tree for tagged PDFs.

Issue: Headers/footers appearing in chunks

Solution: These are filtered by default. If still appearing, check if they're part of the main content flow.

Issue: Tables losing structure

Solution: Use JSON output for tables, which preserves row/column structure.

Issue: Too many small chunks

Solution: Use the merged chunking strategy with a minimum size threshold:

chunks = chunk_with_min_size(doc, min_chars=500)

Framework Integrations

LangChain

OpenDataLoader PDF has an official LangChain integration. Install it separately:

pip install -U langchain-opendataloader-pdf

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

# Load documents
loader = OpenDataLoaderPDFLoader(
    file_path=["document.pdf", "folder/"],
    format="text",
    quiet=True,
)
documents = loader.load()

# Use with any LangChain pipeline
for doc in documents:
    print(doc.metadata)
    print(doc.page_content[:100])

See examples/python/rag/langchain_example.py for a complete working example.

Configuration options:

Parameter	Type	Default	Description
`file_path`	List[str]	Required	PDF files or directories
`format`	str	None	Output format (json, html, markdown, text)
`quiet`	bool	False	Suppress CLI logging
`content_safety_off`	List[str]	None	Disable specific safety filters

Resources:

Best Practices Summary

Always enable reading order for multi-column documents
Use JSON output when you need bounding boxes for citations
Use Markdown output for simple text chunking
Keep AI safety filters on to avoid prompt injection
Chunk by semantic elements (headings, paragraphs) rather than fixed sizes
Store bounding boxes with chunks for precise citations

RAG Integration Guide

On this page