OpenDataLoader LogoOpenDataLoader

Tagged PDF for RAG Pipelines

Leverage PDF structure tags for higher-quality AI data extraction in RAG applications

Why Tagged PDFs Improve RAG Quality

Retrieval-Augmented Generation (RAG) systems depend on accurate document parsing. When PDFs have proper structure tags, you get semantic ground truth instead of heuristic guesses.

Tagged PDF advantages for RAG:

  • Exact reading order — No algorithmic guessing about column layouts
  • Semantic hierarchy — Headings, lists, and sections are explicitly marked
  • Table structure — Row/column relationships are preserved
  • Chunk boundaries — Natural semantic units for vector embedding

Tag-Aware vs Tag-Blind Extraction

AspectTag-Blind (Heuristics)Tag-Aware (Structure Tree)
Reading orderInferred from coordinatesAuthor-defined, exact
Multi-columnOften fails on complex layoutsCorrect by design
HeadingsGuessed from font sizeSemantically tagged (H1-H6)
TablesCell boundaries estimatedRow/column spans preserved
ListsDetected by bullet patternsList structure explicit
Processing speedSlower (visual analysis)Faster (direct extraction)

Example: Multi-Column Document

Tag-Blind Result:                    Tag-Aware Result:
┌─────────────────────┐              ┌─────────────────────┐
│ Introduction The    │              │ Introduction        │
│ first column text   │              │                     │
│ continues here The  │              │ The first column    │
│ second column has   │              │ text continues here │
│ different content   │              │                     │
└─────────────────────┘              │ The second column   │
  ↑ Columns merged incorrectly       │ has different       │
                                     │ content             │
                                     └─────────────────────┘
                                       ↑ Correct reading order

Using Tagged PDFs in RAG Workflows

Check if a PDF is Tagged

Not all PDFs have structure tags. OpenDataLoader automatically detects and uses tags when available:

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="json,markdown",
    use_struct_tree=True                # Use tags if present
)

If the PDF lacks structure tags, OpenDataLoader logs a warning and falls back to the XY-Cut++ algorithm for reading order detection.

CLI Usage

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf file1.pdf file2.pdf folder/ \
  --output-dir output/ \
  -f json,markdown \
  --use-struct-tree

Semantic Chunking with Tagged PDFs

Tagged PDFs enable semantic chunking—splitting documents by meaning rather than arbitrary character counts.

Strategy 1: Chunk by Heading Level

import json

# Load extracted JSON
with open("output/document.json") as f:
    doc = json.load(f)

# Split into chunks by H1/H2 boundaries
chunks = []
current_chunk = []

for element in doc["kids"]:
    if element.get("type") == "heading" and element.get("heading level") in [1, 2]:
        if current_chunk:
            chunks.append(current_chunk)
        current_chunk = [element]
    else:
        current_chunk.append(element)

if current_chunk:
    chunks.append(current_chunk)

Strategy 2: Preserve Semantic Units

Keep related content together (e.g., a heading with its paragraphs):

def semantic_chunk(elements, max_tokens=512):
    """Chunk while preserving semantic units."""
    chunks = []
    current = []
    current_tokens = 0

    for elem in elements:
        elem_tokens = len(elem.get("content", "").split())

        # Start new chunk at major headings (H1)
        is_h1 = elem.get("type") == "heading" and elem.get("heading level") == 1
        if is_h1 and current:
            chunks.append(current)
            current = [elem]
            current_tokens = elem_tokens
        # Or when exceeding token limit
        elif current_tokens + elem_tokens > max_tokens:
            chunks.append(current)
            current = [elem]
            current_tokens = elem_tokens
        else:
            current.append(elem)
            current_tokens += elem_tokens

    if current:
        chunks.append(current)

    return chunks

Strategy 3: Table-Aware Chunking

Never split tables across chunks:

def table_aware_chunk(elements, max_tokens=512):
    """Keep tables intact during chunking."""
    chunks = []
    current = []
    current_tokens = 0

    for elem in elements:
        elem_tokens = len(elem.get("content", "").split())

        # Tables stay together regardless of size
        if elem.get("type") == "table":
            if current:
                chunks.append(current)
            chunks.append([elem])  # Table as its own chunk
            current = []
            current_tokens = 0
        elif current_tokens + elem_tokens > max_tokens:
            chunks.append(current)
            current = [elem]
            current_tokens = elem_tokens
        else:
            current.append(elem)
            current_tokens += elem_tokens

    if current:
        chunks.append(current)

    return chunks

Handling Mixed Documents

Real-world PDF collections contain both tagged and untagged documents. OpenDataLoader handles this gracefully:

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="json,markdown",
    use_struct_tree=True                # Auto-fallback if no tags
)

Behavior:

  • If PDF has tags → Uses structure tree (exact)
  • If PDF lacks tags → Falls back to XY-Cut++ (heuristic)
  • Logs indicate which method was used

Auto-Tagging Untagged PDFs

Many legacy PDFs lack structure tags. The Auto-Tagging Engine generates tags automatically:

opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="tagged-pdf"                 # Generate Tagged PDF
)

This enables RAG-quality extraction even for older documents.

Integration with RAG Frameworks

LangChain Integration

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
    file_path=["file1.pdf", "file2.pdf", "folder/"],
    format="text",
    use_struct_tree=True,
)
documents = loader.load()

Learn More

On this page