OpenDataLoader LogoOpenDataLoader

Reading Order & XY-Cut++

How OpenDataLoader PDF handles multi-column layouts and preserves correct reading order

The Multi-Column Problem

PDF files don't store text in reading order. They store drawing instructions — "draw this glyph at position (x, y)". When you have a two-column academic paper or a newspaper layout, naive text extraction reads left-to-right across the entire page, mixing content from different columns:

❌ Wrong extraction:
"Introduction    Methods
This paper...   We used..."

✅ Correct extraction:
"Introduction
This paper presents a novel approach...

Methods
We used the following methodology..."

This is one of the most common complaints about PDF parsers in RAG pipelines. Jumbled text destroys context and confuses LLMs.

How XY-Cut++ Works

OpenDataLoader uses the XY-Cut++ algorithm, an enhanced version of the classic XY-Cut recursive segmentation. It works in four phases:

Phase 1: Cross-Layout Detection

First, we identify elements that span multiple columns — headers, footers, and full-width titles. These are extracted separately so they don't interfere with column detection.

┌─────────────────────────────────┐
│      DOCUMENT TITLE             │  ← Cross-layout (full width)
├───────────────┬─────────────────┤
│ Column 1      │ Column 2        │
│ text...       │ text...         │
│ text...       │ text...         │
├───────────────┴─────────────────┤
│      Page Footer                │  ← Cross-layout (full width)
└─────────────────────────────────┘

Phase 2: Density Analysis

We calculate the content density ratio to determine whether the layout is content-dense (like newspapers) or sparse:

  • High density (>0.9): Prefer horizontal cuts first
  • Low density: Prefer vertical cuts first

This adaptive approach handles different document styles correctly.

Phase 3: Recursive Segmentation

The algorithm recursively divides the page by finding the largest gaps:

  1. Project all content onto the X-axis and Y-axis
  2. Find the largest gap in each direction
  3. Cut along the axis with the larger gap
  4. Repeat recursively until regions contain single columns
Step 1: Find vertical gap → Split into left/right columns
Step 2: Within each column, find horizontal gaps → Split into blocks
Step 3: Order blocks top-to-bottom within each column

Phase 4: Merge Cross-Layout Elements

Finally, cross-layout elements (headers, footers) are reinserted at the correct positions based on their Y-coordinates.

Why This Matters for RAG

Correct reading order is essential for:

  • Chunking: Semantic chunks should contain coherent text, not mixed columns
  • Context windows: LLMs need text in the order humans would read it
  • Citations: Bounding boxes are only useful if the text they reference is correct

Usage

XY-Cut++ is enabled by default. No configuration needed:

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="markdown,json",
)

To disable reading order sorting (use raw PDF order):

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --reading-order off file1.pdf file2.pdf folder/

Comparison with Other Approaches

ApproachProsCons
Raw extractionFastWrong order, unusable for RAG
ML-basedCan learn complex layoutsGPU required, variable output
XY-Cut++ (OpenDataLoader)Deterministic, fast, no GPUMay struggle with very irregular layouts

Technical Details

The algorithm is implemented in:

  • XYCutPlusPlusSorter.java — Main algorithm

Key parameters:

  • Beta threshold (default: 2.0): Controls cross-layout element detection
  • Density threshold (default: 0.9): Switches between horizontal/vertical preference
  • Minimum gap (default: 5.0 points): Prevents splitting on insignificant gaps

When to Disable Reading Order

Reading order is enabled by default and works well for most documents. Disabling (--reading-order off) is rarely needed:

Use CaseNotes
DebuggingCompare xycut output vs raw PDF order
Custom post-processingWhen your pipeline handles ordering
Tagged PDFsUse --use-struct-tree instead (not off)

Further Reading

On this page