Reading Order & XY-Cut++

How OpenDataLoader PDF handles multi-column layouts and preserves correct reading order

The Multi-Column Problem

PDF files don't store text in reading order. They store drawing instructions — "draw this glyph at position (x, y)". When you have a two-column academic paper or a newspaper layout, naive text extraction reads left-to-right across the entire page, mixing content from different columns:

❌ Wrong extraction:
"Introduction    Methods
This paper...   We used..."

✅ Correct extraction:
"Introduction
This paper presents a novel approach...

Methods
We used the following methodology..."

This is one of the most common complaints about PDF parsers in RAG pipelines. Jumbled text destroys context and confuses LLMs.

How XY-Cut++ Works

OpenDataLoader uses the XY-Cut++ algorithm, an enhanced version of the classic XY-Cut recursive segmentation. It works in four phases:

Phase 1: Cross-Layout Detection

First, we identify elements that span multiple columns — headers, footers, and full-width titles. These are extracted separately so they don't interfere with column detection.

┌─────────────────────────────────┐
│      DOCUMENT TITLE             │  ← Cross-layout (full width)
├───────────────┬─────────────────┤
│ Column 1      │ Column 2        │
│ text...       │ text...         │
│ text...       │ text...         │
├───────────────┴─────────────────┤
│      Page Footer                │  ← Cross-layout (full width)
└─────────────────────────────────┘

Phase 2: Density Analysis

We calculate the content density ratio to determine whether the layout is content-dense (like newspapers) or sparse:

High density (>0.9): Prefer horizontal cuts first
Low density: Prefer vertical cuts first

This adaptive approach handles different document styles correctly.

Phase 3: Recursive Segmentation

The algorithm recursively divides the page by finding the largest gaps:

Project all content onto the X-axis and Y-axis
Find the largest gap in each direction
Cut along the axis with the larger gap
Repeat recursively until regions contain single columns

Step 1: Find vertical gap → Split into left/right columns
Step 2: Within each column, find horizontal gaps → Split into blocks
Step 3: Order blocks top-to-bottom within each column

Phase 4: Merge Cross-Layout Elements

Finally, cross-layout elements (headers, footers) are reinserted at the correct positions based on their Y-coordinates.

Why This Matters for RAG

Correct reading order is essential for:

Chunking: Semantic chunks should contain coherent text, not mixed columns
Context windows: LLMs need text in the order humans would read it
Citations: Bounding boxes are only useful if the text they reference is correct

Usage

XY-Cut++ is enabled by default. No configuration needed:

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="markdown,json",
)

To disable reading order sorting (use raw PDF order):

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --reading-order off file1.pdf file2.pdf folder/

Comparison with Other Approaches

Approach	Pros	Cons
Raw extraction	Fast	Wrong order, unusable for RAG
ML-based	Can learn complex layouts	GPU required, variable output
XY-Cut++ (OpenDataLoader)	Deterministic, fast, no GPU	May struggle with very irregular layouts

Technical Details

The algorithm is implemented in:

XYCutPlusPlusSorter.java — Main algorithm

Key parameters:

Beta threshold (default: 2.0): Controls cross-layout element detection
Density threshold (default: 0.9): Switches between horizontal/vertical preference
Minimum gap (default: 5.0 points): Prevents splitting on insignificant gaps

When to Disable Reading Order

Reading order is enabled by default and works well for most documents. Disabling (--reading-order off) is rarely needed:

Use Case	Notes
Debugging	Compare xycut output vs raw PDF order
Custom post-processing	When your pipeline handles ordering
Tagged PDFs	Use `--use-struct-tree` instead (not `off`)