Reading Order & XY-Cut++
How OpenDataLoader PDF handles multi-column layouts and preserves correct reading order
The Multi-Column Problem
PDF files don't store text in reading order. They store drawing instructions — "draw this glyph at position (x, y)". When you have a two-column academic paper or a newspaper layout, naive text extraction reads left-to-right across the entire page, mixing content from different columns:
❌ Wrong extraction:
"Introduction Methods
This paper... We used..."
✅ Correct extraction:
"Introduction
This paper presents a novel approach...
Methods
We used the following methodology..."This is one of the most common complaints about PDF parsers in RAG pipelines. Jumbled text destroys context and confuses LLMs.
How XY-Cut++ Works
OpenDataLoader uses the XY-Cut++ algorithm, an enhanced version of the classic XY-Cut recursive segmentation. It works in four phases:
Phase 1: Cross-Layout Detection
First, we identify elements that span multiple columns — headers, footers, and full-width titles. These are extracted separately so they don't interfere with column detection.
┌─────────────────────────────────┐
│ DOCUMENT TITLE │ ← Cross-layout (full width)
├───────────────┬─────────────────┤
│ Column 1 │ Column 2 │
│ text... │ text... │
│ text... │ text... │
├───────────────┴─────────────────┤
│ Page Footer │ ← Cross-layout (full width)
└─────────────────────────────────┘Phase 2: Density Analysis
We calculate the content density ratio to determine whether the layout is content-dense (like newspapers) or sparse:
- High density (>0.9): Prefer horizontal cuts first
- Low density: Prefer vertical cuts first
This adaptive approach handles different document styles correctly.
Phase 3: Recursive Segmentation
The algorithm recursively divides the page by finding the largest gaps:
- Project all content onto the X-axis and Y-axis
- Find the largest gap in each direction
- Cut along the axis with the larger gap
- Repeat recursively until regions contain single columns
Step 1: Find vertical gap → Split into left/right columns
Step 2: Within each column, find horizontal gaps → Split into blocks
Step 3: Order blocks top-to-bottom within each columnPhase 4: Merge Cross-Layout Elements
Finally, cross-layout elements (headers, footers) are reinserted at the correct positions based on their Y-coordinates.
Why This Matters for RAG
Correct reading order is essential for:
- Chunking: Semantic chunks should contain coherent text, not mixed columns
- Context windows: LLMs need text in the order humans would read it
- Citations: Bounding boxes are only useful if the text they reference is correct
Usage
XY-Cut++ is enabled by default. No configuration needed:
import opendataloader_pdf
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
input_path=["file1.pdf", "file2.pdf", "folder/"],
output_dir="output/",
format="markdown,json",
)To disable reading order sorting (use raw PDF order):
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --reading-order off file1.pdf file2.pdf folder/Comparison with Other Approaches
| Approach | Pros | Cons |
|---|---|---|
| Raw extraction | Fast | Wrong order, unusable for RAG |
| ML-based | Can learn complex layouts | GPU required, variable output |
| XY-Cut++ (OpenDataLoader) | Deterministic, fast, no GPU | May struggle with very irregular layouts |
Technical Details
The algorithm is implemented in:
XYCutPlusPlusSorter.java— Main algorithm
Key parameters:
- Beta threshold (default: 2.0): Controls cross-layout element detection
- Density threshold (default: 0.9): Switches between horizontal/vertical preference
- Minimum gap (default: 5.0 points): Prevents splitting on insignificant gaps
When to Disable Reading Order
Reading order is enabled by default and works well for most documents. Disabling (--reading-order off) is rarely needed:
| Use Case | Notes |
|---|---|
| Debugging | Compare xycut output vs raw PDF order |
| Custom post-processing | When your pipeline handles ordering |
| Tagged PDFs | Use --use-struct-tree instead (not off) |
Further Reading
- XY-Cut algorithm (Wikipedia)
- arXiv:2504.10258 — XY-Cut++ paper