Heading Levels (MHS)
Measures whether document structure is preserved
Why Heading Structure Matters for RAG
Headings define document hierarchy — chapters, sections, subsections. RAG systems use this structure to create meaningful chunks and understand context. If headings are missed or mis-leveled, chunks lose their semantic boundaries.
Example problem: A user asks about "Section 3.2" but the parser didn't detect it as a heading, so the RAG system can't locate that section.
What MHS Measures
MHS (Markdown Heading Similarity) compares detected headings and their levels against ground truth. A score of 1.0 means all headings were correctly identified with proper hierarchy; lower scores indicate missed or incorrectly leveled headings.

Results
| Engine | Score | Rank |
|---|---|---|
| Docling | 0.824 | #1 |
| OpenDataLoader [hybrid] | 0.821 | #2 |
| Nutrient | 0.819 | #3 |
| Marker | 0.796 | #4 |
| Unstructured [hi_res] | 0.749 | #5 |
| MinerU | 0.743 | #6 |
| OpenDataLoader | 0.739 | #7 |
| Edgeparse | 0.706 | #8 |
| PyMuPDF4LLM | 0.412 | #9 |
| Unstructured | 0.388 | #10 |
| MarkItDown | 0.000 | #11 |
| LiteParse | 0.000 | #11 |
- ML-based engines (Docling) outperform rule-based engines for heading detection
- MarkItDown and LiteParse don't extract heading levels at all
When to Prioritize This Metric
| Use Case | Recommended Engine |
|---|---|
| Long documents with deep hierarchy | Docling |
| Legal documents, technical manuals | Docling |
| Semantic chunking by section | Docling or OpenDataLoader |
| Simple documents, flat structure | Any engine works |
Trade-offs
Higher heading accuracy comes with slower processing. Docling scores 0.80 but takes 16x longer than OpenDataLoader. If your documents have simple structure, speed may matter more.
Learn More
For detailed methodology, raw data, and reproduction scripts, see the opendataloader-bench repository.