OpenDataLoader LogoOpenDataLoader

Benchmark Overview

Benchmarks for OpenDataLoader PDF

About the Benchmark Project

PDF documents are everywhere, but LLMs can't read them directly. Converting PDFs to Markdown preserves structure (headings, tables, reading order) that helps LLMs understand and answer questions accurately.

This benchmark compares open-source PDF-to-Markdown engines to help you choose the right tool for your RAG pipeline or document processing workflow.

What we measure:

  • Reading Order — Is the text extracted in the correct sequence?
  • Table Fidelity — Are tables accurately reconstructed?
  • Heading Hierarchy — Is the document structure preserved?

The evaluation pipeline is modular—add new engines, corpora, or metrics with minimal effort.

Benchmark Results

View full benchmark results →

Quick Comparison

EngineOverallReading OrderTableHeadingSpeed (s/page)License
opendataloader [hybrid]0.9070.9340.9280.8210.463Apache-2.0
nutrient0.8850.9250.7080.8190.008Commercial
docling0.8820.8980.8870.8240.762MIT
marker0.8610.8900.8080.79653.932GPL-3.0
unstructured [hi_res]0.8410.9040.5880.7493.008Apache-2.0
edgeparse0.8370.8940.7170.7060.036Apache-2.0
opendataloader0.8310.9020.4890.7390.015Apache-2.0
mineru0.8310.8570.8730.7435.962AGPL-3.0
pymupdf4llm0.7320.8850.4010.4120.091AGPL-3.0
unstructured0.6860.8820.0000.3880.077Apache-2.0
markitdown0.5890.8440.2730.0000.114MIT
liteparse0.5760.8660.0000.0001.061Apache-2.0

Scores are normalized to [0, 1]. Higher is better for accuracy metrics; lower is better for speed. Bold indicates best performance.

Visual Comparison

Benchmark

Quality Breakdown

Detailed Metrics



On this page