Benchmark Overview

About the Benchmark Project

PDF documents are everywhere, but LLMs can't read them directly. Converting PDFs to Markdown preserves structure (headings, tables, reading order) that helps LLMs understand and answer questions accurately.

This benchmark compares open-source PDF-to-Markdown engines to help you choose the right tool for your RAG pipeline or document processing workflow.

What we measure:

Reading Order — Is the text extracted in the correct sequence?
Table Fidelity — Are tables accurately reconstructed?
Heading Hierarchy — Is the document structure preserved?

The evaluation pipeline is modular—add new engines, corpora, or metrics with minimal effort.

Benchmark Results

View full benchmark results →

Quick Comparison

Engine	Overall	Reading Order	Table	Heading	Speed (s/page)	License
opendataloader [hybrid]	0.907	0.934	0.928	0.821	0.463	Apache-2.0
nutrient	0.885	0.925	0.708	0.819	0.008	Commercial
docling	0.882	0.898	0.887	0.824	0.762	MIT
marker	0.861	0.890	0.808	0.796	53.932	GPL-3.0
unstructured [hi_res]	0.841	0.904	0.588	0.749	3.008	Apache-2.0
edgeparse	0.837	0.894	0.717	0.706	0.036	Apache-2.0
opendataloader	0.831	0.902	0.489	0.739	0.015	Apache-2.0
mineru	0.831	0.857	0.873	0.743	5.962	AGPL-3.0
pymupdf4llm	0.732	0.885	0.401	0.412	0.091	AGPL-3.0
unstructured	0.686	0.882	0.000	0.388	0.077	Apache-2.0
markitdown	0.589	0.844	0.273	0.000	0.114	MIT
liteparse	0.576	0.866	0.000	0.000	1.061	Apache-2.0

Scores are normalized to [0, 1]. Higher is better for accuracy metrics; lower is better for speed. Bold indicates best performance.

Visual Comparison

Benchmark

Quality Breakdown

Benchmark Overview

About the Benchmark Project

Benchmark Results

Quick Comparison

Visual Comparison

Detailed Metrics

On this page