Top Benchmark ScoresApache-2.0

PDF Parsing
Built for RAG

Extract structured data for RAG pipelines. Reading order, tables, bounding boxes — top-ranked in benchmarks. Local-first. Open source.

Bounding BoxesOCR (80+ Languages)Tables · Formulas · Pictures · Charts

Get Started Live Demo PDF Accessibility GitHub

The Problem

PDFs Weren't Built for AI

Lost structure, broken tables, missing accessibility tags — the tool you choose determines 90% of your pipeline's output quality.

"If the data isn't parsed properly, your RAG system will never retrieve accurate answers. Garbage in = garbage out."

Scrambled Reading Order

Multi-column layouts read left-to-right across the page, mixing content from different columns. Your LLM receives jumbled text that makes no sense.

Lost Table Structure

Tables become walls of unformatted text. Row and column relationships disappear, making financial data and specifications unusable.

No Source Coordinates

No way to cite where information came from or highlight the original PDF location. Users can't verify your AI's answers.

Accessibility Non-Compliance

EAA, ADA, Section 508 enforced worldwide. Manual PDF remediation doesn’t scale.

The Solution

Built for RAG, Not Just PDF Reading

OpenDataLoader PDF delivers what LLM pipelines actually need.

XY-Cut++ Reading Order

Correctly reads multi-column layouts. Text flows in the order humans read it.

How it works

Hybrid OCR & AI

Optional LLM enhancement for OCR and complex tables. 93% table accuracy in benchmarks.

Enable hybrid

Bounding Boxes

Every element includes [x1, y1, x2, y2] coordinates for precise citations.

JSON schema

Table Extraction

Detects borders and clusters text into rows/columns. Handles merged cells.

Table schema

Auto-Tagging to Tagged PDF

Open-source PDF auto-tagging pipeline. Untagged PDF in → screen-reader-ready Tagged PDF out. Based on PDF Association specifications, validated with veraPDF.

Learn more

AI Safety Built-in

Filters hidden text, off-page content, and prompt injection attempts.

Safety docs

Output Format

Structured Output with Bounding Boxes

JSON Output Example

{  "type": "heading",  "id": 42,  "level": "Title",  "page number": 1,  "bounding box": [72.0, 700.0, 540.0, 730.0],  "heading level": 1,  "font": "Helvetica-Bold",  "font size": 24.0,  "content": "Introduction"}

Field	Description
type	Element type: heading, paragraph, table, list, image, caption
id	Unique identifier for cross-referencing
page number	1-indexed page reference
bounding box	[left, bottom, right, top] in PDF points
heading level	Heading depth (1+)
font, font size	Typography info
content	Extracted text

Bounding Box Visualization

PDF with bounding box overlays showing detected elements

Why Bounding Boxes Matter for RAG

When your LLM answers a question, bounding boxes let you:

Highlight the exact source location in the PDF
Build citation links with page and position references
Verify extraction accuracy by visual comparison

View Full JSON Schema Browse Sample Extractions

Benchmarks

Why OpenDataLoader PDF?

Benchmark Comparison

Overall Score

#1 in benchmarks — per-document mean of available metrics

opendataloader [hybrid]

0.907

nutrient

0.885

docling

0.882

marker

0.861

unstructured [hi_res]

0.841

edgeparse

0.837

opendataloader

0.831

mineru

0.831

pymupdf4llm

0.732

unstructured

0.686

markitdown

0.589

liteparse

0.576

Speed (s/page)

Lower is faster — full pipeline including layout analysis

nutrient

0.008

opendataloader

0.015

edgeparse

0.036

unstructured

0.077

pymupdf4llm

0.091

markitdown

0.114

opendataloader [hybrid]

0.463

docling

0.762

liteparse

1.061

unstructured [hi_res]

3.008

mineru

5.962

marker

53.932

Reading Order (NID)

Text sequence accuracy

opendataloader [hybrid]

0.934

nutrient

0.925

unstructured [hi_res]

0.904

opendataloader

0.902

docling

0.898

edgeparse

0.894

marker

0.890

pymupdf4llm

0.885

unstructured

0.882

liteparse

0.866

mineru

0.857

markitdown

0.844

Table Score (TEDS)

Table extraction accuracy

opendataloader [hybrid]

0.928

docling

0.887

mineru

0.873

marker

0.808

edgeparse

0.717

nutrient

0.708

unstructured [hi_res]

0.588

opendataloader

0.489

pymupdf4llm

0.401

markitdown

0.273

unstructured

0.000

liteparse

0.000

Heading Score (MHS)

Heading detection accuracy

docling

0.824

opendataloader [hybrid]

0.821

nutrient

0.819

marker

0.796

unstructured [hi_res]

0.749

mineru

0.743

opendataloader

0.739

edgeparse

0.706

pymupdf4llm

0.412

unstructured

0.388

markitdown

0.000

liteparse

0.000

See transparent benchmark methodology

Quick Start

Get Started in 60 Seconds

pip install -U opendataloader-pdf

import opendataloader_pdfopendataloader_pdf.convert(    input_path=["document.pdf"],    output_dir="output/",    format="json,html,pdf,markdown")

View Python Guide

Building a RAG pipeline?

Use our official LangChain integration:

pip install -U langchain-opendataloader-pdf

View RAG Integration Guide

PDF Accessibility

Tagged PDF & PDF/UA Accessibility

Open-source PDF auto-tagging pipeline. Based on PDF Association specifications, developed with Hancom and Dual Lab (veraPDF developers).

Accessibility regulations are enforced worldwide (EAA June 2025, ADA/Section 508, Korea Digital Inclusion Act). Manual PDF remediation doesn't scale.

Accessibility Pipeline

Free

Audit

Check existing PDF tags, detect untagged PDFs

Shipped

Free (Apache 2.0)

Auto-tag

Generate structure tags for untagged PDFs

Available

Enterprise

Export PDF/UA

Convert to PDF/UA-1 or PDF/UA-2 compliant files

Available

Enterprise

Visual Editing

Accessibility studio — review and fix tags

Available

Explore PDF Accessibility

Get Started in Seconds

Ready to Parse PDFs
the Right Way?

One command to get started. No API keys, no cloud, no hassle.

terminal

pip install -U opendataloader-pdf

Read the Docs View on GitHub

Documentation Discussions Star on GitHub

PDF ParsingBuilt for RAG

PDFs Weren't Built for AI

Scrambled Reading Order

Lost Table Structure

No Source Coordinates

Accessibility Non-Compliance

Built for RAG, Not Just PDF Reading

XY-Cut++ Reading Order

Hybrid OCR & AI

Bounding Boxes

Table Extraction

Auto-Tagging to Tagged PDF

AI Safety Built-in

Structured Output with Bounding Boxes

JSON Output Example

Bounding Box Visualization

Why Bounding Boxes Matter for RAG

Why OpenDataLoader PDF?

Benchmark Comparison

Overall Score

Speed (s/page)

Reading Order (NID)

Table Score (TEDS)

Heading Score (MHS)

Get Started in 60 Seconds

Tagged PDF & PDF/UA Accessibility

Accessibility Pipeline

Audit

Auto-tag

Export PDF/UA

Visual Editing

Ready to Parse PDFsthe Right Way?

PDF Parsing
Built for RAG

Ready to Parse PDFs
the Right Way?