Top Benchmark ScoresApache-2.0

PDF Parsing
Built for RAG

Extract structured data for RAG pipelines. Reading order, tables, bounding boxes — top-ranked in benchmarks. Local-first. Open source.

Bounding BoxesOCR (80+ Languages)Tables · Formulas · Pictures · Charts
The Problem

PDFs Weren't Built for AI

Lost structure, broken tables, missing accessibility tags — the tool you choose determines 90% of your pipeline's output quality.

"If the data isn't parsed properly, your RAG system will never retrieve accurate answers. Garbage in = garbage out."

Scrambled Reading Order

Multi-column layouts read left-to-right across the page, mixing content from different columns. Your LLM receives jumbled text that makes no sense.

Lost Table Structure

Tables become walls of unformatted text. Row and column relationships disappear, making financial data and specifications unusable.

No Source Coordinates

No way to cite where information came from or highlight the original PDF location. Users can't verify your AI's answers.

Accessibility Non-Compliance

EAA, ADA, Section 508 enforced worldwide. Manual PDF remediation doesn’t scale.

The Solution

Built for RAG, Not Just PDF Reading

OpenDataLoader PDF delivers what LLM pipelines actually need.

XY-Cut++ Reading Order

Correctly reads multi-column layouts. Text flows in the order humans read it.

How it works

Hybrid OCR & AI

Optional LLM enhancement for OCR and complex tables. 93% table accuracy in benchmarks.

Enable hybrid

Bounding Boxes

Every element includes [x1, y1, x2, y2] coordinates for precise citations.

JSON schema

Table Extraction

Detects borders and clusters text into rows/columns. Handles merged cells.

Table schema

Auto-Tagging to Tagged PDF

Open-source PDF auto-tagging pipeline. Untagged PDF in → screen-reader-ready Tagged PDF out. Based on PDF Association specifications, validated with veraPDF.

Learn more

AI Safety Built-in

Filters hidden text, off-page content, and prompt injection attempts.

Safety docs
Output Format

Structured Output with Bounding Boxes

JSON Output Example

{  "type": "heading",  "id": 42,  "level": "Title",  "page number": 1,  "bounding box": [72.0, 700.0, 540.0, 730.0],  "heading level": 1,  "font": "Helvetica-Bold",  "font size": 24.0,  "content": "Introduction"}
FieldDescription
typeElement type: heading, paragraph, table, list, image, caption
idUnique identifier for cross-referencing
page number1-indexed page reference
bounding box[left, bottom, right, top] in PDF points
heading levelHeading depth (1+)
font, font sizeTypography info
contentExtracted text

Bounding Box Visualization

PDF with bounding box overlays showing detected elements

Why Bounding Boxes Matter for RAG

When your LLM answers a question, bounding boxes let you:

  • Highlight the exact source location in the PDF
  • Build citation links with page and position references
  • Verify extraction accuracy by visual comparison
Benchmarks

Why OpenDataLoader PDF?

Benchmark Comparison

Overall Score

#1 in benchmarks — per-document mean of available metrics

opendataloader [hybrid]
0.907
nutrient
0.885
docling
0.882
marker
0.861
unstructured [hi_res]
0.841
edgeparse
0.837
opendataloader
0.831
mineru
0.831
pymupdf4llm
0.732
unstructured
0.686
markitdown
0.589
liteparse
0.576

Speed (s/page)

Lower is faster — full pipeline including layout analysis

nutrient
0.008
opendataloader
0.015
edgeparse
0.036
unstructured
0.077
pymupdf4llm
0.091
markitdown
0.114
opendataloader [hybrid]
0.463
docling
0.762
liteparse
1.061
unstructured [hi_res]
3.008
mineru
5.962
marker
53.932

Reading Order (NID)

Text sequence accuracy

opendataloader [hybrid]
0.934
nutrient
0.925
unstructured [hi_res]
0.904
opendataloader
0.902
docling
0.898
edgeparse
0.894
marker
0.890
pymupdf4llm
0.885
unstructured
0.882
liteparse
0.866
mineru
0.857
markitdown
0.844

Table Score (TEDS)

Table extraction accuracy

opendataloader [hybrid]
0.928
docling
0.887
mineru
0.873
marker
0.808
edgeparse
0.717
nutrient
0.708
unstructured [hi_res]
0.588
opendataloader
0.489
pymupdf4llm
0.401
markitdown
0.273
unstructured
0.000
liteparse
0.000

Heading Score (MHS)

Heading detection accuracy

docling
0.824
opendataloader [hybrid]
0.821
nutrient
0.819
marker
0.796
unstructured [hi_res]
0.749
mineru
0.743
opendataloader
0.739
edgeparse
0.706
pymupdf4llm
0.412
unstructured
0.388
markitdown
0.000
liteparse
0.000
Quick Start

Get Started in 60 Seconds

pip install -U opendataloader-pdf
import opendataloader_pdfopendataloader_pdf.convert(    input_path=["document.pdf"],    output_dir="output/",    format="json,html,pdf,markdown")

Building a RAG pipeline?

Use our official LangChain integration:

pip install -U langchain-opendataloader-pdf
View RAG Integration Guide
PDF Accessibility

Tagged PDF & PDF/UA Accessibility

Open-source PDF auto-tagging pipeline. Based on PDF Association specifications, developed with Hancom and Dual Lab (veraPDF developers).

Accessibility regulations are enforced worldwide (EAA June 2025, ADA/Section 508, Korea Digital Inclusion Act). Manual PDF remediation doesn't scale.

Accessibility Pipeline

1
Free

Audit

Check existing PDF tags, detect untagged PDFs

Shipped
2
Free (Apache 2.0)

Auto-tag

Generate structure tags for untagged PDFs

Available
3
Enterprise

Export PDF/UA

Convert to PDF/UA-1 or PDF/UA-2 compliant files

Available
4
Enterprise

Visual Editing

Accessibility studio — review and fix tags

Available
Get Started in Seconds

Ready to Parse PDFs
the Right Way?

One command to get started. No API keys, no cloud, no hassle.

terminal
pip install -U opendataloader-pdf