What's New in v2.0
OpenDataLoader PDF v2.0 release highlights: PDF to Markdown for RAG at 100+ pages/sec with no GPU, top benchmark performance, four free AI Add-ons, Apache 2.0 license, LangChain integration
OpenDataLoader PDF v2.0 is out!
OpenDataLoader PDF v2.0 features a hybrid engine that combines AI-based and deterministic extraction methods. This results in both high quality in data extraction and high performance. OpenDataLoader can be used free of charge in a fully air-gapped local environment, eliminating any risk of data leakage to external servers.
It has achieved the No. 1 benchmark performance in the open-source PDF data extraction category. This benchmark (ODL-Bench) has been openly released on GitHub so that users can reproduce and verify results independently.
What's New
Four Free AI Add-ons, Out of the Box
OpenDataLoader PDF v2.0 includes the following four AI features as add-ons at no additional cost:
- OCR - improves text recognition on image-based and scanned PDFs
- Table Extraction - a lightweight AI model that handles merged cells and complex table structures with precision
- Formula Extraction - recognizes mathematical and scientific notation locally, without a cloud call
- Chart Analysis - converts chart visuals into natural-language descriptions
Retire MPL 2.0 license in favor of more permissive Apache 2.0 license
Apache License 2.0 has officially been adopted for OpenDataLoader PDF 2.0. Initially ODL used the MPL-2.0 (Mozilla Public License 2.0) license. The license change is not just a legal update. It is a conscious move to strengthen the brand through technological openness.
Ecosystem Expansion: LangChain Is In
OpenDataLoader PDF has an official LangChain integration. Install langchain-opendataloader-pdf for an official LangChain document loader integration. See LangChain docs.
What makes OpenDataLoader unique?
OpenDataLoader takes a different approach from many PDF parsers:
- Rule-based extraction - Deterministic output without GPU requirements
- Bounding boxes for all elements - Essential for citation systems
- XY-Cut++ reading order - Handles multi-column layouts correctly
- Built-in AI safety filters - Protects against prompt injection
- Native Tagged PDF support - Leverages accessibility metadata
This means: consistent output (same input = same output), no GPU required, faster processing, and no model hallucinations.
How to start
Check our Quick Start guide, Advanced Features, Frequently Asked Questions and other technical documentation at GitHub.
AI-based auto-tagging to Tagged PDF
OpenDataLoader PDF now ships auto-tagging functionality based on its layout analysis engine — the first open-source PDF tool that implements AI-generated accessibility auto-tagging and produces Tagged PDF output entirely under an open-source license (Apache 2.0), with no proprietary dependency. Use `--format tagged-pdf` (CLI) or `format="tagged-pdf"` (Python/Node.js).
Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF, the industry-reference open-source PDF/A and PDF/UA validator.
This is the first major milestone on the roadmap of OpenDataLoader towards PDF accessibility. With the European Accessibility Act (EAA) now in force, South Korea's anti-discrimination legislation tightening, and accessibility regulations expanding globally, compliance has become a real operational burden for enterprises.
Acknowledgments and Collaboration
The development of OpenDataLoader PDF v2.0 has been made possible through the contributions, feedback, and support of our community.
We thank the open-source community for their continued engagement through code contributions, issue reporting, testing, and thoughtful discussions. Your collaboration has been essential in improving the reliability, usability, and performance of OpenDataLoader PDF.
We welcome you to help in improving OpenDataLoader PDF by joining us on GitHub.
You can send issues, review pull requests, submit test PRs based on open issues, or help others in discussions. If you have any questions, feel free to contact us opendataloader@hancom.com
Stay updated and connect with others following us on X and Linkedin.