OpenDataLoader LogoOpenDataLoader

Tagged PDF

Using native PDF structure tags for accurate AI data extraction and accessibility compliance

Why Tagged PDF Matters for AI

Tagged PDF includes semantic structure (headings, paragraphs, lists, tables) that tells AI exactly how a document is organized. When a PDF has proper tags, you get:

  • Exact layout intent — No guessing, no heuristics
  • Correct reading order — Author's intended flow preserved
  • Semantic hierarchy — Headings, lists, tables properly identified

Accessibility Regulations

Multiple regulations now require accessible digital documents, driving widespread adoption of Tagged PDF. Key regulations include the European Accessibility Act (EAA), ADA/Section 508 (USA), and similar laws in other jurisdictions.

See Accessibility Compliance for details.

OpenDataLoader leverages this shift — when structure tags exist, we extract the exact layout the author intended, without guessing.

How to Use Tagged PDF

Enable Tagged PDF extraction with the use_struct_tree option:

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    use_struct_tree=True                # Use native PDF structure tags
)

Most PDF parsers ignore structure tags entirely. OpenDataLoader is one of the few that fully supports them.

CLI Usage

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf file1.pdf file2.pdf folder/ \
  --output-dir output/ \
  --use-struct-tree

Checking if a PDF is Tagged

If a PDF lacks structure tags, OpenDataLoader logs a warning and falls back to visual heuristics (XY-Cut++ algorithm). Check your logs for:

WARN: Document lacks structure tree, falling back to visual heuristics

Development Status

FeaturePurposeStatus
Tag extractionUse existing tags to determine document structureAvailable
Auto-Tagging EngineGenerate structure tags for untagged PDFsAvailable
Tag validationValidate tags against PDF Association recommendationsIn progress
PDF/UA ValidationVerify compliance with PDF/UA standardsQ3 2026
Hybrid extractionCombine tags with visual heuristics for best resultsIn progress

Use Cases

Research Papers

A well-tagged paper lets AI accurately identify author names, affiliations, and sections — enabling automated citation building.

Financial Reports

Proper tags enable precise extraction of balance sheet titles and data cells, automating analysis without error-prone heuristics.

Tags help AI quickly identify and cross-reference clauses, dates, and parties — speeding up legal review.

Learn More

On this page