Tagged PDF

Using native PDF structure tags for accurate AI data extraction and accessibility compliance

Why Tagged PDF Matters for AI

Tagged PDF includes semantic structure (headings, paragraphs, lists, tables) that tells AI exactly how a document is organized. When a PDF has proper tags, you get:

Exact layout intent — No guessing, no heuristics
Correct reading order — Author's intended flow preserved
Semantic hierarchy — Headings, lists, tables properly identified

Multiple regulations now require accessible digital documents, driving widespread adoption of Tagged PDF. Key regulations include the European Accessibility Act (EAA), ADA/Section 508 (USA), and similar laws in other jurisdictions.

See Accessibility Compliance for details.

OpenDataLoader leverages this shift — when structure tags exist, we extract the exact layout the author intended, without guessing.

How to Use Tagged PDF

Enable Tagged PDF extraction with the use_struct_tree option:

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    use_struct_tree=True                # Use native PDF structure tags
)

Most PDF parsers ignore structure tags entirely. OpenDataLoader is one of the few that fully supports them.

CLI Usage

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf file1.pdf file2.pdf folder/ \
  --output-dir output/ \
  --use-struct-tree

Checking if a PDF is Tagged

If a PDF lacks structure tags, OpenDataLoader logs a warning and falls back to visual heuristics (XY-Cut++ algorithm). Check your logs for:

WARN: Document lacks structure tree, falling back to visual heuristics

Development Status

Feature	Purpose	Status
Tag extraction	Use existing tags to determine document structure	Available
Auto-Tagging Engine	Generate structure tags for untagged PDFs	Available
Tag validation	Validate tags against PDF Association recommendations	In progress
PDF/UA Validation	Verify compliance with PDF/UA standards	Q3 2026
Hybrid extraction	Combine tags with visual heuristics for best results	In progress

Tagged PDF for RAG — Optimizing extraction for AI pipelines
Accessibility Compliance — EAA, ADA, and regulatory requirements
PDF Accessibility Glossary — Key terms and concepts
Industry Collaboration — Based on PDF Association specifications, developed with Hancom and Dual Lab

Tagged PDF

Why Tagged PDF Matters for AI

Accessibility Regulations

How to Use Tagged PDF

CLI Usage

Checking if a PDF is Tagged

Development Status

Use Cases

Research Papers

Financial Reports

Legal Contracts

Learn More

On this page