JSON Schema
Understand the layout structure emitted by OpenDataLoader PDF
Every conversion that includes the json format produces a hierarchical document describing detected elements (pages, tables, lists, captions, etc.). Use the following reference to map fields into your downstream processors.
| Field | Type | Required | Description |
|---|
file name | string | Yes | Name of the processed PDF |
number of pages | integer | Yes | Total page count |
author | string | null | Yes | PDF author metadata |
title | string | null | Yes | PDF title metadata |
creation date | string | null | Yes | PDF creation timestamp |
modification date | string | null | Yes | PDF modification timestamp |
kids | array | Yes | Top-level content elements (per page) |
All content elements share these base properties:
| Field | Type | Required | Description |
|---|
type | string | Yes | Element type |
id | integer | No | Unique content identifier |
level | string | No | Heading or structural level |
page number | integer | Yes | Page containing the element (1-indexed) |
bounding box | boundingBox | Yes | |
Text nodes (paragraph, heading, caption, list item) include these additional fields:
| Field | Type | Required | Description |
|---|
font | string | Yes | Font name |
font size | number | Yes | Font size |
text color | string | Yes | RGB color as string array |
content | string | Yes | Raw text value |
hidden text | boolean | No | Whether this is hidden text (e.g., OCR layer) |
| Field | Type | Required | Description |
|---|
heading level | integer | Yes | Heading level (e.g., 1 for h1) |
| Field | Type | Required | Description |
|---|
linked content id | integer | No | ID of the linked content element (table, image, etc.) |
| Field | Type | Required | Description |
|---|
number of rows | integer | Yes | Row count |
number of columns | integer | Yes | Column count |
previous table id | integer | No | Linked table identifier (if broken across pages) |
next table id | integer | No | Linked table identifier |
rows | array | Yes | Row objects |
| Field | Type | Required | Description |
|---|
type | "table row" | Yes | Element type |
row number | integer | Yes | Row index (1-indexed) |
cells | array | Yes | Cell objects |
| Field | Type | Required | Description |
|---|
row number | integer | Yes | Row index of the cell (1-indexed) |
column number | integer | Yes | Column index of the cell (1-indexed) |
row span | integer | Yes | Number of rows spanned |
column span | integer | Yes | Number of columns spanned |
kids | array | Yes | Nested content elements |
| Field | Type | Required | Description |
|---|
numbering style | string | Yes | Marker style (ordered, bullet, etc.) |
number of list items | integer | Yes | Item count |
previous list id | integer | No | Linked list identifier |
next list id | integer | No | Linked list identifier |
list items | array | Yes | Item nodes |
List items include text properties plus:
| Field | Type | Required | Description |
|---|
kids | array | Yes | Nested content elements |
| Field | Type | Required | Description |
|---|
source | string | No | Relative path to the image file |
data | string | No | Base64 data URI (when image-output is "embedded") |
format | string | No | Image format (png, jpeg) |
| Field | Type | Required | Description |
|---|
type | string | Yes | Either header or footer |
kids | array | Yes | Content elements within the header or footer |
| Field | Type | Required | Description |
|---|
kids | array | Yes | Text block children |
The complete JSON Schema is available at schema.json in the repository root.