OpenDataLoader LogoOpenDataLoader

JSON Schema

Understand the layout structure emitted by OpenDataLoader PDF

Every conversion that includes the json format produces a hierarchical document describing detected elements (pages, tables, lists, captions, etc.). Use the following reference to map fields into your downstream processors.

Root node

FieldTypeRequiredDescription
file namestringYesName of the processed PDF
number of pagesintegerYesTotal page count
authorstring | nullYesPDF author metadata
titlestring | nullYesPDF title metadata
creation datestring | nullYesPDF creation timestamp
modification datestring | nullYesPDF modification timestamp
kidsarrayYesTop-level content elements (per page)

Common content fields

All content elements share these base properties:

FieldTypeRequiredDescription
typestringYesElement type
idintegerNoUnique content identifier
levelstringNoHeading or structural level
page numberintegerYesPage containing the element (1-indexed)
bounding boxboundingBoxYes

Text properties

Text nodes (paragraph, heading, caption, list item) include these additional fields:

FieldTypeRequiredDescription
fontstringYesFont name
font sizenumberYesFont size
text colorstringYesRGB color as string array
contentstringYesRaw text value
hidden textbooleanNoWhether this is hidden text (e.g., OCR layer)

Headings

FieldTypeRequiredDescription
heading levelintegerYesHeading level (e.g., 1 for h1)

Captions

FieldTypeRequiredDescription
linked content idintegerNoID of the linked content element (table, image, etc.)

Tables

FieldTypeRequiredDescription
number of rowsintegerYesRow count
number of columnsintegerYesColumn count
previous table idintegerNoLinked table identifier (if broken across pages)
next table idintegerNoLinked table identifier
rowsarrayYesRow objects

Table rows

FieldTypeRequiredDescription
type"table row"YesElement type
row numberintegerYesRow index (1-indexed)
cellsarrayYesCell objects

Table cells

FieldTypeRequiredDescription
row numberintegerYesRow index of the cell (1-indexed)
column numberintegerYesColumn index of the cell (1-indexed)
row spanintegerYesNumber of rows spanned
column spanintegerYesNumber of columns spanned
kidsarrayYesNested content elements

Lists

FieldTypeRequiredDescription
numbering stylestringYesMarker style (ordered, bullet, etc.)
number of list itemsintegerYesItem count
previous list idintegerNoLinked list identifier
next list idintegerNoLinked list identifier
list itemsarrayYesItem nodes

List items

List items include text properties plus:

FieldTypeRequiredDescription
kidsarrayYesNested content elements

Images

FieldTypeRequiredDescription
sourcestringNoRelative path to the image file
datastringNoBase64 data URI (when image-output is "embedded")
formatstringNoImage format (png, jpeg)

Headers and footers

FieldTypeRequiredDescription
typestringYesEither header or footer
kidsarrayYesContent elements within the header or footer

Text blocks

FieldTypeRequiredDescription
kidsarrayYesText block children

JSON Schema

The complete JSON Schema is available at schema.json in the repository root.

On this page