Data Structures
Before the engine can run your VTL script, it needs to know what your data looks like. How many columns does each dataset have? Are they integers, strings, dates? Which columns uniquely identify a row, and which carry the values you’ll be transforming? That description — the data structure — is one of the three required inputs to every run.
The concept and the vocabulary both come from SDMX. Each component of a dataset (i.e. each column) has a name, a type, and a role:
Identifiers are the columns that together uniquely identify a row. In SDMX they’re called dimensions.
Measures carry your actual observations — the values you’ll be transforming.
Attributes are extra metadata attached to an observation, like quality flags or units.
Components can also be marked nullable (allowed to contain nulls). Identifiers, by definition, never are.
So how do you actually tell the engine about this structure? You have a few options, and you pick whichever matches the data you already have.
Option 1: VTL JSON
The engine’s native format is a small JSON document, formally specified
by the VTL JSON Schema
shipped with the engine. You can write it directly as a Python dict
and pass it in:
data_structures = {
"datasets": [
{
"name": "DS_1",
"DataStructure": [
{"name": "Id_1", "type": "Integer", "role": "Identifier", "nullable": False},
{"name": "Me_1", "type": "Number", "role": "Measure", "nullable": True}
]
}
]
}
The shape mirrors what we just described: a list of datasets, each with a name and a list of components. Each component has the four fields:
Field |
Type |
Description |
|---|---|---|
|
string |
Component name. Must match the column name in the datapoints. |
|
string |
VTL data type. See Data Types for the full list. |
|
string |
One of |
|
bool |
Whether the component allows null values. Identifiers must be non-nullable. |
If you have many datasets, or you’d rather keep your Python code tidy,
save the same JSON to a .json file and pass the Path instead.
Scalar inputs (used with run’s scalar_values argument) can be
declared alongside datasets:
{
"datasets": [ /* ... */ ],
"scalars": [
{"name": "Sc_1", "type": "Number"}
]
}
Option 2: SDMX structure files
If you’re already working with SDMX, you don’t need to convert anything.
Hand the engine an SDMX-ML (.xml) or SDMX-JSON (.json) structure
file as a Path, and the engine will read it via pysdmx, extract
each DataStructureDefinition, and translate the components to VTL’s
equivalent representation internally. For the full list of supported
SDMX formats and versions, see pysdmx’s
Formats and versions supported.
from pathlib import Path
from vtlengine import run
result = run(
script="DS_r <- TEST_DSD [calc Me_2 := OBS_VALUE * 2];",
data_structures=Path("path/to/structure.xml"),
datapoints=Path("path/to/data.csv"),
)
By default, the SDMX DSD’s (or Dataflow’s) ID becomes the VTL dataset
name. If your script refers to the dataset by a different name, you can
remap it using sdmx_mappings — see SDMX Inputs.
Option 3: pysdmx objects
If you’ve already loaded your SDMX structures into memory through
pysdmx — typically as Schema, DataStructureDefinition, or
Dataflow objects — you can hand them straight to the engine, no
file paths needed.
from pathlib import Path
from pysdmx.io import read_sdmx
from vtlengine import run
msg = read_sdmx(Path("path/to/structure.xml"))
dsds = msg.get_data_structure_definitions()
result = run(script=script, data_structures=dsds, datapoints=datapoints)
Behind the scenes the engine converts each pysdmx structure to VTL JSON,
using a public helper called to_vtl_json(). You can also call it
yourself when you need to inspect or tweak the result — see
Inside the conversion below.
A useful trick when the same Dataflow should appear in your script
under two different aliases (e.g. a previous-vs-current snapshot): call
to_vtl_json() twice with different dataset_name arguments. See
Sharing one Dataflow between two datasets in SDMX Inputs.
Mixing input formats
You don’t have to commit to a single format. If you have some structures in VTL JSON and others in SDMX, pass a list mixing all of them:
data_structures = [
{"datasets": [...]}, # VTL JSON dict
Path("structure.xml"), # SDMX-ML file
some_pysdmx_dataflow, # pysdmx object
]
The engine walks the list, processes each element with the right loader, and merges the resulting dataset definitions before execution.
Inside the conversion
This section is for the curious — you don’t need it to get a script running, but it’s useful to understand what’s happening when you hand the engine an SDMX structure.
Whenever you pass an SDMX structure (a file path or a pysdmx object),
the engine calls vtlengine.files.sdmx_handler.to_vtl_json() to
translate it into the VTL JSON shape shown in Option 1.
from vtlengine.files.sdmx_handler import to_vtl_json
vtl_structure = to_vtl_json(my_dsd)
# {"datasets": [{"name": "...", "DataStructure": [...]}]}
It accepts three pysdmx types — Schema, DataStructureDefinition,
Dataflow — plus an optional dataset_name to override the default
(which is the structure’s ID). For a Dataflow, the embedded DSD is
extracted automatically; if the Dataflow has no resolved DSD attached,
the function raises an InputValidationException.
The mapping itself is deliberately simple. Roles translate one-to-one:
SDMX role |
VTL role |
Nullable? |
|---|---|---|
|
|
|
|
|
|
|
|
|
Nullability falls out of the role: dimensions are never nullable; everything else is.
Types are collapsed from the long SDMX list onto VTL’s smaller set, following the official SDMX-to-VTL correspondence published in the SDMX 3.1 Technical Notes:
SDMX data type |
VTL type |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
See Data Types for the VTL type semantics and which casts are supported.
See also
VTL Scripts — accepted forms of the script argument.
Datapoints — how datapoints reference these structures by name.
SDMX Inputs — SDMX-specific patterns and
sdmx_mappings.