10 minutes to VTL Engine
Summarizes the main functions of the VTL Engine.
The VTL Engine API provides eight basic methods:
Semantic Analysis: Validates the correctness of a VTL script and computes the data structures of the datasets created within the script. Supports VTL JSON format, SDMX structure files, and pysdmx objects.
Run: Executes a VTL script using the provided input datapoints. Supports loading data structures from VTL JSON, SDMX structure files (.xml, .json), and pysdmx objects. Also supports loading datapoints from plain CSV files and SDMX data files (SDMX-ML, SDMX-JSON, SDMX-CSV).
Run_sdmx: Ensures compatibility with pysdmx by running a VTL script using the pysdmx PandasDataset. The VTL engine uses the input datapoints while mapping the SDMX DataStructureDefinition to the VTL datastructure. Internally uses the run function after converting PandasDatasets.
Generate_sdmx: Ensures compatibility with pysdmx by generating a TransformationScheme object from a VTL script.
Prettify: Formats a VTL script to make it more readable.
validate_datasets: Validates the input datapoints against the provided data structures.
validate_value_domains: Validates input value domains using a JSON Schema.
validate_external_routines: Validates external routines using both JSON Schema and SQLGlot.
Any VTL action requires the following elements as input:
- VTL Script:
The VTL script to be executed. It includes the transformation scheme, as well as any User Defined Operators, Hierarchical Rulesets, and Datapoint Rulesets. It can be provided as a string or as a Path object pointing to a .vtl file.
- Data Structures:
Define the structure of the input artifacts used in the VTL script, according to the VTL Information Model. Data structures can be provided in multiple formats:
VTL JSON format: As dictionaries or paths to JSON files
SDMX structure files: Paths to SDMX-ML (.xml) or SDMX-JSON (.json) structure files
pysdmx objects: Schema, DataStructureDefinition, or Dataflow objects
A list of mixed formats can also be provided.
- External Routines:
The VTL Engine supports the use of SQL (ISO/IEC 9075) within the eval operator. External routines can be provided as a SQL string, a Path object, or a list of such elements pointing to .sql files. The default value is None, which should be used if external routines are not applicable to the script. See example 5 for an example.
- Value Domains:
Define the value domains referenced in the VTL script, usually with an in operator. They can be provided as dictionaries, as paths, or as lists of JSON files. The default value is None, which should be used if value domains are not applicable. See example 5 for details.
- Scalar Values:
The VTL Engine supports scalar values as input for VTL scripts. These can be provided as a dictionary where keys are scalar names and values are the corresponding data. The default value is None, which should be used if scalar values are not required.
- Output Folder:
Specifies the directory where the results of the VTL script execution are saved. This is useful for scripts that generate datasets or scalar values. The output folder can be provided as a Path object.
- SDMX Mappings (optional):
When using SDMX files, you can provide a mapping between SDMX URNs and VTL dataset names. This can be a dictionary or a VtlDataflowMapping object from pysdmx. The mapping is useful when SDMX dataset names differ from VTL script dataset names.
Semantic Analysis
The vtlengine.semantic_analysis() method validates the correctness of a VTL script and computes the data structures of
the datasets generated by the script itself (a prerequisite for semantic analysis).
If the VTL script is correct, the method returns a dictionary containing the data structures of all datasets generated by the script.
If the VTL script is incorrect, a SemanticError is raised.
The data_structures parameter accepts multiple formats:
VTL JSON format: Dictionaries or paths to
.jsonfilesSDMX structure files: Paths to SDMX-ML (
.xml) or SDMX-JSON (.json) filespysdmx objects:
Schema,DataStructureDefinition, orDataflowobjectsMixed lists: Any combination of the above formats
Example 1: Correct VTL
from vtlengine import semantic_analysis
script = """
DS_A <- DS_1 * 10;
"""
data_structures = {
'datasets': [
{'name': 'DS_1',
'DataStructure': [
{'name': 'Id_1',
'type':
'Integer',
'role': 'Identifier',
'nullable': False},
{'name': 'Me_1',
'type': 'Number',
'role': 'Measure',
'nullable': True}
]
}
]
}
sa_result = semantic_analysis(script=script, data_structures=data_structures)
print(sa_result)
Returns:
{'DS_A': Dataset(name='DS_A', components={'Id_1': Component(name='Id_1', data_type="Integer", role="Identifier", nullable=False), 'Me_1': Component(name='Me_1', data_type="Number", role="Measure", nullable=True)}, data=None)}
Example 2: Incorrect VTL
Compared to Example 1, the only difference is that Me_1 uses a String data type instead of Number.
from vtlengine import semantic_analysis
script = """
DS_A <- DS_1 * 10;
"""
data_structures = {
'datasets': [
{'name': 'DS_1',
'DataStructure': [
{'name': 'Id_1',
'type':
'Integer',
'role': 'Identifier',
'nullable': False},
{'name': 'Me_1',
'type': 'String',
'role': 'Measure',
'nullable': True}
]
}
]
}
sa_result = semantic_analysis(script=script, data_structures=data_structures)
print(sa_result)
Raises the following Error:
raise SemanticError(code="1-1-1-2",
vtlengine.Exceptions.SemanticError: ('Invalid implicit cast from String and Integer to Number.', '1-1-1-2')
Example 2b: Using SDMX Structures
The semantic_analysis function can also accept SDMX structure files or pysdmx objects:
from pathlib import Path
from vtlengine import semantic_analysis
script = """
DS_A <- DS_1 * 10;
"""
# Using an SDMX-ML structure file
sdmx_structure = Path("path/to/structure.xml")
sa_result = semantic_analysis(script=script, data_structures=sdmx_structure)
print(sa_result)
Using pysdmx objects directly:
from pathlib import Path
from pysdmx.io import read_sdmx
from vtlengine import semantic_analysis
script = """
DS_A <- DS_1 * 10;
"""
# Load structure using pysdmx
msg = read_sdmx(Path("path/to/structure.xml"))
dsds = msg.get_data_structure_definitions()
sa_result = semantic_analysis(script=script, data_structures=dsds)
print(sa_result)
Run VTL Scripts
The vtlengine.run() method executes a VTL script with the provided datapoints.
It returns a dictionary containing all generated datasets. If the output parameter is set, the engine writes the computation results to the specified folder; otherwise, the data is returned within the dictionary of computed datasets.
Two validations are performed before execution, either of which may raise an error:
Semantic analysis: Equivalent to running
vtlengine.semantic_analysis().Data load analysis: Performs a basic check of data structure names and types.
Example 3: Simple run
from vtlengine import run
import pandas as pd
script = """
DS_A <- DS_1 * 10;
"""
data_structures = {
'datasets': [
{'name': 'DS_1',
'DataStructure': [
{'name': 'Id_1',
'type':
'Integer',
'role': 'Identifier',
'nullable': False},
{'name': 'Me_1',
'type': 'Number',
'role': 'Measure',
'nullable': True}
]
}
]
}
data_df = pd.DataFrame(
{"Id_1": [1, 2, 3],
"Me_1": [10, 20, 30]})
datapoints = {"DS_1": data_df}
run_result = run(script=script, data_structures=data_structures,
datapoints=datapoints)
print(run_result["DS_A"].data)
Id_1 |
Me_1 |
|---|---|
1 |
100.0 |
2 |
200.0 |
3 |
300.0 |
Example 4: Run from SDMX Dataset
The vtlengine.run_sdmx() method executes a VTL script using SDMX files, via the get_datasets function from pysdmx.
It runs a VTL script with one or more PandasDataset instances from the pysdmx library.
This method prepares the required VTL data structures and datapoints, maps dataset structures to VTL identifiers, and delegates execution to the VTL Engine. It performs internal validation of dataset structures and input dependencies using DAG analysis.
For details on reading and writing SDMX datasets, see the pysdmx documentation.
The process works as follows:
Provide a VTL script (as a string, TransformationScheme, or file).
Supply one or more SDMX datasets as PandasDataset objects including structural metadata (from the pysdmx Schema).
When using multiple datasets, provide a mapping linking each dataset to the corresponding name expected in the script. This can be a VTLDataflowMapping object or a dictionary pairing the short-URN with the VTL dataset name.
The method validates all inputs, converts them into a VTL-compatible format, and executes the script.
The result is a new dataset (or several) generated according to the logic defined in the VTL script.
Important
The short-urn is the meaningful part of the URN. The format is: SDMX_type=Agency:ID(Version).
Example:
Dataflow=MD:TEST_DF(1.0) is the short-urn for urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=MD:TEST_DF(1.0)
Optional settings are the same as in the run method, including:
Providing value domains for data validation.
Using external routines as SQL statements.
Controlling how time period columns are formatted in the output.
Saving results to a specified output folder.
Filtering output datasets to return only those marked as “persistent” in the VTL script.
from pathlib import Path
from pysdmx.io import get_datasets
from vtlengine import run_sdmx
data = Path("Docs/_static/data.xml")
structure = Path("Docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = "DS_r <- DS_1 [calc Me_4 := OBS_VALUE];"
print(run_sdmx(script, datasets)['DS_r'].data)
DIM_1 |
DIM_2 |
OBS_VALUE |
Me_4 |
|---|---|---|---|
A |
A |
100 |
100 |
A |
B |
102 |
102 |
B |
B |
14206.490766 |
14206.490766 |
C |
B |
10130 |
10130 |
C |
C |
||
C |
D |
29929.036014 |
29929.036014 |
As part of its compatibility with pysdmx, the function can also take a TransformationScheme object as input. If no mapping is provided, the VTL script must have a single input, and the data file must contain only one dataset.
from pathlib import Path
from pysdmx.io import get_datasets
from pysdmx.model.vtl import TransformationScheme, Transformation
from vtlengine import run_sdmx
data = Path("Docs/_static/data.xml")
structure = Path("Docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = TransformationScheme(
id="TS1",
version="1.0",
agency="MD",
vtl_version="2.1",
items=[
Transformation(
id="T1",
uri=None,
urn=None,
name=None,
description=None,
expression="DS_1 [calc Me_4 := OBS_VALUE];",
is_persistent=True,
result="DS_r1",
annotations=(),
),
Transformation(
id="T2",
uri=None,
urn=None,
name=None,
description=None,
expression="DS_1 [rename OBS_VALUE to Me_5];",
is_persistent=True,
result="DS_r2",
annotations=(),
)
],
)
run_sdmx(script, datasets=datasets)
Finally, mapping information can be used to link an SDMX input dataset to a VTL input dataset via the VtlDataflowMapping object from pysdmx or a dictionary.
from pathlib import Path
from pysdmx.io import get_datasets
from pysdmx.model.vtl import TransformationScheme, Transformation
from pysdmx.model.vtl import VtlDataflowMapping
from vtlengine import run_sdmx
data = Path("Docs/_static/data.xml")
structure = Path("Docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = TransformationScheme(
id="TS1",
version="1.0",
agency="MD",
vtl_version="2.1",
items=[
Transformation(
id="T1",
uri=None,
urn=None,
name=None,
description=None,
expression="DS_1 [calc Me_4 := OBS_VALUE]",
is_persistent=True,
result="DS_r",
annotations=(),
),
],
)
# Mapping using VtlDataflowMapping object:
mapping = VtlDataflowMapping(
dataflow="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=MD:TEST_DF(1.0)",
dataflow_alias="DS_1",
id="VTL_MAP_1",
)
# Mapping using dictionary:
mapping = {
"Dataflow=MD:TEST_DF(1.0)": "DS_1"
}
run_sdmx(script, datasets, mappings=mapping)
Files used in the example can be found here:
Example 4b: Run with SDMX Files
The vtlengine.run() function can also load SDMX files directly, without using the run_sdmx function.
This provides a seamless workflow for SDMX data without requiring manual conversion to VTL JSON format.
Supported SDMX formats for data_structures:
SDMX-ML structure files (
.xml)SDMX-JSON structure files (
.json)pysdmx objects (
Schema,DataStructureDefinition,Dataflow)
Supported SDMX formats for datapoints:
SDMX-ML data files (
.xml)SDMX-JSON data files (
.json)SDMX-CSV data files (
.csv) - with automatic detection
SDMX files are automatically detected by their extension. For CSV files, the engine first attempts to parse as SDMX-CSV, then falls back to plain CSV if SDMX parsing fails.
When using SDMX files, the dataset name in the structure file (from the DataStructureDefinition ID) may differ
from the name in the data file (from the Dataflow reference). Use the sdmx_mappings parameter to map
the data file’s URN to the VTL dataset name used in your script:
from pathlib import Path
from vtlengine import run
# Using SDMX structure and data files directly
structure_file = Path("path/to/structure.xml") # SDMX-ML structure
data_file = Path("path/to/data.xml") # SDMX-ML data
# Map the data file's Dataflow URN to the structure's DSD name
mapping = {"Dataflow=AGENCY:DATAFLOW_ID(1.0)": "DSD_NAME"}
script = "DS_r <- DSD_NAME [calc Me_2 := OBS_VALUE * 2];"
result = run(
script=script,
data_structures=structure_file,
datapoints=data_file,
sdmx_mappings=mapping
)
You can also use sdmx_mappings to give datasets custom names in your VTL script:
from pathlib import Path
from vtlengine import run
structure_file = Path("path/to/structure.xml")
data_file = Path("path/to/data.xml")
script = "DS_r <- MY_DATASET [calc Me_2 := OBS_VALUE * 2];"
# Map SDMX URN to VTL dataset name
mapping = {"Dataflow=MD:TEST_DF(1.0)": "MY_DATASET"}
result = run(
script=script,
data_structures=structure_file,
datapoints=data_file,
sdmx_mappings=mapping
)
You can also mix VTL JSON structures with SDMX structures and plain CSV datapoints with SDMX data files:
from pathlib import Path
from vtlengine import run
# Mix of VTL JSON and SDMX structures
vtl_structure = {"datasets": [{"name": "DS_1", "DataStructure": [...]}]}
sdmx_structure = Path("path/to/sdmx_structure.xml")
# Mix of plain CSV and SDMX data
datapoints = {
"DS_1": Path("path/to/plain_data.csv"), # Plain CSV
"DS_2": Path("path/to/sdmx_data.xml"), # SDMX-ML
}
result = run(
script=script,
data_structures=[vtl_structure, sdmx_structure],
datapoints=datapoints
)
Example 5: Run with multiple Value Domains and External Routines as dictionaries.
from pathlib import Path
import pandas as pd
from vtlengine import run
def main():
script = """
Example_5 <- DS_1 [ calc Me_2:= Me_1 in Countries];
Example_5_2 <- eval(SQL_3(DS_1) language "sqlite" returns dataset { identifier<integer> Id_1,
measure<number> Me_1});
"""
data_structures = {
"datasets": [
{
"name": "DS_1",
"DataStructure": [
{"name": "Id_1", "type": "Integer", "role": "Identifier", "nullable": False},
{"name": "Id_2", "type": "String", "role": "Identifier", "nullable": False},
{"name": "Me_1", "type": "Number", "role": "Measure", "nullable": True},
],
}
]
}
data_df = pd.DataFrame(
{"Id_1": [2012, 2012, 2012], "Id_2": ["AT", "DE", "FR"], "Me_1": [0, 4, 9]}
)
datapoints = {"DS_1": data_df}
external_routines = {
"name": "SQL_3",
"query": "SELECT Id_1, COUNT(*) AS Me_1 FROM DS_1 GROUP BY Id_1;",
}
value_domains =
{"name": "Countries", "setlist": ["DE", "FR", "IT"], "type": "String"}
run_result = run(
script=script,
data_structures=data_structures,
datapoints=datapoints,
value_domains=value_domains,
external_routines=external_routines,
)
print(run_result)
Returns:
Id_1 |
Id_2 |
Me_1 |
Me_2 |
|---|---|---|---|
2012 |
AT |
0.0 |
False |
2012 |
DE |
4.0 |
False |
2012 |
FR |
9.0 |
False |
Id_1 |
Me_1 |
|---|---|
2012 |
3.0 |
Example 6: Run using Paths
The following example shows how to run a VTL script by providing all inputs as Path objects. In this case, datapoints can be passed as a dictionary where the dataset name is the key and the corresponding file path is the value. Here, DS_1 is the dictionary key that matches the dataset defined in the data structures file, but the input data file itself can have any name.
from pathlib import Path
import pandas as pd
from vtlengine import run
def main():
filepath_external_routines = Path("Docs/_static/SQL_4.json")
filepath_ValueDomains = Path("Docs/_static/VD_2.json")
filepath_vtl_script = Path("Docs/_static/Example_6.vtl")
filepath_data_structures = Path("Docs/_static/Example_6.json")
filepath_data = Path("Docs/_static/Example_6_input.csv")
datastructures = filepath_data_structures
datapoints = {"DS_1": filepath_data}
script = filepath_vtl_script
external_routines = filepath_external_routines
value_domains = filepath_ValueDomains
run_result = run(
script=script,
data_structures=data_structures,
datapoints=datapoints,
value_domains=value_domains,
external_routines=external_routines,
)
print(run_result)
Returns:
Id_1 |
Id_2 |
Me_1 |
Me_2 |
|---|---|---|---|
2012 |
AT |
0.0 |
False |
2012 |
DE |
4.0 |
False |
2012 |
FR |
9.0 |
False |
Id_1 |
Me_1 |
|---|---|
2012 |
0.0 |
2012 |
4.0 |
2012 |
9.0 |
Run with Scalar Values
The VTL Engine also supports the use of scalar values to be used as input within the VTL script. When an output path is provided, the engine generates CSV files containing the results of the script execution. Scalar results are saved as a CSV file containing all resulting scalar values.
Example 7: Run with Scalar Values
from vtlengine import run
import pandas as pd
script = """
DS_r <- DS_1[filter Me_1 = Sc_1];
Sc_r <- Sc_1 + 10;
Sc_r2 <- Sc_r * 2;
Sc_r3 <- null;
"""
data_structures = {
"datasets": [
{
"name": "DS_1",
"DataStructure": [
{"name": "Id_1", "type": "Integer", "role": "Identifier", "nullable": False},
{"name": "Me_1", "type": "Number", "role": "Measure", "nullable": True},
],
}
],
"scalars": [
{
"name": "Sc_1",
"type": "Number",
}
],
}
data_df = pd.DataFrame({"Id_1": [1, 2, 3], "Me_1": [10, 20, 30]})
datapoints = {"DS_1": data_df}
scalars = {"Sc_1": 20}
run_result = run(
script=script,
data_structures=data_structures,
datapoints=datapoints,
scalar_values=scalars,
return_only_persistent=True
)
print(run_result)
Returns:
Id_1 |
Me_1 |
|---|---|
2 |
20.0 |
Name |
Value |
|---|---|
Sc_r |
30 |
Sc_r2 |
60 |
Sc_r3 |
Prettify
The vtlengine.prettify() method formats a VTL script to improve readability.
from vtlengine import prettify
script = """
define hierarchical ruleset accountingEntry (variable rule ACCOUNTING_ENTRY) is
B = C - D errorcode "Balance (credit-debit)" errorlevel 4;
N = A - L errorcode "Net (assets-liabilities)" errorlevel 4
end hierarchical ruleset;
DS_r <- check_hierarchy(BOP, accountingEntry rule ACCOUNTING_ENTRY dataset);
"""
prettified_script = prettify(script)
print(prettified_script)
Returns:
define hierarchical ruleset accountingEntry(variable rule ACCOUNTING_ENTRY) is
B = C - D
errorcode "Balance (credit-debit)"
errorlevel 4;
N = A - L
errorcode "Net (assets-liabilities)"
errorlevel 4
end hierarchical ruleset;
DS_r <-
check_hierarchy(
BOP,
accountingEntry,
rule ACCOUNTING_ENTRY);
For more information on usage, please refer to the API documentation