10 minutes to VTL Engine
Summarizes the main functions of the VTL Engine.
The VTL Engine API provides eight basic methods:
Semantic Analysis: Validates the correctness of a VTL script and computes the data structures of the datasets created within the script.
Run: Executes a VTL script using the provided input datapoints.
Run_sdmx: Ensures compatibility with pysdmx by running a VTL script using the pysdmx PandasDataset. The VTL engine uses the input datapoints while mapping the SDMX DataStructureDefinition to the VTL datastructure.
Generate_sdmx: Ensures compatibility with pysdmx by generating a TransformationScheme object from a VTL script.
Prettify: Formats a VTL script to make it more readable.
validate_datasets: Validates the input datapoints against the provided data structures.
validate_value_domains: Validates input value domains using a JSON Schema.
validate_external_routines: Validates external routines using both JSON Schema and SQLGlot.
Any VTL action requires the following elements as input:
- VTL Script:
The VTL script to be executed. It includes the transformation scheme, as well as any User Defined Operators, Hierarchical Rulesets, and Datapoint Rulesets. It can be provided as a string or as a Path object pointing to a .vtl file.
- Data Structures:
Define the structure of the input artifacts used in the VTL script, according to the VTL Information Model. As the current version does not prescribe a standard format for this information, the VTL Engine uses a JSON-based format, which can be found here. Data structures can be provided as dictionaries or as paths to JSON files.
- External Routines:
The VTL Engine supports the use of SQL (ISO/IEC 9075) within the eval operator. External routines can be provided as a SQL string, a Path object, or a list of such elements pointing to .sql files. The default value is None, which should be used if external routines are not applicable to the script. See example 5 for an example.
- Value Domains:
Define the value domains referenced in the VTL script, usually with an in operator. They can be provided as dictionaries, as paths, or as lists of JSON files. The default value is None, which should be used if value domains are not applicable. See example 5 for details.
- Scalar Values:
The VTL Engine supports scalar values as input for VTL scripts. These can be provided as a dictionary where keys are scalar names and values are the corresponding data. The default value is None, which should be used if scalar values are not required.
- Output Folder:
Specifies the directory where the results of the VTL script execution are saved. This is useful for scripts that generate datasets or scalar values. The output folder can be provided as a Path object.
Semantic Analysis
The vtlengine.semantic_analysis() method validates the correctness of a VTL script and computes the data structures of
the datasets generated by the script itself (a prerequisite for semantic analysis).
If the VTL script is correct, the method returns a dictionary containing the data structures of all datasets generated by the script.
If the VTL script is incorrect, a SemanticError is raised.
Example 1: Correct VTL
from vtlengine import semantic_analysis
script = """
DS_A <- DS_1 * 10;
"""
data_structures = {
'datasets': [
{'name': 'DS_1',
'DataStructure': [
{'name': 'Id_1',
'type':
'Integer',
'role': 'Identifier',
'nullable': False},
{'name': 'Me_1',
'type': 'Number',
'role': 'Measure',
'nullable': True}
]
}
]
}
sa_result = semantic_analysis(script=script, data_structures=data_structures)
print(sa_result)
Returns:
{'DS_A': Dataset(name='DS_A', components={'Id_1': Component(name='Id_1', data_type="Integer", role="Identifier", nullable=False), 'Me_1': Component(name='Me_1', data_type="Number", role="Measure", nullable=True)}, data=None)}
Example 2: Incorrect VTL
Compared to Example 1, the only difference is that Me_1 uses a String data type instead of Number.
from vtlengine import semantic_analysis
script = """
DS_A <- DS_1 * 10;
"""
data_structures = {
'datasets': [
{'name': 'DS_1',
'DataStructure': [
{'name': 'Id_1',
'type':
'Integer',
'role': 'Identifier',
'nullable': False},
{'name': 'Me_1',
'type': 'String',
'role': 'Measure',
'nullable': True}
]
}
]
}
sa_result = semantic_analysis(script=script, data_structures=data_structures)
print(sa_result)
Raises the following Error:
raise SemanticError(code="1-1-1-2",
vtlengine.Exceptions.SemanticError: ('Invalid implicit cast from String and Integer to Number.', '1-1-1-2')
Run VTL Scripts
The vtlengine.run() method executes a VTL script with the provided datapoints.
It returns a dictionary containing all generated datasets. If the output parameter is set, the engine writes the computation results to the specified folder; otherwise, the data is returned within the dictionary of computed datasets.
Two validations are performed before execution, either of which may raise an error:
Semantic analysis: Equivalent to running
vtlengine.semantic_analysis().Data load analysis: Performs a basic check of data structure names and types.
Example 3: Simple run
from vtlengine import run
import pandas as pd
script = """
DS_A <- DS_1 * 10;
"""
data_structures = {
'datasets': [
{'name': 'DS_1',
'DataStructure': [
{'name': 'Id_1',
'type':
'Integer',
'role': 'Identifier',
'nullable': False},
{'name': 'Me_1',
'type': 'Number',
'role': 'Measure',
'nullable': True}
]
}
]
}
data_df = pd.DataFrame(
{"Id_1": [1, 2, 3],
"Me_1": [10, 20, 30]})
datapoints = {"DS_1": data_df}
run_result = run(script=script, data_structures=data_structures,
datapoints=datapoints)
print(run_result["DS_A"].data)
Id_1 |
Me_1 |
|---|---|
1 |
100.0 |
2 |
200.0 |
3 |
300.0 |
Example 4: Run from SDMX Dataset
The vtlengine.run_sdmx() method executes a VTL script using SDMX files, via the get_datasets function from pysdmx.
It runs a VTL script with one or more PandasDataset instances from the pysdmx library.
This method prepares the required VTL data structures and datapoints, maps dataset structures to VTL identifiers, and delegates execution to the VTL Engine. It performs internal validation of dataset structures and input dependencies using DAG analysis.
For details on reading and writing SDMX datasets, see the pysdmx documentation.
The process works as follows:
Provide a VTL script (as a string, TransformationScheme, or file).
Supply one or more SDMX datasets as PandasDataset objects including structural metadata (from the pysdmx Schema).
When using multiple datasets, provide a mapping linking each dataset to the corresponding name expected in the script. This can be a VTLDataflowMapping object or a dictionary pairing the short-URN with the VTL dataset name.
The method validates all inputs, converts them into a VTL-compatible format, and executes the script.
The result is a new dataset (or several) generated according to the logic defined in the VTL script.
Important
The short-urn is the meaningful part of the URN. The format is: SDMX_type=Agency:ID(Version).
Example:
Dataflow=MD:TEST_DF(1.0) is the short-urn for urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=MD:TEST_DF(1.0)
Optional settings are the same as in the run method, including:
Providing value domains for data validation.
Using external routines as SQL statements.
Controlling how time period columns are formatted in the output.
Saving results to a specified output folder.
Filtering output datasets to return only those marked as “persistent” in the VTL script.
from pathlib import Path
from pysdmx.io import get_datasets
from vtlengine import run_sdmx
data = Path("Docs/_static/data.xml")
structure = Path("Docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = "DS_r <- DS_1 [calc Me_4 := OBS_VALUE];"
print(run_sdmx(script, datasets)['DS_r'].data)
DIM_1 |
DIM_2 |
OBS_VALUE |
Me_4 |
|---|---|---|---|
A |
A |
100 |
100 |
A |
B |
102 |
102 |
B |
B |
14206.490766 |
14206.490766 |
C |
B |
10130 |
10130 |
C |
C |
||
C |
D |
29929.036014 |
29929.036014 |
As part of its compatibility with pysdmx, the function can also take a TransformationScheme object as input. If no mapping is provided, the VTL script must have a single input, and the data file must contain only one dataset.
from pysdmx.io import get_datasets
from pysdmx.model.vtl import TransformationScheme, Transformation
from vtlengine import run_sdmx
data = Path("Docs/_static/data.xml")
structure = Path("Docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = TransformationScheme(
id="TS1",
version="1.0",
agency="MD",
vtl_version="2.1",
items=[
Transformation(
id="T1",
uri=None,
urn=None,
name=None,
description=None,
expression="DS_1 [calc Me_4 := OBS_VALUE];",
is_persistent=True,
result="DS_r1",
annotations=(),
),
Transformation(
id="T2",
uri=None,
urn=None,
name=None,
description=None,
expression="DS_1 [rename OBS_VALUE to Me_5];",
is_persistent=True,
result="DS_r2",
annotations=(),
)
],
)
run_sdmx(script, datasets=datasets)
Finally, mapping information can be used to link an SDMX input dataset to a VTL input dataset via the VTLDataflowMapping object from pysdmx or a dictionary.
from pysdmx.io import get_datasets
from pysdmx.model.vtl import TransformationScheme, Transformation
from pysdmx.model.vtl import VTLDataflowMapping
from vtlengine import run_sdmx
data = Path("Docs/_static/data.xml")
structure = Path("Docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = TransformationScheme(
id="TS1",
version="1.0",
agency="MD",
vtl_version="2.1",
items=[
Transformation(
id="T1",
uri=None,
urn=None,
name=None,
description=None,
expression="DS_1 [calc Me_4 := OBS_VALUE]",
is_persistent=True,
result="DS_r",
annotations=(),
),
],
)
# Mapping using VTLDataflowMapping object:
mapping = VtlDataflowMapping(
dataflow="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=MD:TEST_DF(1.0)",
dataflow_alias="DS_1",
id="VTL_MAP_1",
)
# Mapping using dictionary:
mapping = {
"Dataflow=MD:TEST_DF(1.0)": "DS_1"
}
run_sdmx(script, datasets, mapping=mapping)
Files used in the example can be found here:
Example 5: Run with multiple Value Domains and External Routines as dictionaries.
from pathlib import Path
import pandas as pd
from vtlengine import run
def main():
script = """
Example_5 <- DS_1 [ calc Me_2:= Me_1 in Countries];
Example_5_2 <- eval(SQL_3(DS_1) language "sqlite" returns dataset { identifier<integer> Id_1,
measure<number> Me_1});
"""
data_structures = {
"datasets": [
{
"name": "DS_1",
"DataStructure": [
{"name": "Id_1", "type": "Integer", "role": "Identifier", "nullable": False},
{"name": "Id_2", "type": "String", "role": "Identifier", "nullable": False},
{"name": "Me_1", "type": "Number", "role": "Measure", "nullable": True},
],
}
]
}
data_df = pd.DataFrame(
{"Id_1": [2012, 2012, 2012], "Id_2": ["AT", "DE", "FR"], "Me_1": [0, 4, 9]}
)
datapoints = {"DS_1": data_df}
external_routines = {
"name": "SQL_3",
"query": "SELECT Id_1, COUNT(*) AS Me_1 FROM DS_1 GROUP BY Id_1;",
}
value_domains =
{"name": "Countries", "setlist": ["DE", "FR", "IT"], "type": "String"}
run_result = run(
script=script,
data_structures=data_structures,
datapoints=datapoints,
value_domains=value_domains,
external_routines=external_routines,
)
print(run_result)
Returns:
Id_1 |
Id_2 |
Me_1 |
Me_2 |
|---|---|---|---|
2012 |
AT |
0.0 |
False |
2012 |
DE |
4.0 |
False |
2012 |
FR |
9.0 |
False |
Id_1 |
Me_1 |
|---|---|
2012 |
3.0 |
Example 6: Run using Paths
The following example shows how to run a VTL script by providing all inputs as Path objects. In this case, datapoints can be passed as a dictionary where the dataset name is the key and the corresponding file path is the value. Here, DS_1 is the dictionary key that matches the dataset defined in the data structures file, but the input data file itself can have any name.
from pathlib import Path
import pandas as pd
from vtlengine import run
def main():
filepath_external_routines = Path("Docs/_static/SQL_4.json")
filepath_ValueDomains = Path("Docs/_static/VD_2.json")
filepath_vtl_script = Path("Docs/_static/Example_6.vtl")
filepath_data_structures = Path("Docs/_static/Example_6.json")
filepath_data = Path("Docs/_static/Example_6_input.csv")
datastructures = filepath_data_structures
datapoints = {"DS_1": filepath_data}
script = filepath_vtl_script
external_routines = filepath_external_routines
value_domains = filepath_ValueDomains
run_result = run(
script=script,
data_structures=data_structures,
datapoints=datapoints,
value_domains=value_domains,
external_routines=external_routines,
)
print(run_result)
Returns:
Id_1 |
Id_2 |
Me_1 |
Me_2 |
|---|---|---|---|
2012 |
AT |
0.0 |
False |
2012 |
DE |
4.0 |
False |
2012 |
FR |
9.0 |
False |
Id_1 |
Me_1 |
|---|---|
2012 |
0.0 |
2012 |
4.0 |
2012 |
9.0 |
Run with Scalar Values
The VTL Engine also supports the use of scalar values to be used as input within the VTL script. When an output path is provided, the engine generates CSV files containing the results of the script execution. Scalar results are saved as CSV files containing the resulting scalar value.
Example 7: Run with Scalar Values
from vtlengine import run
import pandas as pd
script = """
DS_r <- DS_1[filter Me_1 = Sc_1];
Sc_r <- Sc_1 + 10;
"""
data_structures = {
"datasets": [
{
"name": "DS_1",
"DataStructure": [
{"name": "Id_1", "type": "Integer", "role": "Identifier", "nullable": False},
{"name": "Me_1", "type": "Number", "role": "Measure", "nullable": True},
],
}
],
"scalars": [
{
"name": "Sc_1",
"type": "Number",
}
],
}
data_df = pd.DataFrame({"Id_1": [1, 2, 3], "Me_1": [10, 20, 30]})
datapoints = {"DS_1": data_df}
scalars = {"Sc_1": 20}
run_result = run(
script=script,
data_structures=data_structures,
datapoints=datapoints,
scalar_values=scalars,
return_only_persistent=True
)
print(run_result["DS_r"].data)
print(run_result["Sc_r"].value)
Returns:
Id_1 |
Me_1 |
|---|---|
2 |
20.0 |
30
Prettify
The vtlengine.prettify() method formats a VTL script to improve readability.
from vtlengine import prettify
script = """
define hierarchical ruleset accountingEntry (variable rule ACCOUNTING_ENTRY) is
B = C - D errorcode "Balance (credit-debit)" errorlevel 4;
N = A - L errorcode "Net (assets-liabilities)" errorlevel 4
end hierarchical ruleset;
DS_r <- check_hierarchy(BOP, accountingEntry rule ACCOUNTING_ENTRY dataset);
"""
prettified_script = prettify(script)
print(prettified_script)
Returns:
define hierarchical ruleset accountingEntry(variable rule ACCOUNTING_ENTRY) is
B = C - D
errorcode "Balance (credit-debit)"
errorlevel 4;
N = A - L
errorcode "Net (assets-liabilities)"
errorlevel 4
end hierarchical ruleset;
DS_r <-
check_hierarchy(
BOP,
accountingEntry,
rule ACCOUNTING_ENTRY);
For more information on usage, please refer to the API documentation