10 minutes to VTL Engine

Summarizes the main functions of the VTL Engine.

The VTL Engine API provides eight basic methods:

Semantic Analysis: Validates the correctness of a VTL script and computes the data structures of the datasets created within the script. Supports VTL JSON format, SDMX structure files, and pysdmx objects.
Run: Executes a VTL script using the provided input datapoints. Supports loading data structures from VTL JSON, SDMX structure files (.xml, .json), and pysdmx objects. Also supports loading datapoints from plain CSV files and SDMX data files (SDMX-ML, SDMX-JSON, SDMX-CSV).
Run_sdmx: Ensures compatibility with pysdmx by running a VTL script using the pysdmx PandasDataset. The VTL engine uses the input datapoints while mapping the SDMX DataStructureDefinition to the VTL datastructure. Internally uses the run function after converting PandasDatasets.
Generate_sdmx: Ensures compatibility with pysdmx by generating a TransformationScheme object from a VTL script.
Prettify: Formats a VTL script to make it more readable.
validate_datasets: Validates the input datapoints against the provided data structures.
validate_value_domains: Validates input value domains using a JSON Schema.
validate_external_routines: Validates external routines using both JSON Schema and SQLGlot.

Any VTL action requires the following elements as input:

VTL Script:
The VTL script to be executed. It includes the transformation scheme, as well as any User Defined Operators, Hierarchical Rulesets, and Datapoint Rulesets. It can be provided as a string or as a Path object pointing to a .vtl file.
Data Structures:
Define the structure of the input artifacts used in the VTL script, according to the VTL Information Model. Data structures can be provided in multiple formats:
- VTL JSON format: As dictionaries or paths to JSON files
- SDMX structure files: Paths to SDMX-ML (.xml) or SDMX-JSON (.json) structure files
- pysdmx objects: Schema, DataStructureDefinition, or Dataflow objects
A list of mixed formats can also be provided.
External Routines:
The VTL Engine supports the use of SQL (ISO/IEC 9075) within the eval operator. External routines can be provided as a SQL string, a Path object, or a list of such elements pointing to .sql files. The default value is None, which should be used if external routines are not applicable to the script. See example 5 for an example.
Value Domains:
Define the value domains referenced in the VTL script, usually with an in operator. They can be provided as dictionaries, as paths, or as lists of JSON files. The default value is None, which should be used if value domains are not applicable. See example 5 for details.
Scalar Values:
The VTL Engine supports scalar values as input for VTL scripts. These can be provided as a dictionary where keys are scalar names and values are the corresponding data. The default value is None, which should be used if scalar values are not required.
Output Folder:
Specifies the directory where the results of the VTL script execution are saved. This is useful for scripts that generate datasets or scalar values. The output folder can be provided as a Path object.
SDMX Mappings (optional):
When using SDMX files, you can provide a mapping between SDMX URNs and VTL dataset names. This can be a dictionary or a VtlDataflowMapping object from pysdmx. The mapping is useful when SDMX dataset names differ from VTL script dataset names.

Semantic Analysis

The vtlengine.semantic_analysis() method validates the correctness of a VTL script and computes the data structures of the datasets generated by the script itself (a prerequisite for semantic analysis).

If the VTL script is correct, the method returns a dictionary containing the data structures of all datasets generated by the script.
If the VTL script is incorrect, a SemanticError is raised.

The data_structures parameter accepts multiple formats:

VTL JSON format: Dictionaries or paths to .json files
SDMX structure files: Paths to SDMX-ML (.xml) or SDMX-JSON (.json) files
pysdmx objects: Schema, DataStructureDefinition, or Dataflow objects
Mixed lists: Any combination of the above formats

Example 1: Correct VTL

from vtlengine import semantic_analysis

script = """
    DS_A <- DS_1 * 10;
"""

data_structures = {
    'datasets': [
        {'name': 'DS_1',
         'DataStructure': [
             {'name': 'Id_1',
              'type':
                  'Integer',
              'role': 'Identifier',
              'nullable': False},
             {'name': 'Me_1',
              'type': 'Number',
              'role': 'Measure',
              'nullable': True}
         ]
         }
    ]
}

sa_result = semantic_analysis(script=script, data_structures=data_structures)

print(sa_result)

Returns:

{'DS_A': Dataset(name='DS_A', components={'Id_1': Component(name='Id_1', data_type="Integer", role="Identifier", nullable=False), 'Me_1': Component(name='Me_1', data_type="Number", role="Measure", nullable=True)}, data=None)}

Example 2: Incorrect VTL

Compared to Example 1, the only difference is that Me_1 uses a String data type instead of Number.

from vtlengine import semantic_analysis

script = """
    DS_A <- DS_1 * 10;
"""

data_structures = {
    'datasets': [
        {'name': 'DS_1',
         'DataStructure': [
             {'name': 'Id_1',
              'type':
                  'Integer',
              'role': 'Identifier',
              'nullable': False},
             {'name': 'Me_1',
              'type': 'String',
              'role': 'Measure',
              'nullable': True}
         ]
         }
    ]
}

sa_result = semantic_analysis(script=script, data_structures=data_structures)

print(sa_result)

Raises the following Error:

raise SemanticError(code="1-1-1-2",
vtlengine.Exceptions.SemanticError: ('Invalid implicit cast from String and Integer to Number.', '1-1-1-2')

Example 2b: Using SDMX Structures

The semantic_analysis function can also accept SDMX structure files or pysdmx objects:

from pathlib import Path

from vtlengine import semantic_analysis

script = """
    DS_A <- DS_1 * 10;
"""

# Using an SDMX-ML structure file
sdmx_structure = Path("path/to/structure.xml")

sa_result = semantic_analysis(script=script, data_structures=sdmx_structure)

print(sa_result)

Using pysdmx objects directly:

from pathlib import Path

from pysdmx.io import read_sdmx

from vtlengine import semantic_analysis

script = """
    DS_A <- DS_1 * 10;
"""

# Load structure using pysdmx
msg = read_sdmx(Path("path/to/structure.xml"))
dsds = msg.get_data_structure_definitions()

sa_result = semantic_analysis(script=script, data_structures=dsds)

print(sa_result)

Run VTL Scripts

The vtlengine.run() method executes a VTL script with the provided datapoints.

It returns a dictionary containing all generated datasets. If the output parameter is set, the engine writes the computation results to the specified folder; otherwise, the data is returned within the dictionary of computed datasets.

Two validations are performed before execution, either of which may raise an error:

Semantic analysis: Equivalent to running vtlengine.semantic_analysis().
Data load analysis: Performs a basic check of data structure names and types.

Example 3: Simple run

from vtlengine import run
import pandas as pd

script = """
    DS_A <- DS_1 * 10;
"""

data_structures = {
    'datasets': [
        {'name': 'DS_1',
         'DataStructure': [
             {'name': 'Id_1',
              'type':
                  'Integer',
              'role': 'Identifier',
              'nullable': False},
             {'name': 'Me_1',
              'type': 'Number',
              'role': 'Measure',
              'nullable': True}
         ]
         }
    ]
}

data_df = pd.DataFrame(
    {"Id_1": [1, 2, 3],
     "Me_1": [10, 20, 30]})

datapoints = {"DS_1": data_df}

run_result = run(script=script, data_structures=data_structures,
                 datapoints=datapoints)

print(run_result["DS_A"].data)

Returns:
Id_1	Me_1
1	100.0
2	200.0
3	300.0

Example 4: Run from SDMX Dataset

The vtlengine.run_sdmx() method executes a VTL script using SDMX files, via the get_datasets function from pysdmx. It runs a VTL script with one or more PandasDataset instances from the pysdmx library.

This method prepares the required VTL data structures and datapoints, maps dataset structures to VTL identifiers, and delegates execution to the VTL Engine. It performs internal validation of dataset structures and input dependencies using DAG analysis.

For details on reading and writing SDMX datasets, see the pysdmx documentation.

The process works as follows:

Provide a VTL script (as a string, TransformationScheme, or file).
Supply one or more SDMX datasets as PandasDataset objects including structural metadata (from the pysdmx Schema).
When using multiple datasets, provide a mapping linking each dataset to the corresponding name expected in the script. This can be a VTLDataflowMapping object or a dictionary pairing the short-URN with the VTL dataset name.
The method validates all inputs, converts them into a VTL-compatible format, and executes the script.
The result is a new dataset (or several) generated according to the logic defined in the VTL script.

Important

The short-urn is the meaningful part of the URN. The format is: SDMX_type=Agency:ID(Version).

Example:

Dataflow=MD:TEST_DF(1.0) is the short-urn for urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=MD:TEST_DF(1.0)

Optional settings are the same as in the run method, including:

Providing value domains for data validation.
Using external routines as SQL statements.
Controlling how time period columns are formatted in the output.
Saving results to a specified output folder.
Filtering output datasets to return only those marked as “persistent” in the VTL script.

from pathlib import Path

from pysdmx.io import get_datasets

from vtlengine import run_sdmx

data = Path("Docs/_static/data.xml")
structure = Path("Docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = "DS_r <- DS_1 [calc Me_4 := OBS_VALUE];"
print(run_sdmx(script, datasets)['DS_r'].data)

Returns:
DIM_1	DIM_2	OBS_VALUE	Me_4
A	A	100	100
A	B	102	102
B	B	14206.490766	14206.490766
C	B	10130	10130
C	C
C	D	29929.036014	29929.036014

As part of its compatibility with pysdmx, the function can also take a TransformationScheme object as input. If no mapping is provided, the VTL script must have a single input, and the data file must contain only one dataset.

from pathlib import Path

from pysdmx.io import get_datasets
from pysdmx.model.vtl import TransformationScheme, Transformation

from vtlengine import run_sdmx

data = Path("Docs/_static/data.xml")
structure = Path("Docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = TransformationScheme(
    id="TS1",
    version="1.0",
    agency="MD",
    vtl_version="2.1",
    items=[
        Transformation(
            id="T1",
            uri=None,
            urn=None,
            name=None,
            description=None,
            expression="DS_1 [calc Me_4 := OBS_VALUE];",
            is_persistent=True,
            result="DS_r1",
            annotations=(),
        ),
        Transformation(
            id="T2",
            uri=None,
            urn=None,
            name=None,
            description=None,
            expression="DS_1 [rename OBS_VALUE to Me_5];",
            is_persistent=True,
            result="DS_r2",
            annotations=(),
        )
    ],
)
run_sdmx(script, datasets=datasets)

Finally, mapping information can be used to link an SDMX input dataset to a VTL input dataset via the VtlDataflowMapping object from pysdmx or a dictionary.

from pathlib import Path

from pysdmx.io import get_datasets
from pysdmx.model.vtl import TransformationScheme, Transformation
from pysdmx.model.vtl import VtlDataflowMapping

from vtlengine import run_sdmx

data = Path("Docs/_static/data.xml")
structure = Path("Docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = TransformationScheme(
    id="TS1",
    version="1.0",
    agency="MD",
    vtl_version="2.1",
    items=[
        Transformation(
            id="T1",
            uri=None,
            urn=None,
            name=None,
            description=None,
            expression="DS_1 [calc Me_4 := OBS_VALUE]",
            is_persistent=True,
            result="DS_r",
            annotations=(),
        ),
    ],
)
# Mapping using VtlDataflowMapping object:
mapping = VtlDataflowMapping(
        dataflow="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=MD:TEST_DF(1.0)",
        dataflow_alias="DS_1",
        id="VTL_MAP_1",
    )

# Mapping using dictionary:
mapping = {
"Dataflow=MD:TEST_DF(1.0)": "DS_1"
}
run_sdmx(script, datasets, mappings=mapping)

Files used in the example can be found here:

Example 4b: Run with SDMX Files

The vtlengine.run() function can also load SDMX files directly, without using the run_sdmx function. This provides a seamless workflow for SDMX data without requiring manual conversion to VTL JSON format.

Supported SDMX formats for data_structures:

SDMX-ML structure files (.xml)
SDMX-JSON structure files (.json)
pysdmx objects (Schema, DataStructureDefinition, Dataflow)

Supported SDMX formats for datapoints:

SDMX-ML data files (.xml)
SDMX-JSON data files (.json)
SDMX-CSV data files (.csv) - with automatic detection

SDMX files are automatically detected by their extension. For CSV files, the engine first attempts to parse as SDMX-CSV, then falls back to plain CSV if SDMX parsing fails.

When using SDMX files, the dataset name in the structure file (from the DataStructureDefinition ID) may differ from the name in the data file (from the Dataflow reference). Use the sdmx_mappings parameter to map the data file’s URN to the VTL dataset name used in your script:

from pathlib import Path

from vtlengine import run

# Using SDMX structure and data files directly
structure_file = Path("path/to/structure.xml")  # SDMX-ML structure
data_file = Path("path/to/data.xml")            # SDMX-ML data

# Map the data file's Dataflow URN to the structure's DSD name
mapping = {"Dataflow=AGENCY:DATAFLOW_ID(1.0)": "DSD_NAME"}

script = "DS_r <- DSD_NAME [calc Me_2 := OBS_VALUE * 2];"

result = run(
    script=script,
    data_structures=structure_file,
    datapoints=data_file,
    sdmx_mappings=mapping
)

You can also use sdmx_mappings to give datasets custom names in your VTL script:

from pathlib import Path

from vtlengine import run

structure_file = Path("path/to/structure.xml")
data_file = Path("path/to/data.xml")

script = "DS_r <- MY_DATASET [calc Me_2 := OBS_VALUE * 2];"

# Map SDMX URN to VTL dataset name
mapping = {"Dataflow=MD:TEST_DF(1.0)": "MY_DATASET"}

result = run(
    script=script,
    data_structures=structure_file,
    datapoints=data_file,
    sdmx_mappings=mapping
)

You can also mix VTL JSON structures with SDMX structures and plain CSV datapoints with SDMX data files:

from pathlib import Path

from vtlengine import run

# Mix of VTL JSON and SDMX structures
vtl_structure = {"datasets": [{"name": "DS_1", "DataStructure": [...]}]}
sdmx_structure = Path("path/to/sdmx_structure.xml")

# Mix of plain CSV and SDMX data
datapoints = {
    "DS_1": Path("path/to/plain_data.csv"),          # Plain CSV
    "DS_2": Path("path/to/sdmx_data.xml"),           # SDMX-ML
}

result = run(
    script=script,
    data_structures=[vtl_structure, sdmx_structure],
    datapoints=datapoints
)

Example 5: Run with multiple Value Domains and External Routines as dictionaries.

from pathlib import Path

import pandas as pd

from vtlengine import run

def main():
    script = """
                Example_5 <- DS_1 [ calc Me_2:= Me_1 in Countries];
                Example_5_2 <- eval(SQL_3(DS_1) language "sqlite" returns dataset { identifier<integer> Id_1,
                measure<number> Me_1});
            """

    data_structures = {
        "datasets": [
            {
                "name": "DS_1",
                "DataStructure": [
                    {"name": "Id_1", "type": "Integer", "role": "Identifier", "nullable": False},
                    {"name": "Id_2", "type": "String", "role": "Identifier", "nullable": False},
                    {"name": "Me_1", "type": "Number", "role": "Measure", "nullable": True},
                ],
            }
        ]
    }

    data_df = pd.DataFrame(
        {"Id_1": [2012, 2012, 2012], "Id_2": ["AT", "DE", "FR"], "Me_1": [0, 4, 9]}
    )

    datapoints = {"DS_1": data_df}

    external_routines = {
        "name": "SQL_3",
        "query": "SELECT Id_1, COUNT(*) AS Me_1 FROM DS_1 GROUP BY Id_1;",
    }


    value_domains =
    {"name": "Countries", "setlist": ["DE", "FR", "IT"], "type": "String"}

    run_result = run(
        script=script,
        data_structures=data_structures,
        datapoints=datapoints,
        value_domains=value_domains,
        external_routines=external_routines,
    )
    print(run_result)

Returns:

Id_1	Id_2	Me_1	Me_2
2012	AT	0.0	False
2012	DE	4.0	False
2012	FR	9.0	False

Id_1	Me_1
2012	3.0

Example 6: Run using Paths

The following example shows how to run a VTL script by providing all inputs as Path objects. In this case, datapoints can be passed as a dictionary where the dataset name is the key and the corresponding file path is the value. Here, DS_1 is the dictionary key that matches the dataset defined in the data structures file, but the input data file itself can have any name.

from pathlib import Path

import pandas as pd

from vtlengine import run

def main():
    filepath_external_routines = Path("Docs/_static/SQL_4.json")
    filepath_ValueDomains = Path("Docs/_static/VD_2.json")
    filepath_vtl_script = Path("Docs/_static/Example_6.vtl")
    filepath_data_structures = Path("Docs/_static/Example_6.json")
    filepath_data = Path("Docs/_static/Example_6_input.csv")

    datastructures = filepath_data_structures
    datapoints = {"DS_1": filepath_data}
    script = filepath_vtl_script
    external_routines = filepath_external_routines
    value_domains = filepath_ValueDomains
    run_result = run(
        script=script,
        data_structures=data_structures,
        datapoints=datapoints,
        value_domains=value_domains,
        external_routines=external_routines,
    )
    print(run_result)

Returns:

Id_1	Id_2	Me_1	Me_2
2012	AT	0.0	False
2012	DE	4.0	False
2012	FR	9.0	False

Id_1	Me_1
2012	0.0
2012	4.0
2012	9.0

Run with Scalar Values

The VTL Engine also supports the use of scalar values to be used as input within the VTL script. When an output path is provided, the engine generates CSV files containing the results of the script execution. Scalar results are saved as a CSV file containing all resulting scalar values.

Example 7: Run with Scalar Values

from vtlengine import run
import pandas as pd

script = """
    DS_r <- DS_1[filter Me_1 = Sc_1];
    Sc_r <- Sc_1 + 10;
    Sc_r2 <- Sc_r * 2;
    Sc_r3 <- null;
"""

data_structures = {
    "datasets": [
        {
            "name": "DS_1",
            "DataStructure": [
                {"name": "Id_1", "type": "Integer", "role": "Identifier", "nullable": False},
                {"name": "Me_1", "type": "Number", "role": "Measure", "nullable": True},
            ],
        }
    ],
    "scalars": [
        {
            "name": "Sc_1",
            "type": "Number",
        }
    ],
}

data_df = pd.DataFrame({"Id_1": [1, 2, 3], "Me_1": [10, 20, 30]})
datapoints = {"DS_1": data_df}
scalars = {"Sc_1": 20}

run_result = run(
    script=script,
    data_structures=data_structures,
    datapoints=datapoints,
    scalar_values=scalars,
    return_only_persistent=True
)

print(run_result)

Returns:

Id_1	Me_1
2	20.0

Name	Value
Sc_r	30
Sc_r2	60
Sc_r3

Prettify

The vtlengine.prettify() method formats a VTL script to improve readability.

from vtlengine import prettify
script = """
    define hierarchical ruleset accountingEntry (variable rule ACCOUNTING_ENTRY) is
                    B = C - D errorcode "Balance (credit-debit)" errorlevel 4;
                    N = A - L errorcode "Net (assets-liabilities)" errorlevel 4
                end hierarchical ruleset;

    DS_r <- check_hierarchy(BOP, accountingEntry rule ACCOUNTING_ENTRY dataset);
    """
prettified_script = prettify(script)
print(prettified_script)

Returns:

define hierarchical ruleset accountingEntry(variable rule ACCOUNTING_ENTRY) is
    B = C - D
    errorcode "Balance (credit-debit)"
    errorlevel 4;

    N = A - L
    errorcode "Net (assets-liabilities)"
    errorlevel 4
end hierarchical ruleset;

DS_r <-
    check_hierarchy(
        BOP,
        accountingEntry,
        rule ACCOUNTING_ENTRY);

For more information on usage, please refer to the API documentation