10 minutes to VTL Engine

Summarizes the main functions of the VTL Engine.

The VTL Engine API provides eight basic methods:

  • Semantic Analysis: Validates the correctness of a VTL script and computes the data structures of the datasets created within the script. Supports VTL JSON format, SDMX structure files, and pysdmx objects.

  • Run: Executes a VTL script using the provided input datapoints. Supports loading data structures from VTL JSON, SDMX structure files (.xml, .json), and pysdmx objects. Also supports loading datapoints from plain CSV files and SDMX data files (SDMX-ML, SDMX-JSON, SDMX-CSV).

  • Run_sdmx: Ensures compatibility with pysdmx by running a VTL script using the pysdmx PandasDataset. The VTL engine uses the input datapoints while mapping the SDMX DataStructureDefinition to the VTL datastructure. Internally uses the run function after converting PandasDatasets.

  • Generate_sdmx: Ensures compatibility with pysdmx by generating a TransformationScheme object from a VTL script.

  • Prettify: Formats a VTL script to make it more readable.

  • validate_datasets: Validates the input datapoints against the provided data structures.

  • validate_value_domains: Validates input value domains using a JSON Schema.

  • validate_external_routines: Validates external routines using both JSON Schema and SQLGlot.

Any VTL action requires the following elements as input:

  • VTL Script:

    The VTL script to be executed. It includes the transformation scheme, as well as any User Defined Operators, Hierarchical Rulesets, and Datapoint Rulesets. It can be provided as a string or as a Path object pointing to a .vtl file.

  • Data Structures:

    Define the structure of the input artifacts used in the VTL script, according to the VTL Information Model. Data structures can be provided in multiple formats:

    • VTL JSON format: As dictionaries or paths to JSON files

    • SDMX structure files: Paths to SDMX-ML (.xml) or SDMX-JSON (.json) structure files

    • pysdmx objects: Schema, DataStructureDefinition, or Dataflow objects

    A list of mixed formats can also be provided.

  • External Routines:

    The VTL Engine supports the use of SQL (ISO/IEC 9075) within the eval operator. External routines can be provided as a dictionary, a Path object pointing to a .json file or directory, or a list of such elements. The default value is None, which should be used if external routines are not applicable to the script. See example 5 for an example.

  • Value Domains:

    Define the value domains referenced in the VTL script, usually with an in operator. They can be provided as dictionaries, as paths, or as lists of JSON files. The default value is None, which should be used if value domains are not applicable. See example 5 for details.

  • Scalar Values:

    The VTL Engine supports scalar values as input for VTL scripts. These can be provided as a dictionary where keys are scalar names and values are the corresponding data. The default value is None, which should be used if scalar values are not required.

  • Output Folder:

    Specifies the directory where the results of the VTL script execution are saved. This is useful for scripts that generate datasets or scalar values. The output folder can be provided as a Path object.

  • SDMX Mappings (optional):

    When using SDMX files, you can provide a mapping between SDMX URNs and VTL dataset names. This can be a dictionary or a VtlDataflowMapping object from pysdmx. The mapping is useful when SDMX dataset names differ from VTL script dataset names.

Semantic Analysis

The vtlengine.semantic_analysis() method validates the correctness of a VTL script and computes the data structures of the datasets generated by the script itself (a prerequisite for semantic analysis).

  • If the VTL script is correct, the method returns a dictionary containing the data structures of all datasets generated by the script.

  • If the VTL script is incorrect, a SemanticError is raised.

The data_structures parameter accepts multiple formats:

  • VTL JSON format: Dictionaries or paths to .json files

  • SDMX structure files: Paths to SDMX-ML (.xml) or SDMX-JSON (.json) files

  • pysdmx objects: Schema, DataStructureDefinition, or Dataflow objects

  • Mixed lists: Any combination of the above formats

Example 1: Correct VTL

from vtlengine import semantic_analysis

script = """
    DS_A <- DS_1 * 10;
"""

data_structures = {
    'datasets': [
        {'name': 'DS_1',
         'DataStructure': [
             {'name': 'Id_1',
              'type':
                  'Integer',
              'role': 'Identifier',
              'nullable': False},
             {'name': 'Me_1',
              'type': 'Number',
              'role': 'Measure',
              'nullable': True}
         ]
         }
    ]
}

sa_result = semantic_analysis(script=script, data_structures=data_structures)

print(sa_result)

Returns:

{'DS_A': Dataset(name='DS_A', components={'Id_1': Component(name='Id_1', data_type="Integer", role="Identifier", nullable=False), 'Me_1': Component(name='Me_1', data_type="Number", role="Measure", nullable=True)}, data=None)}

Example 2: Incorrect VTL

Compared to Example 1, the only difference is that Me_1 uses a String data type instead of Number.

from vtlengine import semantic_analysis

script = """
    DS_A <- DS_1 * 10;
"""

data_structures = {
    'datasets': [
        {'name': 'DS_1',
         'DataStructure': [
             {'name': 'Id_1',
              'type':
                  'Integer',
              'role': 'Identifier',
              'nullable': False},
             {'name': 'Me_1',
              'type': 'String',
              'role': 'Measure',
              'nullable': True}
         ]
         }
    ]
}

sa_result = semantic_analysis(script=script, data_structures=data_structures)

print(sa_result)

Raises the following Error:

raise SemanticError(code="1-1-1-2",
vtlengine.Exceptions.SemanticError: ('Invalid implicit cast from String and Integer to Number.', '1-1-1-2')

Example 2b: Using SDMX Structures

The semantic_analysis function can also accept SDMX structure files or pysdmx objects:

from pathlib import Path

from vtlengine import semantic_analysis

script = """
    DS_A <- DS_1 * 10;
"""

# Using an SDMX-ML structure file
sdmx_structure = Path("path/to/structure.xml")

sa_result = semantic_analysis(script=script, data_structures=sdmx_structure)

print(sa_result)

Using pysdmx objects directly:

from pathlib import Path

from pysdmx.io import read_sdmx

from vtlengine import semantic_analysis

script = """
    DS_A <- DS_1 * 10;
"""

# Load structure using pysdmx
msg = read_sdmx(Path("path/to/structure.xml"))
dsds = msg.get_data_structure_definitions()

sa_result = semantic_analysis(script=script, data_structures=dsds)

print(sa_result)

Run VTL Scripts

The vtlengine.run() method executes a VTL script with the provided datapoints.

It returns a dictionary containing all generated datasets. If the output parameter is set, the engine writes the computation results to the specified folder; otherwise, the data is returned within the dictionary of computed datasets.

Two validations are performed before execution, either of which may raise an error:

  • Semantic analysis: Equivalent to running vtlengine.semantic_analysis().

  • Data load analysis: Performs a basic check of data structure names and types.

Example 3: Simple run

from vtlengine import run
import pandas as pd

script = """
    DS_A <- DS_1 * 10;
"""

data_structures = {
    'datasets': [
        {'name': 'DS_1',
         'DataStructure': [
             {'name': 'Id_1',
              'type':
                  'Integer',
              'role': 'Identifier',
              'nullable': False},
             {'name': 'Me_1',
              'type': 'Number',
              'role': 'Measure',
              'nullable': True}
         ]
         }
    ]
}

data_df = pd.DataFrame(
    {"Id_1": [1, 2, 3],
     "Me_1": [10, 20, 30]})

datapoints = {"DS_1": data_df}

run_result = run(script=script, data_structures=data_structures,
                 datapoints=datapoints)

print(run_result["DS_A"].data)
Returns:

Id_1

Me_1

1

100.0

2

200.0

3

300.0

Example 3b: Run using the DuckDB engine

The vtlengine.run() method can transpile VTL to SQL and execute it on DuckDB by passing use_duckdb=True. The DuckDB engine is recommended for large datasets and is required for S3 URI support. The recommended pattern is to provide an output_folder so each result dataset is streamed straight to a CSV file rather than materialised in memory. See DuckDB Engine for details.

from pathlib import Path

import pandas as pd

from vtlengine import run

script = "DS_A <- DS_1 * 10;"

data_structures = {
    "datasets": [
        {"name": "DS_1",
         "DataStructure": [
             {"name": "Id_1", "type": "Integer", "role": "Identifier", "nullable": False},
             {"name": "Me_1", "type": "Number",  "role": "Measure",    "nullable": True},
         ]}
    ]
}

datapoints = {"DS_1": pd.DataFrame({"Id_1": [1, 2, 3], "Me_1": [10, 20, 30]})}
output_folder = Path("./vtl-output")

run_result = run(
    script=script,
    data_structures=data_structures,
    datapoints=datapoints,
    output_folder=output_folder,
    use_duckdb=True,
)

# With output_folder set, results are written as CSV files; the returned
# Dataset objects carry no in-memory data.
print(run_result["DS_A"])
print(sorted(p.name for p in output_folder.iterdir()))

The output_folder will contain DS_A.csv with the same rows shown in Example 3. Drop output_folder to receive the result as an in-memory pandas.DataFrame instead.

Warning

Running on large datasets without an output_folder forces the DuckDB engine to materialise every result fully into memory as a pandas.DataFrame. This negates most of the throughput and memory-headroom advantages of the backend and can drop performance significantly. If any individual output dataset is larger than available memory, the run will raise an out-of-memory error, since pandas requires the complete object to be materialised in memory. For anything beyond small/exploratory inputs, set output_folder.

Example 4: Run from SDMX Dataset

The vtlengine.run_sdmx() method executes a VTL script using SDMX files, via the get_datasets function from pysdmx. It runs a VTL script with one or more PandasDataset instances from the pysdmx library.

This method prepares the required VTL data structures and datapoints, maps dataset structures to VTL identifiers, and delegates execution to the VTL Engine. It performs internal validation of dataset structures and input dependencies using DAG analysis.

For details on reading and writing SDMX datasets, see the pysdmx documentation.

The process works as follows:

  • Provide a VTL script (as a string, TransformationScheme, or file).

  • Supply one or more SDMX datasets as PandasDataset objects including structural metadata (from the pysdmx Schema).

  • When using multiple datasets, provide a mapping linking each dataset to the corresponding name expected in the script. This can be a VTLDataflowMapping object or a dictionary pairing the short-URN with the VTL dataset name.

  • The method validates all inputs, converts them into a VTL-compatible format, and executes the script.

  • The result is a new dataset (or several) generated according to the logic defined in the VTL script.

Important

The short-urn is the meaningful part of the URN. The format is: SDMX_type=Agency:ID(Version).

Example:

Dataflow=MD:TEST_DF(1.0) is the short-urn for urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=MD:TEST_DF(1.0)

Optional settings are the same as in the run method, including:

  • Providing value domains for data validation.

  • Using external routines as SQL statements.

  • Controlling how time period columns are formatted in the output.

  • Saving results to a specified output folder.

  • Filtering output datasets to return only those marked as “persistent” in the VTL script.

from pathlib import Path

from pysdmx.io import get_datasets

from vtlengine import run_sdmx

data = Path("docs/_static/data.xml")
structure = Path("docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = "DS_r <- DS_1 [calc Me_4 := OBS_VALUE];"
print(run_sdmx(script, datasets)['DS_r'].data)
Returns:

DIM_1

DIM_2

OBS_VALUE

Me_4

A

A

100

100

A

B

102

102

B

B

14206.490766

14206.490766

C

B

10130

10130

C

C

C

D

29929.036014

29929.036014

As part of its compatibility with pysdmx, the function can also take a TransformationScheme object as input. If no mapping is provided, the VTL script must have a single input, and the data file must contain only one dataset.

from pathlib import Path

from pysdmx.io import get_datasets
from pysdmx.model.vtl import TransformationScheme, Transformation

from vtlengine import run_sdmx

data = Path("docs/_static/data.xml")
structure = Path("docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = TransformationScheme(
    id="TS1",
    version="1.0",
    agency="MD",
    vtl_version="2.1",
    items=[
        Transformation(
            id="T1",
            uri=None,
            urn=None,
            name=None,
            description=None,
            expression="DS_1 [calc Me_4 := OBS_VALUE];",
            is_persistent=True,
            result="DS_r1",
            annotations=(),
        ),
        Transformation(
            id="T2",
            uri=None,
            urn=None,
            name=None,
            description=None,
            expression="DS_1 [rename OBS_VALUE to Me_5];",
            is_persistent=True,
            result="DS_r2",
            annotations=(),
        )
    ],
)
run_sdmx(script, datasets=datasets)

Finally, mapping information can be used to link an SDMX input dataset to a VTL input dataset via the VtlDataflowMapping object from pysdmx or a dictionary.

from pathlib import Path

from pysdmx.io import get_datasets
from pysdmx.model.vtl import TransformationScheme, Transformation
from pysdmx.model.vtl import VtlDataflowMapping

from vtlengine import run_sdmx

data = Path("docs/_static/data.xml")
structure = Path("docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = TransformationScheme(
    id="TS1",
    version="1.0",
    agency="MD",
    vtl_version="2.1",
    items=[
        Transformation(
            id="T1",
            uri=None,
            urn=None,
            name=None,
            description=None,
            expression="DS_1 [calc Me_4 := OBS_VALUE]",
            is_persistent=True,
            result="DS_r",
            annotations=(),
        ),
    ],
)
# Mapping using VtlDataflowMapping object:
mapping = VtlDataflowMapping(
        dataflow="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=MD:TEST_DF(1.0)",
        dataflow_alias="DS_1",
        id="VTL_MAP_1",
    )

# Mapping using dictionary:
mapping = {
"Dataflow=MD:TEST_DF(1.0)": "DS_1"
}
run_sdmx(script, datasets, mappings=mapping)

Files used in the example can be found here:

Example 4b: Run with SDMX Files

The vtlengine.run() function can also load SDMX files directly, without using the run_sdmx function. This provides a seamless workflow for SDMX data without requiring manual conversion to VTL JSON format.

Supported SDMX formats for data_structures:

  • SDMX-ML structure files (.xml)

  • SDMX-JSON structure files (.json)

  • pysdmx objects (Schema, DataStructureDefinition, Dataflow)

Supported SDMX formats for datapoints:

  • SDMX-ML data files (.xml)

  • SDMX-JSON data files (.json)

  • SDMX-CSV data files (.csv) - with automatic detection

SDMX files are automatically detected by their extension. For CSV files, the engine first attempts to parse as SDMX-CSV, then falls back to plain CSV if SDMX parsing fails.

When using SDMX files, the dataset name in the structure file (from the DataStructureDefinition ID) may differ from the name in the data file (from the Dataflow reference). Use the sdmx_mappings parameter to map the data file’s URN to the VTL dataset name used in your script:

from pathlib import Path

from vtlengine import run

# Using SDMX structure and data files directly
structure_file = Path("path/to/structure.xml")  # SDMX-ML structure
data_file = Path("path/to/data.xml")            # SDMX-ML data

# Map the data file's Dataflow URN to the structure's DSD name
mapping = {"Dataflow=AGENCY:DATAFLOW_ID(1.0)": "DSD_NAME"}

script = "DS_r <- DSD_NAME [calc Me_2 := OBS_VALUE * 2];"

result = run(
    script=script,
    data_structures=structure_file,
    datapoints=data_file,
    sdmx_mappings=mapping
)

You can also use sdmx_mappings to give datasets custom names in your VTL script:

from pathlib import Path

from vtlengine import run

structure_file = Path("path/to/structure.xml")
data_file = Path("path/to/data.xml")

script = "DS_r <- MY_DATASET [calc Me_2 := OBS_VALUE * 2];"

# Map SDMX URN to VTL dataset name
mapping = {"Dataflow=MD:TEST_DF(1.0)": "MY_DATASET"}

result = run(
    script=script,
    data_structures=structure_file,
    datapoints=data_file,
    sdmx_mappings=mapping
)

You can also mix VTL JSON structures with SDMX structures and plain CSV datapoints with SDMX data files:

from pathlib import Path

from vtlengine import run

# Mix of VTL JSON and SDMX structures
vtl_structure = {"datasets": [{"name": "DS_1", "DataStructure": [...]}]}
sdmx_structure = Path("path/to/sdmx_structure.xml")

# Mix of plain CSV and SDMX data
datapoints = {
    "DS_1": Path("path/to/plain_data.csv"),          # Plain CSV
    "DS_2": Path("path/to/sdmx_data.xml"),           # SDMX-ML
}

result = run(
    script=script,
    data_structures=[vtl_structure, sdmx_structure],
    datapoints=datapoints
)

Example 4c: Comparing two datasets sharing one SDMX Dataflow

A common SDMX pattern is having two datasets that share a single Dataflow (and therefore one DataStructureDefinition) but contain different data — for example, two reporting periods or a previous-vs-current snapshot. The same Dataflow object can be passed to to_vtl_json twice with different dataset_name arguments to bind it to two VTL aliases without cloning the structure.

import pandas as pd
from pysdmx.model.concept import Concept, DataType
from pysdmx.model.dataflow import (
    Component, Components, Dataflow, DataStructureDefinition, Role,
)

from vtlengine import run
from vtlengine.files.sdmx_handler import to_vtl_json


def build_components() -> Components:
    return Components([
        Component(id="Id_1", required=True, role=Role.DIMENSION,
                  concept=Concept(id="Id_1", dtype=DataType.INTEGER)),
        Component(id="Me_1", required=False, role=Role.MEASURE,
                  concept=Concept(id="Me_1", dtype=DataType.FLOAT)),
    ])


dataflow = Dataflow(
    id="DF_1", agency="ME", version="1.0",
    structure=DataStructureDefinition(
        id="DSD_1", agency="ME", version="1.0",
        components=build_components(),
    ),
)

data_structures = [
    to_vtl_json(dataflow, dataset_name="DS_1"),
    to_vtl_json(dataflow, dataset_name="DS_2"),
]

datapoints = {
    "DS_1": pd.DataFrame({"Id_1": [1, 2, 3], "Me_1": [10.0, 20.0, 30.0]}),
    "DS_2": pd.DataFrame({"Id_1": [1, 2, 3], "Me_1": [10.0, 25.0, 30.0]}),
}

script = """
    DS_diff  <- DS_2 - DS_1;
    DS_equal <- DS_1 = DS_2;
"""

result = run(script=script, data_structures=data_structures,
             datapoints=datapoints, return_only_persistent=False)

Expected output for DS_diff:

Id_1  Me_1
   1   0.0
   2   5.0
   3   0.0

Expected output for DS_equal:

Id_1  bool_var
   1      True
   2     False
   3      True

Example 5: Run with multiple Value Domains and External Routines as dictionaries.

from pathlib import Path

import pandas as pd

from vtlengine import run

def main():
    script = """
                Example_5 <- DS_1 [ calc Me_2:= Id_2 in Countries];
                Example_5_2 <- eval(SQL_3(DS_1) language "SQL" returns dataset { identifier<integer> Id_1,
                measure<number> Me_1});
            """

    data_structures = {
        "datasets": [
            {
                "name": "DS_1",
                "DataStructure": [
                    {"name": "Id_1", "type": "Integer", "role": "Identifier", "nullable": False},
                    {"name": "Id_2", "type": "String", "role": "Identifier", "nullable": False},
                    {"name": "Me_1", "type": "Number", "role": "Measure", "nullable": True},
                ],
            }
        ]
    }

    data_df = pd.DataFrame(
        {"Id_1": [2012, 2012, 2012], "Id_2": ["AT", "DE", "FR"], "Me_1": [0, 4, 9]}
    )

    datapoints = {"DS_1": data_df}

    external_routines = {
        "name": "SQL_3",
        "query": "SELECT Id_1, COUNT(*) AS Me_1 FROM DS_1 GROUP BY Id_1;",
    }


    value_domains = {
        "name": "Countries",
        "setlist": ["DE", "FR", "IT"],
        "type": "String",
    }

    run_result = run(
        script=script,
        data_structures=data_structures,
        datapoints=datapoints,
        value_domains=value_domains,
        external_routines=external_routines,
    )
    print(run_result)

Returns:

Id_1

Id_2

Me_1

Me_2

2012

AT

0.0

False

2012

DE

4.0

True

2012

FR

9.0

True

Id_1

Me_1

2012

3.0

Example 6: Run using Paths

The following example shows how to run a VTL script by providing all inputs as Path objects. In this case, datapoints can be passed as a dictionary where the dataset name is the key and the corresponding file path is the value. Here, DS_1 is the dictionary key that matches the dataset defined in the data structures file, but the input data file itself can have any name.

from pathlib import Path

import pandas as pd

from vtlengine import run

def main():
    filepath_external_routines = Path("docs/_static/SQL_4.json")
    filepath_ValueDomains = Path("docs/_static/VD_2.json")
    filepath_vtl_script = Path("docs/_static/Example_6.vtl")
    filepath_data_structures = Path("docs/_static/Example_6.json")
    filepath_data = Path("docs/_static/Example_6_input.csv")

    data_structures = filepath_data_structures
    datapoints = {"DS_1": filepath_data}
    script = filepath_vtl_script
    external_routines = filepath_external_routines
    value_domains = filepath_ValueDomains
    run_result = run(
        script=script,
        data_structures=data_structures,
        datapoints=datapoints,
        value_domains=value_domains,
        external_routines=external_routines,
    )
    print(run_result)

Returns:

Id_1

Id_2

Me_1

Me_2

2012

AT

0.0

True

2012

DE

4.0

False

2012

FR

9.0

False

Id_1

Me_1

2012

0.0

2012

4.0

2012

9.0

Run with Scalar Values

The VTL Engine also supports the use of scalar values to be used as input within the VTL script. When an output path is provided, the engine generates CSV files containing the results of the script execution. Scalar results are saved as a CSV file containing all resulting scalar values.

Example 7: Run with Scalar Values

from vtlengine import run
import pandas as pd

script = """
    DS_r <- DS_1[filter Me_1 = Sc_1];
    Sc_r <- Sc_1 + 10;
    Sc_r2 <- Sc_r * 2;
    Sc_r3 <- null;
"""

data_structures = {
    "datasets": [
        {
            "name": "DS_1",
            "DataStructure": [
                {"name": "Id_1", "type": "Integer", "role": "Identifier", "nullable": False},
                {"name": "Me_1", "type": "Number", "role": "Measure", "nullable": True},
            ],
        }
    ],
    "scalars": [
        {
            "name": "Sc_1",
            "type": "Number",
        }
    ],
}

data_df = pd.DataFrame({"Id_1": [1, 2, 3], "Me_1": [10, 20, 30]})
datapoints = {"DS_1": data_df}
scalars = {"Sc_1": 20}

run_result = run(
    script=script,
    data_structures=data_structures,
    datapoints=datapoints,
    scalar_values=scalars,
    return_only_persistent=True
)

print(run_result)

Returns:

Id_1

Me_1

2

20.0

Name

Value

Sc_r

30

Sc_r2

60

Sc_r3

Prettify

The vtlengine.prettify() method formats a VTL script to improve readability.

from vtlengine import prettify
script = """
    define hierarchical ruleset accountingEntry (variable rule ACCOUNTING_ENTRY) is
                    B = C - D errorcode "Balance (credit-debit)" errorlevel 4;
                    N = A - L errorcode "Net (assets-liabilities)" errorlevel 4
                end hierarchical ruleset;

    DS_r <- check_hierarchy(BOP, accountingEntry rule ACCOUNTING_ENTRY dataset);
    """
prettified_script = prettify(script)
print(prettified_script)

Returns:

define hierarchical ruleset accountingEntry(variable rule ACCOUNTING_ENTRY) is
    B = C - D
    errorcode "Balance (credit-debit)"
    errorlevel 4;

    N = A - L
    errorcode "Net (assets-liabilities)"
    errorlevel 4
end hierarchical ruleset;

DS_r <-
    check_hierarchy(
        BOP,
        accountingEntry,
        rule ACCOUNTING_ENTRY);

For more information on usage, please refer to the API documentation