10 minutes to VTL Engine

Summarizes the main functions of the VTL Engine

The VTL Engine API implements five basic methods:

  • Semantic Analysis: aimed at validating the correctness of a script and computing the data structures of the data sets created in the script.

  • Run: aimed at executing the provided input on the provided input datasets.

  • Run_sdmx: as part of the compatibility with pysdmx, this method is used to run a VTL script by using the pysdmx PandasDataset. It allows the use of SDMX datasets as input to run the given script.

  • Generate_sdmx: as part of the compatibility with pysdmx, this method is used to generate a TransformationScheme object from a VTL script.

  • Prettify: aimed at formatting the VTL script to make it more readable.

Any action with VTL requires the following elements as input:

  • VTL Script: Is the VTL to be executed, which includes the transformation scheme, as well as de User Defined Operators, Hierarchical Rulesets and Datapoint Rulesets. It is provided as a string or as a Path object to a vtl file.

  • Data structures : Provides the structure of the input artifacts of the VTL script, according to the VTL Information model. Given that the current version doesn’t prescribe a standard format for providing the information, the VTL Engine is implementing a JSON format that can be found here. Data Structures can be provided as Dictionaries or as Paths to JSON files. It is possible to have

  • External routines: The VTL Engine allows using SQL (SQLite) with the eval operator. Can be provided as a string with the SQL or as a path object to an SQL file. Its default value is None, which shall be used if external routines are not applicable to the VTL script.

  • Value domains: Provides the value domains that are used in the VTL script, normally with an in operator. Can be provided as a dictionary or as a path to a JSON file. Its default value is None, which shall be used if value domains are not applicable to the VTL script.

Semantic Analysis

The vtlengine.semantic_analysis() method serves to validate the correctness of a VTL script, as well as to calculate the data structures of the datasets generated by the VTL script itself (that calculation is a pre-requisite for the semantic analysis).

  • If the VTL script is correct, the method returns a dictionary with the data structures of all the datasets generated by the script.

  • If the VTL script is incorrect, raises a SemanticError.

Example 1: Correct VTL

from vtlengine import semantic_analysis

script = """
    DS_A <= DS_1 * 10;
"""

data_structures = {
    'datasets': [
        {'name': 'DS_1',
         'DataStructure': [
             {'name': 'Id_1',
              'type':
                  'Integer',
              'role': 'Identifier',
              'nullable': False},
             {'name': 'Me_1',
              'type': 'Number',
              'role': 'Measure',
              'nullable': True}
         ]
         }
    ]
}

sa_result = semantic_analysis(script=script, data_structures=data_structures)

print(sa_result)

Returns:

{'DS_A': Dataset(name='DS_A', components={'Id_1': Component(name='Id_1', data_type="Integer", role="Identifier", nullable=False), 'Me_1': Component(name='Me_1', data_type="Number", role="Measure", nullable=True)}, data=None)}

Example 2: Incorrect VTL

Note that, as compared to Example 1, the only change is that Me_1 is of the String data type, instead of Number.

from vtlengine import semantic_analysis

script = """
    DS_A <= DS_1 * 10;
"""

data_structures = {
    'datasets': [
        {'name': 'DS_1',
         'DataStructure': [
             {'name': 'Id_1',
              'type':
                  'Integer',
              'role': 'Identifier',
              'nullable': False},
             {'name': 'Me_1',
              'type': 'String',
              'role': 'Measure',
              'nullable': True}
         ]
         }
    ]
}

sa_result = semantic_analysis(script=script, data_structures=data_structures)

print(sa_result)

Will raise the following Error:

raise SemanticError(code="1-1-1-2",
vtlengine.Exceptions.SemanticError: ('Invalid implicit cast from String and Integer to Number.', '1-1-1-2')

Run VTL Scripts

The vtlengine.run() method serves to execute a VTL script with input datapoints.

Returns a dictionary with all the generated Datasets. When the output parameter is set, the engine will write the result of the computation to the output folder, else it will include the data in the dictionary of the computed datasets.

Two validations are performed before running, which can raise errors:

  • Semantic analysis: Equivalent to running the vtlengine.semantic_analysis() method

  • Data load analysis: Basic check of the data structure (names and types)

Example 3: Simple run

from vtlengine import run
import pandas as pd

script = """
    DS_A <= DS_1 * 10;
"""

data_structures = {
    'datasets': [
        {'name': 'DS_1',
         'DataStructure': [
             {'name': 'Id_1',
              'type':
                  'Integer',
              'role': 'Identifier',
              'nullable': False},
             {'name': 'Me_1',
              'type': 'Number',
              'role': 'Measure',
              'nullable': True}
         ]
         }
    ]
}

data_df = pd.DataFrame(
    {"Id_1": [1, 2, 3],
     "Me_1": [10, 20, 30]})

datapoints = {"DS_1": data_df}

run_result = run(script=script, data_structures=data_structures,
                 datapoints=datapoints)

print(run_result["DS_A"].data)
Returns:

Id_1

Me_1

1

100.0

2

200.0

3

300.0

Example 4: Run from SDMX Dataset

The vtlengine.run_sdmx() method serves to execute a VTL script with input SDMX files, using get_datasets function from pysdmx. Executes a VTL script using one or more PandasDataset instances from the pysdmx library.

This function prepares the required VTL data structures and datapoints, handles mapping from dataset structures to VTL identifiers, and delegates execution to the VTL engine. It performs internal validation of dataset structures and the VTL script’s input dependencies using DAG analysis.

Documentation on read and writing SDMX datasets.

The process works as follows:

  • The user provides a VTL script. This can be passed as a string, a Transformation Scheme, or a file.

  • One or more SDMX datasets are supplied as PandasDataset. These datasets must include structural metadata (using pysdmx Schema).

  • If multiple datasets are used, a mapping is required to link each dataset to the corresponding name expected in the script. This mapping could be a VTLDataflowMapping object or a dictionary with key-value pairs of the short-urn and the VTL dataset name.

  • The function checks that all inputs are valid, converts them into VTL-compatible format, and then executes the script.

  • The result is a new dataset (or several) generated by the logic defined in the VTL script.

Important

The short-urn is the meaningful part of the URN. The format is: SDMX_type=Agency:ID(Version).

Example:

Dataflow=MD:TEST_DF(1.0) is the short-urn for urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=MD:TEST_DF(1.0)

Optional settings are the same as in the run method, including:

  • Providing value domains for data validation.

  • Using external routines as SQL statements.

  • Controlling how time period columns are formatted in the output.

  • Saving the result to a specified output folder.

  • Filtering output datasets to only return those marked as “persistent” in the VTL script.

from pathlib import Path

from pysdmx.io import get_datasets

from vtlengine import run_sdmx

data = Path("Docs/_static/data.xml")
structure = Path("Docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = "DS_r <= DS_1 [calc Me_4 := OBS_VALUE];"
print(run_sdmx(script, datasets)['DS_r'].data)
Returns:

DIM_1

DIM_2

OBS_VALUE

Me_4

A

A

100

100

A

B

102

102

B

B

14206.490766

14206.490766

C

B

10130

10130

C

C

C

D

29929.036014

29929.036014

As part with the compatibility with pysdmx, the function can also be used by taking as input a TransformationScheme object. If we do not include a mapping, VTL script must have a single input, and data file must have only one dataset:

from pysdmx.io import get_datasets
from pysdmx.model.vtl import TransformationScheme, Transformation
from vtlengine import run_sdmx

data = Path("Docs/_static/data.xml")
structure = Path("Docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = TransformationScheme(
    id="TS1",
    version="1.0",
    agency="MD",
    vtl_version="2.1",
    items=[
        Transformation(
            id="T1",
            uri=None,
            urn=None,
            name=None,
            description=None,
            expression="DS_1 [calc Me_4 := OBS_VALUE];",
            is_persistent=True,
            result="DS_r1",
            annotations=(),
        ),
        Transformation(
            id="T2",
            uri=None,
            urn=None,
            name=None,
            description=None,
            expression="DS_1 [rename OBS_VALUE to Me_5];",
            is_persistent=True,
            result="DS_r2",
            annotations=(),
        )
    ],
)
run_sdmx(script, datasets=datasets)

Finally, the mapping information is used to map the SDMX input dataset to the VTL input dataset by using the VTLDataflowMapping object from pysdmx or a dictionary.

from pysdmx.io import get_datasets
from pysdmx.model.vtl import TransformationScheme, Transformation
from pysdmx.model.vtl import VTLDataflowMapping
from vtlengine import run_sdmx

data = Path("Docs/_static/data.xml")
structure = Path("Docs/_static/metadata.xml")
datasets = get_datasets(data, structure)
script = TransformationScheme(
    id="TS1",
    version="1.0",
    agency="MD",
    vtl_version="2.1",
    items=[
        Transformation(
            id="T1",
            uri=None,
            urn=None,
            name=None,
            description=None,
            expression="DS_1 [calc Me_4 := OBS_VALUE]",
            is_persistent=True,
            result="DS_r",
            annotations=(),
        ),
    ],
)
# Mapping using VTLDataflowMapping object:
mapping = VtlDataflowMapping(
        dataflow="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=MD:TEST_DF(1.0)",
        dataflow_alias="DS_1",
        id="VTL_MAP_1",
    )

# Mapping using dictionary:
mapping = {
"Dataflow=MD:TEST_DF(1.0)": "DS_1"
}
run_sdmx(script, datasets, mapping=mapping)

Files used in the example can be found here:

Prettify

The vtlengine.prettify() method serves to format a VTL script to make it more readable.

from vtlengine import prettify
script = """
    define hierarchical ruleset accountingEntry (variable rule ACCOUNTING_ENTRY) is
                    B = C - D errorcode "Balance (credit-debit)" errorlevel 4;
                    N = A - L errorcode "Net (assets-liabilities)" errorlevel 4
                end hierarchical ruleset;

    DS_r <= check_hierarchy(BOP, accountingEntry rule ACCOUNTING_ENTRY dataset);
    """
prettified_script = prettify(script)
print(prettified_script)

returns:

define hierarchical ruleset accountingEntry(variable rule ACCOUNTING_ENTRY) is
    B = C - D
    errorcode "Balance (credit-debit)"
    errorlevel 4;

    N = A - L
    errorcode "Net (assets-liabilities)"
    errorlevel 4
end hierarchical ruleset;

DS_r <=
    check_hierarchy(
        BOP,
        accountingEntry,
        rule ACCOUNTING_ENTRY);

For more information on usage, please refer to the API documentation