6. YAML usage

NOTE

This tutorial will mostly remain xscen-specific and, thus, will not go into more advanced YAML functionalities such as anchors. More information on that can be consulted here, while this template makes ample use of them.

While parameters can be explicitely given to functions, most support the use of YAML configuration files to automatically pass arguments. This tutorial will go over basic principles on how to write and prepare configuration files, and provide a few examples.

An xscen function supports YAML parametrisation if it is preceded by the parse_config wrapper in the code. Currently supported functions are:

[1]:
from xscen.config import get_configurable

list(get_configurable().keys())
ERROR 1: PROJ: proj_create_from_database: Open of /home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/share/proj failed
[1]:
['xscen.aggregate.climatological_mean',
 'xscen.aggregate.climatological_op',
 'xscen.aggregate.compute_deltas',
 'xscen.aggregate.compute_indicators',
 'xscen.aggregate.produce_horizon',
 'xscen.aggregate.spatial_mean',
 'xscen.aggregate.subset_warming_level',
 'xscen.biasadjust.adjust',
 'xscen.biasadjust.train',
 'xscen.catutils.build_path',
 'xscen.catutils.parse_directory',
 'xscen.diagnostics.health_checks',
 'xscen.diagnostics.properties_and_measures',
 'xscen.diagnostics.unstack_fill_nan',
 'xscen.ensembles.compute_indicators',
 'xscen.ensembles.ensemble_stats',
 'xscen.ensembles.reduce_ensemble',
 'xscen.ensembles.regrid_dataset',
 'xscen.extract.extract_dataset',
 'xscen.extract.get_warming_level',
 'xscen.extract.search_data_catalogs',
 'xscen.extract.subset_warming_level',
 'xscen.indicators.compute_indicators',
 'xscen.io.rechunk',
 'xscen.io.save_to_netcdf',
 'xscen.io.save_to_zarr',
 'xscen.regrid.create_mask',
 'xscen.regrid.regrid_dataset',
 'xscen.scripting.measure_time',
 'xscen.scripting.send_mail',
 'xscen.scripting.send_mail_on_exit',
 'xscen.spatial.creep_fill',
 'xscen.spatial.creep_weights',
 'xscen.utils.maybe_unstack',
 'xscen.utils.stack_drop_nans',
 'xscen.utils.unstack_fill_nan']

6.1. Loading an existing YAML config file

YAML files are read using xscen.load_config. Any number of files can be called, which will be merged together into a single python dictionary accessed through xscen.CONFIG.

[2]:
from pathlib import Path

import xscen as xs
from xscen import CONFIG
[3]:
# Load configuration
xs.load_config(
    str(
        Path().absolute().parent.parent
        / "templates"
        / "1-basic_workflow_with_config"
        / "config1.yml"
    ),
    # str(Path().absolute().parent.parent / "templates" / "1-basic_workflow_with_config" / "paths1_example.yml")  We can't actually load this file due to the fake paths, but this would be the format
)

# Display the dictionary keys
print(CONFIG.keys())
dict_keys(['tasks', 'extract', 'regrid', 'biasadjust', 'cleanup', 'rechunk', 'diagnostics', 'indicators', 'aggregate', 'ensembles', 'project', 'scripting', 'dask', 'logging', 'xclim', 'to_dataset_dict'])

xscen.CONFIG behaves similarly to a python dictionary, but has a custom __getitem__ that returns a deepcopy of the requested item. As such, it is unmutable and thus, reliable and robust.

[4]:
# A normal python dictionary is mutable, but a CONFIG dictionary is not.
pydict = dict(CONFIG["project"])
print(CONFIG["project"]["id"], ", ", pydict["id"])
pydict2 = pydict
pydict2["id"] = "modified id"
print(CONFIG["project"]["id"], ", ", pydict["id"], ", ", pydict2["id"])
pydict3 = pydict2
pydict3["id"] = "even more modified id"
print(
    CONFIG["project"]["id"],
    ", ",
    pydict["id"],
    ", ",
    pydict2["id"],
    ", ",
    pydict3["id"],
)
t1 ,  t1
t1 ,  modified id ,  modified id
t1 ,  even more modified id ,  even more modified id ,  even more modified id

If one really want to modify the CONFIG dictionary from within the workflow itself, its set method must be used.

[5]:
CONFIG.set("project.id", "modified id")
print(CONFIG["project"]["id"])
modified id

6.2. Building a YAML config file

6.2.1. Generic arguments

Since CONFIG is a python dictionary, anything can be written in it if it is deemed useful for the execution of the script. A good practice, such as seen in this template’s config1.yml, is for example to use the YAML file to provide a list of tasks to be accomplished, give the general description of the project, or provide a dask configuration:

[6]:
print(CONFIG["tasks"])
print(CONFIG["project"])
print(CONFIG["regrid"]["dask"])
['extract', 'regrid', 'biasadjust', 'cleanup', 'rechunk', 'diagnostics', 'indicators', 'climatology', 'delta', 'ensembles']
{'name': 'Template 1 - basic_workflow_with_config', 'version': '1.0.0', 'description': 'Template for xscen workflow', 'id': 'modified id'}
{'n_workers': 2, 'threads_per_worker': 5, 'memory_limit': '10GB'}

These are not linked to any function and will not automatically be called upon by xscen, but can be referred to during the execution of the script. Below is an example where tasks is used to instruct on which tasks to accomplish and which to skip. Many such example can be seen throughout the provided templates.

[7]:
if "extract" in CONFIG["tasks"]:
    print("This will start the extraction process.")

if "figures" in CONFIG["tasks"]:
    print(
        "This would start creating figures, but it will be skipped since it is not in the list of tasks."
    )
This will start the extraction process.

6.2.2. Function-specific parameters

In addition to generic arguments, a major convenience of YAML files is that parameters can be automatically fed to functions if they are wrapped by @parse_config (see above for the list of currently supported functions). The exact following format has to be used:

module:
    function:
        argument:

The most up-to-date list of modules can be consulted here, as well as at the start of this tutorial. A simple example would be as follows:

aggregate:
  compute_deltas:
    kind: "+"
    reference_horizon: "1991-2020"
    to_level: 'delta'

Some functions have arguments in the form of lists and dictionaries. These are also supported:

extract:
    search_data_catalogs:
      variables_and_freqs:
        tasmax: D
        tasmin: D
        pr: D
        dtr: D
      allow_resampling: False
      allow_conversion: True
      periods: ['1991', '2020']
      other_search_criteria:
        source:
          "ERA5-Land"
[8]:
# Note that the YAML used here is more complex and separates tasks between 'reconstruction' and 'simulation', which would break the automatic passing of arguments.
print(
    CONFIG["extract"]["reconstruction"]["search_data_catalogs"]["variables_and_freqs"]
)  # Dictionary
print(CONFIG["extract"]["reconstruction"]["search_data_catalogs"]["periods"])  # List
{'tasmax': 'D', 'tasmin': 'D', 'pr': 'D', 'dtr': 'D'}
['1991', '2020']

Let’s test that it is working, using climatological_op:

[9]:
# We should obtain 30-year means separated in 10-year intervals.
CONFIG["aggregate"]["climatological_op"]
[9]:
{'op': 'mean',
 'window': 30,
 'stride': 10,
 'periods': [['1951', '2100']],
 'to_level': 'climatology'}
[10]:
import pandas as pd
import xarray as xr

# Create a dummy dataset
time = pd.date_range("1951-01-01", "2100-01-01", freq="YS-JAN")
da = xr.DataArray([0] * len(time), coords={"time": time})
da.name = "test"
ds = da.to_dataset()

# Call climatological_op using no argument other than what's in CONFIG
print(xs.climatological_op(ds))
<xarray.Dataset> Size: 676B
Dimensions:         (time: 13)
Coordinates:
    horizon         (time) <U9 468B '1951-1980' '1961-1990' ... '2071-2100'
  * time            (time) datetime64[ns] 104B 1951-01-01 ... 2071-01-01
Data variables:
    test_clim_mean  (time) float64 104B 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Attributes:
    cat:processing_level:  climatology

6.2.3. Managing paths

As a final note, it should be said that YAML files are a good way to privately provide paths to a script without having to explicitely write them in the code. An example is provided here. As stated earlier, xs.load_config will merge together the provided YAML files into a single dictionary, meaning that the separation will be seamless once the script is running.

As an added protection, if the script is to be hosted on Github, paths.yml (or whatever it is being called) can then be added to the .gitignore.

6.2.4. Configuration of external packages

As explained in the load_config documentation, a few top-level sections can be used to configure packages external to xscen. For example, everything under the logging section will be sent to logging.config.dictConfig(...), allowing the full configuration of python’s built-in logging mechanism. The current config does exactly that by configuring a logger for xscen that logs to the console, with a sensibility set to the INFO level and a specified record formating :

[11]:
CONFIG["logging"]
[11]:
{'formatters': {'default': {'format': '%(asctime)s %(levelname)-8s %(name)-15s %(message)s',
   'datefmt': '%Y-%m-%d %H:%M:%S'}},
 'handlers': {'console': {'class': 'logging.StreamHandler',
   'formatter': 'default',
   'level': 'INFO'}},
 'loggers': {'xscen': {'propagate': False,
   'level': 'INFO',
   'handlers': ['console']}},
 'version': 1}

6.3. Passing configuration through the command line

In order to have a more flexible configuration, it can be interesting to modify it using the command line. This way, the workflow can be started with different values without having to edit and save the YAML file each time. Alternatively, the command line arguments can also be used to determine which configuration file to use, so that the same workflow can be launched with different configurations without needing to duplicate the code. The second template workflow uses this method.

The idea is simply to create an ArgumentParser with python’s built-in argparse :

[12]:
from argparse import ArgumentParser

parser = ArgumentParser(description="An example CLI arguments parser.")
parser.add_argument("-c", "--conf", action="append")

# Let's simulate command line arguments
example_args = (
    "-c ../../templates/2-indicators_only/config2.yml "
    '-c project.title="Title" '
    "--conf project.id=newID"
)

args = parser.parse_args(example_args.split())
print(args.conf)
['../../templates/2-indicators_only/config2.yml', 'project.title="Title"', 'project.id=newID']

And then we can simply pass this list to load_config, which accepts file paths and “key=value” pairs.

[13]:
xs.load_config(*args.conf)

print(CONFIG["project"]["title"])
print(CONFIG["project"]["id"])
Title
newID