Good to know

Which function to use when opening data

There are many ways to open data in xscen workflows. The list below tries to make the differences clear:

Search and extract:

Using search_data_catalogs() + extract_dataset(). This is the main method recommended to parse catalogs of “raw” data, data not yet modified by your workflow. It has features meant to ease the aggregation and extraction of raw files :

variable conversion and resampling of subdaily data
spatial and temporal subsetting
matching historical and future runs for simulations

search_data_catalogs returns a dictionary with a specific catalog for each of the unique id found in the search. One should then iterate over this dictionary and call extract_dataset on each item. This then returns a dictionary with a single dataset for each xrfreq. You thus end up with one dataset per frequency and id.

to_dataset_dict:

Using to_dataset_dict(). When all the data you need is in a single catalog (for example, your ProjectCatalog()) and you don’t need any of the features listed above. Note that this can be combined to a simple .search beforehand, to subset on parts of the catalog. As explained in Columns, it creates a dictionary with a single Dataset for each combination of id, domain, processing_level and xrfreq unless different aggregation rules were called during the catalog creation.

to_dataset:

Using to_dataset(). Similar to to_dataset_dict, but only returns a single dataset. If the catalog has more than one, the call will fail. It behaves like to_dask(), but exposes options to add aggregations. This is useful when constructing an ensemble dataset that would otherwise result in distinct entries in the output of to_dataset_dict. It can usually be used in replacement of a combination of to_dataset_dict and create_ensemble().

open_dataset:

Of course, xscen workflows can still use the conventional open_dataset(). Just be aware that datasets opened this way will lack the attributes automatically added by the previous functions, which will then result in poorer metadata or even failure for some xscen functions. Same thing for open_mfdataset(). If one has data listed in a catalog, the functions above will usually provide what you need, i.e. : xr.open_mfdataset(cat.df.path) is very rarely optimal.

create_ensemble:

With to_dataset() or ensemble_stats(), you should usually find what you need. create_ensemble() is not needed in xscen workflows.

Which function to use when resampling data

extract_dataset:: extract_dataset()’s resampling capabilities are meant to provide daily data from finer sources.
resample:: :py:func`xscen.extract.resample` extends xarray’s resample methods with support for weighted resampling when starting from data coarser than daily and for handling of missing timesteps or values.
xclim indicators:: Through compute_indicators(), xscen workflows can easily use xclim indicators to go from daily data to coarser (monthly, seasonal, annual), with missing values handling. This option will add more metadata than the two firsts.

Metadata translation

xscen itself does not add many translatable attributes, but when it does, it will look into xclim’s options for which locales to translate them to. Similar to xclim, it will always add a particular attribute in english and then translations with the same attribute name suffixed by “_XX”, where “XX” is the two-letter language code, as set in the ISO-639-1 standard. For example, if a function adds a long_name and Inuktitut translation is activated, the function will also add a long_name_iu attribute.

In a config file, activating French translations for both xclim’s indicators and xscen (and figanos) is done with :

xclim:
    metadata_locales:
        - fr

Which can also be activated in the code using xclim.core.options.set_options(). Note that this only applies to attributes that are added to a dataset. Some xscen functions will instead update an existing attribute. For example, when calculating the climatology of a variable with long_name Mean temperature, climatological_mean() will update the long_name as 30-year average of Mean temperature. This automatic update is done for all locales available in the variable, no matter what xclim option is activated. For example, if a long_name_eu exists in the variable and a Basque translation catalog exists in that xscen instance, then the attribute will be translated, no matter what xclim’s metadata_locales is set to.

Translation is of course not automatic but relies on manually populated gettext catalogs. xscen ships with a catalog of french (fr) translations. See Translating xscen to learn how to add translations to xscen. xclim’s documentation of the same subject is here.

If your xscen is installed in “editable” mode in its source directory (pip install -e .), you should run make translate each time you pull changes from the upstream source.

Module-wide options

As seen above, it can be useful to use the “special” sections of the config file to set some module-wide options. For example:

logging:
    # same arguments as python's logging.config.dictConfig
xarray:
    keep_attrs: True
xclim:
    metadata_locales:
        - fr
    check_missing: "skip"
warning:
    # warning_category : filter_action
    all: ignore

Global warming dataset

The xscen.extract.get_warming_level() and xscen.extract.subset_warming_level() functions use a custom made database of global temperature averages to find the global warming levels of known climate simulations. The database is stored as a netCDF file inside the package itself. It stores the global temperature average (land and ocean) from 1850 to 2100 for multiple simulations (not all simulations cover the entire temporal range). Simulations are defined through 4 fields:

mip_era : “CMIP6”, “CMIP5” or “obs” (see below)
source : The model name for GCM (same as the source column) and the driving model name for RCM (driving_model column)
experiment : The CMIP experiment name of the run. The “historical” and “pre-industrial” experiments have been merged into each future experiment (similar to what match_hist_and_fut does in search_data_catalogs())
member : The realization variant label of the run (same as the member column)

An extra data_source field is also available and describes how the data has been obtained:

“IPCC Atlas” : The timeseries was copied directly from the public data of the IPCC Atlas’
“From Amon” : The monthly temperature average was resampled annually and averaged over the globe using a cos-lat weighting
“From Amon with xscen” : Same, xscen was used to perform the computation.

In addition to the climate simulations, a few “observational” datasets are made available in the database. The choice of datasets and the methodology was adapted from the WMO’s State of the Global Climate 2021. However, to have some consistency between these and the simulated series, an estimated 1850-1900 mean temperature was added to the WMO-compliant anomalies to get absolute values. Keep in mind that this is only an estimation, the timeseries should only be used to compute anomalies. The observational series have a short dataset name in the source field, “obs” in mip_era and experiment, and an empty member (“”). The data_source is noted : “Computed following WMO guidelines”.