API

Catalog

Catalog objects and related tools.

xscen.catalog.COLUMNS = ['id', 'type', 'processing_level', 'bias_adjust_institution', 'bias_adjust_project', 'mip_era', 'activity', 'driving_model', 'institution', 'source', 'experiment', 'member', 'xrfreq', 'frequency', 'variable', 'domain', 'date_start', 'date_end', 'version', 'format', 'path']: Official column names.

class xscen.catalog.DataCatalog(*args, **kwargs)[source]

A read-only intake_esm catalog adapted to xscen’s syntax.

This class expects the catalog to have the columns listed in xscen.catalog.COLUMNS and it comes with default arguments for reading the CSV files (xscen.catalog.csv_kwargs). For example, all string columns (except path) are cast to a categorical dtype and the datetime columns are parsed with a special function that allows dates outside the conventional datetime64[ns] bounds by storing the data using pandas.Period objects.

Parameters:

*args (str or os.PathLike or dict) – Path to a catalog JSON file. If a dict, it must have two keys: ‘esmcat’ and ‘df’. ‘esmcat’ must be a dict representation of the ESM catalog. ‘df’ must be a Pandas DataFrame containing content that would otherwise be in the CSV file.
check_valid (bool) – If True, will check that all files in the catalog exist on disk and remove those that don’t.
drop_duplicates (bool) – If True, will drop duplicates in the catalog based on the ‘id’ and ‘path’ columns.
**kwargs (dict) – Any other arguments are passed to intake_esm.esm_datastore.

See also

intake_esm.core.esm_datastore

check_valid()[source]

Verify that all files in the catalog exist on disk and remove those that don’t.

If a file is a Zarr, it will also check that all variables are present and remove those that aren’t.

drop_duplicates(columns: list[str] | None = None)[source]

Drop duplicates in the catalog based on a subset of columns.

Parameters:: columns (list of str, optional) – The columns used to identify duplicates. If None, ‘id’ and ‘path’ are used.

exists_in_cat(**columns) → bool[source]

Check if there is an entry in the catalogue corresponding to the arguments given.

Parameters:: columns (Arguments that will be given to catalog.search)
Returns:: bool – True if there is an entry in the catalogue corresponding to the arguments given.

Create a DataCatalog from one or more csv files.

Parameters:

data (DataFrame or path or sequence of paths) – A DataFrame or one or more paths to csv files.
esmdata (path or dict, optional) – The “ESM collection data” as a path to a json file or a dict. If None (default), xscen’s default esm_col_data is used.
read_csv_kwargs (dict, optional) – Extra kwargs to pass to pd.read_csv, in addition to the ones in csv_kwargs.
name (str) – If metadata doesn’t contain it, a name to give to the catalog.

See also

pandas.read_csv

iter_unique(*columns)[source]

Iterate over sub-catalogs for each group of unique values for all specified columns.

This is a generator that yields a tuple of the unique values of the current group, in the same order as the arguments, and the sub-catalog.

search(**columns)[source]: Modification of .search() to add the ‘periods’ keyword.

Open the catalog’s entries into a single dataset.

Same as to_dask(), but with additional control over the aggregations. The dataset definition logic is left untouched by this method (by default: [“id”, “domain”, “processing_level”, “xrfreq”]), except that newly aggregated columns are removed from the “id”. This will override any “custom” id, ones not unstackable with unstack_id().

Ensemble preprocessing logic is taken from xclim.ensembles.create_ensemble(). When create_ensemble_on is given, the function ensures all entries have the correct time coordinate according to xrfreq.

Parameters:

concat_on (list of str or str, optional) – A list of catalog columns over which to concat the datasets (in addition to ‘time’). Each will become a new dimension with the column values as coordinates. Xarray concatenation rules apply and can be acted upon through xarray_combine_by_coords_kwargs.
create_ensemble_on (list of str or str, optional) – The given column values will be merged into a new id-like “realization” column, which will be concatenated over. The given columns are removed from the dataset id, to remove them from the groupby_attrs logic. Xarray concatenation rules apply and can be acted upon through xarray_combine_by_coords_kwargs.
ensemble_name (list of strings, optional) – If create_ensemble_on is given, this can be a subset of those column names to use when constructing the realization coordinate. If None, this will be the same as create_ensemble_on. The resulting coordinate must be unique.
calendar (str, optional) – If create_ensemble_on is given, all datasets are converted to this calendar before concatenation. Ignored otherwise (default). If None, no conversion is done. align_on is always “date”.
kwargs – Any other arguments are passed to to_dataset_dict(). The preprocess argument cannot be used if create_ensemble_on is given.

Returns:

Dataset

See also

intake_esm.core.esm_datastore.to_dataset_dict, intake_esm.core.esm_datastore.to_dask, xclim.ensembles.create_ensemble

unique(columns: str | Sequence[str] | None = None)[source]

Return a series of unique values in the catalog.

Parameters:: columns (str or sequence of str, optional) – The columns to get unique values from. If None, all columns are used.

xscen.catalog.ID_COLUMNS = ['bias_adjust_project', 'mip_era', 'activity', 'driving_model', 'institution', 'source', 'experiment', 'member', 'domain']: Default columns used to create a unique ID

class xscen.catalog.ProjectCatalog(*args, **kwargs)[source]

A DataCatalog with additional ‘write’ functionalities that can update and upload itself.

See also

intake_esm.core.esm_datastore

classmethod create(filename: PathLike | str, *, project: dict | None = None, overwrite: bool = False)[source]

Create a new project catalog from some project metadata.

Creates the json from default esm_col_data and an empty csv file.

Parameters:

filename (os.PathLike or str) – A path to the json file (with or without suffix).
project (dict, optional) – Metadata to create the catalog. If None, CONFIG[‘project’] will be used. Valid fields are:
- title : Name of the project, given as the catalog’s “title”.
- idslug-like version of the name, given as the catalog’s id (should be url-proof)
  Defaults to a modified name.
- version : Version of the project (and thus the catalog), string like “x.y.z”.
- description : Detailed description of the project, given to the catalog’s “description”.
- Any other entry defined in esm_col_data.
At least one of id and title must be given, the rest is optional.
overwrite (bool) – If True, will overwrite any existing JSON and CSV file.

Returns:

ProjectCatalog – An empty intake_esm catalog.

refresh()[source]: Reread the catalog CSV saved on disk.

Update the catalog with new data and writes the new data to the csv file.

Once the internal dataframe is updated with df, the csv on disk is parsed, updated with the internal dataframe, duplicates are dropped and everything is written back to the csv. This means that nothing is _removed_* from the csv when calling this method, and it is safe to use even with a subset of the catalog.

Warning

If a file was deleted between the parsing of the catalog and this call, it will be removed from the csv when check_valid is called.

Parameters:: df (Union[DataCatalog, intake_esm.esm_datastore, pd.DataFrame, pd.Series, Sequence[pd.Series]], optional) – Data to be added to the catalog. If None, nothing is added, but the catalog is still updated.

update_from_ds(ds: Dataset, path: PathLike | str, info_dict: dict | None = None, **info_kwargs)[source]

Update the catalog with new data and writes the new data to the csv file.

We get the new data from the attributes of ds, the dictionary info_dict and path.

Once the internal dataframe is updated with the new data, the csv on disk is parsed, updated with the internal dataframe, duplicates are dropped and everything is written back to the csv. This means that nothing is _removed_* from the csv when calling this method, and it is safe to use even with a subset of the catalog.

Warning

If a file was deleted between the parsing of the catalog and this call, it will be removed from the csv when check_valid is called.

Parameters:

ds (xarray.Dataset) – Dataset that we want to add to the catalog. The columns of the catalog will be filled from the global attributes starting with ‘cat:’ of the dataset.
info_dict (dict, optional) – Extra information to fill in the catalog.
path (os.PathLike or str) – Path to the file that contains the dataset. This will be added to the ‘path’ column of the catalog.

xscen.catalog.concat_data_catalogs(*dcs)[source]

Concatenate a multiple DataCatalogs.

Output catalog is the union of all rows and all derived variables, with the “esmcat” of the first DataCatalog. Duplicate rows are dropped and the index is reset.

xscen.catalog.generate_id(df: DataFrame | Dataset, id_columns: list | None = None) → Series[source]

Create an ID from column entries.

Parameters:

df (pd.DataFrame, xr.Dataset) – Data for which to create an ID.
id_columns (list, optional) – List of column names on which to base the dataset definition. Empty columns will be skipped. If None (default), uses ID_COLUMNS.

Returns:

pd.Series – A series of IDs, one per row of the input DataFrame.

xscen.catalog.subset_file_coverage(df: DataFrame, periods: list[str] | list[list[str]], *, coverage: float = 0.99, duplicates_ok: bool = False) → DataFrame[source]

Return a subset of files that overlap with the target periods.

Parameters:

df (pd.DataFrame) – List of files to be evaluated, with at least a date_start and date_end column, which are expected to be datetime64 objecs.
periods (list of str or list of lists of str) – Either [start, end] or list of [start, end] for the periods to be evaluated. All periods must be covered, otherwise an empty subset is returned.
coverage (float) – Percentage of hours that need to be covered in a given period for the dataset to be valid. Use 0 to ignore this checkup. The coverage calculation is only valid if there are no overlapping periods in df (ensure with duplicates_ok=False).
duplicates_ok (bool) – If True, no checkup is done on possible duplicates.

Returns:

pd.DataFrame – Subset of files that overlap the targetted periods

xscen.catalog.unstack_id(df: DataFrame | ProjectCatalog | DataCatalog) → dict[source]

Reverse-engineer an ID using catalog entries.

Parameters:: df (Union[pd.DataFrame, ProjectCatalog, DataCatalog]) – Either a Project/DataCatalog or a pandas DataFrame.
Returns:: dict – Dictionary with one entry per unique ID, which are themselves dictionaries of all the individual parts of the ID.

Catalog creation and path building tools.

Parse the schema from a configuration and construct path using a dictionary of facets.

Parameters:

data (dict or xr.Dataset or xr.DataArray or pd.Series or DataCatalog or pd.DataFrame) – Dict of facets. Or xarray object to read the facets from. In the latter case, variable and time-dependent facets are read with parse_from_ds() and supplemented with all the object’s attribute, giving priority to the “official” xscen attributes (prefixed with cat:, see xscen.utils.get_cat_attrs()). Can also be a catalog or a DataFrame, in which a “new_path” column is generated for each item.
schemas (Path or dict, optional) – Path to YAML schematic of database schema. If None, will use a default schema. See the comments in the xscen/data/file_schema.yml file for more details on its construction. A dict of dict schemas can be given (same as reading the yaml). Or a single schema dict (single element of the yaml).
root (str or Path, optional) – If given, the generated path(s) is given under this root one.
**extra_facets – Extra facets to supplement or override metadadata missing from the first input.

Returns:

Path or catalog – Constructed path. If “format” is absent from the facets, it has no suffix. If data was a catalog, a copy with a “new_path” column is returned. Another “new_path_type” column is also added if schemas was a collection of schemas (like the default).

Examples

To rename a full catalog, the simplest way is to do:

>>> import xscen as xs
>>> import shutil as sh
>>> new_cat = xs.catutils.build_path(old_cat)
>>> for i, row in new_cat.iterrows():
...     sh.move(row.path, row.new_path)
...

Parse files in a directory and return them as a pd.DataFrame.

Parameters:

directories (list of os.PathLike or list of str) – List of directories to parse. The parse is recursive.
patterns (list of str) – List of possible patterns to be used by parse.parse() to decode the file names. See Notes below.
id_columns (list of str, optional) – List of column names on which to base the dataset definition. Empty columns will be skipped. If None (default), it uses ID_COLUMNS.
read_from_file (boolean or set of strings or tuple of 2 sets of strings or list of tuples) – If True, if some fields were not parsed from their path, files are opened and missing fields are parsed from their metadata, if found. If a sequence of column names, only those fields are parsed from the file, if missing. If False (default), files are never opened. If a tuple of 2 lists of strings, only the first file of groups defined by the first list of columns is read and the second list of columns is parsed from the file and applied to the whole group. For example, ([“source”],[“institution”, “activity”]) will find a group with all the files that have the same source, open only one of the files to read the institution and activity, and write this information in the catalog for all filles of the group. It can also be a list of those tuples.
homogenous_info (dict, optional) – Using the {column_name: description} format, information to apply to all files. These are applied before the cvs.
cvs (str or os.PathLike or dict, optional) – Dictionary with mapping from parsed term to preferred terms (Controlled VocabularieS) for each column. May have an additional “attributes” entry which maps from attribute names in the files to official column names. The attribute translation is done before the rest. In the “variable” entry, if a name is mapped to None (null), that variable will not be listed in the catalog. A term can map to another mapping from field name to values, so that a value on one column triggers the filling of other columns. In the latter case, that other column must exist beforehand, whether it was in the pattern or in the homogenous_info.
dirglob (str, optional) – A glob pattern for path matching to accelerate the parsing of a directory tree if only a subtree is needed. Only folders matching the pattern are parsed to find datasets.
xr_open_kwargs (dict) – If needed, arguments to send xr.open_dataset() when opening the file to read the attributes.
only_official_columns (bool) – If True (default), this ensures the final catalog only has the columns defined in xscen.catalog.COLUMNS. Other fields in the patterns will raise an error. If False, the columns are those used in the patterns and the homogenous info. In that case, the column order is not determined. Path, format and id are always present in the output.
progress (bool) – If True, a counter is shown in stdout when finding files on disk. Does nothing if parallel_dirs is not False.
parallel_dirs (bool or int) – If True, each directory is searched in parallel. If an int, it is the number of parallel searches. This should only be significantly useful if the directories are on different disks.
file_checks (list of str, optional) – A list of file checks to run on the parsed files. Available values are: - “readable” : Check that the file is readable by the current user. - “writable” : Check that the file is writable by the current user. - “ncvalid” : For netCDF, check that it is valid (openable with netCDF4). Any check will slow down the parsing.

Notes

Offical columns names are controlled and ordered by COLUMNS:

[“id”, “type”, “processing_level”, “mip_era”, “activity”, “driving_institution”, “driving_model”, “institution”,
“source”, “bias_adjust_institution”, “bias_adjust_project”,”experiment”, “member”, “xrfreq”, “frequency”, “variable”, “domain”, “date_start”, “date_end”, “version”]
Not all column names have to be present, but “xrfreq” (obtainable through “frequency”), “variable”,
“date_start” and “processing_level” are necessary for a workable catalog.
‘patterns’ should highlight the columns with braces.
This acts like the reverse operation of format(). It is a template string with {field name:type} elements. The default “type” will match alphanumeric parts of the path, excluding the “_”, “/” and “" characters. The “_” type will allow underscores. Field names prefixed by “?” will not be included in the output. See the documentation of parse for more type options. You can also add your own types using the register_parse_type() decorator.

The “DATES” field is special as it will only match dates, either as a single date (YYYY, YYYYMM, YYYYMMDD) assigned to “{date_start}” (with “date_end” automatically inferred) or two dates of the same format as “{date_start}-{date_end}”.

Example: “{source}/{?ignored project name}_{?:_}_{DATES}.nc” Here, “source” will be the full folder name and it can’t include underscores. The first section of the filename will be excluded from the output, it was given a name (ignore project name) to make the pattern readable. The last section of the filenames (“dates”) will yield a “date_start” / “date_end” couple. All other sections in the middle will be ignored, as they match “{?:_}”.

Returns:: pd.DataFrame – Parsed directory files

xscen.catutils.parse_from_ds(obj: str | PathLike | Dataset, names: Sequence[str], attrs_map: Mapping[str, str] | None = None, **xrkwargs)[source]

Parse a list of catalog fields from the file/dataset itself.

If passed a path, this opens the file.

Infers the variable from the variables. Infers xrfreq, frequency, date_start and date_end from the time coordinate if present. Infers other attributes from the coordinates or the global attributes. Attributes names can be translated using the attrs_map mapping (from file attribute name to name in names).

If the obj is the path to a Zarr dataset and none of “frequency”, “xrfreq”, “date_start” or “date_end” are requested, parse_from_zarr() is used instead of opening the file.

Parameters:

obj (str or os.PathLike or xr.Dataset) – Dataset to parse.
names (sequence of str) – List of attributes to be parsed from the dataset.
attrs_map (dict, optional) – In the case of non-standard names in the file, this can be used to match entries in the files to specific ‘names’ in the requested list.
xrkwargs – Arguments to be passed to open_dataset().

xscen.catutils.register_parse_type(name: str, regex: str = '([^\\_\\/\\\\]*)', group_count: int = 1)[source]

Register a new parse type to be available in parse_directory() patterns.

Function decorated by this will be registered in EXTRA_PARSE_TYPES. The function must take a single string and should return a single string. If you return a different type, it may interfere with the other steps of parse_directory.

Parameters:

name (str) – The type name. To make use of this type, put “{field:name}” in your pattern.
regex (str) – A regex string to determine what can be matched by this type. The default matches anything but / and _, same as the default parse type.
group_count (int) – The number of regex groups in the previous regex string.

Extraction

Functions to find and extract data from a catalog.

Take one element of the output of search_data_catalogs and returns a dataset, performing conversions and resampling as needed.

Nothing is written to disk within this function.

Parameters:

catalog (DataCatalog) – Sub-catalog for a single dataset, one value of the output of search_data_catalogs.
variables_and_freqs (dict, optional) – Variables and freqs, following a ‘variable: xrfreq-compatible str’ format. A list of strings can also be provided. If None, it will be read from catalog._requested_variables and catalog._requested_variable_freqs (set by variables_and_freqs in search_data_catalogs)
periods (list of str or list of lists of str, optional) – Either [start, end] or list of [start, end] for the periods to be evaluated. Will be read from catalog._requested_periods if None. Leave both None to extract everything.
region (dict, optional) – Description of the region and the subsetting method (required fields listed in the Notes) used in xscen.spatial.subset.
to_level (str) – The processing level to assign to the output. Defaults to ‘extracted’
ensure_correct_time (bool) – When True (default), even if the data has the correct frequency, its time coordinate is checked so that it exactly matches the frequency code (xrfreq). For example, daily data given at noon would be transformed to be given at midnight. If the time coordinate is invalid, it raises an error.
xr_open_kwargs (dict, optional) – A dictionary of keyword arguments to pass to DataCatalogs.to_dataset_dict, which will be passed to xr.open_dataset.
xr_combine_kwargs (dict, optional) – A dictionary of keyword arguments to pass to DataCatalogs.to_dataset_dict, which will be passed to xr.combine_by_coords.
preprocess (callable, optional) – If provided, call this function on each dataset prior to aggregation.
resample_methods (dict, optional) – Dictionary where the keys are the variables and the values are the resampling method. Options for the resampling method are {‘mean’, ‘min’, ‘max’, ‘sum’, ‘wind_direction’}. If the method is not given for a variable, it is guessed from the variable name and frequency, using the mapping in CVs/resampling_methods.json. If the variable is not found there, “mean” is used by default.
mask (xr.Dataset or xr.DataArray or bool) – A mask that is applied to all variables and only keeps data where it is True. Where the mask is False, variable values are replaced by NaNs. The mask should have the same dimensions as the variables extracted. If mask is a dataset, the dataset should have a variable named ‘mask’. If mask is True, it will expect a mask variable at xrfreq fx to have been extracted.

Returns:

dict – Dictionary (keys = xrfreq) with datasets containing all available and computed variables, subsetted to the region, everything resampled to the requested frequency.

Notes

‘region’ fields:

name: str: Region name used to overwrite domain in the catalog.
method: str: [‘gridpoint’, ‘bbox’, shape’, ‘sel’]
tile_buffer: float, optional: Multiplier to apply to the model resolution.
kwargs: Arguments specific to the method used.

See also

intake_esm.core.esm_datastore.to_dataset_dict, xarray.open_dataset, xarray.combine_by_coords

Use the IPCC Atlas method to return the window of time over which the requested level of global warming is first reached.

Parameters:

realization (xr.Dataset, xr.DataArray, dict, str, Series or sequence of those) – Model to be evaluated. Needs the four fields mip_era, source, experiment and member, as a dict or in a Dataset’s attributes. Strings should follow this formatting: {mip_era}_{source}_{experiment}_{member}. Lists of dicts, strings or Datasets are also accepted, in which case the output will be a dict. Regex wildcards (.*) are accepted, but may lead to unexpected results. Datasets should include the catalogue attributes (starting by “cat:”) required to create such a string: ‘cat:mip_era’, ‘cat:experiment’, ‘cat:member’, and either ‘cat:source’ for global models or ‘cat:driving_model’ for regional models. e.g. ‘CMIP5_CanESM2_rcp85_r1i1p1’
wl (float) – Warming level. e.g. 2 for a global warming level of +2 degree Celsius above the mean temperature of the tas_baseline_period.
window (int) – Size of the rolling window in years over which to compute the warming level.
tas_baseline_period (list, optional) – [start, end] of the base period. The warming is calculated with respect to it. The default is [“1850”, “1900”].
ignore_member (bool) – Decides whether to ignore the member when searching for the model run in tas_csv.
tas_src (str, optional) – Path to a netCDF of annual global mean temperature (tas) with an annual “time” dimension and a “simulation” dimension with the following coordinates: “mip_era”, “source”, “experiment” and “member”. If None, it will default to data/IPCC_annual_global_tas.nc which was built from the IPCC atlas data from Iturbide et al., 2020 (https://doi.org/10.5194/essd-12-2959-2020) and extra data for missing CMIP6 models and pilot models of CRCM5 and ClimEx.
return_horizon (bool) – If True, the output will be a list following the format [‘start_yr’, ‘end_yr’] If False, the output will be a string representing the middle of the period.

Returns:

dict, list or str – If realization is not a sequence, the output will follow the format indicated by return_horizon. If realization is a sequence, the output will be of the same type, with values following the format indicated by return_horizon.

xscen.extract.resample(da: DataArray, target_frequency: str, *, ds: Dataset | None = None, method: str | None = None, missing: str | dict | None = None) → DataArray[source]

Aggregate variable to the target frequency.

If the input frequency is greater than a week, the resampling operation is weighted by the number of days in each sampling period.

Parameters:

da (xr.DataArray) – DataArray of the variable to resample, must have a “time” dimension and be of a finer temporal resolution than “target_frequency”.
target_frequency (str) – The target frequency/freq str, must be one of the frequency supported by xarray.
ds (xr.Dataset, optional) – The “wind_direction” resampling method needs extra variables, which can be given here.
method ({‘mean’, ‘min’, ‘max’, ‘sum’, ‘wind_direction’}, optional) – The resampling method. If None (default), it is guessed from the variable name and frequency, using the mapping in CVs/resampling_methods.json. If the variable is not found there, “mean” is used by default.
missing ({‘mask’, ‘drop’} or dict, optional) – If ‘mask’ or ‘drop’, target periods that would have been computed from fewer timesteps than expected are masked or dropped, using a threshold of 5% of missing data. E.g. the first season of a target_frequency of “QS-DEC” will be masked or dropped if data starts in January. If a dict, points to a xclim check missing method which will mask periods according to the number of NaN values. The dict must contain a “method” field corresponding to the xclim method name and may contain any other args to pass. Options are documented in xclim.core.missing.

Returns:

xr.DataArray – Resampled variable

Search through DataCatalogs.

Parameters:

data_catalogs (str, os.PathLike, DataCatalog, or a list of those) – DataCatalog (or multiple, in a list) or paths to JSON/CSV data catalogs. They must use the same columns and aggregation options.
variables_and_freqs (dict) – Variables and freqs to search for, following a ‘variable: xr-freq-compatible-str’ format. A list of strings can also be provided.
other_search_criteria (dict, optional) – Other criteria to search for in the catalogs’ columns, following a ‘column_name: list(subset)’ format. You can also pass ‘require_all_on: list(columns_name)’ in order to only return results that correspond to all other criteria across the listed columns. More details available at https://intake-esm.readthedocs.io/en/stable/how-to/enforce-search-query-criteria-via-require-all-on.html .
exclusions (dict, optional) – Same as other_search_criteria, but for eliminating results. Any result that matches any of the exclusions will be removed.
match_hist_and_fut (bool) – If True, historical and future simulations will be combined into the same line, and search results lacking one of them will be rejected.
periods (list of str or list of lists of str, optional) – Either [start, end] or list of [start, end] for the periods to be evaluated.
coverage_kwargs (dict, optional) – Arguments to pass to subset_file_coverage (only used when periods is not None).
id_columns (list, optional) – List of columns used to create a id column. If None is given, the original “id” is left.
allow_resampling (bool) – If True (default), variables with a higher time resolution than requested are considered.
allow_conversion (bool) – If True (default) and if the requested variable cannot be found, intermediate variables are searched given that there exists a converting function in the “derived variable registry”.
conversion_yaml (str, optional) – Path to a YAML file that defines the possible conversions (used alongside ‘allow_conversion’=True). This file should follow the xclim conventions for building a virtual module. If None, the “derived variable registry” will be defined by the file in “xscen/xclim_modules/conversions.yml”
restrict_resolution (str, optional) – Used to restrict the results to the finest/coarsest resolution available for a given simulation. [‘finest’, ‘coarsest’].
restrict_members (dict, optional) – Used to restrict the results to a given number of members for a given simulation. Currently only supports {“ordered”: int} format.
restrict_warming_level (bool or dict, optional) – Used to restrict the results only to datasets that exist in the csv used to compute warming levels in subset_warming_level. If True, this will only keep the datasets that have a mip_era, source, experiment and member combination that exist in the csv. This does not guarantee that a given warming level will be reached, only that the datasets have corresponding columns in the csv. More option can be added by passing a dictionary instead of a boolean. If {‘ignore_member’:True}, it will disregard the member when trying to match the dataset to a column. If {tas_src: Path_to_netcdf}, it will use an alternative netcdf instead of the default one provided by xscen. If ‘wl’ is a provided key, then xs.get_warming_level will be called and only datasets that reach the given warming level will be kept. This can be combined with other arguments of the function, for example {‘wl’: 1.5, ‘window’: 30}.

Notes

The “other_search_criteria” and “exclusions” arguments accept wildcard (*) and regular expressions.
Frequency can be wildcarded with ‘NA’ in the variables_and_freqs dict.
Variable names cannot be wildcarded, they must be CMIP6-standard.

Returns:: dict – Keys are the id and values are the DataCatalogs for each entry. A single DataCatalog can be retrieved with concat_data_catalogs(*out.values()). Each DataCatalog has a subset of the derived variable registry that corresponds to the needs of this specific group. Usually, each entry can be written to file in a single Dataset when using extract_dataset with the same arguments.

See also

intake_esm.core.esm_datastore.search

xscen.extract.subset_warming_level(ds: Dataset, wl: float | Sequence[float], to_level: str = 'warminglevel-{wl}vs{period0}-{period1}', wl_dim: str | bool = '+{wl}Cvs{period0}-{period1}', **kwargs) → Dataset | None[source]

Subsets the input dataset with only the window of time over which the requested level of global warming is first reached, using the IPCC Atlas method. A warming level is considered reached only if the full window years are available in the dataset.

Parameters:

ds (xr.Dataset) – Input dataset. The dataset should include attributes to help recognize it and find its warming levels - ‘cat:mip_era’, ‘cat:experiment’, ‘cat:member’, and either ‘cat:source’ for global models or ‘cat:driving_institution’ (optional) + ‘cat:driving_model’ for regional models. Or , it should include a realization dimension constructed as “{mip_era}_{source or driving_model}_{experiment}_{member}” for vectorized subsetting. Vectorized subsetting is currently only implemented for annual data.
wl (float or sequence of floats) – Warming level. e.g. 2 for a global warming level of +2 degree Celsius above the mean temperature of the tas_baseline_period. Multiple levels can be passed, in which case using “{wl}” in to_level and wl_dim is not recommended. Mutliple levels are currently only implemented for annual data.
to_level – The processing level to assign to the output. Use “{wl}”, “{period0}” and “{period1}” in the string to dynamically include wl, ‘tas_baseline_period[0]’ and ‘tas_baseline_period[1]’.
wl_dim (str or boolean, optional) – The value to use to fill the new warminglevel dimension. Use “{wl}”, “{period0}” and “{period1}” in the string to dynamically include wl, ‘tas_baseline_period[0]’ and ‘tas_baseline_period[1]’. If None, no new dimensions will be added, invalid if wl is a sequence. If True, the dimension will include wl as numbers and units of “degC”.
**kwargs – Instructions on how to search for warming levels, passed to get_warming_level().

Returns:

xr.Dataset or None – Warming level dataset, or None if ds can’t be subsetted for the requested warming level. The dataset will have a new dimension warminglevel with wl_dim as coordinates. If wl was a list or if ds had a “realization” dim, the “time” axis is replaced by a fake time starting in 1000-01-01 and with a length of window years. Start and end years of the subsets are bound in the new coordinate “warminglevel_bounds”.

Regridding

Functions to regrid datasets.

xscen.regrid.create_mask(ds: Dataset | DataArray, mask_args: dict) → DataArray[source]

Create a 0-1 mask based on incoming arguments.

Parameters:

ds (xr.Dataset or xr.DataArray) – Dataset or DataArray to be evaluated
mask_args (dict) – Instructions to build the mask (required fields listed in the Notes).

Note

‘mask’ fields:

variable: str, optional: Variable on which to base the mask, if ds_mask is not a DataArray.
where_operator: str, optional: Conditional operator such as ‘>’
where_threshold: str, optional: Value threshold to be used in conjunction with where_operator.
mask_nans: bool: Whether to apply a mask on NaNs.

Returns:: xr.DataArray – Mask array.

xscen.regrid.regrid_dataset(ds: Dataset, ds_grid: Dataset, weights_location: str | PathLike, *, regridder_kwargs: dict | None = None, intermediate_grids: dict | None = None, to_level: str = 'regridded') → Dataset[source]

Regrid a dataset according to weights and a reference grid.

Based on an intake_esm catalog, this function performs regridding on Zarr files.

Parameters:

ds (xarray.Dataset) – Dataset to regrid. The Dataset needs to have lat/lon coordinates. Supports a ‘mask’ variable compatible with ESMF standards.
weights_location (Union[str, os.PathLike]) – Path to the folder where weight file is saved.
ds_grid (xr.Dataset) – Destination grid. The Dataset needs to have lat/lon coordinates. Supports a ‘mask’ variable compatible with ESMF standards.
regridder_kwargs (dict, optional) – Arguments to send xe.Regridder(). If it contains skipna or output_chunks, those are passed to the regridder call directly.
intermediate_grids (dict, optional) – This argument is used to do a regridding in many steps, regridding to regular grids before regridding to the final ds_grid. This is useful when there is a large jump in resolution between ds and ds grid. The format is a nested dictionary shown in Notes. If None, no intermediary grid is used, there is only a regrid from ds to ds_grid.
to_level (str) – The processing level to assign to the output. Defaults to ‘regridded’

Returns:

xarray.Dataset – Regridded dataset

Notes

intermediate_grids =

{‘name_of_inter_grid_1’: {‘cf_grid_2d’: {arguments for util.cf_grid_2d },’regridder_kwargs’:{arguments for xe.Regridder}},: ‘name_of_inter_grid_2’: dictionary_as_above}

See also

xesmf.regridder, xesmf.util.cf_grid_2d

Bias Adjustment

Functions to train and adjust a dataset using a bias-adjustment algorithm.

xscen.biasadjust.adjust(dtrain: Dataset, dsim: Dataset, periods: list[str] | list[list[str]], *, xclim_adjust_args: dict | None = None, to_level: str = 'biasadjusted', bias_adjust_institution: str | None = None, bias_adjust_project: str | None = None, align_on: str | None = 'year') → Dataset[source]

Adjust a simulation.

Parameters:

dtrain (xr.Dataset) – A trained algorithm’s dataset, as returned by train.
dsim (xr.Dataset) – Simulated timeseries, projected period.
periods (list of str or list of lists of str) – Either [start, end] or list of [start, end] of the simulation periods to be adjusted (one at a time).
xclim_adjust_args (dict, optional) – Dict of arguments to pass to the .adjust of the adjustment object.
to_level (str) – The processing level to assign to the output. Defaults to ‘biasadjusted’
bias_adjust_institution (str, optional) – The institution to assign to the output.
bias_adjust_project (str, optional) – The project to assign to the output.
align_on (str, optional) – align_on argument for the fonction xclim.core.calendar.convert_calendar.

Returns:

xr.Dataset – dscen, the bias-adjusted timeseries.

Train a bias-adjustment.

Parameters:

dref (xr.Dataset) – The target timeseries, on the reference period.
dhist (xr.Dataset) – The timeseries to adjust, on the reference period.
var (str or list of str) – Variable on which to do the adjustment. Currently only supports one variable.
period (list of str) – [start, end] of the reference period
method (str) – Name of the sdba.TrainAdjust method of xclim.
group (str or sdba.Grouper or dict, optional) – Grouping information. If a string, it is interpreted as a grouper on the time dimension. If a dict, it is passed to sdba.Grouper.from_kwargs. Defaults to {“group”: “time.dayofyear”, “window”: 31}.
xclim_train_args (dict) – Dict of arguments to pass to the .train of the adjustment object.
maximal_calendar (str) – Maximal calendar dhist can be. The hierarchy: 360_day < noleap < standard < all_leap. If dhist’s calendar is higher than maximal calendar, it will be converted to the maximal calendar.
adapt_freq (dict, optional) – If given, a dictionary of args to pass to the frequency adaptation function.
jitter_under (dict, optional) – If given, a dictionary of args to pass to jitter_under_thresh.
jitter_over (dict, optional) – If given, a dictionary of args to pass to jitter_over_thresh.
align_on (str, optional) – align_on argument for the function xclim.core.calendar.convert_calendar.

Returns:

xr.Dataset – Trained algorithm’s data.

Indicators

Functions to compute xclim indicators.

Calculate variables and indicators based on a YAML call to xclim.

The function cuts the output to be the same years as the inputs. Hence, if an indicator creates a timestep outside the original year range (e.g. the first DJF for QS-DEC), it will not appear in the output.

Parameters:

ds (xr.Dataset) – Dataset to use for the indicators.
indicators (Union[str, os.PathLike, Sequence[Indicator], Sequence[tuple[str, Indicator]], ModuleType]) – Path to a YAML file that instructs on how to calculate missing variables. Can also be only the “stem”, if translations and custom indices are implemented. Can be the indicator module directly, or a sequence of indicators or a sequence of tuples (indicator name, indicator) as returned by iter_indicators().
periods (list of str or list of lists of str, optional) – Either [start, end] or list of [start, end] of continuous periods over which to compute the indicators. This is needed when the time axis of ds contains some jumps in time. If None, the dataset will be considered continuous.
restrict_years (bool) – If True, cut the time axis to be within the same years as the input. This is mostly useful for frequencies that do not start in January, such as QS-DEC. In that instance, xclim would start on previous_year-12-01 (DJF), with a NaN. restrict_years will cut that first timestep. This should have no effect on YS and MS indicators.
to_level (str, optional) – The processing level to assign to the output. If None, the processing level of the inputs is preserved.

Returns:

dict – Dictionary (keys = timedeltas) with indicators separated by temporal resolution.

xscen.indicators.load_xclim_module(filename: str | PathLike, reload: bool = False) → ModuleType[source]

Return the xclim module described by the yaml file (or group of yaml, jsons and py).

Parameters:

filename (str or os.PathLike) – The filepath to the yaml file of the module or to the stem of yaml, jsons and py files.
reload (bool) – If False (default) and the module already exists in xclim.indicators, it is not re-build.

Returns:

ModuleType – The xclim module.

xscen.indicators.registry_from_module(module: ModuleType, registry: DerivedVariableRegistry | None = None, variable_column: str = 'variable') → DerivedVariableRegistry[source]

Convert a xclim virtual indicators module to an intake_esm Derived Variable Registry.

Parameters:

module (ModuleType) – A module of xclim.
registry (DerivedVariableRegistry, optional) – If given, this registry is extended, instead of creating a new one.
variable_column (str) – The name of the variable column (the name used in the query).

Returns:

DerivedVariableRegistry – A variable registry where each indicator and each of its output has been registered. If an indicator returns multiple values, each of them is mapped individually, as the DerivedVariableRegistry only supports single output function. Each indicator was wrapped into a new function that only accepts a dataset and returns it with the extra variable appended. This means all other parameters are given their defaults.

Ensembles

Ensemble statistics and weights.

xscen.ensembles.build_partition_data(datasets: dict | list[Dataset], partition_dim: list[str] = ['source', 'experiment', 'bias_adjust_project'], subset_kw: dict = None, regrid_kw: dict = None, indicators_kw: dict = None, rename_dict: dict = None)[source]

Get the input for the xclim partition functions.

From a list or dictionary of datasets, create a single dataset with partition_dim dimensions (and time) to pass to one of the xclim partition functions (https://xclim.readthedocs.io/en/stable/api.html#uncertainty-partitioning). If the inputs have different grids, they have to be subsetted and regridded to a common grid/point. Indicators can also be computed before combining the datasets.

Parameters:

datasets (dict) – List or dictionnary of Dataset objects that will be included in the ensemble. The datasets should include the necessary (“cat:”) attributes to understand their metadata. Tip: With a project catalog, you can do: datasets = pcat.search(**search_dict).to_dataset_dict().
partition_dim (list[str]) – Components of the partition. They will become the dimension of the output. The default is [‘source’, ‘experiment’, ‘bias_adjust_project’]. For source, the dimension will actually be institution_source_member.
subset_kw (dict) – Arguments to pass to xs.spatial.subset().
regrid_kw – Arguments to pass to xs.regrid_dataset().
indicators_kw – Arguments to pass to xs.indicators.compute_indicators(). All indicators have to be for the same frequency, in order to be put on a single time axis.
rename_dict – Dictionary to rename the dimensions from xscen names to xclim names. The default is {‘source’: ‘model’, ‘bias_adjust_project’: ‘downscaling’, ‘experiment’: ‘scenario’}.

Returns:

xr.Dataset – The input data for the partition functions.

See also

xclim.ensembles

Create an ensemble and computes statistics on it.

Parameters:

datasets (dict or list of [str, os.PathLike, Dataset or DataArray], or Dataset) – List of file paths or xarray Dataset/DataArray objects to include in the ensemble. A dictionary can be passed instead of a list, in which case the keys are used as coordinates along the new realization axis. Tip: With a project catalog, you can do: datasets = pcat.search(**search_dict).to_dataset_dict(). If a single Dataset is passed, it is assumed to already be an ensemble and will be used as is. The ‘realization’ dimension is required.
statistics (dict) – xclim.ensembles statistics to be called. Dictionary in the format {function: arguments}. If a function requires ‘weights’, you can leave it out of this dictionary and it will be applied automatically if the ‘weights’ argument is provided. See the Notes section for more details on robustness statistics, which are more complex in their usage.
create_kwargs (dict, optional) – Dictionary of arguments for xclim.ensembles.create_ensemble.
weights (xr.DataArray, optional) – Weights to apply along the ‘realization’ dimension. This array cannot contain missing values.
common_attrs_only (bool) – If True, keeps only the global attributes that are the same for all datasets and generate new id. If False, keeps global attrs of the first dataset (same behaviour as xclim.ensembles.create_ensemble)
to_level (str) – The processing level to assign to the output.

Returns:

xr.Dataset – Dataset with ensemble statistics

Notes

The positive fraction in ‘change_significance’ and ‘robustness_fractions’ is calculated by xclim using ‘v > 0’, which is not appropriate for relative deltas. This function will attempt to detect relative deltas by using the ‘delta_kind’ attribute (‘rel.’, ‘relative’, ‘*’, or ‘/’) and will apply ‘v - 1’ before calling the function.
The ‘robustness_categories’ statistic requires the outputs of ‘robustness_fractions’. Thus, there are two ways to build the ‘statistics’ dictionary:
1. Having ‘robustness_fractions’ and ‘robustness_categories’ as separate entries in the dictionary. In this case, all outputs will be returned.
2. Having ‘robustness_fractions’ as a nested dictionary under ‘robustness_categories’. In this case, only the robustness categories will be returned.
A ‘ref’ DataArray can be passed to ‘change_significance’ and ‘robustness_fractions’, which will be used by xclim to compute deltas and perform some significance tests. However, this supposes that both ‘datasets’ and ‘ref’ are still timeseries (e.g. annual means), not climatologies where the ‘time’ dimension represents the period over which the climatology was computed. Thus, using ‘ref’ is only accepted if ‘robustness_fractions’ (or ‘robustness_categories’) is the only statistic being computed.
If you want to use compute a robustness statistic on a climatology, you should first compute the climatologies and deltas yourself, then leave ‘ref’ as None and pass the deltas as the ‘datasets’ argument. This will be compatible with other statistics.

xscen.ensembles.generate_weights(datasets: dict | list, *, independence_level: str = 'model', balance_experiments: bool = False, attribute_weights: dict | None = None, skipna: bool = True, v_for_skipna: str | None = None, standardize: bool = False, experiment_weights: bool = False) → DataArray[source]

Use realization attributes to automatically generate weights along the ‘realization’ dimension.

Parameters:

datasets (dict) – List of Dataset objects that will be included in the ensemble. The datasets should include the necessary attributes to understand their metadata - See ‘Notes’ below. A dictionary can be passed instead of a list, in which case the keys are used for the ‘realization’ coordinate. Tip: With a project catalog, you can do: datasets = pcat.search(**search_dict).to_dataset_dict().
independence_level (str) – ‘model’: Weights using the method ‘1 model - 1 Vote’, where every unique combination of ‘source’ and ‘driving_model’ is considered a model. ‘GCM’: Weights using the method ‘1 GCM - 1 Vote’ ‘institution’: Weights using the method ‘1 institution - 1 Vote’
balance_experiments (bool) – If True, each experiment will be given a total weight of 1 (prior to subsequent weighting made through attribute_weights). This option requires the ‘cat:experiment’ attribute to be present in all datasets.
attribute_weights (dict, optional) – Nested dictionaries of weights to apply to each dataset. These weights are applied after the independence weighting. The first level of keys are the attributes for which weights are being given. The second level of keys are unique entries for the attribute, with the value being either an individual weight or a xr.DataArray. If a DataArray is used, its dimensions must be the same non-stationary coordinate as the datasets (ex: time, horizon) and the attribute being weighted (ex: experiment). A others key can be used to give the same weight to all entries not specifically named in the dictionary. Example #1: {‘source’: {‘MPI-ESM-1-2-HAM’: 0.25, ‘MPI-ESM1-2-HR’: 0.5}}, Example #2: {‘experiment’: {‘ssp585’: xr.DataArray, ‘ssp126’: xr.DataArray}, ‘institution’: {‘CCCma’: 0.5, ‘others’: 1}}
skipna (bool) – If True, weights will be computed from attributes only. If False, weights will be computed from the number of non-missing values. skipna=False requires either a ‘time’ or ‘horizon’ dimension in the datasets.
v_for_skipna (str, optional) – Variable to use for skipna=False. If None, the first variable in the first dataset is used.
standardize (bool) – If True, the weights are standardized to sum to 1 (per timestep/horizon, if skipna=False).
experiment_weights (bool) – Deprecated. Use balance_experiments instead.

Notes

The following attributes are required for the function to work:

‘cat:source’ in all datasets
‘cat:driving_model’ in regional climate models
‘cat:institution’ in all datasets if independence_level=’institution’
‘cat:experiment’ in all datasets if split_experiments=True

Even when not required, the ‘cat:member’ and ‘cat:experiment’ attributes are strongly recommended to ensure the weights are computed correctly.

Returns:: xr.DataArray – Weights along the ‘realization’ dimension, or 2D weights along the ‘realization’ and ‘time/horizon’ dimensions if skipna=False.

Aggregation

Functions to aggregate data over time and space.

Compute the mean over ‘year’ for given time periods, respecting the temporal resolution of ds.

Parameters:

ds (xr.Dataset) – Dataset to use for the computation.
window (int, optional) – Number of years to use for the time periods. If left at None and periods is given, window will be the size of the first period. If left at None and periods is not given, the window will be the size of the input dataset.
min_periods (int, optional) – For the rolling operation, minimum number of years required for a value to be computed. If left at None and the xrfreq is either QS or AS and doesn’t start in January, min_periods will be one less than window. If left at None, it will be deemed the same as ‘window’.
interval (int) – Interval (in years) at which to provide an output.
periods (list of str or list of lists of str, optional) – Either [start, end] or list of [start, end] of continuous periods to be considered. This is needed when the time axis of ds contains some jumps in time. If None, the dataset will be considered continuous.
to_level (str, optional) – The processing level to assign to the output. If None, the processing level of the inputs is preserved.

Returns:

xr.Dataset – Returns a Dataset of the climatological mean, by calling climatological_op with option op==’mean’.

xscen.aggregate.climatological_op(ds: Dataset, *, op: str | dict = 'mean', window: int | None = None, min_periods: int | float | None = None, stride: int = 1, periods: list[str] | list[list[str]] | None = None, rename_variables: bool = True, to_level: str = 'climatology', horizons_as_dim: bool = False) → Dataset[source]

Perform an operation ‘op’ over time, for given time periods, respecting the temporal resolution of ds.

Parameters:

ds (xr.Dataset) – Dataset to use for the computation.
op (str or dict) – Operation to perform over time. The operation can be any method name of xarray.core.rolling.DatasetRolling, ‘linregress’, or a dictionary. If ‘op’ is a dictionary, the key is the operation name and the value is a dict of kwargs accepted by the operation. While other operations are technically possible, the following are recommended and tested: [‘max’, ‘mean’, ‘median’, ‘min’, ‘std’, ‘sum’, ‘var’, ‘linregress’]. Operations beyond methods of xarray.core.rolling.DatasetRolling include:
- ‘linregress’ : Computes the linear regression over time, using scipy.stats.linregress and employing years as regressors. The output will have a new dimension ‘linreg_param’ with coordinates: [‘slope’, ‘intercept’, ‘rvalue’, ‘pvalue’, ‘stderr’, ‘intercept_stderr’].
Only one operation per call is supported, so len(op)==1 if a dict.
window (int, optional) – Number of years to use for the rolling operation. If left at None and periods is given, window will be the size of the first period. Hence, if periods are of different lengths, the shortest period should be passed first. If left at None and periods is not given, the window will be the size of the input dataset.
min_periods (int or float, optional) – For the rolling operation, minimum number of years required for a value to be computed. If left at None and the xrfreq is either QS or AS and doesn’t start in January, min_periods will be one less than window. Otherwise, if left at None, it will be deemed the same as ‘window’. If passed as a float value between 0 and 1, this will be interpreted as the floor of the fraction of the window size.
stride (int) – Stride (in years) at which to provide an output from the rolling window operation.
periods (list of str or list of lists of str, optional) – Either [start, end] or list of [start, end] of continuous periods to be considered. This is needed when the time axis of ds contains some jumps in time. If None, the dataset will be considered continuous.
rename_variables (bool) – If True, ‘_clim_{op}’ will be added to variable names.
to_level (str, optional) – The processing level to assign to the output. If None, the processing level of the inputs is preserved.
horizons_as_dim (bool) – If True, the output will have ‘horizon’ and the frequency as ‘month’, ‘season’ or ‘year’ as dimensions and coordinates. The ‘time’ coordinate will be unstacked to horizon and frequency dimensions. Horizons originate from periods and/or windows and their stride in the rolling operation.

Returns:

xr.Dataset – Dataset with the results from the climatological operation.

xscen.aggregate.compute_deltas(ds: Dataset, reference_horizon: str | Dataset, *, kind: str | dict = '+', rename_variables: bool = True, to_level: str | None = 'deltas') → Dataset[source]

Compute deltas in comparison to a reference time period, respecting the temporal resolution of ds.

Parameters:

ds (xr.Dataset) – Dataset to use for the computation.
reference_horizon (str or xr.Dataset) – Either a YYYY-YYYY string corresponding to the ‘horizon’ coordinate of the reference period, or a xr.Dataset containing the climatological mean.
kind (str or dict) – [‘+’, ‘/’, ‘%’] Whether to provide absolute, relative, or percentage deltas. Can also be a dictionary separated per variable name.
rename_variables (bool) – If True, ‘_delta_YYYY-YYYY’ will be added to variable names.
to_level (str, optional) – The processing level to assign to the output. If None, the processing level of the inputs is preserved.

Returns:

xr.Dataset – Returns a Dataset with the requested deltas.

Compute indicators, then the climatological mean, and finally unstack dates in order to have a single dataset with all indicators of different frequencies.

Once this is done, the function drops ‘time’ in favor of ‘horizon’. This function computes the indicators and does an interannual mean. It stacks the season and month in different dimensions and adds a dimension horizon for the period or the warming level, if given.

Parameters:

ds (xr.Dataset) – Input dataset with a time dimension.
indicators (Union[str, os.PathLike, Sequence[Indicator], Sequence[Tuple[str, Indicator]], ModuleType]) – Indicators to compute. It will be passed to the indicators argument of xs.compute_indicators.
periods (list of str or list of lists of str, optional) – Either [start, end] or list of [start_year, end_year] for the period(s) to be evaluated. If both periods and warminglevels are None, the full time series will be used.
warminglevels (dict, optional) – Dictionary of arguments to pass to py:func:xscen.subset_warming_level. If ‘wl’ is a list, the function will be called for each value and produce multiple horizons. If both periods and warminglevels are None, the full time series will be used.
to_level (str, optional) – The processing level to assign to the output. If there is only one horizon, you can use “{wl}”, “{period0}” and “{period1}” in the string to dynamically include that information in the processing level.

Returns:

xr.Dataset – Horizon dataset.

Compute the spatial mean using a variety of available methods.

Parameters:

ds (xr.Dataset) – Dataset to use for the computation.
method (str) – ‘cos-lat’ will weight the area covered by each pixel using an approximation based on latitude. ‘interp_centroid’ will find the region’s centroid (if coordinates are not fed through kwargs), then perform a .interp() over the spatial dimensions of the Dataset. The coordinate can also be directly fed to .interp() through the ‘kwargs’ argument below. ‘xesmf’ will make use of xESMF’s SpatialAverager. This will typically be more precise, especially for irregular regions, but can be much slower than other methods.
spatial_subset (bool, optional) – If True, xscen.spatial.subset will be called prior to the other operations. This requires the ‘region’ argument. If None, this will automatically become True if ‘region’ is provided and the subsetting method is either ‘cos-lat’ or ‘mean’.
region (dict or str, optional) – Description of the region and the subsetting method (required fields listed in the Notes). If method==’interp_centroid’, this is used to find the region’s centroid. If method==’xesmf’, the bounding box or shapefile is given to SpatialAverager. Can also be “global”, for global averages. This is simply a shortcut for {‘name’: ‘global’, ‘method’: ‘bbox’, ‘lon_bnds’ [-180, 180], ‘lat_bnds’: [-90, 90]}.
kwargs (dict, optional) – Arguments to send to either mean(), interp() or SpatialAverager(). For SpatialAverager, one can give skipna or output_chunks here, to be passed to the averager call itself.
simplify_tolerance (float, optional) – Precision (in degree) used to simplify a shapefile before sending it to SpatialAverager(). The simpler the polygons, the faster the averaging, but it will lose some precision.
to_domain (str, optional) – The domain to assign to the output. If None, the domain of the inputs is preserved.
to_level (str, optional) – The processing level to assign to the output. If None, the processing level of the inputs is preserved.

Returns:

xr.Dataset – Returns a Dataset with the spatial dimensions averaged.

Notes

‘region’ required fields:

name: str: Region name used to overwrite domain in the catalog.
method: str: [‘gridpoint’, ‘bbox’, shape’, ‘sel’]
tile_buffer: float, optional: Multiplier to apply to the model resolution. Only used if spatial_subset==True.
kwargs: Arguments specific to the method used.

See also

xarray.Dataset.mean, xarray.Dataset.interp, xesmf.SpatialAverager

Reduction

Functions to reduce an ensemble of simulations.

xscen.reduce.build_reduction_data(datasets: dict | list[Dataset], *, xrfreqs: list[str] | None = None, horizons: list[str] | None = None) → DataArray[source]

Construct the input required for ensemble reduction.

This will combine all variables into a single DataArray and stack all dimensions except “realization”.

Parameters:

datasets (Union[dict, list]) – Dictionary of datasets in the format {“id”: dataset}, or list of datasets. This can be generated by calling .to_dataset_dict() on a catalog.
xrfreqs (list of str, optional) – List of unique frequencies across the datasets. If None, the script will attempt to guess the frequencies from the datasets’ metadata or with xr.infer_freq().
horizons (list of str, optional) – Subset of horizons on which to create the data.

Returns:

xr.DataArray – 2D DataArray of dimensions “realization” and “criteria”, to be used as input for ensemble reduction.

xscen.reduce.reduce_ensemble(data: DataArray, method: str, kwargs: dict)[source]

Reduce an ensemble of simulations using clustering algorithms from xclim.ensembles.

Parameters:

data (xr.DataArray) – Selection criteria data : 2-D xr.DataArray with dimensions ‘realization’ and ‘criteria’. These are the values used for clustering. Realizations represent the individual original ensemble members and criteria the variables/indicators used in the grouping algorithm. This data can be generated using build_reduction_data().
method (str) – [‘kkz’, ‘kmeans’]. Clustering method.
kwargs (dict) – Arguments to send to either xclim.ensembles.kkz_reduce_ensemble or xclim.ensembles.kmeans_reduce_ensemble

Returns:

selected (xr.DataArray) – DataArray of dimension ‘realization’ with the selected simulations.
clusters (dict) – If using kmeans clustering, realizations grouped by cluster.
fig_data (dict) – If using kmeans clustering, data necessary to call xclim.ensembles.plot_rsqprofile()

Diagnostics and Quality Checks

Functions to perform diagnostics on datasets.

Perform a series of health checks on the dataset. Be aware that missing data checks and flag checks can be slow.

Parameters:

ds (xr.Dataset or xr.DataArray) – Dataset to check.
structure (dict, optional) – Dictionary with keys “dims” and “coords” containing the expected dimensions and coordinates. This check will fail is extra dimensions or coordinates are found.
calendar (str, optional) – Expected calendar. Synonyms should be detected correctly (e.g. “standard” and “gregorian”).
start_date (str, optional) – To check if the dataset starts at least at this date.
end_date (str, optional) – To check if the dataset ends at least at this date.
variables_and_units (dict, optional) – Dictionary containing the expected variables and units.
cfchecks (dict, optional) – Dictionary where the key is the variable to check and the values are the cfchecks. The cfchecks themselves must be a dictionary with the keys being the cfcheck names and the values being the arguments to pass to the cfcheck. See xclim.core.cfchecks for more details.
freq (str, optional) – Expected frequency, written as the result of xr.infer_freq(ds.time).
missing (dict or str or list of str, optional) – String, list of strings, or dictionary where the key is the method to check for missing data and the values are the arguments to pass to the method. The methods are: “missing_any”, “at_least_n_valid”, “missing_pct”, “missing_wmo”. See xclim.core.missing() for more details.
flags (dict, optional) – Dictionary where the key is the variable to check and the values are the flags. The flags themselves must be a dictionary with the keys being the data_flags names and the values being the arguments to pass to the data_flags. If None is passed instead of a dictionary, then xclim’s default flags for the given variable are run. See xclim.core.utils.VARIABLES. See also xclim.core.dataflags.data_flags() for the list of possible flags.
flags_kwargs (dict, optional) – Additional keyword arguments to pass to the data_flags (“dims” and “freq”).
return_flags (bool) – Whether to return the Dataset created by data_flags.
raise_on (list of str, optional) – Whether to raise an error if a check fails, else there will only be a warning. The possible values are the names of the checks. Use [“all”] to raise on all checks.

Returns:

xr.Dataset or None – Dataset containing the flags if return_flags is True & raise_on is False for the “flags” check.

xscen.diagnostics.measures_heatmap(meas_datasets: list[Dataset] | dict, to_level: str = 'diag-heatmap') → Dataset[source]

Create a heatmap to compare the performance of the different datasets.

The columns are properties and the rows are datasets. Each point is the absolute value of the mean of the measure over the whole domain. Each column is normalized from 0 (best) to 1 (worst).

Parameters:

meas_datasets (list of xr.Dataset or dict) – List or dictionary of datasets of measures of properties. If it is a dictionary, the keys will be used to name the rows. If it is a list, the rows will be given a number.
to_level (str) – The processing_level to assign to the output.

Returns:

xr.Dataset – Dataset containing the heatmap.

xscen.diagnostics.measures_improvement(meas_datasets: list[Dataset] | dict, to_level: str = 'diag-improved') → Dataset[source]

Calculate the fraction of improved grid points for each property between two datasets of measures.

Parameters:

meas_datasets (list of xr.Dataset or dict) – List of 2 datasets: Initial dataset of measures and final (improved) dataset of measures. Both datasets must have the same variables. It is also possible to pass a dictionary where the values are the datasets and the key are not used.
to_level (str) – processing_level to assign to the output dataset

Returns:

xr.Dataset – Dataset containing information on the fraction of improved grid points for each property.

xscen.diagnostics.properties_and_measures(ds: Dataset, properties: str | PathLike | Sequence[Indicator] | Sequence[tuple[str, Indicator]] | ModuleType, period: list[str] | None = None, unstack: bool = False, rechunk: dict | None = None, dref_for_measure: Dataset | None = None, change_units_arg: dict | None = None, to_level_prop: str = 'diag-properties', to_level_meas: str = 'diag-measures') → tuple[Dataset, Dataset][source]

Calculate properties and measures of a dataset.

Parameters:

ds (xr.Dataset) – Input dataset.
properties (Union[str, os.PathLike, Sequence[Indicator], Sequence[tuple[str, Indicator]], ModuleType]) – Path to a YAML file that instructs on how to calculate properties. Can be the indicator module directly, or a sequence of indicators or a sequence of tuples (indicator name, indicator) as returned by iter_indicators().
period (list of str, optional) – [start, end] of the period to be evaluated. The period will be selected on ds and dref_for_measure if it is given.
unstack (bool) – Whether to unstack ds before computing the properties.
rechunk (dict, optional) – Dictionary of chunks to use for a rechunk before computing the properties.
dref_for_measure (xr.Dataset, optional) – Dataset of properties to be used as the ref argument in the computation of the measure. Ideally, this is the first output (prop) of a previous call to this function. Only measures on properties that are provided both in this dataset and in the properties list will be computed. If None, the second output of the function (meas) will be an empty Dataset.
change_units_arg (dict, optional) – If not None, calls xscen.utils.change_units on ds before computing properties using this dictionary for the variables_and_units argument. It can be useful to convert units before computing the properties, because it is sometimes easier to convert the units of the variables than the units of the properties (e.g. variance).
to_level_prop (str) – processing_level to give the first output (prop)
to_level_meas (str) – processing_level to give the second output (meas)

Returns:

prop (xr.Dataset) – Dataset of properties of ds
meas (xr.Dataset) – Dataset of measures between prop and dref_for_meas

Input / Output

Input/Output functions for xscen.

xscen.io.clean_incomplete(path: str | PathLike, complete: Sequence[str]) → None[source]

Delete un-catalogued variables from a zarr folder.

The goal of this function is to clean up an incomplete calculation. It will remove any variable in the zarr that is neither in the complete list nor in the coords.

Parameters:

path (str, Path) – A path to a zarr folder.
complete (sequence of strings) – Name of variables that were completed.

Returns:

None

xscen.io.estimate_chunks(ds: str | PathLike | Dataset, dims: list, target_mb: float = 50, chunk_per_variable: bool = False) → dict[source]

Return an approximate chunking for a file or dataset.

Parameters:

ds (xr.Dataset, str) – Either a xr.Dataset or the path to a NetCDF file. Existing chunks are not taken into account.
dims (list) – Dimension(s) on which to estimate the chunking. Not implemented for more than 2 dimensions.
target_mb (float) – Roughly the size of chunks (in Mb) to aim for.
chunk_per_variable (bool) – If True, the output will be separated per variable. Otherwise, a common chunking will be found.

Returns:

dict – A dictionary mapping dimensions to chunk sizes.

xscen.io.get_engine(file: str | PathLike) → str[source]

Use functionality of h5py to determine if a NetCDF file is compatible with h5netcdf.

Parameters:: file (str or os.PathLike) – Path to the file.
Returns:: str – Engine to use with xarray

xscen.io.make_toc(ds: Dataset | DataArray, loc: str | None = None) → DataFrame[source]

Make a table of content describing a dataset’s variables.

This return a simple DataFrame with variable names as index, the long_name as “description” and units. Column names and long names are taken from the activated locale if found, otherwise the english version is taken.

Parameters:

ds (xr.Dataset or xr.DataArray) – Dataset or DataArray from which to extract the relevant metadata.
loc (str, optional) – The locale to use. If None, either the first locale in the list of activated xclim locales is used, or “en” if none is activated.

Returns:

pd.DataFrame – A DataFrame with variables as index, and columns “description” and “units”.

Rechunk a dataset into a new zarr.

Parameters:

path_in (path, str or xr.Dataset) – Input to rechunk.
path_out (path or str) – Path to the target zarr.
chunks_over_var (dict) – Mapping from variables to mappings from dimension name to size. Give this argument or chunks_over_dim.
chunks_over_dim (dict) – Mapping from dimension name to size that will be used for all variables in ds. Give this argument or chunks_over_var.
worker_mem (str) – The maximal memory usage of each task. When using a distributed Client, this an approximate memory per thread. Each worker of the client should have access to 10-20% more memory than this times the number of threads.
temp_store (path or str, optional) – A path to a zarr where to store intermediate results.
overwrite (bool) – If True, it will delete whatever is in path_out before doing the rechunking.

Returns:

None

See also

rechunker.rechunk

xscen.io.rechunk_for_saving(ds: Dataset, rechunk: dict)[source]

Rechunk before saving to .zarr or .nc, generalized as Y/X for different axes lat/lon, rlat/rlon.

Parameters:

ds (xr.Dataset) – The xr.Dataset to be rechunked.
rechunk (dict) – A dictionary with the dimension names of ds and the new chunk size. Spatial dimensions can be provided as X/Y.

Returns:

xr.Dataset – The dataset with new chunking.

xscen.io.round_bits(da: DataArray, keepbits: int)[source]

Round floating point variable by keeping a given number of bits in the mantissa, dropping the rest. This allows for a much better compression.

Parameters:

da (xr.DataArray) – Variable to be rounded.
keepbits (int) – The number of bits of the mantissa to keep.

Save a Dataset to NetCDF, rechunking or compressing if requested.

Parameters:

ds (xr.Dataset) – Dataset to be saved.
filename (str or os.PathLike) – Name of the NetCDF file to be saved.
rechunk (dict, optional) – This is a mapping from dimension name to new chunks (in any format understood by dask). Spatial dimensions can be generalized as ‘X’ and ‘Y’, which will be mapped to the actual grid type’s dimension names. Rechunking is only done on data variables sharing dimensions with this argument.
bitround (bool or int or dict) – If not False, float variables are bit-rounded by dropping a certain number of bits from their mantissa, allowing for a much better compression. If an int, this is the number of bits to keep for all float variables. If a dict, a mapping from variable name to the number of bits to keep. If True, the number of bits to keep is guessed based on the variable’s name, defaulting to 12, which yields a relative error below 0.013%.
compute (bool) – Whether to start the computation or return a delayed object.
netcdf_kwargs (dict, optional) – Additional arguments to send to_netcdf()

Returns:

None

See also

xarray.Dataset.to_netcdf

Save the dataset to a tabular file (csv, excel, …).

This function will trigger a computation of the dataset.

Parameters:

ds (xr.Dataset or xr.DataArray) – Dataset or DataArray to be saved. If a Dataset with more than one variable is given, the dimension “variable” must appear in one of row, column or sheet.
filename (str or os.PathLike) – Name of the file to be saved.
output_format ({‘csv’, ‘excel’, …}, optional) – The output format. If None (default), it is inferred from the extension of filename. Not all possible output format are supported for inference. Valid values are any that matches a pandas.DataFrame method like “df.to_{format}”.
row (str or sequence of str, optional) – Name of the dimension(s) to use as indexes (rows). Default is all data dimensions.
column (str or sequence of str, optional) – Name of the dimension(s) to use as columns. Default is “variable”, i.e. the name of the variable(s).
sheet (str or sequence of str, optional) – Name of the dimension(s) to use as sheet names. Only valid if the output format is excel.
coords (bool or sequence of str) – A list of auxiliary coordinates to add to the columns (as would variables). If True, all (if any) are added.
col_sep (str,) – Multi-columns (except in excel) and sheet names are concatenated with this separator.
row_sep (str, optional) – Multi-index names are concatenated with this separator, except in excel. If None (default), each level is written in its own column.
add_toc (bool or DataFrame) – A table of content to add as the first sheet. Only valid if the output format is excel. If True, make_toc() is used to generate the toc. The sheet name of the toc can be given through the “name” attribute of the DataFrame, otherwise “Content” is used.
kwargs – Other arguments passed to the pandas function. If the output format is excel, kwargs to pandas.ExcelWriter can be given here as well.

xscen.io.save_to_zarr(ds: Dataset, filename: str | PathLike, *, rechunk: dict | None = None, zarr_kwargs: dict | None = None, compute: bool = True, encoding: dict | None = None, bitround: bool | int | dict = False, mode: str = 'f', itervar: bool = False, timeout_cleanup: bool = True)[source]

Save a Dataset to Zarr format, rechunking and compressing if requested.

According to mode, removes variables that we don’t want to re-compute in ds.

Parameters:

ds (xr.Dataset) – Dataset to be saved.
filename (str) – Name of the Zarr file to be saved.
rechunk (dict, optional) – This is a mapping from dimension name to new chunks (in any format understood by dask). Spatial dimensions can be generalized as ‘X’ and ‘Y’ which will be mapped to the actual grid type’s dimension names. Rechunking is only done on data variables sharing dimensions with this argument.
zarr_kwargs (dict, optional) – Additional arguments to send to_zarr()
compute (bool) – Whether to start the computation or return a delayed object.
mode ({‘f’, ‘o’, ‘a’}) – If ‘f’, fails if any variable already exists. if ‘o’, removes the existing variables. if ‘a’, skip existing variables, writes the others.
encoding (dict, optional) – If given, skipped variables are popped in place.
bitround (bool or int or dict) – If not False, float variables are bit-rounded by dropping a certain number of bits from their mantissa, allowing for a much better compression. If an int, this is the number of bits to keep for all float variables. If a dict, a mapping from variable name to the number of bits to keep. If True, the number of bits to keep is guessed based on the variable’s name, defaulting to 12, which yields a relative error of 0.012%.
itervar (bool) – If True, (data) variables are written one at a time, appending to the zarr. If False, this function computes, no matter what was passed to kwargs.
timeout_cleanup (bool) – If True (default) and a xscen.scripting.TimeoutException is raised during the writing, the variable being written is removed from the dataset as it is incomplete. This does nothing if compute is False.

Returns:

dask.delayed object if compute=False, None otherwise.

See also

xarray.Dataset.to_zarr

xscen.io.subset_maxsize(ds: Dataset, maxsize_gb: float) → list[source]

Estimate a dataset’s size and, if higher than the given limit, subset it alongside the ‘time’ dimension.

Parameters:

ds (xr.Dataset) – Dataset to be saved.
maxsize_gb (float) – Target size for the NetCDF files. If the dataset is bigger than this number, it will be separated alongside the ‘time’ dimension.

Returns:

list – List of xr.Dataset subsetted alongside ‘time’ to limit the filesize to the requested maximum.

Convert a dataset to a pandas DataFrame with support for multicolumns and multisheet.

This function will trigger a computation of the dataset.

Parameters:

ds (xr.Dataset or xr.DataArray) – Dataset or DataArray to be saved. If a Dataset with more than one variable is given, the dimension “variable” must appear in one of row, column or sheet.
row (str or sequence of str, optional) – Name of the dimension(s) to use as indexes (rows). Default is all data dimensions.
column (str or sequence of str, optional) – Name of the dimension(s) to use as columns. Default is “variable”, i.e. the name of the variable(s).
sheet (str or sequence of str, optional) – Name of the dimension(s) to use as sheet names.
coords (bool or str or sequence of str) – A list of auxiliary coordinates to add to the columns (as would variables). If True, all (if any) are added.

Returns:

pd.DataFrame or dict – DataFrame with a MultiIndex with levels row and MultiColumn with levels column. If sheet is given, the output is dictionary with keys for each unique “sheet” dimensions tuple, values are DataFrames. The DataFrames are always sorted with level priority as given in row and in ascending order.

Spatial tools

Spatial tools.

xscen.spatial.creep_fill(da: DataArray, w: DataArray) → DataArray[source]

Creep fill using pre-computed weights.

Parameters:

da (DataArray) – A DataArray sharing the dimensions with the one used to compute the weights. It can have other dimensions. Dask is supported as long as there are no chunks over the creeped dims.
w (DataArray) – The result of creep_weights.

Returns:

xarray.DataArray, same shape as da, but values filled according to w.

Examples

>>> w = creep_weights(da.isel(time=0).notnull(), n=1)
>>> da_filled = creep_fill(da, w)

xscen.spatial.creep_weights(mask: DataArray, n: int = 1, mode: str = 'clip') → DataArray[source]

Compute weights for the creep fill.

The output is a sparse matrix with the same dimensions as mask, twice.

Parameters:

mask (DataArray) – A boolean DataArray. False values are candidates to the filling. Usually they represent missing values (mask = da.notnull()). All dimensions are creep filled.
n (int) – The order of neighbouring to use. 1 means only the adjacent grid cells are used.
mode ({‘clip’, ‘wrap’}) – If a cell is on the edge of the domain, mode=’wrap’ will wrap around to find neighbours.

Returns:

DataArray – Weights. The dot product must be taken over the last N dimensions.

xscen.spatial.subset(ds: Dataset, method: str, *, name: str | None = None, tile_buffer: float = 0, **kwargs) → Dataset[source]

Subset the data to a region.

Either creates a slice and uses the .sel() method, or customizes a call to clisops.subset() that allows for an automatic buffer around the region.

Parameters:

ds (xr.Dataset) – Dataset to be subsetted.
method (str) – [‘gridpoint’, ‘bbox’, shape’, ‘sel’] If the method is sel, this is not a call to clisops but only a subsetting with the xarray .sel() fonction.
name (str, optional) – Used to rename the ‘cat:domain’ attribute.
tile_buffer (float) – For [‘bbox’, shape’], uses an approximation of the grid cell size to add a buffer around the requested region. This differs from clisops’ ‘buffer’ argument in subset_shape().
**kwargs (dict) – Arguments to be sent to clisops. See relevant function for details. Depending on the method, required kwargs are: - gridpoint: lon, lat - bbox: lon_bnds, lat_bnds - shape: shape - sel: slices for each dimension

Returns:

xr.Dataset – Subsetted Dataset.

Controlled Vocabulary and Mappings

Mappings of (controlled) vocabulary. This module is generated automatically from json files in xscen/CVs. Functions are essentially mappings, most of which are meant to provide translations between columns.

Json files must be shallow dictionaries to be supported. If the json file contains a is_regex: True entry, then the keys are automatically translated as regex patterns and the function returns the value of the first key that matches the pattern. Otherwise the function essentially acts like a normal dictionary. The ‘raw’ data parsed from the json file is added in the dict attribute of the function. Example:

xs.utils.CV.frequency_to_timedelta.dict

frequency_to_timedelta

{
  "1hr": "1h",
  "3hr": "3h",
  "6hr": "6h",
  "day": "1D",
  "sem": "1W",
  "2sem": "2W",
  "mon": "30D",
  "qtr": "90D",
  "6mon": "180D",
  "yr": "365D",
  "fx": "NAN"
}

frequency_to_xrfreq

{
  "1hr": "h",
  "3hr": "3h",
  "6hr": "6h",
  "day": "D",
  "sem": "W",
  "2sem": "2W",
  "mon": "MS",
  "qtr": "QS-DEC",
  "6mon": "2QS-DEC",
  "yr": "YS",
  "fx": "fx"
}

infer_resolution

{
  "CMIP": [
    "^gn[a-g]{0,1}$",
    "^gr[0-9]{0,1}[a-g]{0,1}$",
    "^global$",
    "^gnz$",
    "^gr[0-9]{1}z$",
    "^gm"
  ],
  "CORDEX": [
    "^[A-Z]{3}-[0-9]{2}[i]{0,1}$",
    "^[A-Z]{3}-[0-9]{2}i$"
  ]
}

resampling_methods

{
  "any": {
    "sfcWindfromdir": "wind_direction",
    "sfcWind": "wind_direction",
    "uas": "wind_direction",
    "vas": "wind_direction"
  },
  "D": {
    "tasmin": "min",
    "tasmax": "max"
  }
}

variable_names

{
  "latitude": "lat",
  "longitude": "lon",
  "t2m": "tas",
  "d2m": "tdps",
  "tp": "pr",
  "u10": "uas",
  "v10": "vas"
}

xrfreq_to_frequency

{
  "is_regex": true,
  "h": "1hr",
  "H": "1hr",
  "3h": "3hr",
  "3H": "3hr",
  "6h": "6hr",
  "6H": "6hr",
  "D": "day",
  "W": "sem",
  "2W": "2sem",
  "14D": "2sem",
  "M.*": "mon",
  "Q.*": "qtr",
  "2Q.*": "6mon",
  "A.*": "yr",
  "Y.*": "yr",
  "fx": "fx"
}

xrfreq_to_timedelta

{
  "is_regex": true,
  "h": "1h",
  "H": "1h",
  "3h": "3h",
  "3H": "3h",
  "6h": "6h",
  "6H": "6h",
  "D": "1D",
  "W": "7D",
  "2W": "14D",
  "M.*": "30D",
  "Q.*": "90D",
  "2Q.*": "180D",
  "A.*": "365D",
  "Y.*": "365D",
  "fx": "NAN"
}

Configuration Utilities

Configuration module.

Configuration in this module is taken from yaml files.

Functions wrapped by parse_config() have their kwargs automatically patched by values in the config.

The CONFIG dictionary contains all values, structured by submodules and functions. For example, for function function defined in module.py of this package, the config would look like:

module:
    function:
        ...kwargs...

The load_config() function fills the CONFIG dict from yaml files. It always updates the dictionary, so the latest file read has the highest priority.

At calling time, the priority order is always (from highest to lowest priority):

Explicitly passed keyword-args
Values in the loaded config
Function’s default values.

Special sections

After parsing the files, load_config() will look into the config and perform some extra actions when finding the following special sections:

logging: The content of this section will be sent directly to logging.config.dictConfig().
xarray: The content of this section will be sent directly to xarray.set_options().
xclim: The content of this section will be sent directly to xclim.set_options(). Here goes metadata_locales: - fr to activate the automatic translation of added attributes, for example.
warnings: The content of this section must be a simple mapping. The keys are understood as python warning categories (types) and the values as an action to add to the filter. The key “all” applies the filter to any warnings. Only built-in warnings are supported.

xscen.config.args_as_str(*args: tuple[Any, ...]) → tuple[str, ...][source]: Return arguments as strings.

xscen.config.load_config(*elements, reset: bool = False, encoding: str = None, verbose: bool = False)[source]

Load configuration from given files or key=value pairs.

Once all elements are loaded, special sections are dispatched to their module, but only if the section was changed by the loaded elements. These special sections are:

locales : The locales to use when writing metadata in xscen, xclim and figanos. This section must be a list of 2-char strings.
logging : Everything passed to logging.config.dictConfig().
xarray : Passed to xarray.set_options().
xclim : Passed to xclim.set_options().
warning : Mappings where the key is a Warning category (or “all”) and the value an action to pass to warnings.simplefilter().

Parameters:

elements (str) – Files or values to add into the config. If a directory is passed, all .yml files of this directory are added, in alphabetical order. If a “key=value” string, “key” is a dotted name and value will be evaluated if possible. “key=value” pairs are set last, after all files are being processed.
reset (bool) – If True, the current config is erased before loading files.
encoding (str, optional) – The encoding to use when reading files.
verbose (bool) – if True, each element triggers a INFO log line.

Example

load_config("my_config.yml", "config_dir/", "logging.loggers.xscen.level=DEBUG")

Will load configuration from my_config.yml, then from all yml files in config_dir and then the logging level of xscen’s logger will be set to DEBUG.

xscen.config.parse_config(func_or_cls)[source]

xscen.config.recursive_update(d, other)[source]

Update a dictionary recursively with another dictionary.

Values that are Mappings are updated recursively as well.

Script Utilities

A collection of various convenience objects and functions to use in scripts.

exception xscen.scripting.TimeoutException(seconds: int, task: str = '', **kwargs)[source]: An exception raised with a timeout occurs.

class xscen.scripting.measure_time(name: str | None = None, cpu: bool = False, logger: ~logging.Logger = <Logger xscen.scripting (WARNING)>)[source]

Context for timing a code block.

Parameters:

name (str, optional) – A name to give to the block being timed, for meaningful logging.
cpu (boolean) – If True, the CPU time is also measured and logged.
logger (logging.Logger, optional) – The logger object to use when sending Info messages with the measured time. Defaults to a logger from this module.

xscen.scripting.move_and_delete(moving: list[list[str | PathLike]], pcat: ProjectCatalog, deleting: list[str | PathLike] | None = None, copy: bool = False)[source]

First, move files, then update the catalog with new locations. Finally, delete directories.

This function can be used at the end of for loop in a workflow to clean temporary files.

Parameters:

moving (list of lists of str or os.PathLike) – list of lists of path of files to move, following the format: [[source 1, destination1], [source 2, destination2],…]
pcat (ProjectCatalog) – Catalog to update with new destinations
deleting (list of str or os.PathLike, optional) – list of directories to be deleted including all contents and recreated empty. E.g. the working directory of a workflow.
copy (bool, optional) – If True, copy directories instead of moving them.

Construct the path, save and delete.

This function can be used after each task of a workflow.

Parameters:

ds (xr.Dataset) – Dataset to save.
pcat (ProjectCatalog) – Catalog to update after saving the dataset.
path (str or os.pathlike, optional) – Path where to save the dataset. If the string contains variables in curly bracket. They will be filled by catalog attributes. If None, the catutils.build_path fonction will be used to create a path.
file_format ({‘nc’, ‘zarr’}) – Format of the file. If None, look for the following in order: build_path_kwargs[‘format’], a suffix in path, ds.attrs[‘cat:format’]. If nothing is found, it will default to zarr.
build_path_kwargs (dict, optional) – Arguments to pass to build_path.
save_kwargs (dict, optional) – Arguments to pass to save_to_netcdf or save_to_zarr.
update_kwargs (dict, optional) – Arguments to pass to update_from_ds.

Send email.

Email a single address through a login-less SMTP server. The default values of server and port should work out-of-the-box on Ouranos’s systems.

Parameters:

subject (str) – Subject line.
msg (str) – Main content of the email. Can be UTF-8 and multi-line.
to (str, optional) – Email address to which send the email. If None (default), the email is sent to “{os.getlogin()}@{os.uname().nodename}”. On unix systems simply put your real email address in $HOME/.forward to receive the emails sent to this local address.
server (str) – SMTP server url. Defaults to 127.0.0.1, the local host. This function does not try to log-in.
port (int) – Port of the SMTP service on the server. Defaults to 25, which is usually the default port on unix-like systems.
attachments (list of paths or matplotlib figures or tuples of a string and a path or figure, optional) – List of files to attach to the email. Elements of the list can be paths, the mimetypes of those is guessed and the files are read and sent. Elements can also be matplotlib Figures which are send as png image (savefig) with names like “Figure00.png”. Finally, elements can be tuples of a filename to use in the email and the attachment, handled as above.

Returns:

None

xscen.scripting.send_mail_on_exit(*, subject: str | None = None, msg_ok: str | None = None, msg_err: str | None = None, on_error_only: bool = False, skip_ctrlc: bool = True, **mail_kwargs) → None[source]

Send an email with content depending on how the system exited.

This function is best used by registering it with atexit. Calls send_mail().

Parameters:

subject (str, optional) – Email subject. Will be appended by “Success”, “No errors” or “Failure” depending on how the system exits.
msg_ok (str, optional) – Content of the email if the system exists successfully.
msg_err (str, optional) – Content of the email id the system exists with a non-zero code or with an error. The message will be appended by the exit code or with the error traceback.
on_error_only (boolean) – Whether to only send an email on a non-zero/error exit.
skip_ctrlc (boolean) – If True (default), exiting with a KeyboardInterrupt will not send an email.
mail_kwargs – Other arguments passed to send_mail(). The to argument is necessary for this function to work.

Returns:

None

Example

Send an eamil titled “Woups” upon non-successful program exit. We assume the to field was given in the config.

>>> import atexit
>>> atexit.register(send_mail_on_exit, subject="Woups", on_error_only=True)

xscen.scripting.skippable(seconds: int = 2, task: str = '', logger: Logger | None = None)[source]

Skippable context manager.

When CTRL-C (SIGINT, KeyboardInterrupt) is sent within the context, this catches it, prints to the log and gives a timeout during which a subsequent interruption will stop the script. Otherwise, the context exits normally.

This is meant to be used within a loop so that we can skip some iterations:

for i in iterable:
    with skippable(2, i):
        some_skippable_code()

Parameters:

seconds (int) – Number of seconds to wait for a second CTRL-C.
task (str) – A name for the skippable task, to have an explicit script.
logger (logging.Logger, optional) – The logger to use when printing the messages. The interruption signal is notified with ERROR, while the skipping is notified with INFO. If not given (default), a brutal print is used.

xscen.scripting.timeout(seconds: int, task: str = '')[source]

Timeout context manager.

Only one can be used at a time, this is not multithread-safe : it cannot be used in another thread than the main one, but multithreading can be used in parallel.

Parameters:

seconds (int) – Number of seconds after which the context exits with a TimeoutException. If None or negative, no timeout is set and this context does nothing.
task (str, optional) – A name to give to the task, allowing a more meaningful exception.

Packaging Utilities

Common utilities to be used in many places.

xscen.utils.add_attr(ds: Dataset | DataArray, attr: str, new: str, **fmt)[source]: Add a formatted translatable attribute to a dataset.

xscen.utils.change_units(ds: Dataset, variables_and_units: dict) → Dataset[source]

Change units of Datasets to non-CF units.

Parameters:

ds (xr.Dataset) – Dataset to use
variables_and_units (dict) – Description of the variables and units to output

Returns:

xr.Dataset

Clean up of the dataset.

It can:

convert to the right units using xscen.finalize.change_units
convert the calendar and interpolate over missing dates
call the xscen.common.maybe_unstack function
remove a list of attributes
remove everything but a list of attributes
add attributes
change the prefix of the catalog attrs

in that order.

Parameters:

ds (xr.Dataset) – Input dataset to clean up
variables_and_units (dict, optional) – Dictionary of variable to convert. eg. {‘tasmax’: ‘degC’, ‘pr’: ‘mm d-1’}
convert_calendar_kwargs (dict, optional) – Dictionary of arguments to feed to xclim.core.calendar.convert_calendar. This will be the same for all variables. If missing_by_vars is given, it will override the ‘missing’ argument given here. Eg. {target’: default, ‘align_on’: ‘random’}
missing_by_var (dict, optional) – Dictionary where the keys are the variables and the values are the argument to feed the missing parameters of the xclim.core.calendar.convert_calendar for the given variable with the convert_calendar_kwargs. When the value of an entry is ‘interpolate’, the missing values will be filled with NaNs, then linearly interpolated over time.
maybe_unstack_dict (dict, optional) – Dictionary to pass to xscen.common.maybe_unstack function. The format should be: {‘coords’: path_to_coord_file, ‘rechunk’: {‘time’: -1 }, ‘stack_drop_nans’: True}.
round_var (dict, optional) – Dictionary where the keys are the variables of the dataset and the values are the number of decimal places to round to
common_attrs_only (dict, list of datasets, or list of paths, optional) – Dictionnary of datasets or list of datasets, or path to NetCDF or Zarr files. Keeps only the global attributes that are the same for all datasets and generates a new id.
common_attrs_open_kwargs (dict, optional) – Dictionary of arguments for xarray.open_dataset(). Used with common_attrs_only if given paths.
attrs_to_remove (dict, optional) – Dictionary where the keys are the variables and the values are a list of the attrs that should be removed. For global attrs, use the key ‘global’. The element of the list can be exact matches for the attributes name or use the same substring matching rules as intake_esm: - ending with a ‘*’ means checks if the substring is contained in the string - starting with a ‘^’ means check if the string starts with the substring. eg. {‘global’: [‘unnecessary note’, ‘cell*’], ‘tasmax’: ‘old_name’}
remove_all_attrs_except (dict, optional) – Dictionary where the keys are the variables and the values are a list of the attrs that should NOT be removed, all other attributes will be deleted. If None (default), nothing will be deleted. For global attrs, use the key ‘global’. The element of the list can be exact matches for the attributes name or use the same substring matching rules as intake_esm: - ending with a ‘*’ means checks if the substring is contained in the string - starting with a ‘^’ means check if the string starts with the substring. eg. {‘global’: [‘necessary note’, ‘^cat:’], ‘tasmax’: ‘new_name’}
add_attrs (dict, optional) – Dictionary where the keys are the variables and the values are a another dictionary of attributes. For global attrs, use the key ‘global’. eg. {‘global’: {‘title’: ‘amazing new dataset’}, ‘tasmax’: {‘note’: ‘important info about tasmax’}}
change_attr_prefix (str, optional) – Replace “cat:” in the catalog global attrs by this new string
to_level (str, optional) – The processing level to assign to the output.

Returns:

xr.Dataset – Cleaned up dataset

See also

xclim.core.calendar.convert_calendar

Return a datetime from a string.

Parameters:

date (str, cftime.datetime, pd.Timestamp, datetime.datetime, pd.Period) – Date to be converted
end_of_period (bool or str) – If ‘YE’ or ‘ME’, the returned date will be the end of the year or month that contains the received date. If True, the period is inferred from the date’s precision, but date must be a string, otherwise nothing is done.
out_dtype (str) – Choices are ‘datetime’, ‘period’ or ‘str’
strtime_format (str) – If out_dtype==’str’, this sets the strftime format
freq (str) – If out_dtype==’period’, this sets the frequency of the period.

Returns:

pd.Timestamp, pd.Period, str – Parsed date

xscen.utils.ensure_correct_time(ds: Dataset, xrfreq: str) → Dataset[source]

Ensure a dataset has the correct time coordinate, as expected for the given frequency.

Daily or finer datasets are “floored” even if xr.infer_freq succeeds. Errors are raised if the number of data points per period is not 1. The dataset is modified in-place, but returned nonetheless.

xscen.utils.ensure_new_xrfreq(freq: str) → str[source]: Convert the frequency string to the newer syntax (pandas >= 2.2) if needed.

xscen.utils.get_cat_attrs(ds: Dataset | DataArray | dict, prefix: str = 'cat:', var_as_str=False) → dict[source]

Return the catalog-specific attributes from a dataset or dictionary.

Parameters:

ds (xr.Dataset, dict) – Dataset to be parsed. If a dictionary, it is assumed to be the attributes of the dataset (ds.attrs).
prefix (str) – Prefix automatically generated by intake-esm. With xscen, this should be ‘cat:’
var_as_str (bool) – If True, ‘variable’ will be returned as a string if there is only one.

Returns:

dict – Compilation of all attributes in a dictionary.

xscen.utils.maybe_unstack(ds: Dataset, coords: str | None = None, rechunk: dict | None = None, stack_drop_nans: bool = False) → Dataset[source]

If stack_drop_nans is True, unstack and rechunk.

Parameters:

ds (xr.Dataset) – Dataset to unstack.
coords (str, optional) – Path to a dataset containing the coords to unstack (and only those).
rechunk (dict, optional) – If not None, rechunk the dataset after unstacking.
stack_drop_nans (bool) – If True, unstack the dataset and rechunk it. If False, do nothing.

Returns:

xr.Dataset – Unstacked dataset.

xscen.utils.minimum_calendar(*calendars) → str[source]

Return the minimum calendar from a list.

Uses the hierarchy: 360_day < noleap < standard < all_leap, and returns one of those names.

xscen.utils.natural_sort(_list: list[str])[source]

For strings of numbers. alternative to sorted() that detects a more natural order.

e.g. [r3i1p1, r1i1p1, r10i1p1] is sorted as [r1i1p1, r3i1p1, r10i1p1] instead of [r10i1p1, r1i1p1, r3i1p1]

Format release history in Markdown or ReStructuredText.

Parameters:

style ({“rst”, “md”}) – Use ReStructuredText (rst) or Markdown (md) formatting. Default: Markdown.
file ({os.PathLike, StringIO, TextIO, None}) – If provided, prints to the given file-like object. Otherwise, returns a string.
changes ({str, os.PathLike}, optional) – If provided, manually points to the file where the changelog can be found. Assumes a relative path otherwise.

Returns:

str, optional

Notes

This function exists solely for development purposes. Adapted from xclim.testing.utils.publish_release_notes.

xscen.utils.stack_drop_nans(ds: Dataset, mask: DataArray, *, new_dim: str = 'loc', to_file: str | None = None) → Dataset[source]

Stack dimensions into a single axis and drops indexes where the mask is false.

Parameters:

ds (xr.Dataset) – A dataset with the same coords as mask.
mask (xr.DataArray) – A boolean DataArray with True on the points to keep. Mask will be loaded within this function.
new_dim (str) – The name of the new stacked dim.
to_file (str, optional) – A netCDF filename where to write the stacked coords for use in unstack_fill_nan. If given a string with {shape} and {domain}, the formatting will fill them with the original shape of the dataset and the global attributes ‘cat:domain’. If None (default), nothing is written to disk. It is recommended to fill this argument in the config. It will be parsed automatically. E.g.:

utils:

stack_drop_nans:
to_file: /some_path/coords/coords_{domain}_{shape}.nc

unstack_fill_nan:
coords: /some_path/coords/coords_{domain}_{shape}.nc

Returns:

xr.Dataset – Same as ds, but all dimensions of mask have been stacked to a single new_dim. Indexes where mask is False have been dropped.

See also

unstack_fill_nan: The inverse operation.

xscen.utils.standardize_periods(periods: list[str] | list[list[str]] | None, multiple: bool = True) → list[str] | list[list[str]] | None[source]

Reformats the input to a list of strings, [‘start’, ‘end’], or a list of such lists.

Parameters:

periods (list of str or list of lists of str, optional) – The period(s) to standardize. If None, return None.
multiple (bool) – If True, return a list of periods, otherwise return a single period.

xscen.utils.translate_time_chunk(chunks: dict, calendar: str, timesize) → dict[source]

Translate chunk specification for time into a number.

-1 translates to timesize ‘Nyear’ translates to N times the number of days in a year of calendar calendar.

xscen.utils.unstack_dates(ds: Dataset, seasons: dict[int, str] | None = None, new_dim: str = 'season', winter_starts_year: bool = False)[source]

Unstack a multi-season timeseries into a yearly axis and a season one.

Parameters:

ds (xr.Dataset or DataArray) – The xarray object with a “time” coordinate. Only supports monthly or coarser frequencies. The time axis must be complete and regular (xr.infer_freq(ds.time) doesn’t fail).
seasons (dict, optional) – A dictionary from month number (as int) to a season name. If not given, it is guessed from the time coord’s frequency. See notes.
new_dim (str) – The name of the new dimension.
winter_starts_year (bool) – If True, the year of winter (DJF) is built from the year of January, not December. i.e. DJF made from [Dec 1980, Jan 1981, and Feb 1981] will be associated with the year 1981, not 1980.

Returns:

xr.Dataset or DataArray – Same as ds but the time axis is now yearly (YS-JAN) and the seasons are along the new dimension.

Notes

When season is None, the inferred frequency determines the new coordinate:

For MS, the coordinates are the month abbreviations in english (JAN, FEB, etc.)
For ?QS-? and other ?MS frequencies, the coordinates are the initials of the months in each season. Ex: QS-DEC (with winter_starts_year=True) : DJF, MAM, JJA, SON.
For YS or YS-JAN, the new coordinate has a single value of “annual”.
For ?YS-? frequencies, the new coordinate has a single value of “annual-{anchor}”, were “anchor” is the abbreviation of the first month of the year. Ex: YS-JUL -> “annual-JUL”.

Unstack a Dataset that was stacked by stack_drop_nans().

Parameters:

ds (xr.Dataset) – A dataset with some dims stacked by stack_drop_nans.
dim (str) – The dimension to unstack, same as new_dim in stack_drop_nans.
coords (Sequence of strings, Mapping of str to array, str, optional) – If a sequence : if the dataset has coords along dim that are not original dimensions, those original dimensions must be listed here. If a dict : a mapping from the name to the array of the coords to unstack If a str : a filename to a dataset containing only those coords (as coords). If given a string with {shape} and {domain}, the formatting will fill them with the original shape of the dataset (that should have been store in the attributes of the stacked dimensions) by stack_drop_nans and the global attributes ‘cat:domain’. It is recommended to fill this argument in the config. It will be parsed automatically. E.g.:

utils:

stack_drop_nans:
to_file: /some_path/coords/coords_{domain}_{shape}.nc

unstack_fill_nan:
coords: /some_path/coords/coords_{domain}_{shape}.nc

If None (default), all coords that have dim a single dimension are used as the new dimensions/coords in the unstacked output. Coordinates will be loaded within this function.

Returns:

xr.Dataset – Same as ds, but dim has been unstacked to coordinates in coords. Missing elements are filled according to the defaults of fill_value of xarray.Dataset.unstack().

xscen.utils.update_attr(ds: Dataset | DataArray, attr: str, new: str, others: Sequence[Dataset | DataArray] | None = None, **fmt) → Dataset | DataArray[source]

Format an attribute referencing itself in a translatable way.

Parameters:

ds (Dataset or DataArray) – The input object with the attribute to update.
attr (str) – Attribute name.
new (str) – New attribute as a template string. It may refer to the old version of the attribute with the “{attr}” field.
others (Sequence of Datasets or DataArrays) – Other objects from which we can extract the attribute attr. These can be referenced as “{attrXX}” in new, where XX is the based-1 index of the other source in others. If they don’t have the attr attribute, an empty string is sent to the string formatting. See notes.
fmt – Other formatting data.

Returns:

ds, but updated with the new version of attr, in each of the activated languages.

Notes

This is meant for constructing attributes by extending a previous version or combining it from different sources. For example, given a ds that has long_name=”Variability”:

>>> update_attr(ds, "long_name", _("Mean of {attr}"))

Will update the “long_name” of ds with long_name=”Mean of Variability”. The use of _(…) allows the detection of this string by the translation manager. The function will be able to add a translatable version of the string for each activated language, for example adding a long_name_fr=”Moyenne de Variabilité” (assuming a long_name_fr was present on the initial ds).

If the new attribute is an aggregation from multiple sources, these can be passed in others.

>>> update_attr(
...     ds0,
...     "long_name",
...     _("Addition of {attr} and {attr1}, divided by {attr2}"),
...     others=[ds1, ds2],
... )

Here, ds0 will have it’s long_name updated with the passed string, where attr1 is the long_name of ds1 and attr2 the long_name of ds2. The process will be repeated for each localized long_name available on ds0. For example, if ds0 has a long_name_fr, the template string is translated and filled with the long_name_fr attributes of ds0, ds1 and ds2. If the latter don’t exist, the english version is used instead.

xscen.utils.xrfreq_to_timedelta(freq: str)[source]: Approximate the length of a period based on its frequency offset.