1. Using and understanding Catalogs

INFO

Catalogs in xscen are built upon Datastores in intake_esm. For more information on basic usage, such as the search() function, please consult their documentation.

Catalogs are made of two files:

  • JSON file containing metadata such as the catalog’s title, description etc. It also contains an attribute catalog_file that points towards the CSV. Most xscen catalog will have very similar JSON files.

  • CSV file containing the catalog itself. This file can be zipped.

Two types of catalogs have been implemented in xscen.

  • Static catalogs: A `DataCatalog <../xscen.rst#xscen.catalog.DataCatalog>`__ is a read-only intake-esm catalog that contains information on all available data. Usually, this type of catalog should only be consulted at the start of a new project.

  • Updatable catalogs: A `ProjectCatalog <../xscen.rst#xscen.catalog.ProjectCatalog>`__ is a DataCatalog with additional write functionalities. This kind of catalog should be used to keep track of the new data created during the course of a project, such as regridded or bias-corrected data, since it can update itself and append new information to the associated CSV file.

NOTE: As to not accidentaly lose data, both catalogs currently have no function to remove data from the CSV file. However, upon initialisation and when updating or refreshing itself, the catalog validates that all entries still exist and, if files have been manually removed, deletes their entries from the catalog.

Catalogs in xscen are made to follow a nomenclature that is as close as possible to the Python Earth Science Standard Vocabulary : https://github.com/ES-DOC/pyessv. The columns are listed below but for more details and concrete examples about the entries, consult the relevant page in the documentation:

Column name

Description

id

Unique DatasetID generated by xscen based on a subset of columns.

type

Type of data: [forecast, station-obs, gridded-obs, reconstruction, simulation]

processing_level

Level of post-processing reached: [raw, extracted, regridded, biasadjusted]

bias_adjust_institution

Institution that computed the bias adjustment.

bias_adjust_project

Name of the project that computed the bias adjustment.

mip_era

CMIP Generation associated with the data.

activity

Model Intercomparison Project (MIP) associated with the data.

driving_model

Name of the driver.

institution

Institution associated with the source.

source

Name of the model or the dataset.

experiment

Name of the experiment of the model.

member

Name of the realisation (or of the driving realisation in the case of RCMs).

xrfreq

Pandas/xarray frequency.

frequency

Frequency in letters (CMIP6 format).

variable

Variable(s) in the dataset.

domain

Name of the region covered by the dataset.

date_start

First date of the dataset.

date_end

Last date of the dataset.

version

Version of the dataset.

format

Format of the dataset.

path

Path to the dataset.

Individual projects may use a different set of columns, but those will always be present in the official Ouranos internal catalogs. Some parts of xscen will however expect certain column names, so diverging from the official list is to be done with care.

1.1. Basic Catalog Usage

If an official catalog already exists, it should be opened using xs.DataCatalog by pointing it to the JSON file:

[1]:
from pathlib import Path

from xscen import DataCatalog, ProjectCatalog

# Prepare a dummy folder where data will be put
output_folder = Path().absolute() / "_data"
output_folder.mkdir(exist_ok=True)

DC = DataCatalog(f"{Path().absolute()}/samples/pangeo-cmip6.json")
DC
ERROR 1: PROJ: proj_create_from_database: Open of /home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/share/proj failed

pangeo-cmip6 catalog with 21 dataset(s) from 47 asset(s):

unique
activity 2
institution 2
source 3
experiment 3
member 3
frequency 3
xrfreq 3
variable 4
domain 3
path 47
date_start 2
date_end 2
version 6
id 13
processing_level 1
format 1
mip_era 1
derived_variable 0

The content of the catalog can be accessed by a call to df, which will return a pandas.DataFrame.

[2]:
# Access the catalog
DC.df[0:3]
[2]:
activity institution source experiment member frequency xrfreq variable domain path date_start date_end version id processing_level format mip_era
0 CMIP NOAA-GFDL GFDL-CM4 historical r1i1p1f1 3hr 3h (pr,) gr2 gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo... 1985-01-01 2014-12-31 20180701 CMIP_NOAA-GFDL_GFDL-CM4_historical_r1i1p1f1_gr2 raw zarr CMIP6
1 CMIP NOAA-GFDL GFDL-CM4 historical r1i1p1f1 3hr 3h (pr,) gr1 gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo... 1985-01-01 2014-12-31 20180701 CMIP_NOAA-GFDL_GFDL-CM4_historical_r1i1p1f1_gr1 raw zarr CMIP6
2 CMIP NOAA-GFDL GFDL-CM4 historical r1i1p1f1 day D (pr,) gr2 gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo... 1985-01-01 2014-12-31 20180701 CMIP_NOAA-GFDL_GFDL-CM4_historical_r1i1p1f1_gr2 raw zarr CMIP6

The unique function allows listing unique elements for either all the catalog or a subset of columns. It can be called in a few various ways, listed below:

[3]:
# List all unique elements in the catalog, returns a pandas.Series
DC.unique()
[3]:
activity                                          [CMIP, ScenarioMIP]
institution                                          [NOAA-GFDL, NCC]
source                             [GFDL-CM4, NorESM2-LM, NorESM2-MM]
experiment                               [historical, ssp126, ssp585]
member                                 [r1i1p1f1, r3i1p1f1, r2i1p1f1]
frequency                                              [3hr, day, fx]
xrfreq                                                    [3h, D, fx]
variable                                  [pr, tasmin, tasmax, sftlf]
domain                                                 [gr2, gr1, gn]
path                [gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/hist...
date_start                 [1985-01-01 00:00:00, 2015-01-01 00:00:00]
date_end                   [2014-12-31 00:00:00, 2100-12-31 00:00:00]
version             [20180701, 20190815, 20190920, 20191108, 20200...
id                  [CMIP_NOAA-GFDL_GFDL-CM4_historical_r1i1p1f1_g...
processing_level                                                [raw]
format                                                         [zarr]
mip_era                                                       [CMIP6]
dtype: object
[4]:
# List all unique elements in a subset of columns, returns a pandas.Series
DC.unique(["variable", "frequency"])
[4]:
variable     [pr, tasmin, tasmax, sftlf]
frequency                 [3hr, day, fx]
dtype: object
[5]:
# List all unique elements in a single columns, returns a list
DC.unique("id")[0:5]
[5]:
['CMIP_NOAA-GFDL_GFDL-CM4_historical_r1i1p1f1_gr2',
 'CMIP_NOAA-GFDL_GFDL-CM4_historical_r1i1p1f1_gr1',
 'CMIP_NCC_NorESM2-LM_historical_r1i1p1f1_gn',
 'CMIP_NCC_NorESM2-LM_historical_r3i1p1f1_gn',
 'CMIP_NCC_NorESM2-LM_historical_r2i1p1f1_gn']

1.1.1. Basic .search() commands

The search function comes from intake-esm and allows searching for specific elements in the catalog’s columns. It accepts both wildcards and regular expressions (except for variable, which must be exact due to being in tuples).

While regex isn’t great at inverse matching (“does not contain”), it is possible. Here are a few useful commands:

- ^string            : Starts with string

- string$            : Ends with string

- ^(?!string).*$     : Does not start with string

- .*(?<!string)$     : Does not end with string

- ^((?!string).)*$   : Does not contain substring

- ^(?!string$).*$    : Is not that exact string

This website can be used to test regex commands: https://regex101.com/

[6]:
# Regex: Find all entries that start with "ssp"
print(DC.search(experiment="^ssp").unique("experiment"))
['ssp126', 'ssp585']
[7]:
# Regex: Exclude all entries that start with "ssp"
print(DC.search(experiment="^(?!ssp).*$").unique("experiment"))
['historical']
[8]:
# Regex: Find all experiments except the exact string "ssp126"
print(DC.search(experiment="^(?!ssp126$).*$").unique("experiment"))
['historical', 'ssp585']
[9]:
# Wildcard: Find all entries that start with NorESM2
print(DC.search(source="NorESM2.*").unique("source"))
['NorESM2-LM', 'NorESM2-MM']

Notice that the search function returns everything available that matches some of the criteria.

[10]:
# r1i1p1f1 sftlf is not available
DC.search(
    source="NorESM2-MM",
    experiment="historical",
    member=["r1i1p1f1", "r2i1p1f1"],
    variable=["sftlf", "pr"],
).df
[10]:
activity institution source experiment member frequency xrfreq variable domain path date_start date_end version id processing_level format mip_era
0 CMIP NCC NorESM2-MM historical r1i1p1f1 day D (pr,) gn gs://cmip6/CMIP6/CMIP/NCC/NorESM2-MM/historica... 1985-01-01 2014-12-31 20191108 CMIP_NCC_NorESM2-MM_historical_r1i1p1f1_gn raw zarr CMIP6
1 CMIP NCC NorESM2-MM historical r2i1p1f1 day D (pr,) gn gs://cmip6/CMIP6/CMIP/NCC/NorESM2-MM/historica... 1985-01-01 2014-12-31 20200218 CMIP_NCC_NorESM2-MM_historical_r2i1p1f1_gn raw zarr CMIP6
2 CMIP NCC NorESM2-MM historical r2i1p1f1 fx fx (sftlf,) gn gs://cmip6/CMIP6/CMIP/NCC/NorESM2-MM/historica... 1985-01-01 2014-12-31 20200218 CMIP_NCC_NorESM2-MM_historical_r2i1p1f1_gn raw zarr CMIP6

You can restrict your search to only keep entries that matches all the criteria across a list of columns.

[11]:
# Only returns variables that have all members, source and experiment asked for. In this case, pr, but not sftlf.
DC.search(
    source="NorESM2-MM",
    experiment="historical",
    member=["r1i1p1f1", "r2i1p1f1"],
    variable=["sftlf", "pr"],
    require_all_on=["variable"],
).df
[11]:
activity institution source experiment member frequency xrfreq variable domain path date_start date_end version id processing_level format mip_era
0 CMIP NCC NorESM2-MM historical r1i1p1f1 day D (pr,) gn gs://cmip6/CMIP6/CMIP/NCC/NorESM2-MM/historica... 1985-01-01 2014-12-31 20191108 CMIP_NCC_NorESM2-MM_historical_r1i1p1f1_gn raw zarr CMIP6
1 CMIP NCC NorESM2-MM historical r2i1p1f1 day D (pr,) gn gs://cmip6/CMIP6/CMIP/NCC/NorESM2-MM/historica... 1985-01-01 2014-12-31 20200218 CMIP_NCC_NorESM2-MM_historical_r2i1p1f1_gn raw zarr CMIP6

It is also possible to search for files that intersect a specific time period.

[12]:
DC.search(periods=[["2016", "2017"]]).unique(["date_start", "date_end"])
[12]:
date_start    [2015-01-01 00:00:00]
date_end      [2100-12-31 00:00:00]
dtype: object

1.1.2. Advanced search: xs.search_data_catalogs

search has multiple notable limitations for more advanced searches:

  • It can’t match specific criteria together, such as finding a dataset that would have both 3h precipitation and daily temperature.

  • It has no explicit understanding of climate datasets, and thus can’t match historial and future simulations together or know how realization members or grid resolutions work.

xs.search_data_catalogs was thus created as a more advanced version that is closer to the needs of climate services. It also plays the double role of preparing certain arguments for the extraction function.

Due to how different reference datasets are from climate simulations, this function might have to be called multiple times and the results concatenated into a single dictionary. The main arguments are:

  • variables_and_freqs is used to indicate which variable and which frequency is required. NOTE: With the exception of fixed fields, where ‘fx’ should be used, frequencies here use the pandas nomenclature (‘D’, ‘H’, ‘6H’, ‘MS’, etc.).

  • other_search_criteria is used to search for specific entries in other columns of the catalog, such as activity. require_all_on can also be passed here.

  • exclusions is used to exclude certain simulations or keywords from the results.

  • match_hist_and_fut is used to indicate that RCP/SSP simulations should be matched with their historical counterparts.

  • periods is used to search for specific time periods.

  • allow_resampling is used to allow searching for data at higher frequencies than requested.

  • allow_conversion is used to allow searching for calculable variables, in the case where the requested variable would not be available.

  • restrict_resolution is used to limit the results to the finest or coarsest resolution available for each source.

  • restrict_members is used to limit the results to a maximum number of realizations for each source.

  • restrict_warming_level is used to limit the results to only datasets that are present in the csv used for calculating warming levels. You can also pass a dict to verify that a given warming level is reached.

Note that compared to search, the result of search_data_catalog is a dictionary with one entry per unique ID. A given unique ID might contain multiple datasets as per intake-esm’s definition, because it groups catalog lines per id - domain - processing_level - xrfreq. Thus, it would separate model data that exists at different frequencies.

1.1.2.1. Example 1: Multiple variables and frequencies + Historical and future

Let’s start by searching for CMIP6 data that has subdaily precipitation, daily minimum temperature and the land fraction data. The main difference compared to searching for reference datasets is that in most cases, match_hist_and_fut will be required to match historical simulations to their future counterparts. This works for both CMIP5 and CMIP6 nomenclatures.

[13]:
import xscen as xs

variables_and_freqs = {"tasmin": "D", "pr": "3h", "sftlf": "fx"}
other_search_criteria = {"institution": ["NOAA-GFDL"]}

cat_sim = xs.search_data_catalogs(
    data_catalogs=[f"{Path().absolute()}/samples/pangeo-cmip6.json"],
    variables_and_freqs=variables_and_freqs,
    other_search_criteria=other_search_criteria,
    match_hist_and_fut=True,
)

cat_sim
INFO:xscen.extract:Catalog opened: <pangeo-cmip6 catalog with 21 dataset(s) from 47 asset(s)> from 1 files.
INFO:xscen.extract:Dispatching historical dataset to future experiments.
INFO:xscen.extract:16 assets matched the criteria : {'institution': ['NOAA-GFDL']}.
INFO:xscen.extract:Iterating over 2 potential datasets.
INFO:xscen.extract:Found 2 with all variables requested and corresponding to the criteria.
[13]:
{'ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1_gr2': <pangeo-cmip6 catalog with 3 dataset(s) from 4 asset(s)>,
 'ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1_gr1': <pangeo-cmip6 catalog with 3 dataset(s) from 4 asset(s)>}

If required, at this stage, a dataset can be looked at in more details. If we examine the results (look at the ‘date_start’ and ‘date_end’ columns), we’ll see that it successfully found historical simulations in the CMIP activity and renamed both their activity and experiment to match the future simulations.

[14]:
cat_sim["ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1_gr1"].df
[14]:
activity institution source experiment member frequency xrfreq variable domain path date_start date_end version id processing_level format mip_era
0 ScenarioMIP NOAA-GFDL GFDL-CM4 ssp585 r1i1p1f1 day D (tasmin,) gr1 gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-CM... 2015-01-01 2100-12-31 20180701 ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1... raw zarr CMIP6
1 ScenarioMIP NOAA-GFDL GFDL-CM4 ssp585 r1i1p1f1 day D (tasmin,) gr1 gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo... 1985-01-01 2014-12-31 20180701 ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1... raw zarr CMIP6
2 ScenarioMIP NOAA-GFDL GFDL-CM4 ssp585 r1i1p1f1 3hr 3h (pr,) gr1 gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo... 1985-01-01 2014-12-31 20180701 ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1... raw zarr CMIP6
3 ScenarioMIP NOAA-GFDL GFDL-CM4 ssp585 r1i1p1f1 fx fx (sftlf,) gr1 gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-CM... 2015-01-01 2100-12-31 20180701 ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1... raw zarr CMIP6

1.1.2.2. Example 2: Restricting results

The two previous search results were the same simulation, but on 2 different grids (gr1 and gr2). If desired, restrict_resolution can be called to choose the finest or coarsest grid.

[15]:
variables_and_freqs = {"tasmin": "D", "pr": "3h", "sftlf": "fx"}
other_search_criteria = {"institution": ["NOAA-GFDL"], "experiment": ["ssp585"]}

cat_sim = xs.search_data_catalogs(
    data_catalogs=[f"{Path().absolute()}/samples/pangeo-cmip6.json"],
    variables_and_freqs=variables_and_freqs,
    other_search_criteria=other_search_criteria,
    match_hist_and_fut=True,
    restrict_resolution="finest",
)

cat_sim
INFO:xscen.extract:Catalog opened: <pangeo-cmip6 catalog with 21 dataset(s) from 47 asset(s)> from 1 files.
INFO:xscen.extract:Dispatching historical dataset to future experiments.
INFO:xscen.extract:16 assets matched the criteria : {'institution': ['NOAA-GFDL'], 'experiment': ['ssp585']}.
INFO:xscen.extract:Iterating over 2 potential datasets.
INFO:xscen.extract:Found 2 with all variables requested and corresponding to the criteria.
INFO:xscen.extract:Dataset CMIP6_r1i1p1f1_GFDL-CM4_NOAA-GFDL_ssp585_ScenarioMIP appears to have multiple resolutions.
INFO:xscen.extract:Removing ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1_gr2 from the results.
[15]:
{'ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1_gr1': <pangeo-cmip6 catalog with 3 dataset(s) from 4 asset(s)>}

Similarly, if we search for historical NorESM2-MM data, we’ll find that it has 3 members. If desired, restrict_members can be called to choose a maximum number of realization per model.

[16]:
variables_and_freqs = {"tasmin": "D"}
other_search_criteria = {"source": ["NorESM2-MM"], "experiment": ["historical"]}

cat_sim = xs.search_data_catalogs(
    data_catalogs=[f"{Path().absolute()}/samples/pangeo-cmip6.json"],
    variables_and_freqs=variables_and_freqs,
    other_search_criteria=other_search_criteria,
    restrict_members={"ordered": 2},
)

cat_sim
INFO:xscen.extract:Catalog opened: <pangeo-cmip6 catalog with 21 dataset(s) from 47 asset(s)> from 1 files.
INFO:xscen.extract:11 assets matched the criteria : {'source': ['NorESM2-MM'], 'experiment': ['historical']}.
INFO:xscen.extract:Iterating over 3 potential datasets.
INFO:xscen.extract:Found 3 with all variables requested and corresponding to the criteria.
INFO:xscen.extract:Dataset gn_CMIP6_NorESM2-MM_NCC_historical_CMIP has 3 valid members. Restricting as per requested.
INFO:xscen.extract:Removing CMIP_NCC_NorESM2-MM_historical_r3i1p1f1_gn from the results.
[16]:
{'CMIP_NCC_NorESM2-MM_historical_r1i1p1f1_gn': <pangeo-cmip6 catalog with 1 dataset(s) from 1 asset(s)>,
 'CMIP_NCC_NorESM2-MM_historical_r2i1p1f1_gn': <pangeo-cmip6 catalog with 1 dataset(s) from 1 asset(s)>}

Finally, restrict_warming_level can be used to be sure that the results either exist in xscen’s warming level database (if a boolean), or reach a given warming level.

[17]:
variables_and_freqs = {"tasmin": "D"}

cat_sim = xs.search_data_catalogs(
    data_catalogs=[f"{Path().absolute()}/samples/pangeo-cmip6.json"],
    variables_and_freqs=variables_and_freqs,
    match_hist_and_fut=True,
    restrict_warming_level={
        "wl": 2
    },  # SSP126  gets eliminated, since it doesn't reach +2°C by 2100.
)

cat_sim
INFO:xscen.extract:Catalog opened: <pangeo-cmip6 catalog with 21 dataset(s) from 47 asset(s)> from 1 files.
INFO:xscen.extract:Dispatching historical dataset to future experiments.
INFO:xscen.extract:Global warming level of +2C is not reached by the last year (2100) of the provided 'tas_src' database for CMIP6_NorESM2-MM_ssp126_r1i1p1f1.
INFO:xscen.extract:Global warming level of +2C is not reached by the last year (2100) of the provided 'tas_src' database for CMIP6_NorESM2-MM_ssp126_r1i1p1f1.
INFO:xscen.extract:Global warming level of +2C is not reached by the last year (2100) of the provided 'tas_src' database for CMIP6_NorESM2-MM_ssp126_r1i1p1f1.
INFO:xscen.extract:Global warming level of +2C is not reached by the last year (2100) of the provided 'tas_src' database for CMIP6_NorESM2-MM_ssp126_r1i1p1f1.
INFO:xscen.extract:Global warming level of +2C is not reached by the last year (2100) of the provided 'tas_src' database for CMIP6_NorESM2-MM_ssp126_r1i1p1f1.
INFO:xscen.extract:Global warming level of +2C is not reached by the last year (2100) of the provided 'tas_src' database for CMIP6_NorESM2-MM_ssp126_r1i1p1f1.
INFO:xscen.extract:Removing the following datasets because of the restriction for warming levels: ['ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1f1_gn']
INFO:xscen.extract:Iterating over 4 potential datasets.
INFO:xscen.extract:Found 4 with all variables requested and corresponding to the criteria.
[17]:
{'ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1_gr2': <pangeo-cmip6 catalog with 1 dataset(s) from 2 asset(s)>,
 'ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1_gr1': <pangeo-cmip6 catalog with 1 dataset(s) from 2 asset(s)>,
 'ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn': <pangeo-cmip6 catalog with 1 dataset(s) from 2 asset(s)>,
 'ScenarioMIP_NCC_NorESM2-LM_ssp585_r1i1p1f1_gn': <pangeo-cmip6 catalog with 1 dataset(s) from 2 asset(s)>}

1.1.2.3. Example 3: Search for data that can be computed from what’s available

allow_resampling and allow_conversion are powerful search tools to find data that doesn’t explicitely exist in the catalog, but that can easily be computed.

[18]:
cat_sim_adv = xs.search_data_catalogs(
    data_catalogs=[f"{Path().absolute()}/samples/pangeo-cmip6.json"],
    variables_and_freqs={"evspsblpot": "D", "tas": "YS"},
    other_search_criteria={"source": ["NorESM2-MM"], "processing_level": ["raw"]},
    match_hist_and_fut=True,
    allow_resampling=True,
    allow_conversion=True,
)
cat_sim_adv
INFO:xscen.extract:Catalog opened: <pangeo-cmip6 catalog with 21 dataset(s) from 47 asset(s)> from 1 files.
INFO:xscen.extract:Dispatching historical dataset to future experiments.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.WIND_SPEED_FROM_VECTOR already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.WIND_VECTOR_FROM_SPEED already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.TAS_MIDPOINT already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.RELATIVE_HUMIDITY_FROM_DEWPOINT already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.SPECIFIC_HUMIDITY already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.SNOWFALL_APPROXIMATION already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.RAIN_APPROXIMATION already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.PRECIPITATION already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.POTENTIAL_EVAPOTRANSPIRATION already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.DTR already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.TASMIN_FROM_DTR already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.TASMAX_FROM_DTR already exists and will be overwritten.
INFO:xscen.extract:12 assets matched the criteria : {'source': ['NorESM2-MM'], 'processing_level': ['raw']}.
INFO:xscen.extract:Iterating over 2 potential datasets.
INFO:xscen.extract:Found 2 with all variables requested and corresponding to the criteria.
[18]:
{'ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1f1_gn': <pangeo-cmip6 catalog with 1 dataset(s) from 4 asset(s)>,
 'ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn': <pangeo-cmip6 catalog with 1 dataset(s) from 4 asset(s)>}

If we examine the SSP5-8.5 results, we’ll see that while it failed to find evspsblpot, it successfully understood that tasmin and tasmax can be used to compute it. It also understood that daily tasmin and tasmax is a valid search result for {tas: YS}, since it can be computed first, then aggregated to a yearly frequency.

[19]:
cat_sim_adv["ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn"].unique()
[19]:
activity                                                [ScenarioMIP]
institution                                                     [NCC]
source                                                   [NorESM2-MM]
experiment                                                   [ssp585]
member                                                     [r1i1p1f1]
frequency                                                       [day]
xrfreq                                                            [D]
variable                                             [tasmax, tasmin]
domain                                                           [gn]
path                [gs://cmip6/CMIP6/ScenarioMIP/NCC/NorESM2-MM/s...
date_start                 [2015-01-01 00:00:00, 1985-01-01 00:00:00]
date_end                   [2100-12-31 00:00:00, 2014-12-31 00:00:00]
version                                                    [20191108]
id                    [ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn]
processing_level                                                [raw]
format                                                         [zarr]
mip_era                                                       [CMIP6]
dtype: object

It’s also possible to search for multiple frequencies at the same time by using a list of xrfreq.

[20]:
cat_sim_adv_multifreq = xs.search_data_catalogs(
    data_catalogs=[f"{Path().absolute()}/samples/pangeo-cmip6.json"],
    variables_and_freqs={"tas": ["D", "MS", "YS"]},
    other_search_criteria={
        "source": ["NorESM2-MM"],
        "processing_level": ["raw"],
        "experiment": ["ssp585"],
    },
    match_hist_and_fut=True,
    allow_resampling=True,
    allow_conversion=True,
)
print(
    cat_sim_adv_multifreq[
        "ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn"
    ]._requested_variable_freqs
)
INFO:xscen.extract:Catalog opened: <pangeo-cmip6 catalog with 21 dataset(s) from 47 asset(s)> from 1 files.
INFO:xscen.extract:Dispatching historical dataset to future experiments.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.WIND_SPEED_FROM_VECTOR already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.WIND_VECTOR_FROM_SPEED already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.TAS_MIDPOINT already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.RELATIVE_HUMIDITY_FROM_DEWPOINT already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.SPECIFIC_HUMIDITY already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.SNOWFALL_APPROXIMATION already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.RAIN_APPROXIMATION already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.PRECIPITATION already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.POTENTIAL_EVAPOTRANSPIRATION already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.DTR already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.TASMIN_FROM_DTR already exists and will be overwritten.
/home/docs/checkouts/readthedocs.org/user_builds/xscen/conda/latest/lib/python3.12/site-packages/xclim/core/indicator.py:257: UserWarning: Class conversions.TASMAX_FROM_DTR already exists and will be overwritten.
INFO:xscen.extract:6 assets matched the criteria : {'source': ['NorESM2-MM'], 'processing_level': ['raw'], 'experiment': ['ssp585']}.
INFO:xscen.extract:Iterating over 1 potential datasets.
INFO:xscen.extract:Found 1 with all variables requested and corresponding to the criteria.
['D', 'MS', 'YS']

1.1.2.4. Derived variables

The allow_conversion argument is built upon xclim’s virtual indicators module and intake-esm’s DerivedVariableRegistry in a way that should be seamless to the user. It works by using the methods defined in xscen/xclim_modules/conversions.yml to add a registry of derived variables that exist virtually through computation methods.

In the example above, we can see that the search failed to find evspsblpot within NorESM2-MM, but understood that tasmin and tasmax could be used to estimate it using xclim’s potential_evapotranspiration.

Most use cases should already be covered by the aforementioned file. The preferred way to add new methods is to submit a new indicator to xclim, and then to add a call to that indicator in conversions.yml. In the case where this is not possible or where the transformation would be out of scope for xclim, the calculation can be implemented into xscen/xclim_modules/conversions.py instead.

Alternatively, if other functions or other parameters are required for a specific use case (e.g. using relative_humidity instead of relative_humidity_from_dewpoint, or using a different formula), then a custom YAML file can be used. This custom file can be referred to using the conversion_yaml argument of search_data_catalogs.

.derivedcat can be called on a catalog to obtain the list of DerivedVariable and the function associated to them. In addition, ._requested_variables will display the list of variables that will be opened by the to_dataset_dict() function, including DerivedVariables.

WARNING

_requested_variables should NOT be modified under any circumstance, as it is likely to make to_dataset_dict() fail. To add some transparency on which variables have been requested and which are the dependent ones, xscen has added _requested_variables_true and _dependent_variables. This is very likely to be changed in the future.

[21]:
cat_sim_adv["ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn"].derivedcat
[21]:
DerivedVariableRegistry({'evspsblpot': DerivedVariable(func=functools.partial(<function _derived_func.<locals>.func at 0x7f9363816ca0>, ind=<xclim.indicators.conversions.POTENTIAL_EVAPOTRANSPIRATION object at 0x7f936384cbf0>, nout=0), variable='evspsblpot', query={'variable': ['tasmin', 'tasmax']}, prefer_derived=False), 'tas': DerivedVariable(func=functools.partial(<function _derived_func.<locals>.func at 0x7f93638168e0>, ind=<xclim.indicators.conversions.TAS_MIDPOINT object at 0x7f9363999370>, nout=0), variable='tas', query={'variable': ['tasmin', 'tasmax']}, prefer_derived=False)})
[22]:
print(cat_sim_adv["ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn"]._requested_variables)
print(
    f"Requested: {cat_sim_adv['ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn']._requested_variables_true}"
)
print(
    f"Dependent: {cat_sim_adv['ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn']._dependent_variables}"
)
['tasmax', 'evspsblpot', 'tasmin', 'tas', 'tasmax', 'tasmin']
Requested: ['evspsblpot', 'tas']
Dependent: ['tasmax', 'tasmin', 'tasmax', 'tasmin']

INFO

allow_conversion currently fails if:

  • The requested DerivedVariable also requires a DerivedVariable itself.

  • The dependent variables exist at different frequencies (e.g. ‘pr @1hr’ & ‘tas @3hr’)

1.2. Creating a New Catalog from a Directory

1.2.1. Initialisation

The create argument of ProjectCatalog can be called to create an empty ProjectCatalog and a new set of JSON and CSV files.

By default, xscen will populate the JSON with generic information, defined in catalog.esm_col_data. That metadata can be changed using the project argument with entries compatible with the ESM Catalog Specification (refer to the link above). Usually, the most useful and common entries will be:

  • title

  • description

xscen will also instruct intake_esm to group catalog lines per id - domain - processing_level - xrfreq. This should be adequate for most uses. In the case that it is not, the following can be added to project:

  • “aggregation_control”: {“groupby_attrs”: [list_of_columns]}

Other attributes and behaviours of the project definition can be modified in a similar way.

[23]:
project = {
    "title": "tutorial-catalog",
    "description": "Catalog for the tutorial NetCDFs.",
}

PC = ProjectCatalog(
    str(output_folder / "tutorial-catalog.json"),
    create=True,
    project=project,
    overwrite=True,
)
Successfully wrote ESM catalog json file to: file:///home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/_data/tutorial-catalog.json
[24]:
# The metadata is stored in PC.esmcat
PC.esmcat
[24]:
ESMCatalogModel(esmcat_version='0.1.0', attributes=[], assets=Assets(column_name='path', format=None, format_column_name='format'), aggregation_control=AggregationControl(variable_column_name='variable', groupby_attrs=['id', 'domain', 'processing_level', 'xrfreq'], aggregations=[Aggregation(type=<AggregationType.join_existing: 'join_existing'>, attribute_name='date_start', options={'dim': 'time'}), Aggregation(type=<AggregationType.union: 'union'>, attribute_name='variable', options={})]), id='tutorial-catalog', catalog_dict=None, catalog_file='/home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/_data/tutorial-catalog.csv', description='Catalog for the tutorial NetCDFs.', title='tutorial-catalog', last_updated=datetime.datetime(2024, 4, 19, 14, 1, 8, tzinfo=TzInfo(UTC)))

1.2.2. Appending new data to a ProjectCatalog

At this stage, the CSV is still empty. There are two main ways to populate a catalog with data:

  • Using xs.ProjectCatalog.update_from_ds to append a Dataset and populate the catalog columns using metadata.

  • Using xs.catutils.parse_directory to parse through existing NetCDF or Zarr data and decode their information based on file and directory names.

This tutorial will focus on catutils.parse_directory, as update_from_ds is moreso a function that will be called during a climate-scenario-generation workflow. See the Getting Started tutorial for more details on update_from_ds.

1.2.2.1. Parsing a directory

INFO

If you are an Ouranos employee, this section should be of limited use (unless you need to retroactively parse a directory containing exiting datasets). Please consult the existing Ouranos catalogs using xs.search_data_catalogs instead.

The `parse_directory <../xscen.rst#xscen.catutils.parse_directory>`__ function relies on analyzing patterns to adequately decode the filenames to store that information in the catalog.

  • Patterns are a succession of column names in curly brackets. See below for examples. The pattern starts where the directory path stops.

  • If necessary, read_from_file can be used to open the files and read metadata from global attributes. Refer to the API for Docstrings and usage.

  • In cases where some column information is the same across all data, homogenous_info can be used to explicitely give an attribute to the datasets being processed.

  • Anything that isn’t filled will be marked as None.

The following example will search through the samples folder and infer information from the folder names. The filename is ignored, except its extension. The variable name and time bounds are read from the file itself.

[25]:
from xscen.catutils import parse_directory

df = parse_directory(
    directories=[f"{Path().absolute()}/samples/tutorial/"],
    patterns=[
        "{activity}/{domain}/{institution}/{source}/{experiment}/{member}/{frequency}/{?:_}.nc"
    ],
    homogenous_info={
        "mip_era": "CMIP6",
        "type": "simulation",
        "processing_level": "raw",
    },
    read_from_file=["variable", "date_start", "date_end"],
)
df
INFO:xscen.catutils:Parsing attributes with netCDF4 from /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp370/r1i1p1f1/fx/ScenarioMIP_NCC_NorESM2-MM_ssp370_r1i1p1f1_gn_raw.nc.
INFO:xscen.catutils:Parsing attributes with netCDF4 from /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp370/r1i1p1f1/day/ScenarioMIP_NCC_NorESM2-MM_ssp370_r1i1p1f1_gn_raw.nc.
INFO:xscen.catutils:Parsing attributes with netCDF4 from /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp245/r2i1p1f1/fx/ScenarioMIP_NCC_NorESM2-MM_ssp245_r2i1p1f1_gn_raw.nc.
INFO:xscen.catutils:Parsing attributes with netCDF4 from /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp245/r2i1p1f1/day/ScenarioMIP_NCC_NorESM2-MM_ssp245_r2i1p1f1_gn_raw.nc.
INFO:xscen.catutils:Parsing attributes with netCDF4 from /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp245/r1i1p1f1/fx/ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1f1_gn_raw.nc.
INFO:xscen.catutils:Parsing attributes with netCDF4 from /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp245/r1i1p1f1/day/ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1f1_gn_raw.nc.
INFO:xscen.catutils:Parsing attributes with netCDF4 from /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp585/r1i1p1f1/fx/ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn_raw.nc.
INFO:xscen.catutils:Parsing attributes with netCDF4 from /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp585/r1i1p1f1/day/ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn_raw.nc.
INFO:xscen.catutils:Parsing attributes with netCDF4 from /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp126/r1i1p1f1/fx/ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1f1_gn_raw.nc.
INFO:xscen.catutils:Parsing attributes with netCDF4 from /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp126/r1i1p1f1/day/ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1f1_gn_raw.nc.
INFO:xscen.catutils:Found and parsed 10 files.
[25]:
id type processing_level bias_adjust_institution bias_adjust_project mip_era activity driving_model institution source ... member xrfreq frequency variable domain date_start date_end version format path
0 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp370_r1i1p1... simulation raw None None CMIP6 ScenarioMIP None NCC NorESM2-MM ... r1i1p1f1 fx fx (sftlf,) example-region NaT NaT None nc /home/docs/checkouts/readthedocs.org/user_buil...
1 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp370_r1i1p1... simulation raw None None CMIP6 ScenarioMIP None NCC NorESM2-MM ... r1i1p1f1 D day (tas,) example-region 2001-01-01 2002-12-31 None nc /home/docs/checkouts/readthedocs.org/user_buil...
2 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r2i1p1... simulation raw None None CMIP6 ScenarioMIP None NCC NorESM2-MM ... r2i1p1f1 fx fx (sftlf,) example-region NaT NaT None nc /home/docs/checkouts/readthedocs.org/user_buil...
3 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r2i1p1... simulation raw None None CMIP6 ScenarioMIP None NCC NorESM2-MM ... r2i1p1f1 D day (tas,) example-region 2001-01-01 2002-12-31 None nc /home/docs/checkouts/readthedocs.org/user_buil...
4 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1... simulation raw None None CMIP6 ScenarioMIP None NCC NorESM2-MM ... r1i1p1f1 fx fx (sftlf,) example-region NaT NaT None nc /home/docs/checkouts/readthedocs.org/user_buil...
5 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1... simulation raw None None CMIP6 ScenarioMIP None NCC NorESM2-MM ... r1i1p1f1 D day (tas,) example-region 2001-01-01 2002-12-31 None nc /home/docs/checkouts/readthedocs.org/user_buil...
6 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1... simulation raw None None CMIP6 ScenarioMIP None NCC NorESM2-MM ... r1i1p1f1 fx fx (sftlf,) example-region NaT NaT None nc /home/docs/checkouts/readthedocs.org/user_buil...
7 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1... simulation raw None None CMIP6 ScenarioMIP None NCC NorESM2-MM ... r1i1p1f1 D day (tas,) example-region 2001-01-01 2002-12-31 None nc /home/docs/checkouts/readthedocs.org/user_buil...
8 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1... simulation raw None None CMIP6 ScenarioMIP None NCC NorESM2-MM ... r1i1p1f1 fx fx (sftlf,) example-region NaT NaT None nc /home/docs/checkouts/readthedocs.org/user_buil...
9 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1... simulation raw None None CMIP6 ScenarioMIP None NCC NorESM2-MM ... r1i1p1f1 D day (tas,) example-region 2001-01-01 2002-12-31 None nc /home/docs/checkouts/readthedocs.org/user_buil...

10 rows × 21 columns

1.2.2.2. Unique Dataset ID

In addition to the parse itself, parse_directory will create a unique Dataset ID that can be used to easily determine one simulation from another. This can be edited with the id_columns argument of parse_directory, but by default, IDs are based on CMIP6’s ID structure with additions related to regional models and bias adjustment:

  • {bias_adjust_project} _ {mip_era} _ {activity} _ {driving_model} _ {institution} _ {source} _ {experiment} _ {member} _ {domain}

This utility can also be called by itself through xs.catalog.generate_id().

INFO

When constructing IDs, empty columns will be skipped.

[26]:
df.iloc[0]["id"]
[26]:
'CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp370_r1i1p1f1_example-region'

1.2.2.3. Appending data using ProjectCatalog.update()

At this stage, df is a pandas.DataFrame. ProjectCatalog.update is used to append this data to the CSV file and save the results on disk.

[27]:

tutorial-catalog catalog with 10 dataset(s) from 10 asset(s):

unique
id 5
type 1
processing_level 1
bias_adjust_institution 0
bias_adjust_project 0
mip_era 1
activity 1
driving_model 0
institution 1
source 1
experiment 4
member 2
xrfreq 2
frequency 2
variable 2
domain 1
date_start 1
date_end 1
version 0
format 1
path 10
derived_variable 0

1.2.2.4. More on patterns and advanced features

The patterns argument acts as a reverse format string.

  • The “_” format specifier (like in {field:_} allows matching a name containing underscores for this field. The path separators (/, \) are still excluded. Any format specifier supported by `parse are usable <https://github.com/r1chardj0n3s/parse>`__.

  • Fields starting with a “?” will be ignored in the output. This allows to have readable patterns to identify parts we know exist, but do not want to be included in the metadata

  • The DATES special field will match single dates or date bounds (see below).

  • {?:_} is useful in filenames as a “wildcard” matching. For exammple: {?:_}_{DATES}.nc will read in the last “element” of the filename into date_start and date_end, ignoring all previous parts.

[28]:
# Create fake files for the example:
root = Path(".").absolute() / "_data" / "parser_examples"
root.mkdir(exist_ok=True)

paths = [
    # Folder name includes underscore, single year implicitly means the full year
    "CCCma/CanESM2/day/tg_mean/tg_mean_1950.nc",
    # Fx frequency, no date bounds, strange model name
    "CCCma/CanESM-2/fx/sftlf/sftlf_fx.nc",
    # Bounds given as range at a monthly frequency
    "MIROC/MIROC6/mon/uas/uas_199901-200011.nc",
    # Version number included in the source name, range given a years
    "ERA/ERA5_v2/yr/heat_wave_frequency/hwf_2100-2399.nc",
]
for path in paths:
    (root / path).parent.mkdir(exist_ok=True, parents=True)
    with (root / path).open("w") as f:
        f.write("example")
1.2.2.4.1. Example 1 - wrong

The variable field does not allow underscores, so the first and last files are not parsed correctly.

Notice how the DATES field was parsed into date_start and date_end. It also matched with fx, returning NaT for both fields, as expected.

[29]:
patt = "{institution}/{source}/{frequency}/{variable}/{?var}_{DATES}.nc"
parse_directory(directories=[root], patterns=[patt])
INFO:xscen.catutils:Found and parsed 2 files.
[29]:
id type processing_level bias_adjust_institution bias_adjust_project mip_era activity driving_model institution source ... member xrfreq frequency variable domain date_start date_end version format path
0 CCCma_CanESM-2 None None None None None None None CCCma CanESM-2 ... None fx fx sftlf None NaT NaT None nc /home/docs/checkouts/readthedocs.org/user_buil...
1 MIROC_MIROC6 None None None None None None None MIROC MIROC6 ... None MS mon uas None 1999-01-01 2000-11-30 23:59:59 None nc /home/docs/checkouts/readthedocs.org/user_buil...

2 rows × 21 columns

1.2.2.4.2. Example 2 - wrong again

We fixed the variable field by allowing underscores. We also modified the filename pattern to match any string, including underscores, except for the last element.

Notice how the “1950” part of tg_mean has been converted to date_start='1950-01-01' and date_end='1950-12-31'.

The source field does not allow underscores, so “ERA5_v2” is not parsed correctly. However, what we would want is rather to assign “v2” to the version field.

[30]:
patt = "{institution}/{source}/{frequency}/{variable:_}/{?:_}_{DATES}.nc"
parse_directory(directories=[root], patterns=[patt])
INFO:xscen.catutils:Found and parsed 3 files.
[30]:
id type processing_level bias_adjust_institution bias_adjust_project mip_era activity driving_model institution source ... member xrfreq frequency variable domain date_start date_end version format path
0 CCCma_CanESM-2 None None None None None None None CCCma CanESM-2 ... None fx fx sftlf None NaT NaT None nc /home/docs/checkouts/readthedocs.org/user_buil...
1 CCCma_CanESM2 None None None None None None None CCCma CanESM2 ... None D day tg_mean None 1950-01-01 1950-12-31 23:59:59 None nc /home/docs/checkouts/readthedocs.org/user_buil...
2 MIROC_MIROC6 None None None None None None None MIROC MIROC6 ... None MS mon uas None 1999-01-01 2000-11-30 23:59:59 None nc /home/docs/checkouts/readthedocs.org/user_buil...

3 rows × 21 columns

1.2.2.4.3. Example 3 - Correct!

We added a second pattern that includes the version field.

[31]:
patts = [
    "{institution}/{source}_{version}/{frequency}/{variable:_}/{?:_}_{DATES}.nc",
    "{institution}/{source}/{frequency}/{variable:_}/{?:_}_{DATES}.nc",
]
parse_directory(directories=[root], patterns=patts)
INFO:xscen.catutils:Found and parsed 4 files.
[31]:
id type processing_level bias_adjust_institution bias_adjust_project mip_era activity driving_model institution source ... member xrfreq frequency variable domain date_start date_end version format path
0 ERA_ERA5 None None None None None None None ERA ERA5 ... None YS yr heat_wave_frequency None 2100-01-01 2399-12-31 23:59:59 v2 nc /home/docs/checkouts/readthedocs.org/user_buil...
1 CCCma_CanESM-2 None None None None None None None CCCma CanESM-2 ... None fx fx sftlf None NaT NaT NaN nc /home/docs/checkouts/readthedocs.org/user_buil...
2 CCCma_CanESM2 None None None None None None None CCCma CanESM2 ... None D day tg_mean None 1950-01-01 1950-12-31 23:59:59 NaN nc /home/docs/checkouts/readthedocs.org/user_buil...
3 MIROC_MIROC6 None None None None None None None MIROC MIROC6 ... None MS mon uas None 1999-01-01 2000-11-30 23:59:59 NaN nc /home/docs/checkouts/readthedocs.org/user_buil...

4 rows × 21 columns

1.2.2.4.4. Example 4 - Filter on folder names

We can filter the results to include only some folders with the dirglob argument.

[32]:
parse_directory(directories=[root], patterns=patts, dirglob="*/CanESM*")
INFO:xscen.catutils:Found and parsed 2 files.
[32]:
id type processing_level bias_adjust_institution bias_adjust_project mip_era activity driving_model institution source ... member xrfreq frequency variable domain date_start date_end version format path
0 CCCma_CanESM-2 None None None None None None None CCCma CanESM-2 ... None fx fx sftlf None NaT NaT None nc /home/docs/checkouts/readthedocs.org/user_buil...
1 CCCma_CanESM2 None None None None None None None CCCma CanESM2 ... None D day tg_mean None 1950-01-01 1950-12-31 23:59:59 None nc /home/docs/checkouts/readthedocs.org/user_buil...

2 rows × 21 columns

1.2.2.4.5. Example 5 - Modifying metadata

We use the cvs (Controlled VocabularieS) argument here to replace some terms found in the paths by others we prefer.

Two replacement types are used : - simple : in the source column, all values of “CanESM-2” are replaced by “CanESM2” - complex : in the institution column, if the value “MIROC” is seen, it triggers the setting of “global” in this row’s domain column, overriding whatever was already present in this field.

[33]:
parse_directory(
    directories=[root],
    patterns=patts,
    cvs={
        "source": {"CanESM-2": "CanESM2"},
        "institution": {"MIROC": {"domain": "global"}},
    },
)
INFO:xscen.catutils:Found and parsed 4 files.
[33]:
id type processing_level bias_adjust_institution bias_adjust_project mip_era activity driving_model institution source ... member xrfreq frequency variable domain date_start date_end version format path
0 ERA_ERA5 None None None None None None None ERA ERA5 ... None YS yr heat_wave_frequency None 2100-01-01 2399-12-31 23:59:59 v2 nc /home/docs/checkouts/readthedocs.org/user_buil...
1 CCCma_CanESM2 None None None None None None None CCCma CanESM2 ... None fx fx sftlf None NaT NaT NaN nc /home/docs/checkouts/readthedocs.org/user_buil...
2 CCCma_CanESM2 None None None None None None None CCCma CanESM2 ... None D day tg_mean None 1950-01-01 1950-12-31 23:59:59 NaN nc /home/docs/checkouts/readthedocs.org/user_buil...
3 MIROC_MIROC6_global None None None None None None None MIROC MIROC6 ... None MS mon uas global 1999-01-01 2000-11-30 23:59:59 NaN nc /home/docs/checkouts/readthedocs.org/user_buil...

4 rows × 21 columns

1.2.2.4.6. Example 6 : Even more complex field processing

In the preceding example, we used the cvs argument to replace values by others, or to trigger replacements based on values of other columns. The exact value must be matched and map to exact values. Another alternative to transform the parsed fields is to feed a function to the path parsing step. This is done by declaring a new “type” to the parser. In the following example, we’ll implement a very useless transformation that reverses the letters of the institution.

[34]:
from xscen.catutils import register_parse_type


@register_parse_type("rev")
def _reverse_word(text):
    return "".join(reversed(text))


patts_mod = [
    "{institution:rev}/{source}_{version}/{frequency}/{variable:_}/{?:_}_{DATES}.nc",
    "{institution:rev}/{source}/{frequency}/{variable:_}/{?:_}_{DATES}.nc",
]
parse_directory(directories=[root], patterns=patts_mod)
INFO:xscen.catutils:Found and parsed 4 files.
[34]:
id type processing_level bias_adjust_institution bias_adjust_project mip_era activity driving_model institution source ... member xrfreq frequency variable domain date_start date_end version format path
0 ARE_ERA5 None None None None None None None ARE ERA5 ... None YS yr heat_wave_frequency None 2100-01-01 2399-12-31 23:59:59 v2 nc /home/docs/checkouts/readthedocs.org/user_buil...
1 amCCC_CanESM-2 None None None None None None None amCCC CanESM-2 ... None fx fx sftlf None NaT NaT NaN nc /home/docs/checkouts/readthedocs.org/user_buil...
2 amCCC_CanESM2 None None None None None None None amCCC CanESM2 ... None D day tg_mean None 1950-01-01 1950-12-31 23:59:59 NaN nc /home/docs/checkouts/readthedocs.org/user_buil...
3 CORIM_MIROC6 None None None None None None None CORIM MIROC6 ... None MS mon uas None 1999-01-01 2000-11-30 23:59:59 NaN nc /home/docs/checkouts/readthedocs.org/user_buil...

4 rows × 21 columns

1.2.3. Restructuring catalogued files on disk

The opposite operation to parse_directory is also handled by xscen.catutils. In this section, we show how to create a Path from a xscen-extraced dataset or from a catalog entry.

1.2.3.1. Simple : template string and attributes

Given a dataset that was opened by xs.extract_dataset or DataCatalog.to_dataset(), we can easily construct a path from the xscen-added attributes.

[35]:
# Open
ds = PC.search(variable="tas", experiment="ssp585").to_dataset()

path_template = "{institution}/{source}/{experiment}_{frequency}.nc"

print(path_template.format(**xs.utils.get_cat_attrs(ds)))
NCC/NorESM2-MM/ssp585_day.nc

While this method is simple, it can’t handle neither the list-like variable field nor the date_start and date_end datetime fields.

1.2.3.2. Complete : build_path

The `build_path <../xscen.rst#xscen.catutils.build_path>`__ function has a more complex interface to be used in more complex workflows.

The default parameters has a pretty good folder structure that depends on the columns type (usually one of simulation, reconstruction or station-obs) and processing_level (often raw, biasadjusted or something else).

[36]:
PosixPath('simulation/raw/CMIP6/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp585/r1i1p1f1/day/tas/tas_day_v20191108_CMIP6_ScenarioMIP_example-region_NCC_NorESM2-MM_ssp585_r1i1p1f1_2001-2002')

The folder schema can be passed explictly, as a dictionary with two entries: - “folders” : a list of fields to build the folder hierarchy. - “filename” : a list of fields to build the filename.

In both cases, a special “DATES” field can be given. It will be translated to the most efficient way to write the temporal bounds of the dataset.

[37]:
custom_schema = {
    "folders": ["type", "institution", "source", "experiment"],
    "filename": ["variable", "DATES"],
}
xs.catutils.build_path(ds, schemas=custom_schema)
[37]:
PosixPath('simulation/NCC/NorESM2-MM/ssp585/tas_2001-2002')

The function has more options:

  • A “root” folder can be specified

  • Other fields can be passed to override those in the data or fill for missing ones.

[38]:
xs.catutils.build_path(ds, root=Path("/tmp"), domain="REG")
[38]:
PosixPath('/tmp/simulation/raw/CMIP6/ScenarioMIP/REG/NCC/NorESM2-MM/ssp585/r1i1p1f1/day/tas/tas_day_v20191108_CMIP6_ScenarioMIP_REG_NCC_NorESM2-MM_ssp585_r1i1p1f1_2001-2002')

Above, we called the function with a dataset. In this case, the “facets” are extracted from various sources, with this priority (highest at the top):

  1. Facets passed explicitly to build_path as keyword arguments

  2. Attributes prefixed with “cat:”

  3. Other Attributes

  4. variable names, start and end date, and frequency, as extracted by parse_from_date.

But the function can also take a single dataframe row:

[39]:
xs.catutils.build_path(PC.search(variable="tas", experiment="ssp585").df.iloc[0])
[39]:
PosixPath('simulation/raw/CMIP6/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp585/r1i1p1f1/day/tas/tas_day_CMIP6_ScenarioMIP_example-region_NCC_NorESM2-MM_ssp585_r1i1p1f1_2001-2002.nc')

Or a full DataFrame/Catalog. In this case, the return value is a DataFrame, copy form the catalog, with a “new_path” column added.

[40]:
# We show only three columns of the output catalog
xs.catutils.build_path(PC.search(variable="tas"))[["id", "path", "new_path"]]
[40]:
id path new_path
0 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp370_r1i1p1... /home/docs/checkouts/readthedocs.org/user_buil... simulation/raw/CMIP6/ScenarioMIP/example-regio...
1 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r2i1p1... /home/docs/checkouts/readthedocs.org/user_buil... simulation/raw/CMIP6/ScenarioMIP/example-regio...
2 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1... /home/docs/checkouts/readthedocs.org/user_buil... simulation/raw/CMIP6/ScenarioMIP/example-regio...
3 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1... /home/docs/checkouts/readthedocs.org/user_buil... simulation/raw/CMIP6/ScenarioMIP/example-regio...
4 CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1... /home/docs/checkouts/readthedocs.org/user_buil... simulation/raw/CMIP6/ScenarioMIP/example-regio...

This can be used in a workflow that renames or copies the files to their new name, usually using shutil.

[41]:
import shutil as sh

# Create the destination folder
root = Path(".").absolute() / "_data" / "path_builder_examples"
root.mkdir(exist_ok=True)

# Get new names:
newdf = xs.catutils.build_path(PC, root=root)

# Copy files
for i, row in newdf.iterrows():
    Path(row["new_path"]).parent.mkdir(parents=True, exist_ok=True)
    sh.copy(row["path"], row["new_path"])
    print(f"Copied {row['path']}\n\tto {row['new_path']}")

# Update catalog:
PC.df["path"] = newdf["new_path"]
PC.update()
Copied /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp370/r1i1p1f1/fx/ScenarioMIP_NCC_NorESM2-MM_ssp370_r1i1p1f1_gn_raw.nc
        to /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/_data/path_builder_examples/simulation/raw/CMIP6/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp370/r1i1p1f1/fx/sftlf/sftlf_fx_CMIP6_ScenarioMIP_example-region_NCC_NorESM2-MM_ssp370_r1i1p1f1_fx.nc
Copied /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp370/r1i1p1f1/day/ScenarioMIP_NCC_NorESM2-MM_ssp370_r1i1p1f1_gn_raw.nc
        to /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/_data/path_builder_examples/simulation/raw/CMIP6/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp370/r1i1p1f1/day/tas/tas_day_CMIP6_ScenarioMIP_example-region_NCC_NorESM2-MM_ssp370_r1i1p1f1_2001-2002.nc
Copied /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp245/r2i1p1f1/fx/ScenarioMIP_NCC_NorESM2-MM_ssp245_r2i1p1f1_gn_raw.nc
        to /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/_data/path_builder_examples/simulation/raw/CMIP6/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp245/r2i1p1f1/fx/sftlf/sftlf_fx_CMIP6_ScenarioMIP_example-region_NCC_NorESM2-MM_ssp245_r2i1p1f1_fx.nc
Copied /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp245/r2i1p1f1/day/ScenarioMIP_NCC_NorESM2-MM_ssp245_r2i1p1f1_gn_raw.nc
        to /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/_data/path_builder_examples/simulation/raw/CMIP6/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp245/r2i1p1f1/day/tas/tas_day_CMIP6_ScenarioMIP_example-region_NCC_NorESM2-MM_ssp245_r2i1p1f1_2001-2002.nc
Copied /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp245/r1i1p1f1/fx/ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1f1_gn_raw.nc
        to /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/_data/path_builder_examples/simulation/raw/CMIP6/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp245/r1i1p1f1/fx/sftlf/sftlf_fx_CMIP6_ScenarioMIP_example-region_NCC_NorESM2-MM_ssp245_r1i1p1f1_fx.nc
Copied /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp245/r1i1p1f1/day/ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1f1_gn_raw.nc
        to /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/_data/path_builder_examples/simulation/raw/CMIP6/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp245/r1i1p1f1/day/tas/tas_day_CMIP6_ScenarioMIP_example-region_NCC_NorESM2-MM_ssp245_r1i1p1f1_2001-2002.nc
Copied /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp585/r1i1p1f1/fx/ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn_raw.nc
        to /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/_data/path_builder_examples/simulation/raw/CMIP6/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp585/r1i1p1f1/fx/sftlf/sftlf_fx_CMIP6_ScenarioMIP_example-region_NCC_NorESM2-MM_ssp585_r1i1p1f1_fx.nc
Copied /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp585/r1i1p1f1/day/ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn_raw.nc
        to /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/_data/path_builder_examples/simulation/raw/CMIP6/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp585/r1i1p1f1/day/tas/tas_day_CMIP6_ScenarioMIP_example-region_NCC_NorESM2-MM_ssp585_r1i1p1f1_2001-2002.nc
Copied /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp126/r1i1p1f1/fx/ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1f1_gn_raw.nc
        to /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/_data/path_builder_examples/simulation/raw/CMIP6/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp126/r1i1p1f1/fx/sftlf/sftlf_fx_CMIP6_ScenarioMIP_example-region_NCC_NorESM2-MM_ssp126_r1i1p1f1_fx.nc
Copied /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp126/r1i1p1f1/day/ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1f1_gn_raw.nc
        to /home/docs/checkouts/readthedocs.org/user_builds/xscen/checkouts/latest/docs/notebooks/_data/path_builder_examples/simulation/raw/CMIP6/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp126/r1i1p1f1/day/tas/tas_day_CMIP6_ScenarioMIP_example-region_NCC_NorESM2-MM_ssp126_r1i1p1f1_2001-2002.nc
[ ]: