{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Remove flox spam\n",
    "\n",
    "import logging\n",
    "\n",
    "# Get the logger for the 'flox' package\n",
    "logger = logging.getLogger(\"flox\")\n",
    "# Set the logging level to WARNING\n",
    "logger.setLevel(logging.WARNING)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "# Getting Started\n",
    "\n",
    "This Notebook will go through all major steps of creating a climate scenario using `xscen`. These steps are:\n",
    "\n",
    "- `search_data_catalogs` to find a subset of datasets that match a given project's requirements.\n",
    "- `extract_dataset` to extract them.\n",
    "- `regrid_dataset` to regrid all data to a common grid.\n",
    "- `train` and `adjust` to bias correct the raw simulations.\n",
    "- `compute_indicators` to compute a list of indicators.\n",
    "- `climatological_op` and `spatial_mean` for spatio-temporal aggregation.\n",
    "- `compute_deltas` to compute deltas.\n",
    "- `ensemble_stats` for ensemble statistics.\n",
    "- `clean_up` for minor adjustments that have to be made in preparation for the final product.\n",
    "\n",
    "\n",
    "## Initialisation\n",
    "\n",
    "Typically, the first step should be to create a new *ProjectCatalog* to store the files that will be created during the process. More details on basic usage are provided in the Catalogs Notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "# A temporary bug fix waiting for xclim 0.49\n",
    "import xclim as xc\n",
    "\n",
    "import xscen as xs\n",
    "\n",
    "# Folder where to put the data\n",
    "output_folder = Path().absolute() / \"_data\"\n",
    "output_folder.mkdir(exist_ok=True)\n",
    "\n",
    "project = {\n",
    "    \"title\": \"example-gettingstarted\",\n",
    "    \"description\": \"This is an example catalog for xscen's documentation.\",\n",
    "}\n",
    "\n",
    "pcat = xs.ProjectCatalog(\n",
    "    str(output_folder / \"example-gettingstarted.json\"),\n",
    "    create=True,\n",
    "    project=project,\n",
    "    overwrite=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "### Searching a subset of datasets within *DataCatalogs*\n",
    "\n",
    "<div class=\"alert alert-info\"> <b>INFO</b>\n",
    "\n",
    "At this stage, the search criteria should be for variables that will be **bias corrected**, not necessarily the variables required for the final product. For example, if `sfcWindfromdir` is the final product, then `uas` and `vas` should be searched for since these are the variables that will be bias corrected.\n",
    "</div>\n",
    "\n",
    "`xs.search_data_catalogs` is used to consult a list of *DataCatalogs* and find a subset of datasets that match given search parameters. More details on that function and possible usage are given in the [Understanding Catalogs](1_catalog.ipynb#Advanced-search:-xs.search_data_catalogs) Notebook.\n",
    "\n",
    "The function also plays the double role of preparing certain arguments for the extraction function, as detailed in the relevant [section of this tutorial](#Simplifying-the-call-to-extract_dataset()-with-search_data_catalogs()).\n",
    "\n",
    "Due to how different reference datasets are from climate simulations, this function might have to be called multiple times and the results concatenated into a single dictionary.\n",
    "\n",
    "For the purpose of this tutorial, temperatures and the land fraction from NorESM2-MM will be used:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "variables_and_freqs = {\"tas\": \"D\", \"sftlf\": \"fx\"}\n",
    "other_search_criteria = {\n",
    "    \"source\": [\"NorESM2-MM\"],\n",
    "    \"processing_level\": [\"raw\"],\n",
    "    \"experiment\": \"ssp245\",\n",
    "}\n",
    "\n",
    "cat_sim = xs.search_data_catalogs(\n",
    "    data_catalogs=[str(output_folder / \"tutorial-catalog.json\")],\n",
    "    variables_and_freqs=variables_and_freqs,\n",
    "    other_search_criteria=other_search_criteria,\n",
    "    periods=[2001, 2002],\n",
    "    match_hist_and_fut=True,\n",
    ")\n",
    "\n",
    "cat_sim"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "The result of `search_data_catalog` is a dictionary with one entry per unique ID. Note that a unique ID can be associated to multiple *intake datasets*, as is the case here, because `intake-esm` groups catalog lines per *id - domain - processing_level - xrfeq*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "cat_sim[\"CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1f1_example-region\"].df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "## Extracting data\n",
    "\n",
    "<div class=\"alert alert-warning\"> <b>WARNING</b>\n",
    "\n",
    "It is heavily recommended to stop and analyse the results of `search_data_catalogs` before proceeding to the extraction function.\n",
    "</div>\n",
    "\n",
    "### Defining the region\n",
    "\n",
    "The region for a given project is defined using a dictionary with the relevant information to be used by `clisops.core.subset`. The required fields are as follows:\n",
    "\n",
    "region =\n",
    "\n",
    "    \"name\": str,                    {this will overwrite the *domain* column in the catalog}\n",
    "    \"method\": str,                  [gridpoint, bbox, shape, sel]\n",
    "    \"tile_buffer\": float,    {approximate number of pixels used to expand the domain based on model resolution}\n",
    "    **kwargs                        {other arguments to send `clisops`}\n",
    "\n",
    "The argument *tile_buffer* (optional) is used to apply a buffer zone around the region that is adjusted dynamically according to model resolution during the extraction process (for *bbox* and *shape* only). This is useful to make sure that grid cells that only partially cover the region are selected too.\n",
    "\n",
    "The documentation for the supported subsetting methods are in the following links:\n",
    "\n",
    "- [gridpoint](https://clisops.readthedocs.io/en/latest/api.html#clisops.core.subset.subset_gridpoint)\n",
    "- [bbox](https://clisops.readthedocs.io/en/latest/api.html#clisops.core.subset.subset_bbox)\n",
    "- [shape](https://clisops.readthedocs.io/en/latest/api.html#clisops.core.subset.subset_shape)\n",
    "- *sel* is simply a call to xarray"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "region = {\n",
    "    \"name\": \"example-region\",\n",
    "    \"method\": \"bbox\",\n",
    "    \"tile_buffer\": 1.5,\n",
    "    \"lon_bnds\": [-75, -74],\n",
    "    \"lat_bnds\": [45, 46],\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "### Preparing arguments for *xarray*\n",
    "\n",
    "`xscen` makes use of `intake_esm`'s [to_dataset_dict()](https://intake-esm.readthedocs.io/en/stable/reference/api.html?highlight=to_dataset_dict) for the extraction process, which will automatically compute missing variables as required. Also, this function manages Catalogs, IDs, and both NetCDF and Zarr files seamlessly. When the catalog is made of a single dataset, `to_dataset()` can be used instead to directly obtain an *xr.Dataset* instead of a dictionary.\n",
    "\n",
    "There are a few key differences compared to using *xarray* directly, one of which being that it uses `xr.open_dataset`, even when multiple files are involved, with a subsequent call to `xr.combine_by_coords`. Kwargs are therefore separated in two:\n",
    "\n",
    "- `xr_open_kwargs` is used for optional arguments in `xarray.open_dataset`.\n",
    "\n",
    "- `xr_combine_kwargs` is used for optional arguments in `xarray.combine_by_coords`.\n",
    "\n",
    "More information on possible kwargs can be obtained here: [xarray.open_dataset](https://xarray.pydata.org/en/stable/generated/xarray.open_dataset.html) & [xarray.combine_by_coords](https://xarray.pydata.org/en/stable/generated/xarray.combine_by_coords.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Kwargs for xr.open_dataset\n",
    "xr_open_kwargs = {\"drop_variables\": [\"height\", \"time_bnds\"], \"engine\": \"h5netcdf\"}\n",
    "\n",
    "# Kwargs for xr.combine_by_coords\n",
    "xr_combine_kwargs = {\"data_vars\": \"minimal\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "### Extraction function\n",
    "\n",
    "Extraction is done on each dataset by calling `xs.extract_dataset()`. Since the output could have multiple frequencies, the function returns a python dictionary with keys following the output frequency.\n",
    "\n",
    "- `catalog` is the *DataCatalog* to extract.\n",
    "- `variables_and_freqs` is the same as previously used for `search_data_catalogs`.\n",
    "- `periods` is used to extract specific time periods.\n",
    "- `to_level` will change the *processing_level* of the output. Defaults to \"extracted\".\n",
    "- `region`, `xr_open_kwargs`, and `xr_combine_kwargs` are described above.\n",
    "\n",
    "**NOTE:** Calling the extraction function without passing by `search_data_catalogs` beforehand is possible, but will not support *DerivedVariables*.\n",
    "\n",
    "<div class=\"alert alert-info\"> <b>NOTE</b>\n",
    "\n",
    "`extract_dataset` currently only accepts a single unique ID at a time.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Example with a single simulation\n",
    "ds_dict = xs.extract_dataset(\n",
    "    catalog=cat_sim[\"CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1f1_example-region\"],\n",
    "    variables_and_freqs=variables_and_freqs,\n",
    "    periods=[2001, 2002],\n",
    "    region=region,\n",
    "    xr_open_kwargs=xr_open_kwargs,\n",
    "    xr_combine_kwargs=xr_combine_kwargs,\n",
    ")\n",
    "\n",
    "ds_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {},
   "source": [
    "### Saving files to disk\n",
    "\n",
    "`extract_dataset` does not actually *save* anything to disk. It simply opens and prepares the files as per requested, using lazy computing. The result is a python dictionary containing the results, separated per *xrfreq*.\n",
    "\n",
    "`xscen` has two functions for the purpose of saving data: `save_to_netcdf` and `save_to_zarr`. If possible for a given project, it is strongly recommended to use Zarr files since these are often orders of magnitude faster to read and create compared to NetCDF. They do have a few quirks, however:\n",
    "\n",
    "- Chunk size must separate the dataset in *exactly* equal chunks (with the exception of the last). While it is recommended to calculate ideal chunking and provide them explicitly to the function, `io.estimate_chunks()` can be used to automatically estimate a decent chunking. In a similar way, `io.subset_maxsize()` can be used to roughly cut a dataset along the *time* dimension into subsets of a given size (in Gb), which is especially useful for saving NetCDF files.\n",
    " Chunk sizes can be passed to the two saving functions in a dictionary. Spatial dimensions can be generalized as `'X'` and `'Y'`, which will be mapped to the *xarray.Dataset*'s actual grid type's dimension names.\n",
    "\n",
    "\n",
    "- Default behaviour for a Zarr is to act like a directory, with each new variable being assigned a subfolder. This is great when all required variables have the same dimensions and frequency, but will crash otherwise. If you have daily *tasmax* and subdaily *pr*, for example, they need to be assigned different paths.\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "#### Updating the catalog\n",
    "\n",
    "`intake-esm` will automatically copy the catalog's entry in the dataset's metadata, with a `cat:attr` format. Where appropriate, `xscen` updates that information to keep the metadata up to date with the manipulations. `ProjectCatalog.update_from_ds` will in turn read the metadata of a Dataset and fill in the information into a new catalog entry.\n",
    "\n",
    "This loop means that upon completing a step in the creation of a climate scenario, `ProjectCatalog.update_from_ds()` can be called to update the catalog.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "for ds in ds_dict.values():\n",
    "    filename = str(\n",
    "        output_folder\n",
    "        / f\"{ds.attrs['cat:id']}.{ds.attrs['cat:domain']}.{ds.attrs['cat:processing_level']}.{ds.attrs['cat:frequency']}.zarr\"\n",
    "    )\n",
    "    chunks = xs.io.estimate_chunks(ds, dims=[\"time\"], target_mb=50)\n",
    "    xs.save_to_zarr(ds, filename, rechunk=chunks, mode=\"o\")\n",
    "\n",
    "    # Strongly suggested to update the project catalog AFTER you save to disk, in case it crashes during the process\n",
    "    pcat.update_from_ds(ds=ds, path=filename, info_dict={\"format\": \"zarr\"})\n",
    "\n",
    "pcat.df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "### Simplifying the call to extract_dataset() with search_data_catalogs()\n",
    "\n",
    "When a catalog was produced using `search_data_catalogs`, `xscen` will automatically save the requested periods and frequencies, in addition to *DerivedVariables*. This means that these items do not need to be included during the call to `extract_dataset` and make it possible to extract datasets that have different requirements (such as reference datasets and future simulations)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "cat_sim[\n",
    "    \"CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1f1_example-region\"\n",
    "]._requested_periods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(\n",
    "    cat_sim[\n",
    "        \"CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1f1_example-region\"\n",
    "    ]._requested_variables_true\n",
    ")\n",
    "print(\n",
    "    cat_sim[\n",
    "        \"CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1f1_example-region\"\n",
    "    ]._requested_variable_freqs\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18",
   "metadata": {},
   "source": [
    "Since `cat_sim` contains multiple datasets, extracting the data should be done by looping on `.items()` or `.values()`. Also, since 'CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1f1_example-region' was extracted in the previous step, `pcat.exists_in_cat` can be used to skip re-extracting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "for key, dc in cat_sim.items():\n",
    "    if not pcat.exists_in_cat(id=key, processing_level=\"extracted\"):\n",
    "        dset_dict = xs.extract_dataset(\n",
    "            catalog=dc,\n",
    "            region=region,\n",
    "            xr_open_kwargs=xr_open_kwargs,\n",
    "            xr_combine_kwargs=xr_combine_kwargs,\n",
    "        )\n",
    "\n",
    "        for ds in dset_dict.values():\n",
    "            filename = str(\n",
    "                output_folder\n",
    "                / f\"{ds.attrs['cat:id']}.{ds.attrs['cat:domain']}.{ds.attrs['cat:processing_level']}.{ds.attrs['cat:frequency']}.zarr\"\n",
    "            )\n",
    "            chunks = xs.io.estimate_chunks(ds, dims=[\"time\"], target_mb=50)\n",
    "            xs.save_to_zarr(ds, filename, rechunk=chunks, mode=\"o\")\n",
    "\n",
    "            # Strongly suggested to update the project catalog AFTER you save to disk, in case it crashes during the process\n",
    "            pcat.update_from_ds(ds=ds, path=filename, info_dict={\"format\": \"zarr\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# This is a hidden cell. Since the sample files are very small, we'll create fake data covering a longer time period and hijack the previously saved files.\n",
    "\n",
    "import shutil\n",
    "\n",
    "from xscen.testing import datablock_3d, fake_data\n",
    "\n",
    "ds_dict = pcat.search(processing_level=\"extracted\", variable=\"tas\").to_dataset_dict()\n",
    "for key, ds in ds_dict.items():\n",
    "    attrs = ds.attrs\n",
    "    filename = pcat.search(id=ds.attrs[\"cat:id\"], variable=\"tas\").df.path.iloc[0]\n",
    "\n",
    "    shutil.rmtree(filename)\n",
    "\n",
    "    data = fake_data(\n",
    "        nyears=71,\n",
    "        ny=len(ds.lat),\n",
    "        nx=len(ds.lon),\n",
    "        rand_type=\"tas\",\n",
    "        seed=sorted(list(ds_dict.keys())).index(key),\n",
    "        amplitude=15,\n",
    "        offset=2,\n",
    "    )\n",
    "    ds = datablock_3d(\n",
    "        data,\n",
    "        \"tas\",\n",
    "        \"lon\",\n",
    "        -75,\n",
    "        \"lat\",\n",
    "        45,\n",
    "        x_step=1,\n",
    "        y_step=1.5,\n",
    "        start=\"1/1/1981\",\n",
    "        freq=\"D\",\n",
    "        as_dataset=True,\n",
    "    )\n",
    "    ds.attrs = attrs\n",
    "    ds.attrs[\"cat:date_start\"] = \"1981-01-01\"\n",
    "    ds.attrs[\"cat:date_end\"] = \"2050-01-01\"\n",
    "    chunks = xs.io.estimate_chunks(ds, dims=[\"time\"], target_mb=50)\n",
    "    xs.save_to_zarr(ds, filename, rechunk=chunks, mode=\"o\")\n",
    "\n",
    "    pcat.update_from_ds(ds=ds, path=filename, info_dict={\"format\": \"zarr\"})\n",
    "\n",
    "# For this tutorial, we'll also create a fake reference dataset\n",
    "data = fake_data(\n",
    "    nyears=31, ny=5, nx=4, rand_type=\"tas\", seed=42, amplitude=25, offset=-2\n",
    ")\n",
    "ds_ref = datablock_3d(\n",
    "    data,\n",
    "    \"tas\",\n",
    "    \"lon\",\n",
    "    -74.875,\n",
    "    \"lat\",\n",
    "    45.275,\n",
    "    x_step=0.25,\n",
    "    y_step=0.55,\n",
    "    start=\"1/1/1981\",\n",
    "    freq=\"D\",\n",
    "    as_dataset=True,\n",
    ")\n",
    "ds_ref.attrs = attrs\n",
    "ds_ref.attrs[\"cat:date_start\"] = \"1981-01-01\"\n",
    "ds_ref.attrs[\"cat:date_end\"] = \"2010-01-01\"\n",
    "ds_ref.attrs[\"cat:source\"] = \"ERA5-Land\"\n",
    "ds_ref.attrs[\"cat:institution\"] = \"ECMWF\"\n",
    "ds_ref.attrs[\"cat:domain\"] = \"finer-grid\"\n",
    "ds_ref.attrs[\"cat:activity\"] = None\n",
    "ds_ref.attrs[\"cat:experiment\"] = None\n",
    "ds_ref.attrs[\"cat:member\"] = None\n",
    "ds_ref.attrs[\"cat:mip_era\"] = None\n",
    "ds_ref.attrs[\"cat:path\"] = None\n",
    "ds_ref.attrs[\"cat:id\"] = xs.catalog.generate_id(ds_ref)[0]\n",
    "\n",
    "filename = filename.replace(Path(filename).stem, ds_ref.attrs[\"cat:id\"])\n",
    "xs.save_to_zarr(ds_ref, filename, rechunk=chunks, mode=\"o\")\n",
    "pcat.update_from_ds(ds=ds_ref, path=filename, info_dict={\"format\": \"zarr\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21",
   "metadata": {},
   "source": [
    "## Regridding data\n",
    "\n",
    "<div class=\"alert alert-info\"> <b>NOTE</b>\n",
    "\n",
    "Regridding in `xscen` is built upon `xESMF`. For more information on basic usage and available regridding methods, [please consult their documentation](https://xesmf.readthedocs.io/en/latest/). Their [masking and extrapolation tutorial](https://xesmf.readthedocs.io/en/latest/notebooks/Masking.html) is of particular interest.\n",
    "\n",
    "More details on the regridding functions themselves can be found within the [ESMPy](https://earthsystemmodeling.org/esmpy/) and [ESMF](https://earthsystemmodeling.org/) documentation.\n",
    "</div>\n",
    "\n",
    "The only requirement for using datasets in `xESMF` is that they contain *lon* and *lat*, with *mask* as an optional data variable. Using these, the package can manage both regular and rotated grids. The main advantage of `xESMF` compared to other tools such as *scipy*'s *griddata*, in addition to the fact that the methods are climate science-based, is that the transformation weights are calculated once and broadcasted on the *time* dimension.\n",
    "\n",
    "### Preparing the destination grid\n",
    "\n",
    "`xscen` currently does not explicitly support any function to create a destination grid. If required, however, `xESMF` itself has utilities that can easily create custom regular grids, such as `xesmf.util.cf_grid_2d`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import xesmf\n",
    "\n",
    "ds_grid = xesmf.util.cf_grid_2d(-75, -74, 0.25, 45, 48, 0.55)\n",
    "\n",
    "# cf_grid_2d does not set the 'axis' attribute\n",
    "ds_grid[\"lon\"].attrs[\"axis\"] = \"X\"\n",
    "ds_grid[\"lat\"].attrs[\"axis\"] = \"Y\"\n",
    "\n",
    "# xscen will use the domain to re-assign attributes, so it is important to set it up for custom grids like this\n",
    "ds_grid.attrs[\"cat:domain\"] = \"finer-grid\"\n",
    "ds_grid"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23",
   "metadata": {},
   "source": [
    "### Masking grid cells\n",
    "\n",
    "Masks can be used on both the original grid and the destination grid to ignore certain grid cells during the regridding process. These masks follow the `ESMF` convention, meaning that the mask is a variable within the Dataset, named *mask* and comprised of 0 and 1.\n",
    "\n",
    "`xs.create_mask` will create an adequate DataArray, following the instructions given to the function. In the case of variables that have a time component, the first timestep will be chosen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# to_dataset() will open the dataset, as long as the search() gave a single result.\n",
    "ds_example = pcat.search(\n",
    "    id=\"CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1f1_example-region\",\n",
    "    processing_level=\"extracted\",\n",
    "    variable=\"sftlf\",\n",
    ").to_dataset()\n",
    "\n",
    "# Will mask all pixels that do not match these requirements (at least 25% land)\n",
    "ds_example[\"mask\"] = xs.create_mask(\n",
    "    ds_example, variable=\"sftlf\", where_operator=\">\", where_threshold=25, mask_nans=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import matplotlib.patches as patches\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Plot sftlf\n",
    "ax = plt.subplot(121)\n",
    "ds_example.sftlf.plot.imshow()\n",
    "plt.title(\"sftlf\")\n",
    "\n",
    "# Plot the mask\n",
    "plt.subplot(122)\n",
    "ds_example.mask.plot.imshow()\n",
    "plt.title(\"mask\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26",
   "metadata": {},
   "source": [
    "### Preparing arguments for xESMF.Regridder\n",
    "\n",
    "<div class=\"alert alert-info\">  <b>NOTE</b>\n",
    "\n",
    "xESMF's API appears to be broken on their ReadTheDocs. For a list of available arguments and options in `Regridder()`, please consult their [Github page](https://github.com/pangeo-data/xESMF/blob/master/xesmf/frontend.py) directly.\n",
    "</div>\n",
    "\n",
    "`xESMF.Regridder` is the main utility that computes the transformation weights and performs the regridding. It is supported by many optional arguments and methods, which can be called in `xscen` through `regridder_kwargs`.\n",
    "\n",
    "Available options are:\n",
    "\n",
    "     method: str\n",
    "         Regridding method. More details are given within the ESMF documentation and xESMF tutorials.\n",
    "        - 'bilinear'  (Default)\n",
    "        - 'nearest_s2d'\n",
    "        - 'nearest_d2s'\n",
    "        - 'conservative'\n",
    "        - 'conservative_normed'\n",
    "        - 'patch'\n",
    "\n",
    "    extrap_method: str\n",
    "        Extrapolation method. Defaults to None.\n",
    "        - 'inverse_dist'\n",
    "        - 'nearest_s2d'\n",
    "\n",
    "    extrap_dist_exponent: float\n",
    "        Exponent to raise the distance to when calculating weights for extrapolation.\n",
    "        Defaults to 2.0.\n",
    "\n",
    "    extrap_num_src_pnts : int, optional\n",
    "        The number of source points to use for the extrapolation methods that use more than one source point.\n",
    "        Defaults to 8\n",
    "\n",
    "    unmapped_to_nan: boolean, optional\n",
    "        Set values of unmapped points to `np.nan` instead of the ESMF default of 0.\n",
    "        Defaults to True.\n",
    "\n",
    "    periodic: boolean, optional\n",
    "        Only really useful for global grids with non-conservative regridding.\n",
    "\n",
    "Other options exist in `ESMF/ESMPy`, but not `xESMF`. As they get implemented, they should automatically get supported by `xscen`.\n",
    "\n",
    "<div class=\"alert alert-info\">  <b>NOTE</b>\n",
    "\n",
    "Some utilities that exist in `xESMF` have not yet been explicitly added to `xscen`. If *conservative* regridding is desired, for instance, some additional scripts might be required on the User's side to create the lon/lat boundaries\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "regridder_kwargs = {\"extrap_method\": \"inverse_dist\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28",
   "metadata": {},
   "source": [
    "### Regridding function\n",
    "\n",
    "Regridding for a Dataset is done through `xs.regrid_dataset`, which manages calls to `xESMF.Regridder` and makes sure that the output is CF-compliant.\n",
    "\n",
    "- `weights_location` provides a path where to save the regridding weights (NetCDF file). This file (alongside `reuse_weights=True`) is used by `xESMF` to reuse the transformation weights between datasets that are deemed to have the same grid and vastly improve the speed of the function.\n",
    "- `intermediate_grids` can be called to perform the regridding process in multiple steps. This is recommended when the jump in resolution is very high between the original and destination grid (e.g. from 3° to 0.08°).\n",
    "- `ds_grid` & `regridder_kwargs` are described above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# to_dataset_dict() is called to cast the search results as xr.Dataset objects\n",
    "# frequency=\"^(?!fx$).*$\" is used to exclude fixed fields from the results\n",
    "ds_dict = pcat.search(\n",
    "    processing_level=\"extracted\", frequency=\"^(?!fx$).*$\", domain=\"example-region\"\n",
    ").to_dataset_dict()\n",
    "\n",
    "mask_args = {\n",
    "    \"variable\": \"sftlf\",\n",
    "    \"where_operator\": \">\",\n",
    "    \"where_threshold\": 25,\n",
    "    \"mask_nans\": True,\n",
    "}\n",
    "\n",
    "for ds in ds_dict.values():\n",
    "    # Add a mask on the original grid.\n",
    "    ds[\"mask\"] = xs.create_mask(\n",
    "        pcat.search(\n",
    "            id=ds.attrs[\"cat:id\"], processing_level=\"extracted\", variable=\"sftlf\"\n",
    "        ).to_dataset(),\n",
    "        **mask_args,\n",
    "    )\n",
    "\n",
    "    # Regridding function\n",
    "    ds_regrid = xs.regrid_dataset(\n",
    "        ds=ds,\n",
    "        weights_location=str(output_folder / \"gs-weights\"),\n",
    "        ds_grid=ds_grid,\n",
    "        regridder_kwargs=regridder_kwargs,\n",
    "    )\n",
    "\n",
    "    # Save to zarr\n",
    "    filename = str(\n",
    "        output_folder\n",
    "        / f\"{ds_regrid.attrs['cat:id']}.{ds_regrid.attrs['cat:domain']}.{ds_regrid.attrs['cat:processing_level']}.{ds_regrid.attrs['cat:frequency']}.zarr\"\n",
    "    )\n",
    "    chunks = xs.io.estimate_chunks(ds, dims=[\"time\"], target_mb=50)\n",
    "    xs.save_to_zarr(ds=ds_regrid, filename=filename, rechunk=chunks, mode=\"o\")\n",
    "    pcat.update_from_ds(ds_regrid, path=filename, format=\"zarr\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import matplotlib.patches as patches\n",
    "\n",
    "plt.figure(figsize=[15, 5])\n",
    "\n",
    "vmin = float(ds.tas.isel(time=0).min())\n",
    "vmax = float(ds.tas.isel(time=0).max())\n",
    "\n",
    "ax = plt.subplot(131)\n",
    "ds.tas.isel(time=0).plot.imshow(vmin=vmin, vmax=vmax)\n",
    "plt.title(\"tas: original grid\")\n",
    "rect = patches.Rectangle(\n",
    "    (-75, 45), 1, 2.75, linewidth=1, edgecolor=\"r\", facecolor=\"none\"\n",
    ")\n",
    "ax.add_patch(rect)\n",
    "\n",
    "ax = plt.subplot(132)\n",
    "(ds.tas.isel(time=0).where(ds.mask == 1)).plot.imshow(vmin=vmin, vmax=vmax)\n",
    "rect = patches.Rectangle(\n",
    "    (-75, 45), 1, 2.75, linewidth=1, edgecolor=\"r\", facecolor=\"none\"\n",
    ")\n",
    "ax.add_patch(rect)\n",
    "plt.title(\"tas: original + mask\")\n",
    "plt.tight_layout()\n",
    "\n",
    "plt.subplot(133)\n",
    "ds_regrid.tas.isel(time=0).plot.imshow(vmin=vmin, vmax=vmax)\n",
    "plt.title(\"tas: regridded with mask + extrapolation\")\n",
    "plt.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31",
   "metadata": {},
   "source": [
    "## Bias adjusting data\n",
    "\n",
    "<div class=\"alert alert-info\"> <b>NOTE</b>\n",
    "\n",
    "Bias adjustment in `xscen` is built upon `xsdba`. For more information on basic usage and available methods, [please consult their documentation](https://xsdba.readthedocs.io/en/stable/).\n",
    "</div>\n",
    "\n",
    "### Preparing arguments for xsdba\n",
    "\n",
    "Many optional arguments are used by `xsdba` during the training and adjustment processes. These options heavily depend on the bias adjustment method used, so it is recommended to consult their documentation before proceeding further.\n",
    "\n",
    "These arguments can be sent by using `xsdba_train_kwargs` and `xsdba_adjust_kwargs` during the call to `xs.train` and `xs.adjust` respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "xsdba_train_args = {\"kind\": \"+\", \"nquantiles\": 50}\n",
    "\n",
    "xsdba_adjust_args = {\"detrend\": 3, \"interp\": \"linear\", \"extrapolation\": \"constant\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33",
   "metadata": {},
   "source": [
    "### Bias adjustment function\n",
    "\n",
    "Bias adjustment is done through `xs.train` and `xs.adjust`. They are kept separate to account for cases where a voluminous dataset would require saving after the training step.\n",
    "\n",
    "The arguments to *train()* are:\n",
    "\n",
    "- `dref` and `dhist` indicate the reference and historical datasets.\n",
    "- `var` indicates which variable to bias correct.\n",
    "- `period` defines the period used for building the transfer function.\n",
    "- `method` indicates which bias adjusting method to call within `xsdba`.\n",
    "- `maximal_calendar` instructs on which calendar to use, following this hierarchy: 360_day < noleap < standard < all_leap\n",
    "- `jitter_under` adds a random noise under a given threshold.\n",
    "- `jitter_over`adds a random noise over a given threshold.\n",
    "- `xsdba_train_kwargs` is described above.\n",
    "\n",
    "The arguments to *adjust()* are:\n",
    "\n",
    "- `dtrain` is the result of `biasadjust.train`.\n",
    "- `dsim` is the simulation to bias adjust.\n",
    "- `periods` defines the period(s) to bias adjust.\n",
    "- `xsdba_adjust_kwargs` is described above.\n",
    "\n",
    "<div class=\"alert alert-info\">  <b>NOTE</b>\n",
    "\n",
    "These functions currently do not support multiple variables due to the fact that train and adjust arguments might vary. The function must be called separately for every variable.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ds_dict = pcat.search(processing_level=\"regridded\").to_dataset_dict()\n",
    "\n",
    "# # Open the reference dataset, in this case ERA5-Land\n",
    "ds_ref = pcat.search(processing_level=\"extracted\", source=\"ERA5-Land\").to_dataset()\n",
    "\n",
    "# Currently, only a single variable can be bias adjusted at a time\n",
    "variables = [\"tas\"]\n",
    "for v in variables:\n",
    "    for ds in ds_dict.values():\n",
    "        # Train\n",
    "        ds_train = xs.train(\n",
    "            dref=ds_ref,\n",
    "            dhist=ds,\n",
    "            var=[\"tas\"],\n",
    "            period=[\"1981\", \"2010\"],\n",
    "            xsdba_train_args=xsdba_train_args,\n",
    "        )\n",
    "\n",
    "        # Adjust\n",
    "        ds_adj = xs.adjust(\n",
    "            dtrain=ds_train,\n",
    "            dsim=ds,\n",
    "            periods=[\"1981\", \"2050\"],\n",
    "            bias_adjust_institution=\"Ouranos\",  # add new attribute cat:bias_adjust_institution\n",
    "            bias_adjust_project=\"xscen-tutorial\",  # add new attribute cat:bias_adjust_project\n",
    "            xsdba_adjust_args=xsdba_adjust_args,\n",
    "        )\n",
    "\n",
    "        # Save to zarr\n",
    "        filename = str(\n",
    "            output_folder\n",
    "            / f\"{ds_adj.attrs['cat:id']}.{ds_adj.attrs['cat:domain']}.{ds_adj.attrs['cat:processing_level']}.{ds_adj.attrs['cat:frequency']}.zarr\"\n",
    "        )\n",
    "        chunks = xs.io.estimate_chunks(ds_adj, dims=[\"time\"], target_mb=50)\n",
    "        xs.save_to_zarr(ds=ds_adj, filename=filename, rechunk=chunks, mode=\"o\")\n",
    "        pcat.update_from_ds(ds_adj, path=filename, format=\"zarr\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ds = pcat.search(\n",
    "    processing_level=\"regridded\",\n",
    "    variable=\"tas\",\n",
    "    id=\"CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1f1_example-region\",\n",
    ").to_dataset()\n",
    "ds_adj = pcat.search(\n",
    "    processing_level=\"biasadjusted\",\n",
    "    variable=\"tas\",\n",
    "    id=\"CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1f1_example-region\",\n",
    ").to_dataset()\n",
    "\n",
    "vmin = min(\n",
    "    [\n",
    "        float(ds_ref.tas.sel(time=slice(\"1981\", \"2010\")).mean(dim=\"time\").min()),\n",
    "        float(ds.tas.sel(time=slice(\"1981\", \"2010\")).mean(dim=\"time\").min()),\n",
    "    ]\n",
    ")\n",
    "vmax = max(\n",
    "    [\n",
    "        float(ds_ref.tas.sel(time=slice(\"1981\", \"2010\")).mean(dim=\"time\").max()),\n",
    "        float(ds.tas.sel(time=slice(\"1981\", \"2010\")).mean(dim=\"time\").max()),\n",
    "    ]\n",
    ")\n",
    "\n",
    "fig = plt.figure(figsize=(15, 5))\n",
    "plt.subplot(141)\n",
    "ds_ref.tas.sel(time=slice(\"1981\", \"2010\")).mean(dim=\"time\").transpose(\n",
    "    \"lat\", ...\n",
    ").plot.imshow(vmin=vmin, vmax=vmax)\n",
    "plt.title(\"tas - Ref\")\n",
    "\n",
    "plt.subplot(142)\n",
    "ds.tas.sel(time=slice(\"1981\", \"2010\")).mean(dim=\"time\").transpose(\n",
    "    \"lat\", ...\n",
    ").plot.imshow(vmin=vmin, vmax=vmax)\n",
    "plt.title(\"tas - Raw\")\n",
    "\n",
    "plt.subplot(143)\n",
    "ds_adj.tas.sel(time=slice(\"1981\", \"2010\")).mean(dim=\"time\").transpose(\n",
    "    \"lat\", ...\n",
    ").plot.imshow(vmin=vmin, vmax=vmax)\n",
    "plt.title(\"tas - Bias adjusted\")\n",
    "plt.tight_layout()\n",
    "\n",
    "plt.subplot(144)\n",
    "(\n",
    "    ds_adj.tas.sel(time=slice(\"1981\", \"2010\")).mean(dim=\"time\")\n",
    "    - ds_ref.tas.sel(time=slice(\"1981\", \"2010\")).mean(dim=\"time\")\n",
    ").transpose(\"lat\", ...).plot.imshow(vmin=-1, vmax=1, cmap=\"RdBu_r\")\n",
    "plt.title(\"Bias (°C)\")\n",
    "plt.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36",
   "metadata": {},
   "source": [
    "## Computing indicators\n",
    "\n",
    "<div class=\"alert alert-info\"> <b>NOTE</b>\n",
    "\n",
    "`xscen` relies heavily on `xclim`'s YAML support for calculating indicators. For more information on how to build the YAML file, consult [this Notebook](https://xclim.readthedocs.io/en/latest/notebooks/extendxclim.html?highlight=yaml#YAML-file).\n",
    "</div>\n",
    "\n",
    "`xs.compute_indicators` makes use of *xclim*'s indicator modules functionalities to compute a given list of indicators. It is called by either using:\n",
    "\n",
    "- The path to a [YAML file](https://xclim.readthedocs.io/en/stable/api.html#yaml-file-structure) structured in a way compatible with *xclim*'s `build_indicator_module_from_yaml`\n",
    "- An indicator module directly\n",
    "- A sequence of indicators\n",
    "- A sequence of tuples as returned by calling `iter_indicators()` on an indicator module.\n",
    "\n",
    "Same as the extraction function, since the output could have multiple frequencies, the function returns a python dictionary with the output frequency as keys. The inputs of `xs.compute_indicators` are:\n",
    "\n",
    "- `ds` is the *xr.Dataset* containing the required variables.\n",
    "- `indicators` instructs on which indicator(s) to compute. It can be a number of things, as listed above.\n",
    "- `periods` is a list of [start, end] of continuous periods to be considered.\n",
    "\n",
    "This example will use a simple YAML file structured like this:\n",
    "\n",
    "```\n",
    "realm: atmos\n",
    "indicators:\n",
    "  growing_degree_days:\n",
    "    base: growing_degree_days\n",
    "  tg_min:\n",
    "    base: tg_min\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ds_dict = pcat.search(processing_level=\"biasadjusted\").to_dataset_dict()\n",
    "\n",
    "for ds in ds_dict.values():\n",
    "    # Output is dict, but it has only one frequency.\n",
    "    _, ds_ind = xs.compute_indicators(\n",
    "        ds=ds,\n",
    "        indicators=Path().absolute() / \"samples\" / \"indicators.yml\",\n",
    "    ).popitem()\n",
    "\n",
    "    # Save the results\n",
    "    filename = str(\n",
    "        output_folder\n",
    "        / f\"{ds_ind.attrs['cat:id']}.{ds_ind.attrs['cat:domain']}.{ds_ind.attrs['cat:processing_level']}.{ds_ind.attrs['cat:frequency']}.zarr\"\n",
    "    )\n",
    "    chunks = xs.io.estimate_chunks(ds, dims=[\"time\"], target_mb=50)\n",
    "    xs.save_to_zarr(ds_ind, filename, rechunk=chunks, mode=\"o\")\n",
    "\n",
    "    # Strongly suggested to update the project catalog AFTER you save to disk, in case it crashes during the process\n",
    "    pcat.update_from_ds(ds=ds_ind, path=filename, format=\"zarr\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "display(ds_ind)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39",
   "metadata": {},
   "source": [
    "## Spatio-temporal aggregation\n",
    "\n",
    "### Climatological operations\n",
    "\n",
    "`xs.climatological_op` is used to perform *n*-year operations over `ds.time.dt.year`.\n",
    "\n",
    "**NOTE:** The aggregation is over *year*, __*not* over *time*__. For example, if given monthly data, the climatological operation will be computed separately for January, February, etc. This means that the data should already be aggregated at the required frequency, for example using `xs.compute_indicators` to compute yearly, seasonal, or monthly indicators.\n",
    "\n",
    "The function call requires a `xr.Dataset` and argument `op` specifies the operation to perform. It can be any of the following:\n",
    "`['max', 'mean', 'median', 'min', 'std', 'sum', 'var', 'linregress']`.\n",
    "\n",
    "The optional arguments are as follows:\n",
    "\n",
    "- `window` indicates how many year to use for the average. Uses all available years by default.\n",
    "- `min_period` minimum number of years required for a value to be computed during the `rolling` operation.\n",
    "- `stride` indicates the stride (in years) at which to provide an output.\n",
    "- `periods` is a list of [start, end] of continuous periods to be considered.\n",
    "\n",
    "Additional arguments allow to control the output of the function by automatically renaming variables to reflect the operation performed, restructuring the output dataset and setting the `to_level` attribute.\n",
    "\n",
    "In the following example, we will use `op='mean'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ds_dict = pcat.search(processing_level=\"indicators\").to_dataset_dict()\n",
    "\n",
    "for key, ds in ds_dict.items():\n",
    "    ds_mean = xs.climatological_op(\n",
    "        ds=ds,\n",
    "        op=\"mean\",\n",
    "        window=30,\n",
    "        stride=10,\n",
    "        rename_variables=False,\n",
    "        to_level=\"30yr-climatology\",\n",
    "        horizons_as_dim=False,\n",
    "    )\n",
    "\n",
    "    # Save to zarr\n",
    "    filename = str(\n",
    "        output_folder\n",
    "        / f\"{ds_mean.attrs['cat:id']}.{ds_mean.attrs['cat:domain']}.{ds_mean.attrs['cat:processing_level']}.{ds_mean.attrs['cat:frequency']}.zarr\"\n",
    "    )\n",
    "    xs.save_to_zarr(ds=ds_mean, filename=filename, mode=\"o\")\n",
    "    pcat.update_from_ds(ds_mean, path=filename, format=\"zarr\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "display(ds_mean)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42",
   "metadata": {},
   "source": [
    "#### Horizon coordinate and time dimension\n",
    "\n",
    "Even if no `stride` is called, `xs.climatological_op` will substantially change the nature of the `time` dimension, because it now represents an aggregation over time. While no standards exist on how to reflect that in a dataset, the following was chosen for `xscen`:\n",
    "\n",
    "- `time` corresponds to the first timestep of each temporal average.\n",
    "- `horizon` is a new coordinate that either follows the format YYYY-YYYY or a warming-level specific nomenclature.\n",
    "- The `cat:frequency` and `cat:xrfreq` attributes remain unchanged.\n",
    "\n",
    "\n",
    "Alternatively, setting the `horizons_as_dim` argument to *True* will rearrange the dataset with a new dimension `horizon` and a dimension named according to the temporal aggregation when it is `month` or `season`, but omitting the singleton dimension `year`. The time stamps are conserved in the `time` coordinate as an array with those new dimensions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(f\"time: {ds_mean.time.values}\")\n",
    "print(f\"horizon: {ds_mean.horizon.values}\")\n",
    "print(f\"cat:xrfreq attribute: {ds_mean.attrs['cat:xrfreq']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44",
   "metadata": {},
   "source": [
    "### Computing deltas\n",
    "\n",
    "`xs.compute_deltas` is pretty self-explanatory. However, note that this function relies on the `horizon` coordinate described above and, thus, is intended to be performed following some kind of temporal aggregation.\n",
    "\n",
    "It has the following arguments:\n",
    "\n",
    "- `reference_horizon` indicates which horizon to use as reference.\n",
    "- `kind` is either \"+\", \"/\", or \"%\" for absolute, relative, or percentage deltas respectively. This argument can also be a dictionary, with the keys corresponding to data variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ds_dict = pcat.search(processing_level=\"30yr-climatology\").to_dataset_dict()\n",
    "\n",
    "for key, ds in ds_dict.items():\n",
    "    ds_delta = xs.compute_deltas(\n",
    "        ds=ds,\n",
    "        reference_horizon=\"1981-2010\",\n",
    "        kind={\"growing_degree_days\": \"%\", \"tg_min\": \"+\"},\n",
    "        to_level=\"deltas\",\n",
    "    )\n",
    "\n",
    "    # Save to zarr\n",
    "    filename = str(\n",
    "        output_folder\n",
    "        / f\"{ds_delta.attrs['cat:id']}.{ds_delta.attrs['cat:domain']}.{ds_delta.attrs['cat:processing_level']}.{ds_delta.attrs['cat:frequency']}.zarr\"\n",
    "    )\n",
    "    chunks = xs.io.estimate_chunks(ds, dims=[\"time\"], target_mb=50)\n",
    "    xs.save_to_zarr(ds=ds_delta, filename=filename, rechunk=chunks, mode=\"o\")\n",
    "    pcat.update_from_ds(ds_delta, path=filename, format=\"zarr\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(f\"Deltas over {ds_delta.horizon.values}\")\n",
    "display(ds_delta.tg_min_delta_1981_2010.isel(lon=0, lat=0).values)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47",
   "metadata": {},
   "source": [
    "### Spatial mean\n",
    "\n",
    "`xs.spatial_mean` is used to compute the spatial average over a given region, using various methods. The argument `call_clisops` can also be used to subset the domain prior to the averaging.\n",
    "\n",
    "- `method: cos-lat` will perform an average operation over the spatial dimensions, accounting for changes in grid cell area along the 'lat' coordinate.\n",
    "\n",
    "- `method: xesmf` will perform a call to *xESMF*'s [SpatialAverager](https://pangeo-xesmf.readthedocs.io/en/latest/notebooks/Spatial_Averaging.html). This method is the most precise, especially for irregular regions, but can be much slower.\n",
    "   - `kwargs` is used to sent arguments to `xesmf.SpatialAverager`.\n",
    "   - `region` is used to send a bbox or shapefile to the `SpatialAverager`. This argument is a dictionary that follows the same requirements as the one for `xs.extract` described previously.\n",
    "   - `simplify_tolerance` is a float that can be used to change the precision (in degree) of a shapefile before sending it to `SpatialAverager`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ds_dict = pcat.search(processing_level=\"deltas\", domain=\"finer-grid\").to_dataset_dict()\n",
    "\n",
    "for key, ds in ds_dict.items():\n",
    "    ds_savg = xs.spatial_mean(\n",
    "        ds=ds,\n",
    "        method=\"cos-lat\",\n",
    "        to_domain=\"aggregated\",\n",
    "    )\n",
    "\n",
    "    # Save to zarr\n",
    "    filename = str(\n",
    "        output_folder\n",
    "        / f\"{ds_savg.attrs['cat:id']}.{ds_savg.attrs['cat:domain']}.{ds_savg.attrs['cat:processing_level']}.{ds_savg.attrs['cat:frequency']}.zarr\"\n",
    "    )\n",
    "    chunks = xs.io.estimate_chunks(ds, dims=[\"time\"], target_mb=50)\n",
    "    xs.save_to_zarr(ds=ds_savg, filename=filename, rechunk=chunks, mode=\"o\")\n",
    "    pcat.update_from_ds(ds_savg, path=filename, format=\"zarr\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Aggregated deltas over the study area\n",
    "display(ds_savg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50",
   "metadata": {},
   "source": [
    "## Ensemble statistics\n",
    "\n",
    "### Weights\n",
    "\n",
    "Typically, if an ensemble is inhomogeneous (uneven number of realizations per model, mix of GCMs and RCMs, etc.), the first step should be to call `xs.generate_weights` to create an adequate guess of what the weights should be between the various datasets. Do note, however, that this function does not replace an explicit assessment of the performance or independence of the simulations, and the results provided should be taken with a grain of salt.\n",
    "\n",
    "The arguments are as follows:\n",
    "\n",
    "- `independence_level` instruct on which weighting scheme to use and strongly influences the outputs. One of 'model', 'GCM', 'institution'.\n",
    "- `experiment_weights` can be used to assign a given total weight to each experiment (currently only supports giving 1 to each experiment).\n",
    "- `skipna` instructs on whether the weights should account for simulations with missing data.\n",
    "- `v_for_skipna` is the variable to use in the case of `skipna=False`.\n",
    "- `standardize` to make the weights sum to 1 for each instance of 'horizon' or 'time'.\n",
    "\n",
    "<div class=\"alert alert-info\"> <b>NOTE</b>\n",
    "\n",
    "`generate_weights` was built with `xscen` in mind, and thus relies on the `cat:` attributes automatically generated by `intake-esm` when data is loaded from a catalog. In the case of data generated elsewhere, the required and recommended attributes should minimally be added to the dataset before using this function.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# We don't have many simulations in this example to perform ensemble statistics, but we'll use the two SSP2-4.5 realizations\n",
    "datasets = pcat.search(\n",
    "    processing_level=\"deltas\", domain=\"aggregated\", experiment=\"ssp245\"\n",
    ").to_dataset_dict()\n",
    "\n",
    "weights = xs.generate_weights(datasets, independence_level=\"model\")\n",
    "display(weights)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52",
   "metadata": {},
   "source": [
    "### Ensemble stats\n",
    "\n",
    "`xs.ensemble_stats` creates an ensemble out of many datasets and computes statistics on that ensemble (min, max, mean, percentiles, etc.) using the `xclim.ensembles` module. The inputs can be given in the form of a list or a dictionary of xr.Dataset or of paths.\n",
    "\n",
    "The arguments are as follows:\n",
    "\n",
    "- `statistics` is a dictionary that instructs on which `xclim.ensembles` statistics to call. It follows the format {function: arguments}.\n",
    "- `weights` is used to weight the various datasets, if needed.\n",
    "- `common_attrs_only`: `xclim.ensembles.create_ensemble` copies the attributes from the first dataset, but this might not be representative of the new ensemble. If `common_attrs_only` is True, it only keeps the global attributes that are the same for all datasets and generates a new ID.\n",
    "- `create_kwargs`: If given a set of paths, `xclim.ensembles.create_ensemble` will ignore the chunking on disk and open the datasets with only a chunking along the new `realization` dimension. Thus, for large datasets, this should be used to explicitly specify chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ens_stats = xs.ensemble_stats(\n",
    "    datasets=datasets,\n",
    "    statistics={\n",
    "        \"ensemble_percentiles\": {\"split\": False}\n",
    "    },  # should be an existing function in xclim.ensembles\n",
    "    weights=weights,\n",
    "    common_attrs_only=True,\n",
    ")\n",
    "\n",
    "path = output_folder / f\"ensemble_{ens_stats.attrs['cat:id']}.zarr\"\n",
    "xs.save_to_zarr(ens_stats, filename=path, mode=\"o\")\n",
    "pcat.update_from_ds(ds=ens_stats, path=path, format=\"zarr\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "display(ens_stats)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55",
   "metadata": {},
   "source": [
    "## Clean up\n",
    "\n",
    "At any time, such as after bias adjustment, `xs.clean_up` can be called to perform a number of small modifications to the datasets. That function can:\n",
    "\n",
    " - convert the variables to non-CF units using `xs.utils.change_units`\n",
    " - call the `xs.utils.maybe_unstack` function\n",
    " - convert the calendar and interpolate over missing dates\n",
    " - remove, remove everything but, and/or add a list of attributes\n",
    " - change the prefix of the catalog attrs (by default: `cat:`)\n",
    "\n",
    "in that order.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56",
   "metadata": {},
   "source": [
    "### Calendars\n",
    "\n",
    "During the bias adjustment step, it is frequent to convert the calendar to 'noleap'. However, once that step has been processed, we might want to put back all the February 29th (or other missing data in the case of '360_day' calendar). This can be done using the `convert_calendar_kwargs` argument of `xs.clean_up`, which passes a dictionary to `xclim.core.calendar.convert_calendar`.\n",
    "\n",
    "Usually, we want to linearly interpolate the missing temperatures, but put 0 mm/day for missing precipitation. If our dataset has many variables, the `missing` argument (for `convert_calendar`) can be set for each variable with `missing_by_var`. If `missing_by_var` is given 'interpolate', the missing data will be filled with NaNs, then linearly interpolated over time.\n",
    "\n",
    "Eg. `{'tasmax':'interpolate', 'pr':[0]}`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "57",
   "metadata": {},
   "outputs": [],
   "source": [
    "convert_calendar_kwargs = {\"calendar\": \"standard\"}\n",
    "missing_by_var = {\"tas\": \"interpolate\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58",
   "metadata": {},
   "source": [
    "### Attributes\n",
    "\n",
    "We might want to add, remove or modify the attributes.\n",
    "\n",
    "It is possible to write a list of attributes to remove with `attrs_to_remove`, or a list of attributes to keep and remove everything else with `remove_all_attrs_except`. Both take the shape of a dictionary where the keys are the variables (and 'global' for global attrs) and the values are the list.\n",
    "\n",
    "The element of the list can be exact matches for the attribute names or use regex matching rules (using a `fullmatch`):\n",
    "\n",
    "Attributes can also be added to datasets using `add_attrs`. This is a dictionary where the keys are the variables and the values are a another dictionary of attributes.\n",
    "\n",
    "It is also possible to modify the catalogue prefix 'cat:' by a new string with `change_attr_prefix`. Don't use this if this is not the last step of your workflow, as it may break some functions that rely on those prefixes to find the right dataset attributes.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59",
   "metadata": {},
   "outputs": [],
   "source": [
    "attrs_to_remove = {\n",
    "    \"tas\": [\".*name.*\"]\n",
    "}  # remove tas attrs that contain the substring 'name'\n",
    "remove_all_attrs_except = {\n",
    "    \"global\": [\"cat:.*\"]\n",
    "}  # remove all the global attrs EXCEPT for the one starting with cat:\n",
    "add_attrs = {\n",
    "    \"tas\": {\"notes\": \"some crucial information\"}\n",
    "}  # add a new tas attribute named 'notes' with value 'some crucial information'\n",
    "change_attr_prefix = \"dataset:\"  # change 'cat': to 'dataset:'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60",
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = pcat.search(\n",
    "    processing_level=\"biasadjusted\", variable=\"tas\", experiment=\"ssp245\", member=\"r1.*\"\n",
    ").to_dataset()\n",
    "\n",
    "ds_clean = xs.clean_up(\n",
    "    ds=ds,\n",
    "    variables_and_units={\"tas\": \"degC\"},  # convert units\n",
    "    convert_calendar_kwargs=convert_calendar_kwargs,\n",
    "    missing_by_var=missing_by_var,\n",
    "    attrs_to_remove=attrs_to_remove,\n",
    "    remove_all_attrs_except=remove_all_attrs_except,\n",
    "    add_attrs=add_attrs,\n",
    "    change_attr_prefix=change_attr_prefix,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect calendars and the interpolated values\n",
    "print(\"Initial calendar: \", ds.time.dt.calendar)\n",
    "print(ds.time.sel(time=slice(\"2000-02-28\", \"2000-03-01\")).values)\n",
    "print(ds.tas.sel(time=slice(\"2000-02-28\", \"2000-03-01\")).isel(lat=1, lon=1).values)\n",
    "\n",
    "print(\"Final calendar: \", ds_clean.time.dt.calendar)\n",
    "print(ds_clean.time.sel(time=slice(\"2000-02-28\", \"2000-03-01\")).values)\n",
    "print(\n",
    "    ds_clean.tas.sel(time=slice(\"2000-02-28\", \"2000-03-01\")).isel(lat=1, lon=1).values\n",
    ")\n",
    "print(\"\")\n",
    "print(\"Inspect initial attributes\")\n",
    "display(ds)\n",
    "\n",
    "print(\"\")\n",
    "print(\"Inspect final attributes\")\n",
    "display(ds_clean)"
   ]
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}