credit.datasets.downscaling_dataset
===================================

.. py:module:: credit.datasets.downscaling_dataset


Attributes
----------

.. autoapisummary::

   credit.datasets.downscaling_dataset.Array


Classes
-------

.. autoapisummary::

   credit.datasets.downscaling_dataset.Sample
   credit.datasets.downscaling_dataset.DownscalingDataset


Module Contents
---------------

.. py:data:: Array

.. py:class:: Sample

   Bases: :py:obj:`TypedDict`


   Simple class for structuring data for the ML model.

   x = predictor (input) data
   y = predictand (target) data

   Using typing.TypedDict gives us several advantages:
     1. Single 'source of truth' for the type and documentation of each example.
     2. A static type checker can check the types are correct.

   Instead of TypedDict, we could use typing.NamedTuple, which would
   provide runtime checks, but the deal-breaker with Tuples is that
   they're immutable so we cannot change the values in the
   transforms.


   .. py:attribute:: input
      :type:  Array


   .. py:attribute:: target
      :type:  Array


.. py:class:: DownscalingDataset

   Bases: :py:obj:`torch.utils.data.Dataset`


   Class that wrangles a collection of DataMaps and their
   associated DataTransforms for use in ML.  Intended to be
   initialized with `**conf['data']`; see `config/downscaling.yml`
   for an example.

   :param rootpath: pathway to the files
   :type rootpath: str
   :param history_len: number of input timesteps for training
   :type history_len: int
   :param forecast_len: number of output timesteps for training
   :type forecast_len: int
   :param valid_history_len: number of input timesteps for validation
   :type valid_history_len: int
   :param valid_forecast_len: number of output timesteps for validation
                              all _history_len and _forecast_len arguments default to 1
   :type valid_forecast_len: int
   :param first_date: starting point of dataset
   :type first_date: YYYY-MM-DD string
   :param last_date: (YYYY-MM-DD string): ending point of dataset
                     Use first_date and last_date to define a subset of the dataset,
                     e.g., to use 1980-2000 for training and 2001-2010 for testing.
                     Default to first and last timesteps in the files, respectively.
   :param datasets: dicts of parameters for initializing
                    DataMaps and their corresponding DataTransforms via
                    **kwargs.  These dicts are replaced with actual objects
                    during initialization.
   :type datasets: nested dict
   :param image_width: width of dataset in gridcells
   :type image_width: int
   :param image_height: height of dataset in gridcells
                        image_width and _height are used to automatically resize
                        different constituent datasets to the same size via
                        expansion & padding.  E.g., if declared image size is
                        120x120, a 55x55 dataset would be expanded by a factor of
                        2 and padded by 10 to match.  Defaults to the width
                        (height) of the widest (tallest) dataset.
   :type image_height: int
   :param get_time_from: name of the dataset to pull time
                         coordinates from when creating a sample; used when writing
                         output to netcdf.  Defaults to first non-static dataset
                         that has boundary variables.
   :type get_time_from: str
   :param transform: apply normalizing transforms to samples?
                     Defaults to True
   :type transform: bool

   .. attribute:: sample_len

      number of timesteps in a sample
      (sample_len = history_len + forecast_len)

   .. attribute:: len

      number of samples in the dataset, which will be less than
      the number of timesteps in the netcdf files, due to
      first_date/last_date subsetting and sample_len > 1.

   .. attribute:: mode

      determines which variables are returned when sampling.
      Valid values are 'train', 'test', and 'predict'.
      Defaults to 'train'.
      
      train: boundary & prognostic variables for input timesteps,
          prognostic and diagnostic for target timesteps
      test: boundary variables for input timesteps,
          prognostic and diagnostic for target timesteps
      predict: boundary variables for input timesteps,
          nothing for target timesteps

   .. attribute:: output

      determines the format of samples; see __getitem__ for
      details.  Valid values: 'by_dset', 'by_io', 'tensor'.
      Defaults to 'tensor'.

   .. attribute:: arrangement

      a pandas dataframe that defines how variables are
      ordered in the tensor.  Columns: dataset, dim, var, usage, name

   .. attribute:: tnames

      a list of names (structured `dataset.var[.z-level]`)
      corresponding to the channels in an output tensor (i.e.,
      prognostic and diagnostic variables only).


   .. py:attribute:: rootpath
      :type:  str


   .. py:attribute:: history_len
      :type:  int
      :value: 1


   .. py:attribute:: forecast_len
      :type:  int
      :value: 1


   .. py:attribute:: valid_history_len
      :type:  int
      :value: 1


   .. py:attribute:: valid_forecast_len
      :type:  int
      :value: 1


   .. py:attribute:: first_date
      :type:  str
      :value: None


   .. py:attribute:: last_date
      :type:  str
      :value: None


   .. py:attribute:: datasets
      :type:  Dict


   .. py:attribute:: image_width
      :type:  int
      :value: None


   .. py:attribute:: image_height
      :type:  int
      :value: None


   .. py:attribute:: transform
      :type:  bool
      :value: True


   .. py:attribute:: get_time_from
      :type:  str
      :value: None


   .. py:attribute:: _mode
      :type:  str
      :value: 'train'


   .. py:attribute:: _output
      :type:  str
      :value: 'tensor'


   .. py:method:: __post_init__()


   .. py:method:: _setup_datasets()

      Replace the `datasets` argument dict (a nested dict of
      configurations for the constituent datasets & their
      associated transforms) with DataMap and DataTransform
      objects created from those configurations.

      Datasets are allowed to be different sizes.  When sampling, we
      automatically resize them to (image_width x image_height) by
      adding Expand and/or Pad transforms to the sequence defined
      for each dataset.  To do this, we need to construct the
      self.datasets dict in two passes, because we need the shape of
      each dataset in order to figure out those resizing transforms.


   .. py:method:: _setup_arrangement()

      construct a pandas dataframe that defines the ordering used
      by :meth:`~.rearrange` to go from the nested dict arrangement
      returned by :meth:`~.getdata` to the arrangement used as input
      by :meth:`~.to_tensor`.  See :meth:`~.__getitem__` for more
      details.

      Also creates self.tnames, a list of names for the channels in
      the output tensor formatted "<dataset>.<variable>[.z<level>]"


   .. py:method:: __len__()

      Number of samples in the dataset.  Note that this is
      smaller than the number of timesteps in the files, both
      because `first_date` and `last_date` may not cover the full
      range, and because sample length = `history_len` +
      `forecast_len` and `~.__getitem__` only returns complete
      samples.


   .. py:method:: __getitem__(index)

      Gets the index'th sample from the dataset.  The value of the
      'output' attribute controls the format of the returned object.

      `~.output == 'by_dset'` returns a nested dict structured
      [dataset][usage][variable] (the format returned by
      :meth:`~.getdata`); the leaf elements are numpy ndarrays
      covering the full time period (history_len+forecast_len).

      `~.output == 'by_io'` splits the ndarrays into history and
      forecast periods and reorganizes the nested dict to
      [input/target][dataset.variable] (the format returned by
      :meth:`~.rearrange` and taken as input by
      :meth:`~.to_tensor`).  The variables are ordered:

          - first: boundary > prognostic > diagnostic (in-only > in-out > out-only)
          - then: static > 2D > 3D
          - then: order that datasets are defined in config
          - then: alphabetical by variable name

      `~.output == 'tensor'` stacks the ndarrays in the z-dimension
      and converts them to a pair of pyTorch tensors (input and
      target), returning them as a Sample.  It also includes the
      associated time coordinates in the sample


   .. py:method:: getdata(dset, index)

      gets data for the index'th sample from dataset `dset`.
      Returns a nested dict organized [dataset][usage][variable].


   .. py:method:: rearrange(items)

      Rearranges a nested dict of data from [dataset][usage][var]
      to [input/target][dataset.var]. Elements returned depend on
      `~.mode`:

          - `train`: input contains boundary and prognostic
            variables, target contains prognostic and diagnostic
            variables.

          - `init`: input contains boundary and prognostic
            variables, target is empty.

          - `infer`: input contains boundary variables, `target` is
            empty.


   .. py:method:: to_tensor(sample)

      Takes a nested dict organized [input/target][vars], with
      data arrays (numpy ndarrays) dimensioned [T, Z, Y, X].

      Stacks variables along the z-dimension (i.e., so different
      z-levels of a 3D variable are treated as different variables).

      Combines variable stacks into tensors ordered [V, T, Y, X] and
      returns a Sample where x is historical / input data and y is
      forecast / target data.


   .. py:method:: revert(prediction)

      Converts a tensor (ML model output) back into nested dict
      of numpy arrays (i.e, reverses the `getdata -> rearrange ->
      to_tensor` pipeline).


   .. py:property:: mode
      :type: str


   .. py:property:: output
      :type: str