credit.data404
==============

.. py:module:: credit.data404


Attributes
----------

.. autoapisummary::

   credit.data404.Array
   credit.data404.IMAGE_ATTR_NAMES


Classes
-------

.. autoapisummary::

   credit.data404.Sample
   credit.data404.CONUS404Dataset


Functions
---------

.. autoapisummary::

   credit.data404.get_forward_data
   credit.data404.flatten
   credit.data404.lazymerge
   credit.data404.testC4loader


Module Contents
---------------

.. py:function:: get_forward_data(filename) -> xarray.DataArray

   Lazily opens a Zarr store


.. py:data:: Array

.. py:data:: IMAGE_ATTR_NAMES
   :value: ('historical_ERA5_images', 'target_ERA5_images')


.. py:class:: Sample

   Bases: :py:obj:`TypedDict`


   Simple class for structuring data for the ML model.

   x = input (predictor) data (i.e, C404dataset[historical mask]
   y = target (predictand) data (i.e, C404dataset[forecast mask]

   Using typing.TypedDict gives us several advantages:
     1. Single 'source of truth' for the type and documentation of each example.
     2. A static type checker can check the types are correct.

   Instead of TypedDict, we could use typing.NamedTuple,
   which would provide runtime checks, but the deal-breaker with Tuples is that they're immutable
   so we cannot change the values in the transforms.


   .. py:attribute:: x
      :type:  Array


   .. py:attribute:: y
      :type:  Array


.. py:function:: flatten(array)

   flattens a list-of-lists


.. py:function:: lazymerge(zlist, rename=None)

   merges zarr stores opened lazily with get_forward_data()


.. py:class:: CONUS404Dataset

   Bases: :py:obj:`torch.utils.data.Dataset`


   Each Zarr store for the CONUS-404 data contains one year of
   hourly data for one variable.

   When we're sampling data, we only want to load from a single zarr
   store; we don't want the samples to span zarr store boundaries.
   This lets us leave years out from across the entire span for
   validation during training.

   To do this, we segment the dataset by year.  We figure out how
   many samples we could have, then subtract all the ones that start
   in one year and end in another (or end past the end of the
   dataset).  Then we create an index of which segment each sample
   belongs to, and the number of that sample within the segment.

   Then, for the __getitem__ method, we look up which segment the
   sample is in and its numbering within the segment, then open the
   corresponding zarr store and read only the data we want with an
   isel() call.

   For multiple variables, we necessariy end up reading from multiple
   stores, but they're merged into a single xarray Dataset, so
   hopefully that won't cause a big performance hit.


   .. py:attribute:: zarrpath
      :type:  str
      :value: '/glade/campaign/ral/risc/DATA/conus404/zarr'


   .. py:attribute:: varnames
      :type:  List[str]
      :value: []


   .. py:attribute:: history_len
      :type:  int
      :value: 2


   .. py:attribute:: forecast_len
      :type:  int
      :value: 1


   .. py:attribute:: transform
      :type:  Optional[Callable]
      :value: None


   .. py:attribute:: seed
      :type:  int
      :value: 22


   .. py:attribute:: skip_periods
      :type:  int
      :value: None


   .. py:attribute:: one_shot
      :type:  bool
      :value: False


   .. py:attribute:: start
      :type:  str
      :value: None


   .. py:attribute:: finish
      :type:  str
      :value: None


   .. py:method:: __post_init__()


   .. py:method:: __len__()


   .. py:method:: __getitem__(index)


   .. py:method:: get_data(index, do_transform=True)

      like gets an element by index (as __getitem__ does), but
      with an optional argument to skip applying the normalization
      transform.


.. py:function:: testC4loader()

   Test load speed of different number of vars & storage locs.
   Full load for C404 takes about 4 sec on campaign, 5 sec on scratch