credit.datasets.datamap
=======================

.. py:module:: credit.datasets.datamap

.. autoapi-nested-parse::

   datamap.py
   --------------------------------------------------

   The xarray library can be slow on very large datasets, so we need to
   manage data manually when training ML models.  A datamap object
   performs many of the same functions as an xarray.Dataset (not to be
   confused with a torch.utils.data.Dataset), but drops a lot of
   functionality and checking so it's faster.

   A datamap provides a list-like interface to a set of netcdf files,
   virtually concatenating them along the time dimension.  It tracks
   which index values live in which file, and lazily opens files and
   reads data only for the requested indexes.  It also can interconvert
   datetimes and indexes.

   NOTE: the datamap object assumes that:

   1) The time coordinates are uniformly spaced and have no gaps (i.e.,
   after setup, it only tracks how many timesteps are in each file, not
   the actual values of the time coordinate)

   2) Lexicographic sort of the filenames is equivalent to chronological
   ordering of the data.  This is true if the filesnames are uniformly
   named except for a time element, and that the time element is ISO-8601
   (YYYY-MM-DD etc.)

   3) Contents of the files are uniform.  All of the variables in the
   datamap have the same dimensionality, and all variables exist in each
   file.

   4) The time dimension has a coordinate variable with the attributes
   'axis', 'calendar', and 'units' set to the appropriate CF-compliant
   values.

   Content:


Classes
-------

.. autoapisummary::

   credit.datasets.datamap.VarDict
   credit.datasets.datamap.DataMap


Functions
---------

.. autoapisummary::

   credit.datasets.datamap.rescale_minmax


Module Contents
---------------

.. py:class:: VarDict

   Bases: :py:obj:`TypedDict`


   a dictionary of the variables that could be in a dataset


   .. py:attribute:: boundary
      :type:  List


   .. py:attribute:: prognostic
      :type:  List


   .. py:attribute:: diagnostic
      :type:  List


   .. py:attribute:: unused
      :type:  List


.. py:function:: rescale_minmax(x)

   rescale data to [0,1].  Don't use
   `sklearn.preprocessing.minmax_scale` because it requires reshaping
   the data, which is silly for a use case this simple.


.. py:class:: DataMap

   Class for reading in netCDF data from multiple files.

   rootpath: pathway to the files
   glob: filename glob of netcdf files (relative to rootpath)
   dim: dimensions of the data:
       static: no time dimension; data is loaded on initialization
       3D: data has z-dimension; can subset levels using zstride
       2D: default: time-varying 2D data
   normalize: if dim=='static' & normalize == True, scale data to range [0,1]
   zstride: if dim=='3D', subset in Z dimension by ::zstride when reading
   boundary: list of variable names to use as (input-only) boundary conditions
   diagnostic: list of diagnostic (output-only) variable names
   prognostic: list of prognostic (state / input-output) variable names
   unused: list of unused variables (optional)
   history_len: number of input timesteps
   forecast_len: number of output timesteps
   first_date: restrict dataset to timesteps >= this point in time
   last_date: restrict dataset to timesteps <= this point in time

   first_date and last_date default to None, which means use the
   first/last timestep in the dataset.  Note that they must be
   YYYY-MM-DD strings (with optional HH:MM:SS), not datetime objects.

   Note also that the time coordinate must be contiguous across the
   files, with no gaps or overlaps.


   .. py:attribute:: rootpath
      :type:  str


   .. py:attribute:: glob
      :type:  str


   .. py:attribute:: dim
      :type:  str
      :value: '2D'


   .. py:attribute:: normalize
      :type:  bool
      :value: False


   .. py:attribute:: zstride
      :type:  int
      :value: 1


   .. py:attribute:: variables
      :type:  VarDict[str, List]
      :value: []


   .. py:attribute:: history_len
      :type:  int
      :value: 2


   .. py:attribute:: forecast_len
      :type:  int
      :value: 1


   .. py:attribute:: first_date
      :type:  str
      :value: None


   .. py:attribute:: last_date
      :type:  str
      :value: None


   .. py:method:: __post_init__()


   .. py:method:: date2tindex(datestring)

      Convert datestring (in ISO8601 YYYY-MM-DD format) to
      internal time index.  Datestring can optionally also have an
      HH:MM:SS component; if absent, it defaults to 00:00:00.
      Returns 0 if dataset is static.


   .. py:method:: sindex2dates(sindex)

      Returns dates associated with sample index as a dict
      containing time coordinates, units, calendar, and ISO8601
      dates from the cftime library.  Returns None if dataset is
      static.


   .. py:method:: __len__()


   .. py:method:: __getitem__(index)


   .. py:property:: mode
      :type: str


   .. py:method:: read(segment, start, finish)

      open file & read data from start to finish for needed variables