credit.datasets.datamap

credit.datasets.datamap#

The xarray library can be slow on very large datasets, so we need to manage data manually when training ML models. A datamap object performs many of the same functions as an xarray.Dataset (not to be confused with a torch.utils.data.Dataset), but drops a lot of functionality and checking so it’s faster.

A datamap provides a list-like interface to a set of netcdf files, virtually concatenating them along the time dimension. It tracks which index values live in which file, and lazily opens files and reads data only for the requested indexes. It also can interconvert datetimes and indexes.

NOTE: the datamap object assumes that:

1) The time coordinates are uniformly spaced and have no gaps (i.e., after setup, it only tracks how many timesteps are in each file, not the actual values of the time coordinate)

2) Lexicographic sort of the filenames is equivalent to chronological ordering of the data. This is true if the filesnames are uniformly named except for a time element, and that the time element is ISO-8601 (YYYY-MM-DD etc.)

3) Contents of the files are uniform. All of the variables in the datamap have the same dimensionality, and all variables exist in each file.

4) The time dimension has a coordinate variable with the attributes ‘axis’, ‘calendar’, and ‘units’ set to the appropriate CF-compliant values.

Content:

Classes#

`VarDict`	a dictionary of the variables that could be in a dataset
`DataMap`	Class for reading in netCDF data from multiple files.

Functions#

rescale_minmax(x)

rescale data to [0,1]. Don't use

Module Contents#

class credit.datasets.datamap.VarDict#

Bases: TypedDict

a dictionary of the variables that could be in a dataset

boundary: List#

prognostic: List#

diagnostic: List#

unused: List#

credit.datasets.datamap.rescale_minmax(x)#: rescale data to [0,1]. Don’t use sklearn.preprocessing.minmax_scale because it requires reshaping the data, which is silly for a use case this simple.

class credit.datasets.datamap.DataMap#

Class for reading in netCDF data from multiple files.

rootpath: pathway to the files glob: filename glob of netcdf files (relative to rootpath) dim: dimensions of the data:

static: no time dimension; data is loaded on initialization 3D: data has z-dimension; can subset levels using zstride 2D: default: time-varying 2D data

normalize: if dim==’static’ & normalize == True, scale data to range [0,1] zstride: if dim==’3D’, subset in Z dimension by ::zstride when reading boundary: list of variable names to use as (input-only) boundary conditions diagnostic: list of diagnostic (output-only) variable names prognostic: list of prognostic (state / input-output) variable names unused: list of unused variables (optional) history_len: number of input timesteps forecast_len: number of output timesteps first_date: restrict dataset to timesteps >= this point in time last_date: restrict dataset to timesteps <= this point in time

first_date and last_date default to None, which means use the first/last timestep in the dataset. Note that they must be YYYY-MM-DD strings (with optional HH:MM:SS), not datetime objects.

Note also that the time coordinate must be contiguous across the files, with no gaps or overlaps.

rootpath: str#

glob: str#

dim: str = '2D'#

normalize: bool = False#

zstride: int = 1#

variables: VarDict[str, List] = []#

history_len: int = 2#

forecast_len: int = 1#

first_date: str = None#

last_date: str = None#

__post_init__()#

date2tindex(datestring)#: Convert datestring (in ISO8601 YYYY-MM-DD format) to internal time index. Datestring can optionally also have an HH:MM:SS component; if absent, it defaults to 00:00:00. Returns 0 if dataset is static.

sindex2dates(sindex)#: Returns dates associated with sample index as a dict containing time coordinates, units, calendar, and ISO8601 dates from the cftime library. Returns None if dataset is static.

__len__()#

__getitem__(index)#

property mode: str#

read(segment, start, finish)#: open file & read data from start to finish for needed variables