credit.data404

credit.data404#

Attributes#

`Array`
`IMAGE_ATTR_NAMES`

Classes#

`Sample`	Simple class for structuring data for the ML model.
`CONUS404Dataset`	Each Zarr store for the CONUS-404 data contains one year of

Functions#

`get_forward_data`(→ xarray.DataArray)	Lazily opens a Zarr store
`flatten`(array)	flattens a list-of-lists
`lazymerge`(zlist[, rename])	merges zarr stores opened lazily with get_forward_data()
`testC4loader`()	Test load speed of different number of vars & storage locs.

Module Contents#

credit.data404.get_forward_data(filename) → xarray.DataArray#: Lazily opens a Zarr store

credit.data404.Array#

credit.data404.IMAGE_ATTR_NAMES = ('historical_ERA5_images', 'target_ERA5_images')#

class credit.data404.Sample#

Bases: TypedDict

Simple class for structuring data for the ML model.

x = input (predictor) data (i.e, C404dataset[historical mask] y = target (predictand) data (i.e, C404dataset[forecast mask]

Using typing.TypedDict gives us several advantages:

Single ‘source of truth’ for the type and documentation of each example.
A static type checker can check the types are correct.

Instead of TypedDict, we could use typing.NamedTuple, which would provide runtime checks, but the deal-breaker with Tuples is that they’re immutable so we cannot change the values in the transforms.

x: Array#

y: Array#

credit.data404.flatten(array)#: flattens a list-of-lists

credit.data404.lazymerge(zlist, rename=None)#: merges zarr stores opened lazily with get_forward_data()

class credit.data404.CONUS404Dataset#

Bases: torch.utils.data.Dataset

Each Zarr store for the CONUS-404 data contains one year of hourly data for one variable.

When we’re sampling data, we only want to load from a single zarr store; we don’t want the samples to span zarr store boundaries. This lets us leave years out from across the entire span for validation during training.

To do this, we segment the dataset by year. We figure out how many samples we could have, then subtract all the ones that start in one year and end in another (or end past the end of the dataset). Then we create an index of which segment each sample belongs to, and the number of that sample within the segment.

Then, for the __getitem__ method, we look up which segment the sample is in and its numbering within the segment, then open the corresponding zarr store and read only the data we want with an isel() call.

For multiple variables, we necessariy end up reading from multiple stores, but they’re merged into a single xarray Dataset, so hopefully that won’t cause a big performance hit.

zarrpath: str = '/glade/campaign/ral/risc/DATA/conus404/zarr'#

varnames: List[str] = []#

history_len: int = 2#

forecast_len: int = 1#

transform: Callable | None = None#

seed: int = 22#

skip_periods: int = None#

one_shot: bool = False#

start: str = None#

finish: str = None#

__post_init__()#

__len__()#

__getitem__(index)#

get_data(index, do_transform=True)#: like gets an element by index (as __getitem__ does), but with an optional argument to skip applying the normalization transform.

credit.data404.testC4loader()#: Test load speed of different number of vars & storage locs. Full load for C404 takes about 4 sec on campaign, 5 sec on scratch