credit.datasets.downscaling_dataset#

Attributes#

Classes#

Sample

Simple class for structuring data for the ML model.

DownscalingDataset

Class that wrangles a collection of DataMaps and their

Module Contents#

credit.datasets.downscaling_dataset.Array#
class credit.datasets.downscaling_dataset.Sample#

Bases: TypedDict

Simple class for structuring data for the ML model.

x = predictor (input) data y = predictand (target) data

Using typing.TypedDict gives us several advantages:
  1. Single ‘source of truth’ for the type and documentation of each example.

  2. A static type checker can check the types are correct.

Instead of TypedDict, we could use typing.NamedTuple, which would provide runtime checks, but the deal-breaker with Tuples is that they’re immutable so we cannot change the values in the transforms.

input: Array#
target: Array#
class credit.datasets.downscaling_dataset.DownscalingDataset#

Bases: torch.utils.data.Dataset

Class that wrangles a collection of DataMaps and their associated DataTransforms for use in ML. Intended to be initialized with **conf[‘data’]; see config/downscaling.yml for an example.

Parameters:
  • rootpath (str) – pathway to the files

  • history_len (int) – number of input timesteps for training

  • forecast_len (int) – number of output timesteps for training

  • valid_history_len (int) – number of input timesteps for validation

  • valid_forecast_len (int) – number of output timesteps for validation all _history_len and _forecast_len arguments default to 1

  • first_date (YYYY-MM-DD string) – starting point of dataset

  • last_date – (YYYY-MM-DD string): ending point of dataset Use first_date and last_date to define a subset of the dataset, e.g., to use 1980-2000 for training and 2001-2010 for testing. Default to first and last timesteps in the files, respectively.

  • datasets (nested dict) – dicts of parameters for initializing DataMaps and their corresponding DataTransforms via **kwargs. These dicts are replaced with actual objects during initialization.

  • image_width (int) – width of dataset in gridcells

  • image_height (int) – height of dataset in gridcells image_width and _height are used to automatically resize different constituent datasets to the same size via expansion & padding. E.g., if declared image size is 120x120, a 55x55 dataset would be expanded by a factor of 2 and padded by 10 to match. Defaults to the width (height) of the widest (tallest) dataset.

  • get_time_from (str) – name of the dataset to pull time coordinates from when creating a sample; used when writing output to netcdf. Defaults to first non-static dataset that has boundary variables.

  • transform (bool) – apply normalizing transforms to samples? Defaults to True

sample_len#

number of timesteps in a sample (sample_len = history_len + forecast_len)

len#

number of samples in the dataset, which will be less than the number of timesteps in the netcdf files, due to first_date/last_date subsetting and sample_len > 1.

mode#

determines which variables are returned when sampling. Valid values are ‘train’, ‘test’, and ‘predict’. Defaults to ‘train’.

train: boundary & prognostic variables for input timesteps,

prognostic and diagnostic for target timesteps

test: boundary variables for input timesteps,

prognostic and diagnostic for target timesteps

predict: boundary variables for input timesteps,

nothing for target timesteps

output#

determines the format of samples; see __getitem__ for details. Valid values: ‘by_dset’, ‘by_io’, ‘tensor’. Defaults to ‘tensor’.

arrangement#

a pandas dataframe that defines how variables are ordered in the tensor. Columns: dataset, dim, var, usage, name

tnames#

a list of names (structured dataset.var[.z-level]) corresponding to the channels in an output tensor (i.e., prognostic and diagnostic variables only).

rootpath: str#
history_len: int = 1#
forecast_len: int = 1#
valid_history_len: int = 1#
valid_forecast_len: int = 1#
first_date: str = None#
last_date: str = None#
datasets: Dict#
image_width: int = None#
image_height: int = None#
transform: bool = True#
get_time_from: str = None#
_mode: str = 'train'#
_output: str = 'tensor'#
__post_init__()#
_setup_datasets()#

Replace the datasets argument dict (a nested dict of configurations for the constituent datasets & their associated transforms) with DataMap and DataTransform objects created from those configurations.

Datasets are allowed to be different sizes. When sampling, we automatically resize them to (image_width x image_height) by adding Expand and/or Pad transforms to the sequence defined for each dataset. To do this, we need to construct the self.datasets dict in two passes, because we need the shape of each dataset in order to figure out those resizing transforms.

_setup_arrangement()#

construct a pandas dataframe that defines the ordering used by rearrange() to go from the nested dict arrangement returned by getdata() to the arrangement used as input by to_tensor(). See __getitem__() for more details.

Also creates self.tnames, a list of names for the channels in the output tensor formatted “<dataset>.<variable>[.z<level>]”

__len__()#

Number of samples in the dataset. Note that this is smaller than the number of timesteps in the files, both because first_date and last_date may not cover the full range, and because sample length = history_len + forecast_len and ~.__getitem__ only returns complete samples.

__getitem__(index)#

Gets the index’th sample from the dataset. The value of the ‘output’ attribute controls the format of the returned object.

~.output == ‘by_dset’ returns a nested dict structured [dataset][usage][variable] (the format returned by getdata()); the leaf elements are numpy ndarrays covering the full time period (history_len+forecast_len).

~.output == ‘by_io’ splits the ndarrays into history and forecast periods and reorganizes the nested dict to [input/target][dataset.variable] (the format returned by rearrange() and taken as input by to_tensor()). The variables are ordered:

  • first: boundary > prognostic > diagnostic (in-only > in-out > out-only)

  • then: static > 2D > 3D

  • then: order that datasets are defined in config

  • then: alphabetical by variable name

~.output == ‘tensor’ stacks the ndarrays in the z-dimension and converts them to a pair of pyTorch tensors (input and target), returning them as a Sample. It also includes the associated time coordinates in the sample

getdata(dset, index)#

gets data for the index’th sample from dataset dset. Returns a nested dict organized [dataset][usage][variable].

rearrange(items)#

Rearranges a nested dict of data from [dataset][usage][var] to [input/target][dataset.var]. Elements returned depend on ~.mode:

  • train: input contains boundary and prognostic variables, target contains prognostic and diagnostic variables.

  • init: input contains boundary and prognostic variables, target is empty.

  • infer: input contains boundary variables, target is empty.

to_tensor(sample)#

Takes a nested dict organized [input/target][vars], with data arrays (numpy ndarrays) dimensioned [T, Z, Y, X].

Stacks variables along the z-dimension (i.e., so different z-levels of a 3D variable are treated as different variables).

Combines variable stacks into tensors ordered [V, T, Y, X] and returns a Sample where x is historical / input data and y is forecast / target data.

revert(prediction)#

Converts a tensor (ML model output) back into nested dict of numpy arrays (i.e, reverses the getdata -> rearrange -> to_tensor pipeline).

property mode: str#
property output: str#