credit.datasets.era5#

Attributes#

Classes#

ERA5Dataset

Pytorch Dataset for processed ERA5 data. Relies on a configuration dictionary to define:

Module Contents#

credit.datasets.era5.logger#
credit.datasets.era5.VALID_FIELD_TYPES#
class credit.datasets.era5.ERA5Dataset(config, return_target=False)#

Bases: torch.utils.data.Dataset

Pytorch Dataset for processed ERA5 data. Relies on a configuration dictionary to define:
  1. 2D / 3D variables

  2. Start, End and Frequency of Datetimes

  3. path to glob for the data

Example YAML configuration#

data:
  source:
    ERA5:
      prognostic:
        vars_3D: ['T', 'U', 'V', 'Q']
        vars_2D: ['T500', 'U500', 'V500', 'Q500' ,'Z500', 'tsi', 't2m','SP']
        path: "<path to prognostic>"
      diagnostic:
        vars_3D: ['T', 'U', 'V', 'Q']
        vars_2D: ['T500', 'U500', 'V500', 'Q500' ,'Z500', 'tsi', 't2m','SP']
        path: "<path to diagnostic>"
      static:
        vars_3D: ['T', 'U', 'V', 'Q']
        vars_2D: ['T500', 'U500', 'V500', 'Q500' ,'Z500', 'tsi', 't2m','SP']
        path: "<path to static>"
      dynamic_forcing:
        vars_3D: ['T', 'U', 'V', 'Q']
        vars_2D: ['T500', 'U500', 'V500', 'Q500' ,'Z500', 'tsi', 't2m','SP']
        path: "<path to dynamic forcing>"

start_datetime: "2017-01-01"
end_datetime: "2019-12-31"
timestep: "6h"
Assumptions:
  1. The data MUST be stored in yearly zarr or netCDF files with a unique 4-digit year (YYYY) in the file name

  2. “time” dimension / coordinate is present

  3. “level” dimension name representing the vertical level

  4. Dimension order of (‘time’, level’, ‘latitude’, ‘longitude’) for 3D vars (remove level for 2D)

  5. Data should be chunked efficiently for a fast read (recommend small chunks across time dimension).

source_name = 'ERA5'#
return_target = False#
dt#
num_forecast_steps#
start_datetime#
end_datetime#
datetimes#
years#
file_dict#
var_dict#
_timestamps()#

return total time steps

__len__()#
_map_files(file_list)#

Create a dictionary to lookup the file for a timestep

Parameters:

file_list (list) – List of file paths

__getitem__(args)#

Returns a sample of data.

Parameters:

args (tuple) – Input_time step from sampler, step index from sampler

_open_ds_extract_fields(field_type, t, return_data, is_target=False)#

opens the dataset, reshapes and concats the variables into n np array, packs it into the return dict if the data exists.

Parameters:
  • field_type (str) – Field type (“prognostic”, “diagnostic”, etc)

  • t (pd.Timestamp) – Current timestamp

  • return_data (dict) – Dictionary of data to return

  • is_target (bool) – Flag for if data is x or y data

_reshape_and_concat(ds_3D, ds_2D)#

Stack 3D variables along level and variable, concatenate with 2D variables, and reorder dimensions.

Parameters:
  • ds_3D (xr.Dataset) – Xarray dataset with 3D spatial variables

  • ds_2D (xr.Dataset) – Xarray dataset with 2D spatial variables

_add_metadata(return_data, t, t_target=None)#

Update metadata dictionary

Parameters:
  • return_data (dict) – Return dictionary

  • t (int) – Time step

  • t_target – Target time step or None

_convert_cf_time(ts)#

Convert pandas timestamp to cftime

Parameters:

ts – pandas timestamp

_pop_and_merge_targets(return_data, dim=0)#

Look for target diagnostic and prognostic variables. If both exist, concatenate them along specified dimension.

Parameters:
  • return_data – Dictionary of current data to return

  • dim – Concat dimension