credit.datasets.era5#
Refactored ERA5Dataset with nested input/target structure.
Sample structure returned by __getitem__:
{
"input": {
"era5/prognostic/3d/T": tensor, # (n_levels, 1, lat, lon)
"era5/prognostic/2d/SP": tensor, # (1, 1, lat, lon)
"era5/dynamic_forcing/2d/tsi": tensor,
"era5/static/2d/LSM": tensor,
...
},
"target": { # only when return_target=True
"era5/prognostic/3d/T": tensor,
"era5/prognostic/2d/SP": tensor,
...
},
"metadata": {
"input_datetime": int, # nanoseconds since epoch
"target_datetime": int, # only when return_target=True
},
}
- Output key format (flat, slash-delimited):
“{source}/{field_type}/{dim}/{varname}”
source : “era5” field_type: “prognostic” | “dynamic_forcing” | “static” | “diagnostic” dim : “2d” (surface / single-level)
“3d” (multi-level upper-air)
varname : variable name as given in config (e.g. “T”, “SP”, “tsi”)
- Tensor shapes (no batch dimension):
3D variable : (n_levels, 1, lat, lon) — n_levels = len(config levels) 2D variable : (1, 1, lat, lon) — singleton level dim
- After DataLoader collation the batch dimension is prepended:
(batch, n_levels, 1, lat, lon)
- File naming:
Each field type supports an optional
filename_time_formatconfig key that specifies a strftime format string describing how the datetime appears in the file name. Defaults to"%Y"(annual files).Examples:
filename_time_format: "%Y" # era5_2021.zarr filename_time_format: "%Y_%m" # era5_2021_06.nc filename_time_format: "%Y%m%d" # era5_20210601.nc
If only a single file matches the glob pattern,
filename_time_formatis ignored and that file is used for all timestamps.
Attributes#
Classes#
PyTorch Dataset for processed ERA5 data with nested input/target structure. |
|
PyTorch Dataset for Google Cloud ARCO ERA5 data with nested input/target structure. |
Module Contents#
- credit.datasets.era5.logger#
- credit.datasets.era5.VALID_FIELD_TYPES#
- class credit.datasets.era5.ERA5Dataset(config: dict, return_target: bool = False)#
Bases:
torch.utils.data.DatasetPyTorch Dataset for processed ERA5 data with nested input/target structure.
See module docstring for full description of output format and file naming.
Example YAML configuration:
data: source: ERA5: level_coord: "level" levels: [10, 30, 40, 50, 60, 70, 80, 90, 95, 100, 105, 110, 120, 130, 136, 137] variables: prognostic: vars_3D: ['T', 'U', 'V', 'Q'] vars_2D: ['SP', 't2m'] path: "/data/era5_*.zarr" filename_time_format: "%Y" # annual (default) dynamic_forcing: vars_2D: ['tsi'] path: "/data/solar_*.nc" filename_time_format: "%Y_%m" # monthly static: vars_2D: ['Z_GDS4_SFC', 'LSM'] path: "/data/lsm.nc" # single file — filename_time_format not needed diagnostic: null start_datetime: "2017-01-01" end_datetime: "2019-12-31" timestep: "6h" forecast_len: 1
- Assumptions:
A “time” dimension / coordinate is present for non-static fields.
A level coordinate (name given by
level_coord) represents the vertical axis of 3D variables.Dimension order: (time, level, latitude, longitude) for 3D; (time, latitude, longitude) for 2D; (latitude, longitude) for static.
- source_name: str = 'era5'#
- level_coord: str#
- levels: list[int]#
- return_target: bool = False#
- static_metadata: dict#
- dt#
- num_forecast_steps: int#
- start_datetime#
- end_datetime#
- datetimes: pandas.DatetimeIndex#
- file_dict: dict[str, list[tuple[pandas.Timestamp, pandas.Timestamp, str]] | None]#
- var_dict: dict[str, dict[str, list[str]]]#
- __len__() int#
- __getitem__(args: tuple) dict#
Return a nested input/target sample dict.
- Parameters:
args –
(t, i)where t is the current timestamp (nanoseconds or pd.Timestamp) and i is the within-sequence step index produced by the sampler. Wheni == 0prognostic and static fields are loaded in addition to dynamic forcing.- Returns:
Dict with keys
"input","metadata", and optionally"target"(whenreturn_target=True). Both"input"and"target"are dicts of per-variable tensors keyed by"era5/{field_type}/{dim}/{varname}".
- _register_field(field_type: str, d: dict | None) None#
Validate and register one field type from the config variables block.
Populates
self.file_dictandself.var_dictfor field_type.- Parameters:
field_type – One of
"prognostic","dynamic_forcing","static","diagnostic".d – Field-type config dict, or
None/ null to disable the field.
- Raises:
KeyError – If field_type is not a recognised field type.
ValueError – If d defines neither
vars_3Dnorvars_2D.
- _build_timestamps() pandas.DatetimeIndex#
Return valid initialisation timestamps for the dataset.
- Returns:
DatetimeIndex from
start_datetimetoend_datetimeminus the forecast horizon, at the configured timestep frequency.
- _extract_field(field_type: str, t: pandas.Timestamp, sample: dict) None#
Open the dataset for field_type at time t and populate sample.
Keys written are
"era5/{field_type}/3d/{varname}"for 3D variables and"era5/{field_type}/2d/{varname}"for 2D variables.- Parameters:
field_type – One of
"prognostic","dynamic_forcing","static","diagnostic".t – Timestamp to select.
sample –
Dict to write variable tensors into (modified in place). Tensor shapes (no batch dimension):
3D variable:
(n_levels, 1, lat, lon)2D variable:
(1, 1, lat, lon)
- static _to_cftime(ts: pandas.Timestamp, calendar: str) cftime.datetime#
Convert a pandas Timestamp to a cftime.datetime.
- Parameters:
ts – Pandas Timestamp to convert.
calendar – cftime calendar string read from the dataset (e.g.
"noleap","gregorian","proleptic_gregorian").
- Returns:
cftime.datetime with the specified calendar.
- class credit.datasets.era5.ARCOERA5Dataset(data_config: dict, return_target: bool = False)#
Bases:
torch.utils.data.DatasetPyTorch Dataset for Google Cloud ARCO ERA5 data with nested input/target structure.
See module docstring for full description of output format and file naming.
Example YAML configuration:
data: source: ARCO_ERA5: level_coord: "hybrid" levels: [10, 30, 40, 50, 60, 70, 80, 90, 95, 100, 105, 110, 120, 130, 136, 137] variables: prognostic: vars_3D: ["temperature", "u_component_of_wind", "v_component_of_wind", "specific_humidity"] vars_2D: ["surface_pressure"] dynamic_forcing: vars_2D: ["toa_incident_solar_radiation"] static: vars_2D: ["land_sea_mask"] diagnostic: vars_2D: ["total_precipitation"] start_datetime: "2017-01-01" end_datetime: "2019-12-31" timestep: "6h" forecast_len: 1
- Assumptions:
A “time” dimension / coordinate is present for non-static fields.
A level coordinate (name given by
level_coord) represents the vertical axis of 3D variables.Dimension order: (time, level, latitude, longitude) for 3D; (time, latitude, longitude) for 2D; (latitude, longitude) for static.
- pressure_lev_era5_path = 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'#
- model_lev_era5_path = 'gs://gcp-public-data-arco-era5/ar/model-level-1h-0p25deg.zarr-v1'#
- model_lev_vars = ['divergence', 'fraction_of_cloud_cover', 'geopotential', 'ozone_mass_mixing_ratio',...#
- source_name: str = 'arco_era5'#
- level_coord: str#
- return_target: bool = False#
- static_metadata: dict#
- dt#
- num_forecast_steps: int#
- start_datetime#
- end_datetime#
- datetimes: pandas.DatetimeIndex#
- var_dict: dict[str, dict[str, list[str]]]#
- fs = None#
- mod_level_store = None#
- pres_level_store = None#
- __len__() int#
- __getitem__(args: tuple) dict#
Return a nested input/target sample dict.
- Parameters:
args –
(t, i)where t is the current timestamp (nanoseconds or pd.Timestamp) and i is the within-sequence step index produced by the sampler. Wheni == 0prognostic and static fields are loaded in addition to dynamic forcing.- Returns:
Dict with keys
"input","metadata", and optionally"target"(whenreturn_target=True). Both"input"and"target"are dicts of per-variable tensors keyed by"arco_era5/{field_type}/{dim}/{varname}".
- _init_fs()#
- _register_field(field_type: str, d: dict | None) None#
Validate and register one field type from the config variables block.
Populates
self.file_dictandself.var_dictfor field_type.- Parameters:
field_type – One of
"prognostic","dynamic_forcing","static","diagnostic".d – Field-type config dict, or
None/ null to disable the field.
- Raises:
KeyError – If field_type is not a recognised field type.
ValueError – If d defines neither
vars_3Dnorvars_2D.
- _build_timestamps() pandas.DatetimeIndex#
Return valid initialisation timestamps for the dataset.
- Returns:
DatetimeIndex from
start_datetimetoend_datetimeminus the forecast horizon, at the configured timestep frequency.
- _extract_field(field_type: str, t: pandas.Timestamp, sample: dict) None#
Open the dataset for field_type at time t and populate sample.
Keys written are
"era5/{field_type}/3d/{varname}"for 3D variables and"era5/{field_type}/2d/{varname}"for 2D variables.- Parameters:
field_type – One of
"prognostic","dynamic_forcing","static","diagnostic".t – Timestamp to select.
sample –
Dict to write variable tensors into (modified in place). Tensor shapes (no batch dimension):
3D variable:
(n_levels, 1, lat, lon)2D variable:
(1, 1, lat, lon)
- static _to_cftime(ts: pandas.Timestamp, calendar: str) cftime.datetime#
Convert a pandas Timestamp to a cftime.datetime.
- Parameters:
ts – Pandas Timestamp to convert.
calendar – cftime calendar string read from the dataset (e.g.
"noleap","gregorian","proleptic_gregorian").
- Returns:
cftime.datetime with the specified calendar.