credit.datasets.era5#
ARCOERA5Dataset: PyTorch Dataset for streaming ERA5 data from the Google Cloud ARCO ERA5 public Zarr store.
Sample structure returned by __getitem__:
{
"input": {
"{source_name}/prognostic/3d/temperature": tensor, # (n_levels, 1, lat, lon)
"{source_name}/prognostic/2d/surface_pressure": tensor, # (1, 1, lat, lon)
"{source_name}/dynamic_forcing/2d/toa_incident_solar_radiation": tensor,
"{source_name}/static/2d/land_sea_mask": tensor,
...
},
"target": { # only when return_target=True
"{source_name}/prognostic/3d/temperature": tensor,
"{source_name}/prognostic/2d/surface_pressure": tensor,
...
},
"metadata": {
"input_datetime": int, # nanoseconds since epoch
"target_datetime": int, # only when return_target=True
},
}
- Output key format (flat, slash-delimited):
“{source_name}/{field_type}/{dim}/{varname}”
field_type: “prognostic” | “dynamic_forcing” | “static” | “diagnostic” dim : “2d” (surface / single-level)
“3d” (multi-level; level_coord = “level” or “hybrid”)
varname : variable name as in the ARCO ERA5 Zarr store
Attributes#
Classes#
PyTorch Dataset for Google Cloud ARCO ERA5 data with nested input/target structure. |
|
PyTorch Dataset for WeatherBench2 ERA5 data on Google Cloud Storage. |
Module Contents#
- class credit.datasets.era5.ARCOERA5Dataset(data_config: dict[str, Any], return_target: bool = False)#
Bases:
credit.datasets.base_dataset.BaseDatasetPyTorch Dataset for Google Cloud ARCO ERA5 data with nested input/target structure.
See the module docstring for a full description of the output format and file naming.
Example YAML configuration:
data: source: Example_ARCOERA5: # User-provided name (arbitrary key) dataset_type: "arco_era5" level_coord: "hybrid" levels: [10, 30, 40, 50, 60, 70, 80, 90, 95, 100, 105, 110, 120, 130, 136, 137] variables: prognostic: vars_3D: ["temperature", "u_component_of_wind", "v_component_of_wind", "specific_humidity"] vars_2D: ["surface_pressure"] dynamic_forcing: vars_2D: ["toa_incident_solar_radiation"] static: vars_2D: ["land_sea_mask"] diagnostic: vars_2D: ["total_precipitation"] start_datetime: "2017-01-01" end_datetime: "2019-12-31" timestep: "6h" forecast_len: 1
- Assumptions:
A “time” dimension / coordinate is present for non-static fields.
A level coordinate (name given by
level_coord) represents the vertical axis of 3D variables.Dimension order: (time, level, latitude, longitude) for 3D; (time, latitude, longitude) for 2D; (latitude, longitude) for static.
- dataset_type = 'arco_era5'#
- pressure_lev_era5_path = 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'#
- model_lev_era5_path = 'gs://gcp-public-data-arco-era5/ar/model-level-1h-0p25deg.zarr-v1'#
- model_lev_vars = ['divergence', 'fraction_of_cloud_cover', 'geopotential', 'ozone_mass_mixing_ratio',...#
- level_coord: str#
- mod_level_store = None#
- pres_level_store = None#
- static_metadata: dict[str, Any]#
- mode = 'remote'#
- _fs = None#
- _init_fs()#
Initialize the GCSFileSystem and zarr stores for pressure-level and model-level ERA5 data.
- _extract_field(field_type: credit.datasets.base_dataset.VALID_FIELD_TYPES, t: pandas.Timestamp, sample: dict[str, Any]) None#
Open the dataset for field_type at time t and populate sample.
Keys written are
"{source_name}/{field_type}/3d/{varname}"for 3D variables and"{source_name}/{field_type}/2d/{varname}"for 2D variables.- Parameters:
field_type – One of
"prognostic","dynamic_forcing","static","diagnostic".t – Timestamp to select.
sample –
Dict to write variable tensors into (modified in place). Tensor shapes (no batch dimension):
3D variable:
(n_levels, 1, lat, lon)2D variable:
(1, 1, lat, lon)
- credit.datasets.era5._WB2_ERA5_BASE = 'gs://weatherbench2/datasets/era5'#
- credit.datasets.era5._WB2_ERA5_STORE_PATHS: dict[str, str]#
- credit.datasets.era5._WB2_ERA5_DEFAULT_LEVELS: dict[str, list[int]]#
- class credit.datasets.era5.WeatherBench2ERA5Dataset(data_config: dict, resolution: str = '1440x721', return_target: bool = False)#
Bases:
credit.datasets.base_dataset.BaseDatasetPyTorch Dataset for WeatherBench2 ERA5 data on Google Cloud Storage.
Provides access to ERA5 reanalysis data prepared for the WeatherBench2 benchmark at multiple resolutions. All data is read lazily from public Google Cloud Storage zarr stores (anonymous access, no credentials required).
Available resolutions:
See
_WB2_ERA5_DEFAULT_LEVELSfor default pressure levels per resolution.Example YAML configuration:
data: source: WeatherBench2_ERA5: dataset_type: "weatherbench2_era5" resolution: "1440x721" # optional; overridden by the resolution kwarg level_coord: "level" levels: [50, 100, 200, 500, 850, 1000] # optional; defaults to all available variables: prognostic: vars_3D: ["temperature", "u_component_of_wind", "v_component_of_wind", "specific_humidity"] vars_2D: ["surface_pressure", "2m_temperature"] dynamic_forcing: vars_2D: ["total_precipitation_6hr"] static: vars_2D: ["geopotential_at_surface"] diagnostic: null start_datetime: "2017-01-01" end_datetime: "2019-12-31" timestep: "6h" forecast_len: 1
Output key format:
"weatherbench2_era5/{field_type}/{dim}/{varname}"
- Assumptions:
Non-static variables have a “time” dimension in the zarr store.
3D pressure-level variables have a “level” coordinate (hPa).
Dimension order: (time, level, latitude, longitude) for 3D; (time, latitude, longitude) for 2D; (latitude, longitude) for static.
- dataset_type: str = 'weatherbench2_era5'#
- resolution: str#
- store_path: str#
- level_coord: str#
- levels: list[int]#
- static_metadata: dict#
- _fs = None#
- store = None#
- mode = 'remote'#
- _init_fs() None#
- _extract_field(field_type: credit.datasets.base_dataset.VALID_FIELD_TYPES, t: pandas.Timestamp, sample: dict) None#
Open the zarr store and extract variables for field_type at time t.
Keys written to sample:
"weatherbench2_era5/{field_type}/3d/{varname}"— shape(n_levels, 1, lat, lon)"weatherbench2_era5/{field_type}/2d/{varname}"— shape(1, 1, lat, lon)
- Parameters:
field_type – One of
"prognostic","dynamic_forcing","static","diagnostic".t – Timestamp to select.
sample – Dict to write variable tensors into (modified in place).