credit.datasets.local#
LocalDataset: generic PyTorch Dataset for loading atmospheric data from local NetCDF/Zarr files. Supports any combination of prognostic, dynamic_forcing, static, and diagnostic field types with optional 3D (multi-level) and 2D (surface/single-level) variables.
Sample structure returned by __getitem__:
{
"input": {
"{source_name}/prognostic/3d/T": tensor, # (n_levels, 1, lat, lon)
"{source_name}/prognostic/2d/SP": tensor, # (1, 1, lat, lon)
"{source_name}/dynamic_forcing/2d/tsi": tensor,
"{source_name}/static/2d/LSM": tensor,
...
},
"target": { # only when return_target=True
"{source_name}/prognostic/3d/T": tensor,
"{source_name}/prognostic/2d/SP": tensor,
...
},
"metadata": {
"input_datetime": int, # nanoseconds since epoch
"target_datetime": int, # only when return_target=True
},
}
- Output key format (flat, slash-delimited):
“{source_name}/{field_type}/{dim}/{varname}”
field_type: “prognostic” | “dynamic_forcing” | “static” | “diagnostic” dim : “2d” (surface / single-level)
- “3d” (multi-level upper-air; requires level_coord in config;
if levels is omitted all levels in the file are used)
varname : variable name as given in config (e.g. “T”, “SP”, “tsi”)
- Tensor shapes (no batch dimension):
3D variable : (n_levels, 1, lat, lon) — n_levels = len(config levels) 2D variable : (1, 1, lat, lon) — singleton level dim
- After DataLoader collation the batch dimension is prepended:
(batch, n_levels, 1, lat, lon)
- File naming:
Each field type supports an optional
filename_time_formatconfig key that specifies a strftime format string describing how the datetime appears in the file name. Defaults to"%Y"(annual files).Examples:
filename_time_format: "%Y" # data_2021.zarr filename_time_format: "%Y_%m" # data_2021_06.nc filename_time_format: "%Y%m%d" # data_20210601.nc
If only a single file matches the glob pattern,
filename_time_formatis ignored and that file is used for all timestamps.
Classes#
Generic PyTorch Dataset for local NetCDF/Zarr atmospheric data files. |
Module Contents#
- class credit.datasets.local.LocalDataset(data_config: dict[str, Any], return_target: bool = False)#
Bases:
credit.datasets.base_dataset.BaseDatasetGeneric PyTorch Dataset for local NetCDF/Zarr atmospheric data files.
See module docstring for full description of output format and file naming.
Example YAML configuration:
data: source: My_Surface_Data: # User-provided name (arbitrary key) dataset_type: "local" level_coord: "level" levels: [10, 30, 40, 50, 60, 70, 80, 90, 95, 100, 105, 110, 120, 130, 136, 137] variables: prognostic: vars_3D: ['T', 'U', 'V', 'Q'] vars_2D: ['SP', 't2m'] path: "/data/era5_*.zarr" filename_time_format: "%Y" # annual (default) dynamic_forcing: vars_2D: ['tsi'] path: "/data/solar_*.nc" filename_time_format: "%Y_%m" # monthly static: vars_2D: ['Z_GDS4_SFC', 'LSM'] path: "/data/lsm.nc" # single file — filename_time_format not needed diagnostic: null start_datetime: "2017-01-01" end_datetime: "2019-12-31" timestep: "6h" forecast_len: 1
- Assumptions:
A “time” dimension / coordinate is present for non-static fields.
A level coordinate (name given by
level_coord) represents the vertical axis of 3D variables.Dimension order: (time, level, latitude, longitude) for 3D; (time, latitude, longitude) for 2D; (latitude, longitude) for static.
- dataset_type = 'local'#
- level_coord: str | None#
- levels: list | None#
- static_metadata: dict[str, Any]#
- mode = 'local'#
- time_coord#
- _extract_field(field_type: str, t: pandas.Timestamp, sample: dict[str, Any]) None#
Open the dataset for field_type at time t and populate sample.
Keys written are
"{source_name}/{field_type}/3d/{varname}"for 3D variables and"{source_name}/{field_type}/2d/{varname}"for 2D variables.- Parameters:
field_type – One of
"prognostic","dynamic_forcing","static","diagnostic".t – Timestamp to select.
sample –
Dict to write variable tensors into (modified in place). Tensor shapes (no batch dimension):
3D variable:
(n_levels, 1, lat, lon)2D variable:
(1, 1, lat, lon)