credit.datasets.local#

LocalDataset: generic PyTorch Dataset for loading atmospheric data from local NetCDF/Zarr files. Supports any combination of prognostic, dynamic_forcing, static, and diagnostic field types with optional 3D (multi-level) and 2D (surface/single-level) variables.

Sample structure returned by __getitem__:

{
    "input": {
        "{source_name}/prognostic/3d/T":        tensor,  # (n_levels, 1, lat, lon)
        "{source_name}/prognostic/2d/SP":       tensor,  # (1,        1, lat, lon)
        "{source_name}/dynamic_forcing/2d/tsi": tensor,
        "{source_name}/static/2d/LSM":          tensor,
        ...
    },
    "target": {                                  # only when return_target=True
        "{source_name}/prognostic/3d/T":        tensor,
        "{source_name}/prognostic/2d/SP":       tensor,
        ...
    },
    "metadata": {
        "input_datetime":  int,                  # nanoseconds since epoch
        "target_datetime": int,                  # only when return_target=True
    },
}
Output key format (flat, slash-delimited):

“{source_name}/{field_type}/{dim}/{varname}”

field_type: “prognostic” | “dynamic_forcing” | “static” | “diagnostic” dim : “2d” (surface / single-level)

“3d” (multi-level upper-air; requires level_coord in config;

if levels is omitted all levels in the file are used)

varname : variable name as given in config (e.g. “T”, “SP”, “tsi”)

Tensor shapes (no batch dimension):

3D variable : (n_levels, 1, lat, lon) — n_levels = len(config levels) 2D variable : (1, 1, lat, lon) — singleton level dim

After DataLoader collation the batch dimension is prepended:

(batch, n_levels, 1, lat, lon)

File naming:

Each field type supports an optional filename_time_format config key that specifies a strftime format string describing how the datetime appears in the file name. Defaults to "%Y" (annual files).

Examples:

filename_time_format: "%Y"       # data_2021.zarr
filename_time_format: "%Y_%m"    # data_2021_06.nc
filename_time_format: "%Y%m%d"   # data_20210601.nc

If only a single file matches the glob pattern, filename_time_format is ignored and that file is used for all timestamps.

Classes#

LocalDataset

Generic PyTorch Dataset for local NetCDF/Zarr atmospheric data files.

Module Contents#

class credit.datasets.local.LocalDataset(data_config: dict[str, Any], return_target: bool = False)#

Bases: credit.datasets.base_dataset.BaseDataset

Generic PyTorch Dataset for local NetCDF/Zarr atmospheric data files.

See module docstring for full description of output format and file naming.

Example YAML configuration:

data:
  source:
    My_Surface_Data:  # User-provided name (arbitrary key)
      dataset_type: "local"
      level_coord: "level"
      levels: [10, 30, 40, 50, 60, 70, 80, 90, 95, 100, 105, 110, 120, 130, 136, 137]
      variables:
        prognostic:
          vars_3D: ['T', 'U', 'V', 'Q']
          vars_2D: ['SP', 't2m']
          path: "/data/era5_*.zarr"
          filename_time_format: "%Y"        # annual (default)
        dynamic_forcing:
          vars_2D: ['tsi']
          path: "/data/solar_*.nc"
          filename_time_format: "%Y_%m"     # monthly
        static:
          vars_2D: ['Z_GDS4_SFC', 'LSM']
          path: "/data/lsm.nc"
          # single file — filename_time_format not needed
        diagnostic: null

  start_datetime: "2017-01-01"
  end_datetime: "2019-12-31"
  timestep: "6h"
  forecast_len: 1
Assumptions:
  1. A “time” dimension / coordinate is present for non-static fields.

  2. A level coordinate (name given by level_coord) represents the vertical axis of 3D variables.

  3. Dimension order: (time, level, latitude, longitude) for 3D; (time, latitude, longitude) for 2D; (latitude, longitude) for static.

dataset_type = 'local'#
level_coord: str | None#
levels: list | None#
static_metadata: dict[str, Any]#
mode = 'local'#
time_coord#
_extract_field(field_type: str, t: pandas.Timestamp, sample: dict[str, Any]) None#

Open the dataset for field_type at time t and populate sample.

Keys written are "{source_name}/{field_type}/3d/{varname}" for 3D variables and "{source_name}/{field_type}/2d/{varname}" for 2D variables.

Parameters:
  • field_type – One of "prognostic", "dynamic_forcing", "static", "diagnostic".

  • t – Timestamp to select.

  • sample

    Dict to write variable tensors into (modified in place). Tensor shapes (no batch dimension):

    • 3D variable: (n_levels, 1, lat, lon)

    • 2D variable: (1, 1, lat, lon)