credit.datasets.MRMS#

MRMSDataset with nested input/target structure.

Field type semantics (mirrors ERA5 conventions):
prognostic — input at step 0 AND target; model prediction fed back

at step > 0 (autoregressive rollout)

diagnostic — target only; not fed back into the model dynamic_forcing — input at every step; never a target

Sample structure returned by __getitem__:

{
“input”: {

“mrms/prognostic/2d/MultiSensor_QPE_01H_Pass2_00.00”: tensor, “mrms/dynamic_forcing/2d/MultiSensor_QPE_06H_Pass2_00.00”: tensor, …

}, “target”: { # only when return_target=True

“mrms/prognostic/2d/MultiSensor_QPE_01H_Pass2_00.00”: tensor, …

}, “metadata”: {

“input_datetime”: int, # nanoseconds since epoch “target_datetime”: int, # only when return_target=True

},

}

All MRMS variables are 2D. Tensor shape (no batch dimension):

(1, 1, lat, lon) — singleton level dim, consistent with ERA5 2D convention

After DataLoader collation the batch dimension is prepended:

(batch, 1, 1, lat, lon)

Modes:
local — load from NetCDF (.nc) or Zarr (.zarr) files on disk using

the same filename_time_format strftime convention as ERA5.

remote — stream directly from AWS S3 (noaa-mrms-pds, anonymous access)

via s3fs + pygrib.

File naming (local mode):

Controlled by the optional filename_time_format config key. Defaults to "%Y%m%d-%H%M%S" (one file per timestamp).

Examples:

filename_time_format: "%Y%m%d-%H%M%S"   # MRMS_20240601-060000.nc
filename_time_format: "%Y%m%d"           # MRMS_20240601.nc  (daily)
filename_time_format: "%Y%m"             # MRMS_202406.nc    (monthly)

If only a single file matches the glob pattern, filename_time_format is ignored and that file is used for all timestamps.

Attributes#

Classes#

MRMSDataset

PyTorch Dataset for MRMS data with nested input/target structure.

Functions#

_apply_extent(→ xarray.DataArray)

Subset da to a spatial extent if provided.

Module Contents#

credit.datasets.MRMS.logger#
credit.datasets.MRMS.VALID_FIELD_TYPES#
credit.datasets.MRMS._S3_URI = 's3://noaa-mrms-pds/{region}/{varname}/{date_str}/MRMS_{varname}_{datetime_str}.grib2.gz'#
credit.datasets.MRMS._apply_extent(da: xarray.DataArray, extent: list[float] | None) xarray.DataArray#

Subset da to a spatial extent if provided.

Parameters:
  • da – DataArray with lat and lon coordinates (0-360 longitude).

  • extent[min_lon, max_lon, min_lat, max_lat] in either -180–180 or 0-360 format; normalised to 0-360 internally. None returns da unchanged.

Returns:

Spatially subsetted DataArray, or da unchanged if extent is None.

class credit.datasets.MRMS.MRMSDataset(config: dict, return_target: bool = False)#

Bases: torch.utils.data.Dataset

PyTorch Dataset for MRMS data with nested input/target structure.

Field types follow ERA5 conventions: prognostic variables appear in both input (at step 0) and target; dynamic_forcing appears in input at every step; diagnostic appears in target only. At step i > 0 the model’s own prognostic predictions are fed back — no disk read occurs for prognostic fields at those steps.

Supports loading directly from AWS S3 (remote mode) or from local NetCDF / Zarr files (local mode). Spatial subsetting via extent is applied at load time on the native MRMS grid.

See module docstring for full description of output format and file naming.

Example YAML configuration (local mode):

data:
  source:
    MRMS:
      mode: "local"
      variables:
        prognostic:                         # input at step 0 + target
          vars_2D:
            - "MultiSensor_QPE_01H_Pass2_00.00"
          path: "/data/MRMS_*.nc"
          filename_time_format: "%Y%m%d-%H%M%S"
        dynamic_forcing:                    # input every step
          vars_2D:
            - "MultiSensor_QPE_06H_Pass2_00.00"
          path: "/data/MRMS_*.nc"
          filename_time_format: "%Y%m%d-%H%M%S"
      extent: [-130, -60, 20, 55]   # [min_lon, max_lon, min_lat, max_lat]

  start_datetime: "2024-06-01"
  end_datetime:   "2024-07-01"
  timestep:       "6h"
  forecast_len:   0

Example YAML configuration (remote mode):

data:
  source:
    MRMS:
      mode: "remote"
      region: "CONUS"
      variables:
        prognostic:
          vars_2D:
            - "MultiSensor_QPE_01H_Pass2_00.00"
      extent: [-130, -60, 20, 55]
Assumptions:
  1. Local files have time, lat, lon dimensions/coordinates.

  2. Longitude coordinates are in the 0–360 convention (both local and remote).

  3. extent is specified as [min_lon, max_lon, min_lat, max_lat] in either -180-180 or 0-360 format; it is normalised to 0-360 internally.

source_name: str = 'mrms'#
return_target: bool = False#
mode: str#
region: str#
extent: list[float] | None#
static_metadata: dict#
dt#
num_forecast_steps: int#
start_datetime#
end_datetime#
datetimes: pandas.DatetimeIndex#
file_dict: dict[str, list[tuple[pandas.Timestamp, pandas.Timestamp, str]] | None]#
var_dict: dict[str, dict[str, list[str]]]#
__len__() int#
__getitem__(args: tuple) dict#

Return a nested input/target sample dict.

Prognostic fields are loaded into input only at step i == 0 (consistent with ERA5 autoregressive rollout semantics). Dynamic forcing is loaded at every step. Diagnostic fields never appear in input.

Parameters:

args(t, i) where t is the current timestamp (nanoseconds or pd.Timestamp) and i is the within-sequence step index produced by the sampler.

Returns:

Dict with keys "input", "metadata", and optionally "target" (when return_target=True). Both "input" and "target" are dicts of per-variable tensors keyed by "mrms/{field_type}/2d/{varname}".

_register_field(field_type: str, d: dict | None) None#

Validate and register one field type from the config variables block.

Parameters:
  • field_type – One of "prognostic", "diagnostic", "dynamic_forcing".

  • d – Field-type config dict, or None / null to disable the field.

Raises:
  • KeyError – If field_type is not a recognised MRMS field type.

  • ValueError – If d defines no vars_2D.

_build_timestamps() pandas.DatetimeIndex#

Return valid initialisation timestamps for the dataset.

Returns:

DatetimeIndex from start_datetime to end_datetime minus the forecast horizon, at the configured timestep frequency.

_extract_field(field_type: str, t: pandas.Timestamp, sample: dict) None#

Load all variables for field_type at time t into sample.

Dispatches to local or remote loading based on self.mode.

Parameters:
  • field_type – Registered field type (e.g. "prognostic").

  • t – Timestamp to load.

  • sample – Dict to write variable tensors into (modified in place). Tensor shape (no batch dimension): (1, 1, lat, lon).

_load_local_var(field_type: str, vname: str, t: pandas.Timestamp)#

Load a single variable from a local NetCDF or Zarr file.

Parameters:
  • field_type – Field type key used to look up file intervals.

  • vname – Variable name within the dataset.

  • t – Timestamp to select.

Returns:

2-D numpy array (lat, lon) after optional extent subsetting.

Raises:

KeyError – If no files are registered for field_type.

_load_remote_var(vname: str, t: pandas.Timestamp)#

Stream a single variable from the MRMS S3 bucket.

Imports s3fs and pygrib lazily so they are only required when remote mode is actually used.

Parameters:
  • vname – MRMS variable name (used in the S3 path).

  • t – Timestamp to fetch.

Returns:

2-D numpy array (lat, lon) after optional extent subsetting.