credit.datasets.MRMS#
MRMSDataset with nested input/target structure.
- Field type semantics (mirrors ERA5 conventions):
- prognostic — input at step 0 AND target; model prediction fed back
at step > 0 (autoregressive rollout)
diagnostic — target only; not fed back into the model dynamic_forcing — input at every step; never a target
Sample structure returned by __getitem__:
- {
- “input”: {
“mrms/prognostic/2d/MultiSensor_QPE_01H_Pass2_00.00”: tensor, “mrms/dynamic_forcing/2d/MultiSensor_QPE_06H_Pass2_00.00”: tensor, …
}, “target”: { # only when return_target=True
“mrms/prognostic/2d/MultiSensor_QPE_01H_Pass2_00.00”: tensor, …
}, “metadata”: {
“input_datetime”: int, # nanoseconds since epoch “target_datetime”: int, # only when return_target=True
},
}
- All MRMS variables are 2D. Tensor shape (no batch dimension):
(1, 1, lat, lon) — singleton level dim, consistent with ERA5 2D convention
- After DataLoader collation the batch dimension is prepended:
(batch, 1, 1, lat, lon)
- Modes:
- local — load from NetCDF (.nc) or Zarr (.zarr) files on disk using
the same
filename_time_formatstrftime convention as ERA5.- remote — stream directly from AWS S3 (noaa-mrms-pds, anonymous access)
via s3fs + pygrib.
- File naming (local mode):
Controlled by the optional
filename_time_formatconfig key. Defaults to"%Y%m%d-%H%M%S"(one file per timestamp).Examples:
filename_time_format: "%Y%m%d-%H%M%S" # MRMS_20240601-060000.nc filename_time_format: "%Y%m%d" # MRMS_20240601.nc (daily) filename_time_format: "%Y%m" # MRMS_202406.nc (monthly)
If only a single file matches the glob pattern,
filename_time_formatis ignored and that file is used for all timestamps.
Attributes#
Classes#
PyTorch Dataset for MRMS data with nested input/target structure. |
Functions#
|
Subset da to a spatial extent if provided. |
Module Contents#
- credit.datasets.MRMS.logger#
- credit.datasets.MRMS.VALID_FIELD_TYPES#
- credit.datasets.MRMS._S3_URI = 's3://noaa-mrms-pds/{region}/{varname}/{date_str}/MRMS_{varname}_{datetime_str}.grib2.gz'#
- credit.datasets.MRMS._apply_extent(da: xarray.DataArray, extent: list[float] | None) xarray.DataArray#
Subset da to a spatial extent if provided.
- Parameters:
da – DataArray with
latandloncoordinates (0-360 longitude).extent –
[min_lon, max_lon, min_lat, max_lat]in either -180–180 or 0-360 format; normalised to 0-360 internally.Nonereturns da unchanged.
- Returns:
Spatially subsetted DataArray, or da unchanged if extent is
None.
- class credit.datasets.MRMS.MRMSDataset(config: dict, return_target: bool = False)#
Bases:
torch.utils.data.DatasetPyTorch Dataset for MRMS data with nested input/target structure.
Field types follow ERA5 conventions:
prognosticvariables appear in both input (at step 0) and target;dynamic_forcingappears in input at every step;diagnosticappears in target only. At stepi > 0the model’s own prognostic predictions are fed back — no disk read occurs for prognostic fields at those steps.Supports loading directly from AWS S3 (remote mode) or from local NetCDF / Zarr files (local mode). Spatial subsetting via
extentis applied at load time on the native MRMS grid.See module docstring for full description of output format and file naming.
Example YAML configuration (local mode):
data: source: MRMS: mode: "local" variables: prognostic: # input at step 0 + target vars_2D: - "MultiSensor_QPE_01H_Pass2_00.00" path: "/data/MRMS_*.nc" filename_time_format: "%Y%m%d-%H%M%S" dynamic_forcing: # input every step vars_2D: - "MultiSensor_QPE_06H_Pass2_00.00" path: "/data/MRMS_*.nc" filename_time_format: "%Y%m%d-%H%M%S" extent: [-130, -60, 20, 55] # [min_lon, max_lon, min_lat, max_lat] start_datetime: "2024-06-01" end_datetime: "2024-07-01" timestep: "6h" forecast_len: 0
Example YAML configuration (remote mode):
data: source: MRMS: mode: "remote" region: "CONUS" variables: prognostic: vars_2D: - "MultiSensor_QPE_01H_Pass2_00.00" extent: [-130, -60, 20, 55]
- Assumptions:
Local files have
time,lat,londimensions/coordinates.Longitude coordinates are in the 0–360 convention (both local and remote).
extentis specified as[min_lon, max_lon, min_lat, max_lat]in either -180-180 or 0-360 format; it is normalised to 0-360 internally.
- source_name: str = 'mrms'#
- return_target: bool = False#
- mode: str#
- region: str#
- extent: list[float] | None#
- static_metadata: dict#
- dt#
- num_forecast_steps: int#
- start_datetime#
- end_datetime#
- datetimes: pandas.DatetimeIndex#
- file_dict: dict[str, list[tuple[pandas.Timestamp, pandas.Timestamp, str]] | None]#
- var_dict: dict[str, dict[str, list[str]]]#
- __len__() int#
- __getitem__(args: tuple) dict#
Return a nested input/target sample dict.
Prognostic fields are loaded into
inputonly at stepi == 0(consistent with ERA5 autoregressive rollout semantics). Dynamic forcing is loaded at every step. Diagnostic fields never appear ininput.- Parameters:
args –
(t, i)where t is the current timestamp (nanoseconds or pd.Timestamp) and i is the within-sequence step index produced by the sampler.- Returns:
Dict with keys
"input","metadata", and optionally"target"(whenreturn_target=True). Both"input"and"target"are dicts of per-variable tensors keyed by"mrms/{field_type}/2d/{varname}".
- _register_field(field_type: str, d: dict | None) None#
Validate and register one field type from the config variables block.
- Parameters:
field_type – One of
"prognostic","diagnostic","dynamic_forcing".d – Field-type config dict, or
None/ null to disable the field.
- Raises:
KeyError – If field_type is not a recognised MRMS field type.
ValueError – If d defines no
vars_2D.
- _build_timestamps() pandas.DatetimeIndex#
Return valid initialisation timestamps for the dataset.
- Returns:
DatetimeIndex from
start_datetimetoend_datetimeminus the forecast horizon, at the configured timestep frequency.
- _extract_field(field_type: str, t: pandas.Timestamp, sample: dict) None#
Load all variables for field_type at time t into sample.
Dispatches to local or remote loading based on
self.mode.- Parameters:
field_type – Registered field type (e.g.
"prognostic").t – Timestamp to load.
sample – Dict to write variable tensors into (modified in place). Tensor shape (no batch dimension):
(1, 1, lat, lon).
- _load_local_var(field_type: str, vname: str, t: pandas.Timestamp)#
Load a single variable from a local NetCDF or Zarr file.
- Parameters:
field_type – Field type key used to look up file intervals.
vname – Variable name within the dataset.
t – Timestamp to select.
- Returns:
2-D numpy array
(lat, lon)after optional extent subsetting.- Raises:
KeyError – If no files are registered for field_type.
- _load_remote_var(vname: str, t: pandas.Timestamp)#
Stream a single variable from the MRMS S3 bucket.
Imports
s3fsandpygriblazily so they are only required when remote mode is actually used.- Parameters:
vname – MRMS variable name (used in the S3 path).
t – Timestamp to fetch.
- Returns:
2-D numpy array
(lat, lon)after optional extent subsetting.