credit.datasets.mrms#
MRMSDataset: PyTorch Dataset for MRMS data with nested input/target structure.
Sample structure returned by __getitem__:
- {
- “input”: {<user_provided_name>: {“<user_provided_name>/prognostic/2d/MultiSensor_QPE_01H_Pass2_00.00”: tensor,
“<user_provided_name>/prognostic/2d/MultiSensor_QPE_06H_Pass2_00.00”: tensor}},
- “target”: {<user_provided_name>: {“<user_provided_name>/prognostic/2d/MultiSensor_QPE_01H_Pass2_00.00”: tensor,
“<user_provided_name>/prognostic/2d/MultiSensor_QPE_06H_Pass2_00.00”: tensor}}, # only populated when return_target=True
“metadata”: {<user_provided_name>: {“input_datetime”: int, “target_datetime”: int}},
}
- All MRMS variables are 2D. Tensor shape (no batch dimension):
(1, 1, lat, lon) — singleton level dim, consistent with ERA5 2D convention
- After DataLoader collation the batch dimension is prepended:
(batch, 1, 1, lat, lon)
- Modes:
- local — load from NetCDF (.nc) or Zarr (.zarr) files on disk using
the same
filename_time_formatstrftime convention as ERA5.- remote — stream directly from AWS S3 (noaa-mrms-pds, anonymous access)
via s3fs + pygrib.
- File naming (local mode):
Controlled by the optional
filename_time_formatconfig key. Defaults to"%Y%m%d-%H%M%S"(one file per timestamp).Examples:
filename_time_format: "%Y%m%d-%H%M%S" # MRMS_20240601-060000.nc filename_time_format: "%Y%m%d" # MRMS_20240601.nc (daily) filename_time_format: "%Y%m" # MRMS_202406.nc (monthly)
If only a single file matches the glob pattern,
filename_time_formatis ignored and that file is used for all timestamps.
Classes#
PyTorch Dataset for MRMS data with nested input/target structure. |
Functions#
|
Subset da to a spatial extent if provided. |
Module Contents#
- credit.datasets.mrms._apply_extent(da: xarray.DataArray, extent: list[float] | None) xarray.DataArray#
Subset da to a spatial extent if provided.
- Parameters:
da – DataArray with
latandloncoordinates (0-360 longitude).extent –
[min_lon, max_lon, min_lat, max_lat]in either -180–180 or 0-360 format; normalised to 0-360 internally.Nonereturns da unchanged.
- Returns:
Spatially subsetted DataArray, or da unchanged if extent is
None.
- class credit.datasets.mrms.MRMSDataset(data_config: dict[str, Any], return_target: bool = False)#
Bases:
credit.datasets.base_dataset.BaseDatasetPyTorch Dataset for MRMS data with nested input/target structure.
Field types follow CREDIT Gen2 conventions:
prognosticvariables appear in both input (at step 0) and target;dynamic_forcingappears in input at every step;diagnosticappears in target only. At stepi > 0the model’s own prognostic predictions are fed back — no disk read occurs for prognostic fields at those steps.Supports loading directly from AWS S3 (remote mode) or from local NetCDF / Zarr files (local mode). Spatial subsetting via
extentis applied at load time on the native MRMS grid.See module docstring for full description of output format and file naming.
Example YAML configuration (local mode):
data: source: Example_MRMS: # User-provided name (arbitrary key) dataset_type: "mrms" mode: "local" variables: prognostic: # input at step 0 + target vars_2D: - "MultiSensor_QPE_01H_Pass2_00.00" path: "/data/MRMS_*.nc" filename_time_format: "%Y%m%d-%H%M%S" dynamic_forcing: # input every step vars_2D: - "MultiSensor_QPE_06H_Pass2_00.00" path: "/data/MRMS_*.nc" filename_time_format: "%Y%m%d-%H%M%S" extent: [-130, -60, 20, 55] # [min_lon, max_lon, min_lat, max_lat] start_datetime: "2024-06-01" end_datetime: "2024-07-01" timestep: "6h" forecast_len: 0
Example YAML configuration (remote mode):
data: source: Example_MRMS: # User-provided name (arbitrary key) dataset_type: "mrms" mode: "remote" region: "CONUS" variables: prognostic: vars_2D: - "MultiSensor_QPE_01H_Pass2_00.00" extent: [-130, -60, 20, 55]
- Assumptions:
Local files have
time,lat,londimensions/coordinates.Longitude coordinates are in the 0–360 convention (both local and remote).
extentis specified as[min_lon, max_lon, min_lat, max_lat]in either -180-180 or 0-360 format; it is normalised to 0-360 internally.
- dataset_type: str = 'mrms'#
- region: str#
- extent: list[float] | None#
- static_metadata: dict#
- _fs = None#
- _get_file_source(field_config: dict[str, Any]) list[tuple[pandas.Timestamp, pandas.Timestamp, str]] | bool | None#
Return the file source for a field. Override in subclasses for different modes/backends.
- Parameters:
field_config (dict[str, Any]) – Validated field-type config dict.
- Raises:
ValueError – If
self.modeis not a recognised mode.- Returns:
- Depending on the mode and field type,
this method may return a list of (start_time, end_time, file_path) tuples produced by _map_files, a boolean indicating the presence of the field (e.g., for remote data), or None if the field is disabled. The expected return type should be consistent within a dataset class.
- Return type:
list[tuple[pd.Timestamp, pd.Timestamp, str]] | bool | None
- _extract_field(field_type: str, t: pandas.Timestamp, sample: dict) None#
Load all 2-D variables for field_type at time t into sample.
Dispatches to local or remote loading based on
self.mode.- Parameters:
field_type – Registered field type (e.g.
"prognostic").t – Timestamp to load.
sample – Dict to write variable tensors into (modified in place). Tensor shape (no batch dimension):
(1, 1, lat, lon).
- _load_local_var(field_type: str, vname: str, t: pandas.Timestamp)#
Load a single variable from a local NetCDF or Zarr file.
- Parameters:
field_type – Field type key used to look up file intervals.
vname – Variable name within the dataset.
t – Timestamp to select.
- Returns:
2-D numpy array
(lat, lon)after optional extent subsetting.- Raises:
KeyError – If no files are registered for field_type.
- _load_remote_var(vname: str, t: pandas.Timestamp)#
Stream a single variable from the MRMS S3 bucket.
Imports
s3fsandpygriblazily so they are only required when remote mode is actually used.- Parameters:
vname – MRMS variable name (used in the S3 path).
t – Timestamp to fetch.
- Returns:
2-D numpy array
(lat, lon)after optional extent subsetting.