credit.datasets.goes#
GOESDataset: PyTorch Dataset for GOES data with nested input/target structure.
Sample structure returned by __getitem__:
- {
- “input”: {<user_provided_name>: {“<user_provided_name>/prognostic/2d/CMI_C04”: tensor,
“<user_provided_name>/prognostic/2d/CMI_C07”: tensor}},
- “target”: {<user_provided_name>: {“<user_provided_name>/prognostic/2d/CMI_C04”: tensor,
“<user_provided_name>/prognostic/2d/CMI_C07”: tensor}}, # only populated when return_target=True
“metadata”: {<user_provided_name>: {“input_datetime”: int, “target_datetime”: int}},
}
- All GOES variables are 2D. Tensor shape (no batch dimension):
(1, 1, lat, lon) — singleton level dim, consistent with CREDIT Gen2 2D convention
- After DataLoader collation the batch dimension is prepended:
(batch, 1, 1, lat, lon)
Classes#
PyTorch Dataset for GOES-R ABI Level-2 (L2) satellite imagery. |
Functions#
|
Compute row (latitude) and column (longitude) slices that bound a geographic extent on a 2-D grid. |
|
Find the 2-D grid indices of the point nearest to a target lat/lon using Haversine distance. |
Module Contents#
- credit.datasets.goes._build_spatial_slices(extent: list[int] | None, lat2d: numpy.ndarray | None = None, lon2d: numpy.ndarray | None = None) tuple[slice, slice]#
Compute row (latitude) and column (longitude) slices that bound a geographic extent on a 2-D grid.
Given an optional bounding box in geographic coordinates, returns a pair of
sliceobjects that can be passed directly toxarray.Dataset.isel(or NumPy fancy-indexing) to crop a 2-D field to the requested region.- Parameters:
extent – Bounding box as
[lon_min, lon_max, lat_min, lat_max]in decimal degrees. PassNoneto select the entire grid (both slices becomeslice(None)). Valid range for longitude:[-180, 180]; and latitude:[-90, 90].lat2d – 2-D array of latitudes (degrees) with the same shape as the target grid. Required when
extentis notNone.lon2d – 2-D array of longitudes (degrees) with the same shape as the target grid. Required when
extentis notNone.
- Returns:
A
(y_slice, x_slice)tuple wherey_sliceindexes rows (latitude axis) andx_sliceindexes columns (longitude axis) of the 2-D grid.- Raises:
ValueError – If
extentis a list butlat2dorlon2dareNone.TypeError – If
extentis neitherNonenor a list.
- credit.datasets.goes._find_nearest_latlon(lat2d: numpy.ndarray, lon2d: numpy.ndarray, lat_target: float, lon_target: float) tuple[int, int]#
Find the 2-D grid indices of the point nearest to a target lat/lon using Haversine distance.
- Parameters:
lat2d – 2-D array of latitudes in decimal degrees.
lon2d – 2-D array of longitudes in decimal degrees.
lat_target – Target latitude in decimal degrees. Valid range for latitude:
[-90, 90].lon_target – Target longitude in decimal degrees. Valid range for longitude:
[-180, 180];
- Returns:
A
(i, j)tuple of the row and column indices of the nearest grid point.
- class credit.datasets.goes.GOESDataset(data_config: dict[str, Any], return_target: bool = False)#
Bases:
credit.datasets.base_dataset.BaseDatasetPyTorch Dataset for GOES-R ABI Level-2 (L2) satellite imagery.
Field types follow CREDIT Gen2 conventions:
prognosticvariables appear in both input (at step 0) and target;dynamic_forcingappears in input at every step;diagnosticappears in target only. At stepi > 0the model’s own prognostic predictions are fed back — no disk read occurs for prognostic fields at those steps.Supports loading directly from AWS S3 (remote mode) or from local NetCDF files (local mode). Spatial subsetting via
extentis applied at load time on the curvilinear GOES grid.See module docstring for full description of output format and file naming.
Example YAML configuration (local mode):
- data:
- source:
- Example_GOES: # User-provided name (arbitrary key)
dataset_type: “goes” goes_position: “east” # or “west” mode: “local” product: “ABI-L2-MCMIPC” variables:
- prognostic:
vars_2D: [“CMI_C04”, “CMI_C07”, “CMI_C08”, “CMI_C09”, “CMI_C10”, “CMI_C13”] path: “/glade/derecho/scratch/kevinyang/datasets/goes/”
diagnostic: null dynamic_forcing: null
latlon2d_dir: “/glade/derecho/scratch/kevinyang/datasets/goes/” extent: [-130, -60, 20, 55]
start_datetime: “2021-06-01” end_datetime: “2021-06-04” timestep: “6h” forecast_len: 0
Example YAML configuration (remote mode):
- data:
- source:
- Example_GOES: # User-provided name (arbitrary key)
dataset_type: “goes” goes_position: “east” # or “west” mode: “remote” product: “ABI-L2-MCMIPC” variables:
- prognostic:
vars_2D: [“CMI_C04”, “CMI_C07”, “CMI_C08”, “CMI_C09”, “CMI_C10”, “CMI_C13”]
diagnostic: null dynamic_forcing: null
latlon2d_dir: “/glade/derecho/scratch/kevinyang/datasets/goes/” extent: [-130, -60, 20, 55]
- Parameters:
config –
Top-level experiment configuration dictionary. The relevant sub-keys are:
config["source"]["Example_GOES"]: user-provided source name.dataset_type(str): has to be “goes” to trigger this dataset class.goes_position(str): Satellite position. One of"east","west". Defaults to"east".mode(str):"local"or"remote"(S3). Defaults to"local".product(str): ABI product string, e.g."ABI-L2-MCMIPC".extent(list, optional): Bounding box[lon_min, lon_max, lat_min, lat_max]to spatially crop each field.latlon2d_dir(str): Directory containing pre-computed lat/lon grid NetCDF files.qc_path(str, optional): Path to a Parquet QC table. Timestamps that miss file or fail QC are replaced withNonein the file map.variables(dict): Mapping of field_type to variable spec,
config["timestep"](str): Model timestep as apandas.Timedelta-parseable string (e.g."1h").config["forecast_len"](int): Number of autoregressive forecast steps.config["start_datetime"](str): Start of the data range.config["end_datetime"](str): End of the data range.
return_target – When
Truethe sample also contains a"target"key populated with prognostic and diagnostic fields att + dt. Defaults toFalse.
- datetimes#
Valid input times for which samples can be fetched.
- Type:
pd.DatetimeIndex
- file_dict#
Maps each field type to a list of
(period_start, period_end, file path)tuples built during initialization.- Type:
dict
- var_dict#
Maps each field type to
{"vars_2D": [<variable names>]}.- Type:
dict
- y_slice#
Row crop derived from
extent(orslice(None)for the full grid).- Type:
slice
- x_slice#
Column crop derived from
extent(orslice(None)for the full grid).- Type:
slice
- Raises:
FileNotFoundError – If the lat/lon grid NetCDF cannot be found under
latlon2d_dir.
- dataset_type = 'goes'#
- goes_position: str#
- product: str#
- static_metadata: dict[str, Any]#
- qc_path: str#
- tolerance#
- _fs = None#
- latlon2d_dir: str#
- extent#
- _collect_GOES_file_path(base_dir: str = '', verbose: bool = False)#
Build a time-ordered file map for the dataset’s datetime range.
For each requested timestamp the method lists the appropriate S3 or local hourly directory, parses GOES L2 filenames, and associates each timestamp with the nearest file within
tolerance(default 3 minutes). QC filtering is applied automatically whenself.qc_pathis set, masking bad intervals by setting their path entry toNone.- Parameters:
base_dir – Root directory prepended to relative paths when
modeis"local". Ignored for remote mode.verbose – When
True, print a warning for each hour directory that cannot be listed (missing data, permission errors, etc.).
- Returns:
A list of
(period_start, period_end, file_path)tuples, one per timestamp indatetimes.file_pathis"NONE"when no file was found within tolerance, orNonewhen the interval was masked by the QC table atself.qc_path.- Raises:
FileNotFoundError – If GOES L2 files are not found for the requested datetime.
ValueError – If the GOES L2 filenames do not match the expected naming convention (fewer than 6 underscore-separated tokens).
- _get_file_source(field_config: dict[str, Any]) list[tuple[pandas.Timestamp, pandas.Timestamp, str]] | bool | None#
Return the file source for a field. Override in subclasses for different modes/backends.
- Parameters:
field_config (dict[str, Any]) – Validated field-type config dict.
- Raises:
ValueError – If
self.modeis not a recognised mode.- Returns:
- Depending on the mode and field type,
this method may return a list of (start_time, end_time, file_path) tuples produced by _map_files, a boolean indicating the presence of the field (e.g., for remote data), or None if the field is disabled. The expected return type should be consistent within a dataset class.
- Return type:
list[tuple[pd.Timestamp, pd.Timestamp, str]] | bool | None
- _extract_field(field_type: str, t: pandas.Timestamp, sample: dict) None#
Load all 2-D variables for a field type at time
tintosample.Dispatches to
_load_local_varor_load_remote_vardepending onmode, then stores each variable as atorch.Tensorof shape(1, 1, ny, nx)under the key"{source_name}/{field_type}/2d/{vname}"insample. Does nothing if the field type has no registered variables.- Parameters:
field_type – One of
"prognostic","diagnostic", or"dynamic_forcing".t – Timestamp for which to load data.
sample – Output dictionary that is updated in-place.
- _load_local_var(field_type: str, vnames: list[str], t: pandas.Timestamp)#
Load variables from a local NetCDF file and apply spatial cropping.
- Parameters:
field_type – Field type used to look up the file map in
file_dict.vnames – Variable names to extract from the dataset.
t – Timestamp used to locate the correct file via
_find_file.
- Returns:
A dict mapping each variable name to its cropped
numpy.ndarray.- Raises:
KeyError – If no files are registered for
field_type.
- _load_remote_var(field_type: str, vnames: list[str], t: pandas.Timestamp)#
Load variables from a remote S3 NetCDF file and apply spatial cropping.
Uses the cached
_fsS3FileSystem to open the file as a byte stream and reads it with theh5netcdfengine.- Parameters:
field_type – Field type used to look up the file map in
file_dict.vnames – Variable names to extract from the dataset.
t – Timestamp used to locate the correct file via
_find_file.
- Returns:
A dict mapping each variable name to its cropped
numpy.ndarray.- Raises:
KeyError – If no files are registered for
field_type.