credit.datasets.era5
====================

.. py:module:: credit.datasets.era5

.. autoapi-nested-parse::

   era5.py
   -------
   ARCOERA5Dataset: PyTorch Dataset for streaming ERA5 data from the Google Cloud
   ARCO ERA5 public Zarr store.

   Sample structure returned by __getitem__::

       {
           "input": {
               "{source_name}/prognostic/3d/temperature":              tensor,  # (n_levels, 1, lat, lon)
               "{source_name}/prognostic/2d/surface_pressure":         tensor,  # (1,        1, lat, lon)
               "{source_name}/dynamic_forcing/2d/toa_incident_solar_radiation": tensor,
               "{source_name}/static/2d/land_sea_mask":                tensor,
               ...
           },
           "target": {                                  # only when return_target=True
               "{source_name}/prognostic/3d/temperature":      tensor,
               "{source_name}/prognostic/2d/surface_pressure": tensor,
               ...
           },
           "metadata": {
               "input_datetime":  int,                  # nanoseconds since epoch
               "target_datetime": int,                  # only when return_target=True
           },
       }

   Output key format (flat, slash-delimited):
       "{source_name}/{field_type}/{dim}/{varname}"

       field_type: "prognostic" | "dynamic_forcing" | "static" | "diagnostic"
       dim       : "2d"  (surface / single-level)
                   "3d"  (multi-level; level_coord = "level" or "hybrid")
       varname   : variable name as in the ARCO ERA5 Zarr store



Attributes
----------

.. autoapisummary::

   credit.datasets.era5._WB2_ERA5_BASE
   credit.datasets.era5._WB2_ERA5_STORE_PATHS
   credit.datasets.era5._WB2_ERA5_DEFAULT_LEVELS


Classes
-------

.. autoapisummary::

   credit.datasets.era5.ARCOERA5Dataset
   credit.datasets.era5.WeatherBench2ERA5Dataset


Module Contents
---------------

.. py:class:: ARCOERA5Dataset(data_config: dict[str, Any], return_target: bool = False)

   Bases: :py:obj:`credit.datasets.base_dataset.BaseDataset`


   PyTorch Dataset for Google Cloud ARCO ERA5 data with nested input/target structure.

   See the module docstring for a full description of the output format and file naming.

   Example YAML configuration::

       data:
         source:
           Example_ARCOERA5:  # User-provided name (arbitrary key)
             dataset_type: "arco_era5"
             level_coord: "hybrid"
             levels: [10, 30, 40, 50, 60, 70, 80, 90, 95, 100, 105, 110, 120, 130, 136, 137]
             variables:
               prognostic:
                 vars_3D: ["temperature", "u_component_of_wind", "v_component_of_wind", "specific_humidity"]
                 vars_2D: ["surface_pressure"]
               dynamic_forcing:
                 vars_2D: ["toa_incident_solar_radiation"]
               static:
                 vars_2D: ["land_sea_mask"]
               diagnostic:
                 vars_2D: ["total_precipitation"]

         start_datetime: "2017-01-01"
         end_datetime: "2019-12-31"
         timestep: "6h"
         forecast_len: 1

   Assumptions:
       1. A "time" dimension / coordinate is present for non-static fields.
       2. A level coordinate (name given by ``level_coord``) represents the
          vertical axis of 3D variables.
       3. Dimension order: (time, level, latitude, longitude) for 3D;
          (time, latitude, longitude) for 2D; (latitude, longitude) for static.


   .. py:attribute:: dataset_type
      :value: 'arco_era5'



   .. py:attribute:: pressure_lev_era5_path
      :value: 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'



   .. py:attribute:: model_lev_era5_path
      :value: 'gs://gcp-public-data-arco-era5/ar/model-level-1h-0p25deg.zarr-v1'



   .. py:attribute:: model_lev_vars
      :value: ['divergence', 'fraction_of_cloud_cover', 'geopotential', 'ozone_mass_mixing_ratio',...



   .. py:attribute:: level_coord
      :type:  str


   .. py:attribute:: mod_level_store
      :value: None



   .. py:attribute:: pres_level_store
      :value: None



   .. py:attribute:: static_metadata
      :type:  dict[str, Any]


   .. py:attribute:: mode
      :value: 'remote'



   .. py:attribute:: _fs
      :value: None



   .. py:method:: _init_fs()

      Initialize the GCSFileSystem and zarr stores for pressure-level and model-level ERA5 data.



   .. py:method:: _extract_field(field_type: credit.datasets.base_dataset.VALID_FIELD_TYPES, t: pandas.Timestamp, sample: dict[str, Any]) -> None

      Open the dataset for *field_type* at time *t* and populate *sample*.

      Keys written are ``"{source_name}/{field_type}/3d/{varname}"`` for 3D variables
      and ``"{source_name}/{field_type}/2d/{varname}"`` for 2D variables.

      :param field_type: One of ``"prognostic"``, ``"dynamic_forcing"``,
                         ``"static"``, ``"diagnostic"``.
      :param t: Timestamp to select.
      :param sample: Dict to write variable tensors into (modified in place).
                     Tensor shapes (no batch dimension):

                     - 3D variable: ``(n_levels, 1, lat, lon)``
                     - 2D variable: ``(1, 1, lat, lon)``



.. py:data:: _WB2_ERA5_BASE
   :value: 'gs://weatherbench2/datasets/era5'


.. py:data:: _WB2_ERA5_STORE_PATHS
   :type:  dict[str, str]

.. py:data:: _WB2_ERA5_DEFAULT_LEVELS
   :type:  dict[str, list[int]]

.. py:class:: WeatherBench2ERA5Dataset(data_config: dict, resolution: str = '1440x721', return_target: bool = False)

   Bases: :py:obj:`credit.datasets.base_dataset.BaseDataset`


   PyTorch Dataset for WeatherBench2 ERA5 data on Google Cloud Storage.

   Provides access to ERA5 reanalysis data prepared for the WeatherBench2
   benchmark at multiple resolutions. All data is read lazily from public
   Google Cloud Storage zarr stores (anonymous access, no credentials required).

   Available resolutions:

   +--------------+-------------------+------------+------------------+
   | ``resolution`` | Grid              | Approx deg | Timestep         |
   +==============+===================+============+==================+
   | ``"1440x721"`` | 1440 × 721 global | 0.25°      | 6-hourly, 13 lev |
   +--------------+-------------------+------------+------------------+
   | ``"240x121"``  | 240 × 121 global  | 1.5°       | 6-hourly, 13 lev |
   +--------------+-------------------+------------+------------------+
   | ``"64x32"``    | 64 × 32 global    | ~5.6°      | 6-hourly, 13 lev |
   +--------------+-------------------+------------+------------------+
   | ``"full"``     | 1440 × 721 global | 0.25°      | hourly, 37 lev   |
   +--------------+-------------------+------------+------------------+

   See ``_WB2_ERA5_DEFAULT_LEVELS`` for default pressure levels per resolution.

   Example YAML configuration::

       data:
         source:
           WeatherBench2_ERA5:
             dataset_type: "weatherbench2_era5"
             resolution: "1440x721"   # optional; overridden by the resolution kwarg
             level_coord: "level"
             levels: [50, 100, 200, 500, 850, 1000]  # optional; defaults to all available
             variables:
               prognostic:
                 vars_3D: ["temperature", "u_component_of_wind", "v_component_of_wind",
                            "specific_humidity"]
                 vars_2D: ["surface_pressure", "2m_temperature"]
               dynamic_forcing:
                 vars_2D: ["total_precipitation_6hr"]
               static:
                 vars_2D: ["geopotential_at_surface"]
               diagnostic: null

         start_datetime: "2017-01-01"
         end_datetime:   "2019-12-31"
         timestep: "6h"
         forecast_len: 1

   Output key format::

       "weatherbench2_era5/{field_type}/{dim}/{varname}"

   Assumptions:
       1. Non-static variables have a "time" dimension in the zarr store.
       2. 3D pressure-level variables have a "level" coordinate (hPa).
       3. Dimension order: (time, level, latitude, longitude) for 3D;
          (time, latitude, longitude) for 2D; (latitude, longitude) for static.


   .. py:attribute:: dataset_type
      :type:  str
      :value: 'weatherbench2_era5'



   .. py:attribute:: resolution
      :type:  str


   .. py:attribute:: store_path
      :type:  str


   .. py:attribute:: level_coord
      :type:  str


   .. py:attribute:: levels
      :type:  list[int]


   .. py:attribute:: static_metadata
      :type:  dict


   .. py:attribute:: _fs
      :value: None



   .. py:attribute:: store
      :value: None



   .. py:attribute:: mode
      :value: 'remote'



   .. py:method:: _init_fs() -> None


   .. py:method:: _extract_field(field_type: credit.datasets.base_dataset.VALID_FIELD_TYPES, t: pandas.Timestamp, sample: dict) -> None

      Open the zarr store and extract variables for *field_type* at time *t*.

      Keys written to *sample*:

      - ``"weatherbench2_era5/{field_type}/3d/{varname}"`` — shape ``(n_levels, 1, lat, lon)``
      - ``"weatherbench2_era5/{field_type}/2d/{varname}"`` — shape ``(1, 1, lat, lon)``

      :param field_type: One of ``"prognostic"``, ``"dynamic_forcing"``,
                         ``"static"``, ``"diagnostic"``.
      :param t: Timestamp to select.
      :param sample: Dict to write variable tensors into (modified in place).



