credit.datasets.local
=====================

.. py:module:: credit.datasets.local

.. autoapi-nested-parse::

   local.py
   --------
   LocalDataset: generic PyTorch Dataset for loading atmospheric data from local
   NetCDF/Zarr files. Supports any combination of prognostic, dynamic_forcing,
   static, and diagnostic field types with optional 3D (multi-level) and 2D
   (surface/single-level) variables.

   Sample structure returned by __getitem__::

       {
           "input": {
               "{source_name}/prognostic/3d/T":        tensor,  # (n_levels, 1, lat, lon)
               "{source_name}/prognostic/2d/SP":       tensor,  # (1,        1, lat, lon)
               "{source_name}/dynamic_forcing/2d/tsi": tensor,
               "{source_name}/static/2d/LSM":          tensor,
               ...
           },
           "target": {                                  # only when return_target=True
               "{source_name}/prognostic/3d/T":        tensor,
               "{source_name}/prognostic/2d/SP":       tensor,
               ...
           },
           "metadata": {
               "input_datetime":  int,                  # nanoseconds since epoch
               "target_datetime": int,                  # only when return_target=True
           },
       }

   Output key format (flat, slash-delimited):
       "{source_name}/{field_type}/{dim}/{varname}"

       field_type: "prognostic" | "dynamic_forcing" | "static" | "diagnostic"
       dim       : "2d"  (surface / single-level)
                   "3d"  (multi-level upper-air; requires level_coord in config;
                          if levels is omitted all levels in the file are used)
       varname   : variable name as given in config (e.g. "T", "SP", "tsi")

   Tensor shapes (no batch dimension):
       3D variable : (n_levels, 1, lat, lon)   — n_levels = len(config levels)
       2D variable : (1,        1, lat, lon)   — singleton level dim

   After DataLoader collation the batch dimension is prepended:
       (batch, n_levels, 1, lat, lon)

   File naming:
       Each field type supports an optional ``filename_time_format`` config key
       that specifies a strftime format string describing how the datetime appears
       in the file name.  Defaults to ``"%Y"`` (annual files).

       Examples::

           filename_time_format: "%Y"       # data_2021.zarr
           filename_time_format: "%Y_%m"    # data_2021_06.nc
           filename_time_format: "%Y%m%d"   # data_20210601.nc

       If only a single file matches the glob pattern, ``filename_time_format`` is
       ignored and that file is used for all timestamps.



Classes
-------

.. autoapisummary::

   credit.datasets.local.LocalDataset


Module Contents
---------------

.. py:class:: LocalDataset(data_config: dict[str, Any], return_target: bool = False)

   Bases: :py:obj:`credit.datasets.base_dataset.BaseDataset`


   Generic PyTorch Dataset for local NetCDF/Zarr atmospheric data files.

   See module docstring for full description of output format and file naming.

   Example YAML configuration::

       data:
         source:
           My_Surface_Data:  # User-provided name (arbitrary key)
             dataset_type: "local"
             level_coord: "level"
             levels: [10, 30, 40, 50, 60, 70, 80, 90, 95, 100, 105, 110, 120, 130, 136, 137]
             variables:
               prognostic:
                 vars_3D: ['T', 'U', 'V', 'Q']
                 vars_2D: ['SP', 't2m']
                 path: "/data/era5_*.zarr"
                 filename_time_format: "%Y"        # annual (default)
               dynamic_forcing:
                 vars_2D: ['tsi']
                 path: "/data/solar_*.nc"
                 filename_time_format: "%Y_%m"     # monthly
               static:
                 vars_2D: ['Z_GDS4_SFC', 'LSM']
                 path: "/data/lsm.nc"
                 # single file — filename_time_format not needed
               diagnostic: null

         start_datetime: "2017-01-01"
         end_datetime: "2019-12-31"
         timestep: "6h"
         forecast_len: 1

   Assumptions:
       1. A "time" dimension / coordinate is present for non-static fields.
       2. A level coordinate (name given by ``level_coord``) represents the
          vertical axis of 3D variables.
       3. Dimension order: (time, level, latitude, longitude) for 3D;
          (time, latitude, longitude) for 2D; (latitude, longitude) for static.


   .. py:attribute:: dataset_type
      :value: 'local'



   .. py:attribute:: level_coord
      :type:  str | None


   .. py:attribute:: levels
      :type:  list | None


   .. py:attribute:: static_metadata
      :type:  dict[str, Any]


   .. py:attribute:: mode
      :value: 'local'



   .. py:attribute:: time_coord


   .. py:method:: _extract_field(field_type: str, t: pandas.Timestamp, sample: dict[str, Any]) -> None

      Open the dataset for *field_type* at time *t* and populate *sample*.

      Keys written are ``"{source_name}/{field_type}/3d/{varname}"`` for 3D variables
      and ``"{source_name}/{field_type}/2d/{varname}"`` for 2D variables.

      :param field_type: One of ``"prognostic"``, ``"dynamic_forcing"``,
                         ``"static"``, ``"diagnostic"``.
      :param t: Timestamp to select.
      :param sample: Dict to write variable tensors into (modified in place).
                     Tensor shapes (no batch dimension):

                     - 3D variable: ``(n_levels, 1, lat, lon)``
                     - 2D variable: ``(1, 1, lat, lon)``



