credit.cli
==========

.. py:module:: credit.cli

.. autoapi-nested-parse::

   CREDIT unified command-line interface.

   Single entrypoint for training, rollout, job submission, and config generation.

   .. rubric:: Examples

   credit init     --grid 0.25deg -o my_config.yml
   credit train    -c config.yml
   credit realtime -c config.yml --init-time 2024-01-15T00 --steps 40
   credit rollout  -c config.yml
   credit submit   --cluster derecho -c config.yml --gpus 4 --nodes 2
   credit submit   --cluster casper  -c config.yml --mode rollout --jobs 10
   credit submit   --cluster casper  -c config.yml --mode realtime --init-time 2024-01-15T00


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/credit/cli/_ask/index
   /autoapi/credit/cli/_common/index
   /autoapi/credit/cli/_convert/index
   /autoapi/credit/cli/_parser/index
   /autoapi/credit/cli/_plot/index
   /autoapi/credit/cli/_submit/index


Attributes
----------

.. autoapisummary::

   credit.cli._AGENT_SYSTEM_PROMPT
   credit.cli._AGENT_TOOL_DEFS
   credit.cli._CREDIT_SYSTEM_PROMPT
   credit.cli._AGENT_BASH_BLOCKLIST
   credit.cli._PBS_DEFAULTS


Exceptions
----------

.. autoapisummary::

   credit.cli._ProviderError


Functions
---------

.. autoapisummary::

   credit.cli._agent
   credit.cli._ask
   credit.cli._collect_run_context
   credit.cli._agent_bash
   credit.cli._agent_list_files
   credit.cli._agent_read_file
   credit.cli._dispatch_tool
   credit.cli._find_torchrun
   credit.cli._is_ncar_system
   credit.cli._prompt
   credit.cli._prompt_bool
   credit.cli._repo_root
   credit.cli._setup_logging
   credit.cli._convert
   credit.cli._init
   credit.cli._write_reload_config
   credit.cli._build_parser
   credit.cli.main
   credit.cli._build_channel_map
   credit.cli._build_denorm_stats
   credit.cli._metrics
   credit.cli._plot
   credit.cli._build_pbs_script
   credit.cli._build_realtime_pbs_script
   credit.cli._build_rollout_pbs_script
   credit.cli._compute_chain
   credit.cli._do_submit_realtime
   credit.cli._do_submit_rollout
   credit.cli._load_pbs_config
   credit.cli._print_ensemble_rollout_plan
   credit.cli._print_job_plan
   credit.cli._qsub
   credit.cli._realtime
   credit.cli._resolve_pbs_opts
   credit.cli._rollout
   credit.cli._rollout_ensemble
   credit.cli._submit
   credit.cli._train


Package Contents
----------------

.. py:data:: _AGENT_SYSTEM_PROMPT
   :value: Multiline-String

   .. raw:: html

      <details><summary>Show Value</summary>

   .. code-block:: python

      """You are CREDIT-Agent, an agentic AI assistant for the CREDIT software package (Community Research Earth Digital Intelligence Twin),
      an AI-based numerical weather prediction framework developed by the NCAR MILES group.
      When introducing yourself, use the name "CREDIT-Agent". Do not call yourself "CREDIT" — that is the name of the software package you support.
      
      You have access to tools that let you read files, list files, and run safe read-only shell commands.
      Use them to investigate the user's question thoroughly before answering.
      
      Typical tasks:
      - Diagnose why a training run crashed (read PBS logs, config, Python tracebacks)
      - Explain what a config option does (read the relevant source file)
      - Suggest config changes based on the user's hardware and dataset
      - Check whether a job is still running (qstat) and interpret its output
      - Diff configs between two experiments
      
      Guidelines:
      - Always read relevant files before speculating — the answer is usually in the logs or config.
      - When reading PBS output files (*.o*), focus on the last 100 lines first.
      - Suggest concrete, actionable fixes — not generic advice.
      - Keep responses concise; use markdown headers and code blocks.
      """

   .. raw:: html

      </details>


.. py:data:: _AGENT_TOOL_DEFS

.. py:data:: _CREDIT_SYSTEM_PROMPT
   :value: Multiline-String

   .. raw:: html

      <details><summary>Show Value</summary>

   .. code-block:: python

      """You are CREDIT-Ask, an AI assistant for the CREDIT software package (Community Research Earth Digital Intelligence Twin),
      an AI-based numerical weather prediction framework developed by the NCAR MILES group.
      When introducing yourself, use the name "CREDIT-Ask". Do not call yourself "CREDIT" — that is the name of the software package you support.
      
      ## What CREDIT is
      CREDIT trains deep learning models (primarily WXFormer) to forecast global atmospheric state.
      It runs on NCAR HPC clusters: Casper (single-node, A100/H100 GPUs) and Derecho (multi-node, A100 GPUs).
      The main entry point is the `credit` CLI.
      
      ## Key CLI commands
      - `credit train -c config.yml`                    — start/resume training
      - `credit submit --cluster casper|derecho -c config.yml [--mode train|rollout|realtime] --gpus N [--nodes N] [--chain N] [--reload] [--jobs N] [--init-time YYYY-MM-DDTHH] [--steps N]`
      - `credit plot -c config.yml --field VAR_2T --denorm`   — quick visualisation from checkpoint
      - `credit rollout -c config.yml`                  — batch forecast to NetCDF
      - `credit realtime -c config.yml --init-time YYYY-MM-DDTHH --steps N`
      - `credit init --grid 1deg|0.25deg -o my_config.yml`    — generate starter config
      - `credit ask "..."`                              — this command
      
      ## v2 data schema (YAML)
      ```yaml
      data:
        source:
          ERA5:
            levels: [...]          # pressure/model levels
            variables:
              prognostic:
                vars_3D: [U, V, T, Q, Z]    # each × n_levels channels
                vars_2D: [SP, VAR_2T, ...]
              diagnostic:
                vars_2D: [precip, evap, ...]
        mean_path: /path/to/mean.nc
        std_path:  /path/to/std.nc
      ```
      
      ## Trainer config
      ```yaml
      trainer:
        type: era5-gen2
        mode: ddp           # none | ddp | fsdp
        train_batch_size: 8
        num_epoch: 5        # epochs per PBS job
        epochs: 70          # total training target
        thread_workers: 4   # DataLoader workers per GPU
        prefetch_factor: 4
        use_tensorboard: True
        use_ema: True
        ema_decay: 0.9999
        use_scheduler: True
        scheduler:
          scheduler_type: linear-warmup-cosine
          warmup_steps: 1000
          total_steps: 500000
          min_lr: 1.0e-5
        dataloader_timeout_s: 300   # preflight hang detection
      ```
      
      ## Cluster specifics
      - **Casper**: 1 node, torchrun --standalone, GPUs: V100/A100/H100, queue: casper
        - Pre-built env: `/glade/u/home/schreck/.conda/envs/credit-casper`
      - **Derecho single-node**: torchrun --standalone (NOT mpiexec)
      - **Derecho multi-node**: mpiexec + torchrun --rdzv-backend=c10d
        - Pre-built env: `/glade/work/benkirk/conda-envs/credit-derecho-torch28-nccl221`
      - Data root: `/glade/campaign/cisl/aiml/ksha/CREDIT_data/`
      - Default save_loc: `/glade/derecho/scratch/$USER/CREDIT_runs/`
      
      ## Common problems and fixes
      | Symptom | Likely cause | Fix |
      |---------|-------------|-----|
      | Training loop hangs on startup | DataLoader OOM (too many workers × prefetch × batch × channels) | Reduce `thread_workers` to 1 or 0, or `prefetch_factor` to 1 |
      | `RendezvousConnectionError` on Derecho | Single-node job using c10d rendezvous | Use `--nodes 1` so `credit submit` generates `--standalone` |
      | Loss > 100 or growing | Bad normalization or wrong data paths | Check `mean_path`/`std_path`; run `credit plot --denorm` |
      | Loss stuck (not decreasing) | LR too low/high, wrong scheduler, EMA misconfigured | Check scheduler config; try reducing LR 10×; check warmup_steps |
      | `KeyError: 'linear-warmup-cosine'` | Old CREDIT version | `pip install -e . --no-deps` to reinstall |
      | Checkpoint not found | Wrong `save_loc` or first epoch | Set `load_weights: False` for first run |
      | PBS job cancelled after failure | Normal: `afterok` chain auto-cancels remaining jobs | Use `credit submit --reload --chain N` to restart |
      | FSDP + EMA slow | EMA does extra full-param sync on FSDP | Use `use_ema: False` with FSDP or accept overhead |
      
      ## How --chain works (train mode)
      `--chain N` submits N PBS jobs with afterok dependencies. Job 1 runs fresh (or --reload).
      Jobs 2..N auto-generate `config_reload.yml` and resume from checkpoint.
      Rule of thumb: chain = ceil(total_epochs / num_epoch). E.g., 70 epochs / 5 per job = 14.
      
      ## submit --mode options
      - `--mode train` (default): training job, supports --chain and --reload
      - `--mode rollout`: N parallel jobs covering all init times, use --jobs N; reads predict: section
      - `--mode realtime`: single forecast job, requires --init-time YYYY-MM-DDTHH and --steps N
      
      ## What healthy training looks like
      - After epoch 1: train_loss ≈ 1–3 (order 1)
      - Loss should decrease steadily each epoch
      - Validation loss should track training loss (not diverge)
      - `credit plot -c config.yml --field VAR_2T --denorm` should show recognisable weather patterns after ~10 epochs
      
      Be concise, specific, and actionable. When referencing config keys use inline code. If you see a training log or config in the context, use it to give run-specific advice.
      """

   .. raw:: html

      </details>


.. py:exception:: _ProviderError

   Bases: :py:obj:`Exception`


   Raised when a provider call fails in a way that should trigger fallback.


.. py:function:: _agent(args) -> None

   Run an agentic session: Claude reads files and runs commands to answer your question.


.. py:function:: _ask(args) -> None

   Unified AI assistant: tries agentic mode first, falls back to simple chat.


.. py:function:: _collect_run_context(args) -> str

   Gather config, training log, and recent PBS output for context injection.


.. py:data:: _AGENT_BASH_BLOCKLIST
   :value: ('rm ', 'rmdir', 'mv ', 'cp ', '> ', '>>', 'tee ', 'dd ', 'mkfs', 'chmod', 'chown', 'curl',...


.. py:data:: _PBS_DEFAULTS

.. py:function:: _agent_bash(command: str) -> str

.. py:function:: _agent_list_files(pattern: str) -> str

.. py:function:: _agent_read_file(path: str, tail: int = 400) -> str

.. py:function:: _dispatch_tool(name: str, tool_input: dict) -> str

.. py:function:: _find_torchrun() -> str

   Return the path to torchrun, preferring the active conda env.


.. py:function:: _is_ncar_system() -> bool

   Return True if running on a known NCAR HPC system (Casper or Derecho).


.. py:function:: _prompt(prompt: str, default=None) -> str

   Print a prompt and return stripped input, or *default* if empty.


.. py:function:: _prompt_bool(prompt: str, default: bool = True) -> bool

.. py:function:: _repo_root() -> str

   Absolute path to the miles-credit repo root.


.. py:function:: _setup_logging(level: int = logging.INFO) -> None

.. py:function:: _convert(args: argparse.Namespace) -> None

   Interactive v1 → v2 config converter.


.. py:function:: _init(args: argparse.Namespace) -> None

   Copy a config template to the user's desired location.


.. py:function:: _write_reload_config(config_path: str) -> str

   Patch trainer reload fields and write a reload config next to the checkpoint.

   Reads the YAML at *config_path*, sets the five fields required for a clean
   resume, and writes the result to ``<save_loc>/config_reload.yml``.

   Returns the path to the written reload config.


.. py:function:: _build_parser() -> argparse.ArgumentParser

.. py:function:: main() -> None

.. py:function:: _build_channel_map(conf)

   Return a dict mapping variable name -> list of channel indices in the output tensor.


.. py:function:: _build_denorm_stats(conf)

   Return (mean_arr, std_arr) aligned with ERA5Dataset target channel order.


.. py:function:: _metrics(args) -> None

   Run WeatherBench2-style evaluation and optionally generate scorecard plots.


.. py:function:: _plot(args) -> None

   Load checkpoint, run one forward pass, produce global maps.


.. py:function:: _build_pbs_script(args: argparse.Namespace, config: str, repo: str, account: str = None, depend_on: str = None, save_loc: str = None) -> str

   Return a PBS batch script string for the given args and config path.


.. py:function:: _build_realtime_pbs_script(args: argparse.Namespace, config: str, repo: str, init_time: str, steps: int, save_loc: str = None) -> str

   Return a PBS script that runs a single realtime forecast.


.. py:function:: _build_rollout_pbs_script(args: argparse.Namespace, config: str, repo: str, subset: int, n_subsets: int, save_loc: str = None) -> str

   Return a PBS script for one subset of an ensemble rollout.


.. py:function:: _compute_chain(args: argparse.Namespace) -> int

   Return the number of jobs to chain.


.. py:function:: _do_submit_realtime(args: argparse.Namespace) -> None

   Submit a single PBS job for a realtime forecast.


.. py:function:: _do_submit_rollout(args: argparse.Namespace) -> None

   Submit N parallel PBS rollout jobs to cover all init times.


.. py:function:: _load_pbs_config(config_path: str) -> dict

   Return the ``pbs:`` section from a YAML config file.


.. py:function:: _print_ensemble_rollout_plan(args: argparse.Namespace, n_jobs: int, n_forecasts: int, ensemble_size: int) -> None

   Print a human-readable summary of an ensemble rollout submission.


.. py:function:: _print_job_plan(args: argparse.Namespace, n_jobs: int) -> None

   Print a human-readable summary of what is about to be submitted.


.. py:function:: _qsub(script: str, save_loc: str | None = None) -> str

   Write *script* to save_loc/pbs_scripts/, call qsub, and return the job ID string.


.. py:function:: _realtime(args: argparse.Namespace) -> None

.. py:function:: _resolve_pbs_opts(args: argparse.Namespace, pbs_cfg: dict) -> argparse.Namespace

   Return a copy of *args* with None fields filled from *pbs_cfg* then cluster defaults.


.. py:function:: _rollout(args: argparse.Namespace) -> None

.. py:function:: _rollout_ensemble(args: argparse.Namespace) -> None

   Deprecated: use ``credit submit --mode rollout`` instead.


.. py:function:: _submit(args: argparse.Namespace) -> None

   Generate and optionally submit PBS batch scripts, with optional chaining.


.. py:function:: _train(args: argparse.Namespace) -> None