credit.cli

credit.cli#

CREDIT unified command-line interface.

Single entrypoint for training, rollout, job submission, and config generation.

Examples

credit init –grid 0.25deg -o my_config.yml credit train -c config.yml credit realtime -c config.yml –init-time 2024-01-15T00 –steps 40 credit rollout -c config.yml credit submit –cluster derecho -c config.yml –gpus 4 –nodes 2 credit submit –cluster casper -c config.yml –mode rollout –jobs 10 credit submit –cluster casper -c config.yml –mode realtime –init-time 2024-01-15T00

Submodules#

Attributes#

`_AGENT_SYSTEM_PROMPT`
`_AGENT_TOOL_DEFS`
`_CREDIT_SYSTEM_PROMPT`
`_AGENT_BASH_BLOCKLIST`
`_PBS_DEFAULTS`

Exceptions#

_ProviderError

Raised when a provider call fails in a way that should trigger fallback.

Functions#

`_agent`(→ None)	Run an agentic session: Claude reads files and runs commands to answer your question.
`_ask`(→ None)	Unified AI assistant: tries agentic mode first, falls back to simple chat.
`_collect_run_context`(→ str)	Gather config, training log, and recent PBS output for context injection.
`_agent_bash`(→ str)
`_agent_list_files`(→ str)
`_agent_read_file`(→ str)
`_dispatch_tool`(→ str)
`_find_torchrun`(→ str)	Return the path to torchrun, preferring the active conda env.
`_is_ncar_system`(→ bool)	Return True if running on a known NCAR HPC system (Casper or Derecho).
`_prompt`(→ str)	Print a prompt and return stripped input, or default if empty.
`_prompt_bool`(→ bool)
`_repo_root`(→ str)	Absolute path to the miles-credit repo root.
`_setup_logging`(→ None)
`_convert`(→ None)	Interactive v1 → v2 config converter.
`_init`(→ None)	Copy a config template to the user's desired location.
`_write_reload_config`(→ str)	Patch trainer reload fields and write a reload config next to the checkpoint.
`_build_parser`(→ argparse.ArgumentParser)
`main`(→ None)
`_build_channel_map`(conf)	Return a dict mapping variable name -> list of channel indices in the output tensor.
`_build_denorm_stats`(conf)	Return (mean_arr, std_arr) aligned with ERA5Dataset target channel order.
`_metrics`(→ None)	Run WeatherBench2-style evaluation and optionally generate scorecard plots.
`_plot`(→ None)	Load checkpoint, run one forward pass, produce global maps.
`_build_pbs_script`(→ str)	Return a PBS batch script string for the given args and config path.
`_build_realtime_pbs_script`(→ str)	Return a PBS script that runs a single realtime forecast.
`_build_rollout_pbs_script`(→ str)	Return a PBS script for one subset of an ensemble rollout.
`_compute_chain`(→ int)	Return the number of jobs to chain.
`_do_submit_realtime`(→ None)	Submit a single PBS job for a realtime forecast.
`_do_submit_rollout`(→ None)	Submit N parallel PBS rollout jobs to cover all init times.
`_load_pbs_config`(→ dict)	Return the `pbs:` section from a YAML config file.
`_print_ensemble_rollout_plan`(→ None)	Print a human-readable summary of an ensemble rollout submission.
`_print_job_plan`(→ None)	Print a human-readable summary of what is about to be submitted.
`_qsub`(→ str)	Write script to save_loc/pbs_scripts/, call qsub, and return the job ID string.
`_realtime`(→ None)
`_resolve_pbs_opts`(→ argparse.Namespace)	Return a copy of args with None fields filled from pbs_cfg then cluster defaults.
`_rollout`(→ None)
`_rollout_ensemble`(→ None)	Deprecated: use `credit submit --mode rollout` instead.
`_submit`(→ None)	Generate and optionally submit PBS batch scripts, with optional chaining.
`_train`(→ None)

Package Contents#

credit.cli._AGENT_SYSTEM_PROMPT = Multiline-String#

Show Value

"""You are CREDIT-Agent, an agentic AI assistant for the CREDIT software package (Community Research Earth Digital Intelligence Twin),
an AI-based numerical weather prediction framework developed by the NCAR MILES group.
When introducing yourself, use the name "CREDIT-Agent". Do not call yourself "CREDIT" — that is the name of the software package you support.

You have access to tools that let you read files, list files, and run safe read-only shell commands.
Use them to investigate the user's question thoroughly before answering.

Typical tasks:
- Diagnose why a training run crashed (read PBS logs, config, Python tracebacks)
- Explain what a config option does (read the relevant source file)
- Suggest config changes based on the user's hardware and dataset
- Check whether a job is still running (qstat) and interpret its output
- Diff configs between two experiments

Guidelines:
- Always read relevant files before speculating — the answer is usually in the logs or config.
- When reading PBS output files (*.o*), focus on the last 100 lines first.
- Suggest concrete, actionable fixes — not generic advice.
- Keep responses concise; use markdown headers and code blocks.
"""

credit.cli._AGENT_TOOL_DEFS#

credit.cli._CREDIT_SYSTEM_PROMPT = Multiline-String#

Show Value

"""You are CREDIT-Ask, an AI assistant for the CREDIT software package (Community Research Earth Digital Intelligence Twin),
an AI-based numerical weather prediction framework developed by the NCAR MILES group.
When introducing yourself, use the name "CREDIT-Ask". Do not call yourself "CREDIT" — that is the name of the software package you support.

## What CREDIT is
CREDIT trains deep learning models (primarily WXFormer) to forecast global atmospheric state.
It runs on NCAR HPC clusters: Casper (single-node, A100/H100 GPUs) and Derecho (multi-node, A100 GPUs).
The main entry point is the `credit` CLI.

## Key CLI commands
- `credit train -c config.yml`                    — start/resume training
- `credit submit --cluster casper|derecho -c config.yml [--mode train|rollout|realtime] --gpus N [--nodes N] [--chain N] [--reload] [--jobs N] [--init-time YYYY-MM-DDTHH] [--steps N]`
- `credit plot -c config.yml --field VAR_2T --denorm`   — quick visualisation from checkpoint
- `credit rollout -c config.yml`                  — batch forecast to NetCDF
- `credit realtime -c config.yml --init-time YYYY-MM-DDTHH --steps N`
- `credit init --grid 1deg|0.25deg -o my_config.yml`    — generate starter config
- `credit ask "..."`                              — this command

## v2 data schema (YAML)
```yaml
data:
  source:
    ERA5:
      levels: [...]          # pressure/model levels
      variables:
        prognostic:
          vars_3D: [U, V, T, Q, Z]    # each × n_levels channels
          vars_2D: [SP, VAR_2T, ...]
        diagnostic:
          vars_2D: [precip, evap, ...]
  mean_path: /path/to/mean.nc
  std_path:  /path/to/std.nc
```

## Trainer config
```yaml
trainer:
  type: era5-gen2
  mode: ddp           # none | ddp | fsdp
  train_batch_size: 8
  num_epoch: 5        # epochs per PBS job
  epochs: 70          # total training target
  thread_workers: 4   # DataLoader workers per GPU
  prefetch_factor: 4
  use_tensorboard: True
  use_ema: True
  ema_decay: 0.9999
  use_scheduler: True
  scheduler:
    scheduler_type: linear-warmup-cosine
    warmup_steps: 1000
    total_steps: 500000
    min_lr: 1.0e-5
  dataloader_timeout_s: 300   # preflight hang detection
```

## Cluster specifics
- **Casper**: 1 node, torchrun --standalone, GPUs: V100/A100/H100, queue: casper
  - Pre-built env: `/glade/u/home/schreck/.conda/envs/credit-casper`
- **Derecho single-node**: torchrun --standalone (NOT mpiexec)
- **Derecho multi-node**: mpiexec + torchrun --rdzv-backend=c10d
  - Pre-built env: `/glade/work/benkirk/conda-envs/credit-derecho-torch28-nccl221`
- Data root: `/glade/campaign/cisl/aiml/ksha/CREDIT_data/`
- Default save_loc: `/glade/derecho/scratch/$USER/CREDIT_runs/`

## Common problems and fixes
| Symptom | Likely cause | Fix |
|---------|-------------|-----|
| Training loop hangs on startup | DataLoader OOM (too many workers × prefetch × batch × channels) | Reduce `thread_workers` to 1 or 0, or `prefetch_factor` to 1 |
| `RendezvousConnectionError` on Derecho | Single-node job using c10d rendezvous | Use `--nodes 1` so `credit submit` generates `--standalone` |
| Loss > 100 or growing | Bad normalization or wrong data paths | Check `mean_path`/`std_path`; run `credit plot --denorm` |
| Loss stuck (not decreasing) | LR too low/high, wrong scheduler, EMA misconfigured | Check scheduler config; try reducing LR 10×; check warmup_steps |
| `KeyError: 'linear-warmup-cosine'` | Old CREDIT version | `pip install -e . --no-deps` to reinstall |
| Checkpoint not found | Wrong `save_loc` or first epoch | Set `load_weights: False` for first run |
| PBS job cancelled after failure | Normal: `afterok` chain auto-cancels remaining jobs | Use `credit submit --reload --chain N` to restart |
| FSDP + EMA slow | EMA does extra full-param sync on FSDP | Use `use_ema: False` with FSDP or accept overhead |

## How --chain works (train mode)
`--chain N` submits N PBS jobs with afterok dependencies. Job 1 runs fresh (or --reload).
Jobs 2..N auto-generate `config_reload.yml` and resume from checkpoint.
Rule of thumb: chain = ceil(total_epochs / num_epoch). E.g., 70 epochs / 5 per job = 14.

## submit --mode options
- `--mode train` (default): training job, supports --chain and --reload
- `--mode rollout`: N parallel jobs covering all init times, use --jobs N; reads predict: section
- `--mode realtime`: single forecast job, requires --init-time YYYY-MM-DDTHH and --steps N

## What healthy training looks like
- After epoch 1: train_loss ≈ 1–3 (order 1)
- Loss should decrease steadily each epoch
- Validation loss should track training loss (not diverge)
- `credit plot -c config.yml --field VAR_2T --denorm` should show recognisable weather patterns after ~10 epochs

Be concise, specific, and actionable. When referencing config keys use inline code. If you see a training log or config in the context, use it to give run-specific advice.
"""

exception credit.cli._ProviderError#

Bases: Exception

Raised when a provider call fails in a way that should trigger fallback.

credit.cli._agent(args) → None#: Run an agentic session: Claude reads files and runs commands to answer your question.

credit.cli._ask(args) → None#: Unified AI assistant: tries agentic mode first, falls back to simple chat.

credit.cli._collect_run_context(args) → str#: Gather config, training log, and recent PBS output for context injection.

credit.cli._AGENT_BASH_BLOCKLIST = ('rm ', 'rmdir', 'mv ', 'cp ', '> ', '>>', 'tee ', 'dd ', 'mkfs', 'chmod', 'chown', 'curl',...#

credit.cli._PBS_DEFAULTS#

credit.cli._agent_bash(command: str) → str#

credit.cli._agent_list_files(pattern: str) → str#

credit.cli._agent_read_file(path: str, tail: int = 400) → str#

credit.cli._dispatch_tool(name: str, tool_input: dict) → str#

credit.cli._find_torchrun() → str#: Return the path to torchrun, preferring the active conda env.

credit.cli._is_ncar_system() → bool#: Return True if running on a known NCAR HPC system (Casper or Derecho).

credit.cli._prompt(prompt: str, default=None) → str#: Print a prompt and return stripped input, or default if empty.

credit.cli._prompt_bool(prompt: str, default: bool = True) → bool#

credit.cli._repo_root() → str#: Absolute path to the miles-credit repo root.

credit.cli._setup_logging(level: int = logging.INFO) → None#

credit.cli._convert(args: argparse.Namespace) → None#: Interactive v1 → v2 config converter.

credit.cli._init(args: argparse.Namespace) → None#: Copy a config template to the user’s desired location.

credit.cli._write_reload_config(config_path: str) → str#

Patch trainer reload fields and write a reload config next to the checkpoint.

Reads the YAML at config_path, sets the five fields required for a clean resume, and writes the result to <save_loc>/config_reload.yml.

Returns the path to the written reload config.

credit.cli._build_parser() → argparse.ArgumentParser#

credit.cli.main() → None#

credit.cli._build_channel_map(conf)#: Return a dict mapping variable name -> list of channel indices in the output tensor.

credit.cli._build_denorm_stats(conf)#: Return (mean_arr, std_arr) aligned with ERA5Dataset target channel order.

credit.cli._metrics(args) → None#: Run WeatherBench2-style evaluation and optionally generate scorecard plots.

credit.cli._plot(args) → None#: Load checkpoint, run one forward pass, produce global maps.

credit.cli._build_pbs_script(args: argparse.Namespace, config: str, repo: str, account: str = None, depend_on: str = None, save_loc: str = None) → str#: Return a PBS batch script string for the given args and config path.

credit.cli._build_realtime_pbs_script(args: argparse.Namespace, config: str, repo: str, init_time: str, steps: int, save_loc: str = None) → str#: Return a PBS script that runs a single realtime forecast.

credit.cli._build_rollout_pbs_script(args: argparse.Namespace, config: str, repo: str, subset: int, n_subsets: int, save_loc: str = None) → str#: Return a PBS script for one subset of an ensemble rollout.

credit.cli._compute_chain(args: argparse.Namespace) → int#: Return the number of jobs to chain.

credit.cli._do_submit_realtime(args: argparse.Namespace) → None#: Submit a single PBS job for a realtime forecast.

credit.cli._do_submit_rollout(args: argparse.Namespace) → None#: Submit N parallel PBS rollout jobs to cover all init times.

credit.cli._load_pbs_config(config_path: str) → dict#: Return the pbs: section from a YAML config file.

credit.cli._print_ensemble_rollout_plan(args: argparse.Namespace, n_jobs: int, n_forecasts: int, ensemble_size: int) → None#: Print a human-readable summary of an ensemble rollout submission.

credit.cli._print_job_plan(args: argparse.Namespace, n_jobs: int) → None#: Print a human-readable summary of what is about to be submitted.

credit.cli._qsub(script: str, save_loc: str | None = None) → str#: Write script to save_loc/pbs_scripts/, call qsub, and return the job ID string.

credit.cli._realtime(args: argparse.Namespace) → None#

credit.cli._resolve_pbs_opts(args: argparse.Namespace, pbs_cfg: dict) → argparse.Namespace#: Return a copy of args with None fields filled from pbs_cfg then cluster defaults.

credit.cli._rollout(args: argparse.Namespace) → None#

credit.cli._rollout_ensemble(args: argparse.Namespace) → None#: Deprecated: use credit submit --mode rollout instead.

credit.cli._submit(args: argparse.Namespace) → None#: Generate and optionally submit PBS batch scripts, with optional chaining.

credit.cli._train(args: argparse.Namespace) → None#