Quickstart#

Get from zero to a running training job in under 10 minutes. This page covers the full loop — install, configure, submit, monitor, visualise, get help. Every command is copy-pasteable.

1. Set up your environment#

Note

NCAR users on Casper — pre-built environment, no conda create needed:

conda activate /glade/campaign/cisl/aiml/credit/conda_envs/credit-casper

git clone https://github.com/NCAR/miles-credit.git
cd miles-credit
# Installs credit into the .local directory in your home directory.
pip install --user -e .

Note

NCAR users on Derecho:

conda activate /glade/campaign/cisl/aiml/credit/conda_envs/credit-derecho

git clone https://github.com/NCAR/miles-credit.git
cd miles-credit
pip install --user -e .

Note

Other systems:

conda create -n credit python=3.12
conda activate credit
pip install miles-credit

Or install the main development branch:

git clone https://github.com/NCAR/miles-credit.git
cd miles-credit
pip install -e .

Verify the install worked:

credit --help

More detail: Installation | Getting Started

2. Generate a config#

CREDIT ships with ready-to-use configs for ERA5. Pick your resolution:

# 1-degree ERA5 — good starting point, fast to train
credit init --grid 1deg -o my_run.yml

# 0.25-degree ERA5 — full resolution, needs more memory and time
credit init --grid 0.25deg -o my_run.yml

Note

NCAR users: data paths in these configs already point to /glade/campaign/cisl/aiml/ksha/CREDIT_data/ — readable by all NCAR staff. save_loc defaults to /glade/derecho/scratch/$USER/CREDIT_runs/... No edits required to get started.

Open my_run.yml and find the # USER SETTINGS block. The only things you may want to change before your first run:

Field	Default	Notes
`trainer.num_epoch`	`5`	Epochs per PBS job. Increase if walltime allows.
`trainer.train_batch_size`	`8`	Per-GPU. Reduce if you hit OOM.
`save_loc`	scratch dir	Where checkpoints and logs are written.

More detail: Config reference | Training guide

3. Submit a training job#

Submit#

credit submit automatically figures out how many jobs to chain from trainer.epochs / trainer.num_epoch in your config — you don’t need to calculate it yourself.

# Casper — chain computed automatically from config
credit submit --cluster casper  -c my_run.yml --gpus 4

# Derecho — 1 node × 4 GPUs
credit submit --cluster derecho -c my_run.yml --gpus 4 --nodes 1

# Derecho — multi-node (e.g. 4 nodes × 4 GPUs = 16 GPUs total)
credit submit --cluster derecho -c my_run.yml --gpus 4 --nodes 4

Before submitting, credit submit always prints a job plan:

====================================================
  Job plan
====================================================
  Cluster  : casper
  Config   : my_run.yml
  GPUs     : 4 GPU(s)
  Walltime : 12:00:00 per job
  Chain    : 14 jobs  (70 epochs ÷ 5 per job)
  DataLoader memory est. : ~8 GB
====================================================

If the memory estimate is high (> 24 GB) it will warn you to reduce thread_workers or prefetch_factor before the job hangs silently.

Override the chain length manually if needed:

credit submit --cluster casper -c my_run.yml --gpus 4 --chain 5

Preview the full PBS script without submitting:

credit submit --cluster casper -c my_run.yml --gpus 4 --dry-run

Job 1 starts immediately; jobs 2–N are queued with PBS afterok and start automatically when the previous job succeeds.

Resuming a failed chain#

If a job fails mid-run (preemption, node fault), the remaining afterok jobs are cancelled by PBS. Restart from the last good checkpoint:

credit submit --cluster derecho -c my_run.yml --gpus 4 --nodes 1 --reload --chain 5

--reload patches the config to set load_weights: True and all related reload flags automatically — no manual YAML editing required.

More detail: Training guide | credit submit --help

4. Monitor progress#

Training log#

The trainer writes a CSV after every epoch:

# Quick check: last 5 epochs
tail -5 /glade/derecho/scratch/$USER/CREDIT_runs/my_run/training_log.csv

Columns: epoch, train_loss, val_loss, lr, epoch_time_s.

What healthy training looks like:

After epoch 1: train_loss ≈ 1–3
Loss should decrease steadily each epoch
val_loss should track train_loss (not diverge)

TensorBoard#

tensorboard --logdir /glade/derecho/scratch/$USER/CREDIT_runs/my_run/tensorboard

Then open http://localhost:6006 in your browser. On HPC you will need SSH port-forwarding — see Monitoring with TensorBoard.

5. Visualise a prediction#

Once at least one checkpoint exists, run a forward pass and produce a 3-panel global map (truth | prediction | difference) for any field:

# Denormalised to physical units (K for temperature, Pa for pressure)
credit plot -c my_run.yml --field VAR_2T --denorm

# Multiple fields at once
credit plot -c my_run.yml --field VAR_2T SP VAR_10U --denorm

# Specific pressure level (index into your levels list)
credit plot -c my_run.yml --field U --level 5 --denorm

Plots are saved to <save_loc>/plots/. No GPU required — runs on CPU.

What to look for:

What you see	Meaning
Recognisable weather patterns after ~10 epochs	Training is going well
Uniform grey prediction	Too few epochs, or LR/normalisation problem
Loss > 100 or growing	Check `mean_path` / `std_path` in config
Small smooth difference map	Model is converging correctly

More detail: credit plot --help

6. Get help from the AI assistant#

credit ask is a unified AI assistant — it automatically runs in agent mode (reads files, runs commands, iterates) when Anthropic is available, or falls back to simple chat (Groq, Gemini, OpenAI) otherwise.

pip install "miles-credit[ask]"

# Set whichever key you have — free options work well for quick questions:
export GROQ_API_KEY=gsk_...           # https://console.groq.com       (free, no card needed)
export GOOGLE_API_KEY=AIza...         # https://aistudio.google.com    (free)
export OPENAI_API_KEY=sk-...          # https://platform.openai.com
export ANTHROPIC_API_KEY=sk-ant-...   # https://console.anthropic.com  (enables agent mode)

credit ask "how do I resume a failed Derecho job?"
credit ask -c my_run.yml "my loss stopped decreasing at epoch 12, what should I check?"

Provider	Env var	Mode	Cost
Anthropic	`ANTHROPIC_API_KEY`	Agent (multi-turn, reads files)	~$0.01–0.05/session
OpenAI	`OPENAI_API_KEY`	Simple chat	Pay-per-use
Google	`GOOGLE_API_KEY`	Simple chat	Free
Groq	`GROQ_API_KEY`	Simple chat	Free tier (no card needed)

Priority when multiple keys are set: Anthropic agent → OpenAI → Google → Groq.

# Agent mode: reads your PBS log, config, and source to give a specific answer
credit ask -c my_run.yml "why did my training run crash?"
credit ask -c my_run.yml "review this config before I start a 200-epoch run on 8 H100s"
credit ask "what PBS jobs are running and how much walltime do they have left?"

See the full AI Assistant documentation for all examples, options, and cost details.

Common problems#

Symptom	Fix
Training hangs on startup, no error	DataLoader is using too much RAM. Set `thread_workers: 1` and `prefetch_factor: 1` in your config.
`RendezvousConnectionError` on Derecho	Use `--nodes 1` so the job gets `torchrun --standalone` instead of MPI rendezvous.
`ANTHROPIC_API_KEY is not set`	Run `export ANTHROPIC_API_KEY=sk-ant-...` or add it to `~/.bashrc`.
PBS chain cancelled after job failure	Expected — PBS `afterok` cancels remaining jobs. Use `--reload --chain N` to restart.
Checkpoint not found on first run	Normal — set `load_weights: False` in config (the default).
Out of GPU memory	Reduce `train_batch_size`. For 0.25° start with `train_batch_size: 1`.

What’s next#

Goal	Where to go
Understand every config field	Config reference
Multi-node training details	Training guide
Run a forecast from a trained model	Inference guide
Serve forecasts over HTTP	Forecast API Server
Set up TensorBoard on HPC	TensorBoard
Evaluate your model against baselines	Evaluation
Use a custom dataset	Dataset structure
Add a new model architecture	Model architectures

Quickstart

Contents

Quickstart#

1. Set up your environment#

2. Generate a config#

3. Submit a training job#

Submit#

Resuming a failed chain#

4. Monitor progress#

Training log#

TensorBoard#

5. Visualise a prediction#

6. Get help from the AI assistant#

Common problems#

What’s next#