Quickstart#

Get from zero to a running training job in under 10 minutes. This page covers the full loop — install, configure, submit, monitor, visualise, get help. Every command is copy-pasteable.


1. Set up your environment#

Note

NCAR users on Casper — pre-built environment, no conda create needed:

conda activate /glade/campaign/cisl/aiml/credit/conda_envs/credit-casper

git clone https://github.com/NCAR/miles-credit.git
cd miles-credit
# Installs credit into the .local directory in your home directory.
pip install --user -e .

Note

NCAR users on Derecho:

conda activate /glade/campaign/cisl/aiml/credit/conda_envs/credit-derecho

git clone https://github.com/NCAR/miles-credit.git
cd miles-credit
pip install --user -e .

Note

Other systems:

conda create -n credit python=3.12
conda activate credit
pip install miles-credit

Or install the main development branch:

git clone https://github.com/NCAR/miles-credit.git
cd miles-credit
pip install -e .

Verify the install worked:

credit --help

2. Generate a config#

CREDIT ships with ready-to-use configs for ERA5. Pick your resolution:

# 1-degree ERA5 — good starting point, fast to train
credit init --grid 1deg -o my_run.yml

# 0.25-degree ERA5 — full resolution, needs more memory and time
credit init --grid 0.25deg -o my_run.yml

Note

NCAR users: data paths in these configs already point to /glade/campaign/cisl/aiml/ksha/CREDIT_data/ — readable by all NCAR staff. save_loc defaults to /glade/derecho/scratch/$USER/CREDIT_runs/... No edits required to get started.

Open my_run.yml and find the # USER SETTINGS block. The only things you may want to change before your first run:

Field

Default

Notes

trainer.num_epoch

5

Epochs per PBS job. Increase if walltime allows.

trainer.train_batch_size

8

Per-GPU. Reduce if you hit OOM.

save_loc

scratch dir

Where checkpoints and logs are written.


3. Submit a training job#

Submit#

credit submit automatically figures out how many jobs to chain from trainer.epochs / trainer.num_epoch in your config — you don’t need to calculate it yourself.

# Casper — chain computed automatically from config
credit submit --cluster casper  -c my_run.yml --gpus 4

# Derecho — 1 node × 4 GPUs
credit submit --cluster derecho -c my_run.yml --gpus 4 --nodes 1

# Derecho — multi-node (e.g. 4 nodes × 4 GPUs = 16 GPUs total)
credit submit --cluster derecho -c my_run.yml --gpus 4 --nodes 4

Before submitting, credit submit always prints a job plan:

====================================================
  Job plan
====================================================
  Cluster  : casper
  Config   : my_run.yml
  GPUs     : 4 GPU(s)
  Walltime : 12:00:00 per job
  Chain    : 14 jobs  (70 epochs ÷ 5 per job)
  DataLoader memory est. : ~8 GB
====================================================

If the memory estimate is high (> 24 GB) it will warn you to reduce thread_workers or prefetch_factor before the job hangs silently.

Override the chain length manually if needed:

credit submit --cluster casper -c my_run.yml --gpus 4 --chain 5

Preview the full PBS script without submitting:

credit submit --cluster casper -c my_run.yml --gpus 4 --dry-run

Job 1 starts immediately; jobs 2–N are queued with PBS afterok and start automatically when the previous job succeeds.

Resuming a failed chain#

If a job fails mid-run (preemption, node fault), the remaining afterok jobs are cancelled by PBS. Restart from the last good checkpoint:

credit submit --cluster derecho -c my_run.yml --gpus 4 --nodes 1 --reload --chain 5

--reload patches the config to set load_weights: True and all related reload flags automatically — no manual YAML editing required.

More detail: Training guide | credit submit --help


4. Monitor progress#

Training log#

The trainer writes a CSV after every epoch:

# Quick check: last 5 epochs
tail -5 /glade/derecho/scratch/$USER/CREDIT_runs/my_run/training_log.csv

Columns: epoch, train_loss, val_loss, lr, epoch_time_s.

What healthy training looks like:

  • After epoch 1: train_loss ≈ 1–3

  • Loss should decrease steadily each epoch

  • val_loss should track train_loss (not diverge)

TensorBoard#

tensorboard --logdir /glade/derecho/scratch/$USER/CREDIT_runs/my_run/tensorboard

Then open http://localhost:6006 in your browser. On HPC you will need SSH port-forwarding — see Monitoring with TensorBoard.


5. Visualise a prediction#

Once at least one checkpoint exists, run a forward pass and produce a 3-panel global map (truth | prediction | difference) for any field:

# Denormalised to physical units (K for temperature, Pa for pressure)
credit plot -c my_run.yml --field VAR_2T --denorm

# Multiple fields at once
credit plot -c my_run.yml --field VAR_2T SP VAR_10U --denorm

# Specific pressure level (index into your levels list)
credit plot -c my_run.yml --field U --level 5 --denorm

Plots are saved to <save_loc>/plots/. No GPU required — runs on CPU.

What to look for:

What you see

Meaning

Recognisable weather patterns after ~10 epochs

Training is going well

Uniform grey prediction

Too few epochs, or LR/normalisation problem

Loss > 100 or growing

Check mean_path / std_path in config

Small smooth difference map

Model is converging correctly

More detail: credit plot --help


6. Get help from the AI assistant#

credit ask is a unified AI assistant — it automatically runs in agent mode (reads files, runs commands, iterates) when Anthropic is available, or falls back to simple chat (Groq, Gemini, OpenAI) otherwise.

pip install "miles-credit[ask]"

# Set whichever key you have — free options work well for quick questions:
export GROQ_API_KEY=gsk_...           # https://console.groq.com       (free, no card needed)
export GOOGLE_API_KEY=AIza...         # https://aistudio.google.com    (free)
export OPENAI_API_KEY=sk-...          # https://platform.openai.com
export ANTHROPIC_API_KEY=sk-ant-...   # https://console.anthropic.com  (enables agent mode)

credit ask "how do I resume a failed Derecho job?"
credit ask -c my_run.yml "my loss stopped decreasing at epoch 12, what should I check?"

Provider

Env var

Mode

Cost

Anthropic

ANTHROPIC_API_KEY

Agent (multi-turn, reads files)

~$0.01–0.05/session

OpenAI

OPENAI_API_KEY

Simple chat

Pay-per-use

Google

GOOGLE_API_KEY

Simple chat

Free

Groq

GROQ_API_KEY

Simple chat

Free tier (no card needed)

Priority when multiple keys are set: Anthropic agent → OpenAI → Google → Groq.

# Agent mode: reads your PBS log, config, and source to give a specific answer
credit ask -c my_run.yml "why did my training run crash?"
credit ask -c my_run.yml "review this config before I start a 200-epoch run on 8 H100s"
credit ask "what PBS jobs are running and how much walltime do they have left?"

See the full AI Assistant documentation for all examples, options, and cost details.


Common problems#

Symptom

Fix

Training hangs on startup, no error

DataLoader is using too much RAM. Set thread_workers: 1 and prefetch_factor: 1 in your config.

RendezvousConnectionError on Derecho

Use --nodes 1 so the job gets torchrun --standalone instead of MPI rendezvous.

ANTHROPIC_API_KEY is not set

Run export ANTHROPIC_API_KEY=sk-ant-... or add it to ~/.bashrc.

PBS chain cancelled after job failure

Expected — PBS afterok cancels remaining jobs. Use --reload --chain N to restart.

Checkpoint not found on first run

Normal — set load_weights: False in config (the default).

Out of GPU memory

Reduce train_batch_size. For 0.25° start with train_batch_size: 1.


What’s next#

Goal

Where to go

Understand every config field

Config reference

Multi-node training details

Training guide

Run a forecast from a trained model

Inference guide

Serve forecasts over HTTP

Forecast API Server

Set up TensorBoard on HPC

TensorBoard

Evaluate your model against baselines

Evaluation

Use a custom dataset

Dataset structure

Add a new model architecture

Model architectures