Quickstart#
Get from zero to a running training job in under 10 minutes. This page covers the full loop — install, configure, submit, monitor, visualise, get help. Every command is copy-pasteable.
1. Set up your environment#
Note
NCAR users on Casper — pre-built environment, no conda create needed:
conda activate /glade/campaign/cisl/aiml/credit/conda_envs/credit-casper
git clone https://github.com/NCAR/miles-credit.git
cd miles-credit
# Installs credit into the .local directory in your home directory.
pip install --user -e .
Note
NCAR users on Derecho:
conda activate /glade/campaign/cisl/aiml/credit/conda_envs/credit-derecho
git clone https://github.com/NCAR/miles-credit.git
cd miles-credit
pip install --user -e .
Note
Other systems:
conda create -n credit python=3.12
conda activate credit
pip install miles-credit
Or install the main development branch:
git clone https://github.com/NCAR/miles-credit.git
cd miles-credit
pip install -e .
Verify the install worked:
credit --help
More detail: Installation | Getting Started
2. Generate a config#
CREDIT ships with ready-to-use configs for ERA5. Pick your resolution:
# 1-degree ERA5 — good starting point, fast to train
credit init --grid 1deg -o my_run.yml
# 0.25-degree ERA5 — full resolution, needs more memory and time
credit init --grid 0.25deg -o my_run.yml
Note
NCAR users: data paths in these configs already point to
/glade/campaign/cisl/aiml/ksha/CREDIT_data/ — readable by all NCAR staff.
save_loc defaults to /glade/derecho/scratch/$USER/CREDIT_runs/...
No edits required to get started.
Open my_run.yml and find the # USER SETTINGS block. The only things you
may want to change before your first run:
Field |
Default |
Notes |
|---|---|---|
|
|
Epochs per PBS job. Increase if walltime allows. |
|
|
Per-GPU. Reduce if you hit OOM. |
|
scratch dir |
Where checkpoints and logs are written. |
More detail: Config reference | Training guide
3. Submit a training job#
Submit#
credit submit automatically figures out how many jobs to chain from
trainer.epochs / trainer.num_epoch in your config — you don’t need to
calculate it yourself.
# Casper — chain computed automatically from config
credit submit --cluster casper -c my_run.yml --gpus 4
# Derecho — 1 node × 4 GPUs
credit submit --cluster derecho -c my_run.yml --gpus 4 --nodes 1
# Derecho — multi-node (e.g. 4 nodes × 4 GPUs = 16 GPUs total)
credit submit --cluster derecho -c my_run.yml --gpus 4 --nodes 4
Before submitting, credit submit always prints a job plan:
====================================================
Job plan
====================================================
Cluster : casper
Config : my_run.yml
GPUs : 4 GPU(s)
Walltime : 12:00:00 per job
Chain : 14 jobs (70 epochs ÷ 5 per job)
DataLoader memory est. : ~8 GB
====================================================
If the memory estimate is high (> 24 GB) it will warn you to reduce
thread_workers or prefetch_factor before the job hangs silently.
Override the chain length manually if needed:
credit submit --cluster casper -c my_run.yml --gpus 4 --chain 5
Preview the full PBS script without submitting:
credit submit --cluster casper -c my_run.yml --gpus 4 --dry-run
Job 1 starts immediately; jobs 2–N are queued with PBS afterok and start
automatically when the previous job succeeds.
Resuming a failed chain#
If a job fails mid-run (preemption, node fault), the remaining afterok jobs
are cancelled by PBS. Restart from the last good checkpoint:
credit submit --cluster derecho -c my_run.yml --gpus 4 --nodes 1 --reload --chain 5
--reload patches the config to set load_weights: True and all related
reload flags automatically — no manual YAML editing required.
More detail: Training guide |
credit submit --help
4. Monitor progress#
Training log#
The trainer writes a CSV after every epoch:
# Quick check: last 5 epochs
tail -5 /glade/derecho/scratch/$USER/CREDIT_runs/my_run/training_log.csv
Columns: epoch, train_loss, val_loss, lr, epoch_time_s.
What healthy training looks like:
After epoch 1:
train_loss≈ 1–3Loss should decrease steadily each epoch
val_lossshould tracktrain_loss(not diverge)
TensorBoard#
tensorboard --logdir /glade/derecho/scratch/$USER/CREDIT_runs/my_run/tensorboard
Then open http://localhost:6006 in your browser.
On HPC you will need SSH port-forwarding — see Monitoring with TensorBoard.
5. Visualise a prediction#
Once at least one checkpoint exists, run a forward pass and produce a 3-panel global map (truth | prediction | difference) for any field:
# Denormalised to physical units (K for temperature, Pa for pressure)
credit plot -c my_run.yml --field VAR_2T --denorm
# Multiple fields at once
credit plot -c my_run.yml --field VAR_2T SP VAR_10U --denorm
# Specific pressure level (index into your levels list)
credit plot -c my_run.yml --field U --level 5 --denorm
Plots are saved to <save_loc>/plots/. No GPU required — runs on CPU.
What to look for:
What you see |
Meaning |
|---|---|
Recognisable weather patterns after ~10 epochs |
Training is going well |
Uniform grey prediction |
Too few epochs, or LR/normalisation problem |
Loss > 100 or growing |
Check |
Small smooth difference map |
Model is converging correctly |
More detail:
credit plot --help
6. Get help from the AI assistant#
credit ask is a unified AI assistant — it automatically runs in agent mode (reads files,
runs commands, iterates) when Anthropic is available, or falls back to simple chat
(Groq, Gemini, OpenAI) otherwise.
pip install "miles-credit[ask]"
# Set whichever key you have — free options work well for quick questions:
export GROQ_API_KEY=gsk_... # https://console.groq.com (free, no card needed)
export GOOGLE_API_KEY=AIza... # https://aistudio.google.com (free)
export OPENAI_API_KEY=sk-... # https://platform.openai.com
export ANTHROPIC_API_KEY=sk-ant-... # https://console.anthropic.com (enables agent mode)
credit ask "how do I resume a failed Derecho job?"
credit ask -c my_run.yml "my loss stopped decreasing at epoch 12, what should I check?"
Provider |
Env var |
Mode |
Cost |
|---|---|---|---|
Anthropic |
|
Agent (multi-turn, reads files) |
~$0.01–0.05/session |
OpenAI |
|
Simple chat |
Pay-per-use |
|
Simple chat |
Free |
|
Groq |
|
Simple chat |
Free tier (no card needed) |
Priority when multiple keys are set: Anthropic agent → OpenAI → Google → Groq.
# Agent mode: reads your PBS log, config, and source to give a specific answer
credit ask -c my_run.yml "why did my training run crash?"
credit ask -c my_run.yml "review this config before I start a 200-epoch run on 8 H100s"
credit ask "what PBS jobs are running and how much walltime do they have left?"
See the full AI Assistant documentation for all examples, options, and cost details.
Common problems#
Symptom |
Fix |
|---|---|
Training hangs on startup, no error |
DataLoader is using too much RAM. Set |
|
Use |
|
Run |
PBS chain cancelled after job failure |
Expected — PBS |
Checkpoint not found on first run |
Normal — set |
Out of GPU memory |
Reduce |
What’s next#
Goal |
Where to go |
|---|---|
Understand every config field |
|
Multi-node training details |
|
Run a forecast from a trained model |
|
Serve forecasts over HTTP |
|
Set up TensorBoard on HPC |
|
Evaluate your model against baselines |
|
Use a custom dataset |
|
Add a new model architecture |