Training a Model#

Note

New to CREDIT? Jump straight to Training with the v2 Data Schema — it is the recommended path for all new experiments and uses a single credit command for everything. The sections below document the legacy v1 workflow.

CREDIT supports three modes for training a model. In your configuration file (model.yml), under the trainer field, you can set mode to one of the following:

  • None: Trains on a single GPU without any special distributed settings.

  • ddp: Uses Distributed Data Parallel (DDP) for multi-GPU training.

  • fsdp: Uses Fully Sharded Data Parallel (FSDP) for multi-GPU training.

Training on a Single GPU (No Distributed Training)#

To start a training run from epoch 0, use:

credit_train -c config/model.yml

Ensure the trainer section in model.yml is set as follows:

trainer:
    load_weights: False
    load_optimizer: False
    load_scaler: False
    load_scheduler: False
    reload_epoch: False
    start_epoch: 0
    num_epoch: 10
    epochs: &epochs 70

These settings ensure training starts at epoch 0 without loading any pre-existing weights. The model will train for 10 epochs and save a checkpoint (checkpoint.pt) to the save_loc directory as well as a training_log.csv file that will report on statistics such as the epoch number and the training and validation loss.

To continue training from epoch 11, update these settings:

trainer:
    load_weights: True
    load_optimizer: True
    load_scaler: True
    load_scheduler: True
    reload_epoch: True
    start_epoch: 0
    num_epoch: 10
    epochs: &epochs 70

Setting reload_epoch: True ensures that training resumes from the last saved checkpoint and will automatically load training_log.csv. Once training has been run seven times, reaching epoch 70, the training process is complete.

Training with Distributed Data Parallel (DDP) or Fully Sharded Data Parallel (FSDP)#

To train on multiple GPUs, set mode to ddp or fsdp in model.yml.

trainer:
    mode: ddp  # Use 'fsdp' for Fully Sharded Data Parallel

Then, start training as usual:

credit_train -c config/model.yml

This command generates a PBS script and submits it via qsub. Job resources are controlled by the pbs: section of your config — see below.

PBS configuration in your config file#

The pbs: block is the primary place to set your allocation code, walltime, node count, conda environment, and other job parameters. You do not need to pass these on the command line every time.

# ---- Derecho ----------------------------------------------------------------
pbs:
    project: "NCAR0001"        # YOUR allocation code (PBS -A) — change this!
    job_name: "credit_gen2"      # job name shown in qstat
    walltime: "12:00:00"       # wall-clock limit per job (HH:MM:SS)
    nodes: 1                   # number of nodes (derecho only; casper is always 1)
    ncpus: 64                  # CPUs per node
    ngpus: 4                   # GPUs per node
    mem: ‘480GB’               # memory per node
    queue: ‘main’              # queue name
    conda: "credit-derecho"    # conda env name or full path
# ---- Casper -----------------------------------------------------------------
pbs:
    project: "NCAR0001"
    job_name: "credit_gen2"
    walltime: "04:00:00"
    ncpus: 8
    ngpus: 1
    mem: ‘128GB’
    queue: ‘casper’
    gpu_type: ‘a100_80gb’      # a100_80gb, v100, h100, etc.
    conda: "credit"

Resolution order — the same setting can come from three places, highest priority first:

Priority

Source

Example

1

CLI flag

--account NCAR0001 --gpus 4

2

pbs: section in config

project: "NCAR0001"

3

Built-in cluster default

4 GPUs, 12 h walltime, etc.

You can also export PBS_ACCOUNT in your shell as a global fallback for the account code (useful if you work across multiple configs but always charge the same project).

Running on Casper vs. Derecho#

Key Differences#

Feature

Derecho

Casper

GPUs per node

4

1

Total GPUs

32 (8 nodes × 4)

1

Memory

480GB

128GB

Walltime

12:00:00

4:00:00

GPU Type

A100

V100/A100/H100

Queue

main

casper

Casper is best for small-scale experiments, while Derecho is designed for large-scale, multi-node training. Derecho only has A100 GPUs with 40 Gb of memory. Casper has both 40 Gb and 80 Gb A100s along with a small number of H100s with 80 Gb of memory.


Training with the v2 Data Schema#

CREDIT v2 is the recommended path for all new experiments. It uses a cleaner nested data schema with explicit variable categories (prognostic, diagnostic, dynamic_forcing, static) and a unified credit command for everything.

Quickstart on Casper or Derecho#

# 1. Activate the pre-built environment (NCAR users)
#    Casper:
conda activate /glade/u/home/schreck/.conda/envs/credit-casper
#    Derecho:
#    conda activate /glade/work/benkirk/conda-envs/credit-derecho-torch28-nccl221

# 2. Clone the repo and install
git clone https://github.com/NCAR/miles-credit.git
cd miles-credit
pip install -e . --no-deps

# 3. Generate a config from a built-in template
credit init --grid 1deg -o my_experiment.yml      # or --grid 0.25deg for full-res

# 4. Set your allocation in the pbs: section of my_experiment.yml, then submit.
#    --chain auto-computes job count from ceil(epochs / num_epoch) in the config.
credit submit --cluster casper  -c my_experiment.yml --chain 10
credit submit --cluster derecho -c my_experiment.yml --chain 10

That’s it. --chain 10 submits 10 back-to-back jobs via PBS afterok dependencies — no manual resubmission needed.

Note

NCAR users: data paths in the built-in configs point to /glade/campaign/cisl/aiml/ksha/CREDIT_data/ which is readable by all NCAR staff. save_loc defaults to /glade/derecho/scratch/$USER/CREDIT_runs/... — no config edits required to get started.

How many jobs do I need?#

Rule of thumb: --chain = ceil(total_epochs / epochs_per_job).

num_epoch in the trainer config controls how many epochs run per job submission (default 5). epochs is the total training target (default 70).

total epochs

epochs/job (num_epoch)

--chain

70

5

14

70

10

7

100

10

10

Use --dry-run to inspect the PBS scripts before submitting:

credit submit --cluster derecho -c my_experiment.yml --chain 10 --dry-run

Available configs#

Grid

File

Notes

config/wxformer_1dg_6hr_v2.yml

ERA5 model-level — good starting point

0.25°

config/wxformer_025deg_6hr_v2.yml

Full-res pressure-level, 13 levels

starter

config/starter_v2.yml

Minimal template with USER SETTINGS comments

What does a healthy training run look like?#

After the first epoch, train_loss should be O(1) (roughly 1–3). It should decrease steadily across epochs. If losses are > 100 or growing, something is wrong with normalization or the data paths.

Check progress at any time:

# Quick check: tail the CSV log
tail -5 /glade/derecho/scratch/$USER/CREDIT_runs/my_run/training_log.csv

# Global map: truth vs prediction in physical units (saves to <save_loc>/plots/)
credit plot -c my_experiment.yml --field VAR_2T --denorm

# Visual dashboard: TensorBoard (see Monitoring with TensorBoard for SSH forwarding)
tensorboard --logdir /glade/derecho/scratch/$USER/CREDIT_runs/my_run/tensorboard

Trainer configuration#

Set trainer.type: era5-gen2 in your config. Key fields:

trainer:
    type: era5-gen2
    mode: ddp               # none | ddp | fsdp
    train_batch_size: 8     # per-GPU; total = batch_size × n_gpus
    num_epoch: 5            # epochs per job submission
    epochs: &epochs 70      # total training target
    use_tensorboard: True   # write TensorBoard logs to save_loc/tensorboard/
    use_ema: True           # recommended: EMA shadow weights for checkpointing
    ema_decay: 0.9999
    use_scheduler: True
    scheduler:
        scheduler_type: linear-warmup-cosine
        warmup_steps: 1000
        total_steps: 500000
        min_lr: 1.0e-5

When use_tensorboard: True, metrics are written to <save_loc>/tensorboard/ after each epoch. Launch the viewer from any machine with access to the filesystem:

tensorboard --logdir /glade/derecho/scratch/$USER/my_run/tensorboard

See Monitoring with TensorBoard for port-forwarding instructions for Casper and Derecho.

Job submission#

The credit submit command generates a ready-to-use PBS script and optionally calls qsub. Resource settings are read from the pbs: section of your config (see above); CLI flags override them when provided.

# Minimal — all settings come from the pbs: block in config.yml
credit submit --cluster derecho -c config.yml

# Override specific settings on the fly
credit submit --cluster derecho -c config.yml --nodes 2 --walltime 06:00:00

# Charge a different account for this run only
credit submit --cluster derecho -c config.yml --account NCAR0002

# Preview the generated PBS script without submitting
credit submit --cluster derecho -c config.yml --dry-run

# Casper
credit submit --cluster casper -c config.yml

See credit submit --help for the full option list.

Resuming training#

Wall-time limits on Casper (12 h) and Derecho mean a 70-epoch run typically needs multiple job submissions. Two options:

Option A — chain jobs upfront with --chain N#

Submit all jobs at once before training starts. PBS afterok dependencies ensure each job only starts after the previous one completes successfully:

# Submit 10 back-to-back jobs (job 1 fresh, jobs 2–10 auto-reload)
credit submit --cluster derecho -c config.yml --chain 10

# Same for Casper
credit submit --cluster casper -c config.yml --chain 10

If you estimate ~5 epochs per 12 h wall time and need 70 epochs total, --chain 14 covers the full run without any manual resubmission.

Use --dry-run to preview all scripts before submitting:

credit submit --cluster derecho -c config.yml --chain 10 --dry-run

Option B — manual reload with --reload#

Submit one job at a time. After each job completes, resubmit with --reload:

# First job
credit submit --cluster derecho -c config.yml

# Every subsequent job
credit submit --cluster derecho -c config.yml --reload

Restarting a failed chain#

If the cluster kills a job mid-run (preemption, node failure, etc.), the remaining afterok jobs in the chain are automatically cancelled by PBS. To restart from the last good checkpoint, combine --reload and --chain:

# Resume and re-queue 5 more jobs from the latest checkpoint
credit submit --cluster derecho -c config.yml --reload --chain 5

Job 1 picks up from the checkpoint; jobs 2–5 are chained behind it with afterok. The epoch counter stays continuous because reload_epoch: True always reads the next epoch from the checkpoint file.

Both options write <save_loc>/config_reload.yml with these five fields patched automatically — no manual config editing required:

load_weights: True
load_optimizer: True
load_scaler: True
load_scheduler: True
reload_epoch: True   # auto-detects next epoch from checkpoint

reload_epoch: True causes the trainer to read the epoch from the checkpoint and set start_epoch = checkpoint_epoch + 1, so the epoch counter is always continuous.