Training a Model#
CREDIT supports three modes for training a model. In your configuration file (model.yml), under the trainer field, you can set mode to one of the following:
None: Trains on a single GPU without any special distributed settings.ddp: Uses Distributed Data Parallel (DDP) for multi-GPU training.fsdp: Uses Fully Sharded Data Parallel (FSDP) for multi-GPU training.
Training on a Single GPU (No Distributed Training)#
To start a training run from epoch 0, use:
credit_train -c config/model.yml
Ensure the trainer section in model.yml is set as follows:
trainer:
load_weights: False
load_optimizer: False
load_scaler: False
load_scheduler: False
reload_epoch: False
start_epoch: 0
num_epoch: 10
epochs: &epochs 70
These settings ensure training starts at epoch 0 without loading any pre-existing weights. The model will train for 10 epochs and save a checkpoint (checkpoint.pt) to the save_loc directory as well as a training_log.csv file that will report on statistics such as the epoch number and the training and validation loss.
To continue training from epoch 11, update these settings:
trainer:
load_weights: True
load_optimizer: True
load_scaler: True
load_scheduler: True
reload_epoch: True
start_epoch: 0
num_epoch: 10
epochs: &epochs 70
Setting reload_epoch: True ensures that training resumes from the last saved checkpoint and will automatically load training_log.csv. Once training has been run seven times, reaching epoch 70, the training process is complete.
Training with Distributed Data Parallel (DDP) or Fully Sharded Data Parallel (FSDP)#
To train on multiple GPUs, set mode to ddp or fsdp in model.yml.
trainer:
mode: ddp # Use 'fsdp' for Fully Sharded Data Parallel
Then, start training as usual:
credit_train -c config/model.yml
This command generates a launch script (launch.sh) and submits a job on Derecho, allocating the required number of nodes and GPUs. The settings for this job are controlled by the pbs field in model.yml.
Example PBS Configuration (Derecho)#
pbs:
conda: "credit-derecho"
project: "NAML0001"
job_name: "train_model"
walltime: "12:00:00"
nodes: 8
ncpus: 64
ngpus: 4
mem: '480GB'
queue: 'main'
conda: The environment containing themiles-creditinstallation.project: Your project code.nodesandngpus: The number of nodes and GPUs per node. In this example,8 nodes × 4 GPUs= 32 GPUs total.
Example launch.sh Script for Derecho#
#!/bin/bash
#PBS -A NAML0001
#PBS -N train_model
#PBS -l walltime=12:00:00
#PBS -l select=1:ncpus=64:ngpus=4
#PBS -q main
#PBS -j oe
#PBS -k eod
#PBS -r n
# Load modules
module load conda cuda cudnn mkl
conda activate credit-derecho
# Export environment variables
export LSCRATCH=/glade/derecho/scratch/schreck/
export LOGLEVEL=INFO
# Launch training
mpiexec --cpu-bind none --no-transfer \
python applications/train.py -c model.yml --backend nccl
This script utilizes MPI for coordinating training across multiple nodes and GPUs. It includes necessary environment variables for Derecho’s system configuration. Users should not need to modify this script, as it is tailored for Derecho and may change with system updates.
Running on Casper vs. Derecho#
For Casper, modify model.yml as follows:
pbs:
conda: "credit"
project: "NAML0001"
job_name: "train_model"
nodes: 1
ncpus: 32
ngpus: 4
mem: '900GB'
walltime: '4:00:00'
gpu_type: 'a100'
queue: 'casper'
Once again, to launch the job on Casper, run:
credit_train -c config/example-v2026.1.0.yml -l 1
This command generates a launch script (launch.sh), which will look like:
#!/bin/bash -l
#PBS -N train_model
#PBS -l select=1:ncpus=32:ngpus=4:mem=900g:gpu_type=a100
#PBS -l walltime=4:00:00
#PBS -A NAML0001
#PBS -q casper
#PBS -j oe
#PBS -k eod
source ~/.bashrc
conda activate credit-casper
torchrun --standalone --nnodes 1 --nproc-per-node=4 applications/train.py -c config/example-v2026.1.0.yml
and note that the torchrun command is used rather than MPI. In order to utilize MPI,
PyTorch needs to be compiled from source on your own system against the MPI installation on that system.
torchrun can perform distributed training across all GPUs on a single node with minimal configuration
and is recommended for use on Casper or other servers focused on single node training.
It is possible to use torchrun for multi-node training orchestration but requires starting torchrun
instances separately on each node and coordinating communication.
Key Differences#
Feature |
Derecho |
Casper |
|---|---|---|
GPUs per node |
4 |
1 |
Total GPUs |
32 (8 nodes × 4) |
1 |
Memory |
480GB |
128GB |
Walltime |
12:00:00 |
4:00:00 |
GPU Type |
A100 |
V100/A100/H100 |
Queue |
|
|
Casper is best for small-scale experiments, while Derecho is designed for large-scale, multi-node training. Derecho only has A100 GPUs with 40 Gb of memory. Casper has both 40 Gb and 80 Gb A100s along with a small number of H100s with 80 Gb of memory.