Monitoring Training with TensorBoard#
CREDIT writes training metrics to TensorBoard after every epoch when use_tensorboard: True
is set in the trainer config. Logs are written to <save_loc>/tensorboard/.
Enabling TensorBoard#
Add one line to the trainer block of your config:
trainer:
type: era5-gen2
use_tensorboard: True # <-- add this
...
TensorBoard is off by default so existing configs are unaffected. All production v2 configs
(config/wxformer_025deg_6hr_v2.yml, config/wxformer_1dg_6hr_v2.yml) have it enabled.
What gets logged#
Each scalar is grouped so that train and validation curves appear on the same chart:
TensorBoard tag |
Metric |
|---|---|
|
Training loss (mean over epoch) |
|
Validation loss (mean over epoch) |
|
Training accuracy |
|
Validation accuracy |
|
Training mean absolute error |
|
Validation mean absolute error |
|
Learning rate |
|
Rollout forecast length (auto-regressive curriculum) |
Additional per-variable metrics are logged if save_metric_vars is set in the config.
Viewing logs locally#
If your scratch filesystem is mounted locally (e.g. via SSHFS or on a login node):
tensorboard --logdir /glade/derecho/scratch/$USER/my_run/tensorboard
# then open http://localhost:6006 in your browser
Viewing logs from Casper or Derecho (SSH port forwarding)#
Step 1 — start TensorBoard on the HPC node#
SSH to a login node or submit an interactive job, then:
tensorboard --logdir /glade/derecho/scratch/$USER/my_run/tensorboard --port 6006
Step 2 — forward the port to your laptop#
In a separate terminal on your laptop:
# Replace <username> and <hostname> with your NCAR username and the login node
ssh -N -L 6006:localhost:6006 <username>@derecho.hpc.ucar.edu
# or for Casper:
ssh -N -L 6006:localhost:6006 <username>@casper.ucar.edu
Step 3 — open in your browser#
Navigate to http://localhost:6006.
One-liner (no separate terminal)#
ssh -L 6006:localhost:6006 <username>@casper.ucar.edu \
"tensorboard --logdir /glade/derecho/scratch/$USER/my_run/tensorboard --port 6006"
Comparing multiple runs#
Point TensorBoard at a parent directory to overlay runs in the same chart:
# Each subdirectory becomes a separate run in the legend
tensorboard --logdir /glade/derecho/scratch/$USER/experiments/
This works well when save_loc is structured like:
experiments/
wxformer_025deg_run1/tensorboard/
wxformer_025deg_run2/tensorboard/
wxformer_1deg_baseline/tensorboard/
Resuming a run#
TensorBoard appends new events to the existing log directory each time training resumes.
Epoch numbers are preserved (the trainer continues from start_epoch), so the loss curve
remains continuous across job restarts.
Installation#
TensorBoard is included with PyTorch and does not need a separate install. If it is somehow missing from your environment:
pip install tensorboard