What’s in the Configuration File?#
Your configuration file drives everything from the model training to inference, to creating validation runs. This page documents the possible config options and the what the flags / settings do.
CREDIT Configuration Guide#
Overview#
This document provides detailed instructions on configuring configuration.yml for running CREDIT.
Key Topics Covered:
Understanding and modifying
configuration.ymlStandard Configuration Values and Recommendations
Best Practices and troubleshooting
Summary tables are included at the end of each subsection.
General Setup#
Workspace Configuration#
The following settings define where CREDIT will store output files:
save_loc: '/path/to/workspace/'
seed: 1000
save_loc: Directory where model weights, logs, and scripts are stored. If it doesn’t exist, CREDIT will create it automatically. The models weights can be large, so make sure ample storage is available.seed: Random seed for reproducibility. Changing this affects experiment results.
Data Configuration#
CREDIT requires multiple types of atmospheric data, formatted in YEARLY .nc or .zarr files, the following variables can be contained within the same, or different files.
Upper-Air Variables#
Upper-Air variables are those which have either pressure or model levels. These variables are considered prognostic (input + output) and have an expected format which covers whole spatial domain and model levels.
variables: ['U', 'V', 'T', 'Q']
save_loc: '/path/to/upper_air_data/'
Expected format:
(time, level, latitude, longitude)Normalization: Handled automatically by the dataloader—no need to preprocess.
Surface Variables#
Despite being named ‘surface variables’ these are prognostic variables (input &output) that are on single levels, either surface, top-of-model, or somewhere in the middle.
surface_variables: ['SP', 't2m', 'Z500', 'T500', 'U500', 'V500', 'Q500']
save_loc_surface: '/path/to/surface_data/'
Expected format:
(time, latitude, longitude)Must align with upper-air variable timestamps.
Forcing & Diagnostic Variables#
dynamic_forcing_variables: ['tsi','sst']
save_loc_dynamic_forcing: '/path/to/dynamic_forcing_data/'
diagnostic_variables: ['Z500', 'T500', 'U500', 'V500', 'Q500']
save_loc_diagnostic: '/path/to/diagnostic_data/'
Dynamic forcing variables provide additional time-dependent factors (e.g., solar forcing or SST forcing), these are dynamic (changing in time) variables provided during run time.
Diagnostic variables are used for evaluation but not directly predicted by the model.
Periodic & Static Forcing#
forcing_variables: ['TSI', 'SST']
save_loc_forcing: '/path/to/forcing_data.nc'
static_variables: ['Z_GDS4_SFC', 'LSM']
save_loc_static: '/path/to/static_data.nc'
Periodic forcing: Should cover an entire leap year (e.g., 366 days for an hourly model).
Static variables: Must be normalized by the user before use.
Physics and Normalization Files#
CREDIT requires external reference files for conservation physics and data normalization. These files must be provided in .zarr or .nc format.
Physics File: save_loc_physics#
save_loc_physics: '/path/to/physics_data.zarr'
Purpose: Stores grid information and coefficients needed for enforcing conservation constraints in the post-processing step (
post_block).Required for:
Mass conservation (
global_mass_fixer)Water conservation (
global_water_fixer)Energy conservation (
global_energy_fixer)
Must include the following variables:
For pressure-level grids:
lon2d,lat2d(longitude/latitude coordinates).For hybrid sigma-pressure grids:
lon2d,lat2d,coef_a,coef_b(sigma coordinate coefficients).
💡 If conservation constraints (post_conf) are enabled, this file is required!
Normalization Files#
CREDIT uses z-score normalization to standardize input variables. The mean and standard deviation files must contain all variables used in the model (upper-air, surface, forcing, diagnostic).
mean_path: '/path/to/mean.nc'
std_path: '/path/to/std.nc'
mean_path: NetCDF/Zarr file containing mean values for all variables.std_path: NetCDF/Zarr file containing standard deviation values.
Expected Format#
Both mean_path and std_path should store 1D variables indexed by level:
Variable Type |
Expected Dimensions |
Example Variables |
|---|---|---|
Upper-Air |
|
|
Surface |
|
|
Forcing |
|
|
Diagnostics |
|
|
💡 Ensure these files contain ALL variables listed in the configuration.yml sections for variables, surface_variables, dynamic_forcing_variables, and diagnostic_variables.
Summary of Key Physics & Normalization Recommendations#
Parameter |
Required For |
Notes |
|---|---|---|
|
Conservation constraints ( |
Required if conservation physics is enabled. |
|
|
Required for z-score normalization. |
|
|
Must include all model variables. |
Training Data Selection#
train_years: [1979, 2014] # 1979 - 2013
valid_years: [2014, 2018] # 2014 - 2017
Defines training/validation split. Adjust these to match the dataset.
This section contains critical configuration parameters related to data preprocessing, input structure, and training behavior. Below is an expanded, structured section covering these settings in depth.
Data Preprocessing and Temporal Configuration#
CREDIT supports different data normalization workflows, input histories, and forecast strategies. These settings control how data is preprocessed, how the model receives historical context, and how it is trained to predict future states.
Normalization: scaler_type#
scaler_type: 'std_new' # Options: 'std_new', 'std_cached'
std_new: The recommended approach. Uses z-score normalization with precomputed means and standard deviations from training data.std_cached: Assumes data has already been pre-normalized (e.g., stored in a cached dataset). Use only when working with preprocessed inputs.
Historical Context: history_len#
history_len: 1
valid_history_len: 1
history_len: Number of time steps used as input during training.valid_history_len: Same ashistory_len; modifying this separately is not recommended.
💡 For example, if history_len: 4, the model will use the last 4 time steps to predict the next state.
Forecast Lead Time Configuration#
CREDIT can be trained in single-step or multi-step forecasting mode:
forecast_len: 0
valid_forecast_len: 0
forecast_len:0→ Single-step prediction (predicts only the next time step).1, 2, 3, ...→ Multi-step prediction (predicts several time steps ahead).
valid_forecast_len:Can be equal to or smaller than
forecast_len.If
forecast_len > 1, setting a smallervalid_forecast_lenallows shorter validation sequences (useful for debugging).
Multi-Step Training Options#
If forecast_len > 0, CREDIT supports customized backpropagation strategies to improve training efficiency.
backprop_on_timestep: [1, 2, 3, 5, 6, 7]
Specifies which time steps contribute to the loss during backpropagation.
If unspecified, the trainer will backpropagate on all timesteps
Helps control memory usage by skipping certain time steps.
💡 For example, [1, 2, 3, 5, 6, 7] means the model backpropagates on these timesteps but skips others.
retain_graph: False
Specifies whether the trainer keeps the computation graph through the autoregressive prediction during training
If so, the backpropagation will go from each
backprop_on_timestepto the start of the autoregressive rolloutWill use a lot more memory
One-Shot Loss Computation#
one_shot: False
True: Computes loss only on the final predicted time step (useful for speeding up multi-step training).False: Computes loss at every time step, which may improve stability.
Temporal Resolution and Data Alignment#
CREDIT supports models trained on different time step intervals:
lead_time_periods: 6 # Example: 6-hourly training data
Controls the time step between consecutive forecast states.
6→ 6-hourly model (common for ERA5).1→ Hourly model.
Input Data Ordering: static_first#
CREDIT provides flexibility in how input tensors are structured:
static_first: False
True→ Order:[static → dynamic forcing → periodic forcing](matches olderstdworkflow).False→ Order:[dynamic forcing → periodic forcing → static](recommended forstd_new).
💡 If you are using std_new, set static_first: False.
Dataset Type#
CREDIT supports multiple data loading strategies:
dataset_type: ERA5_MultiStep_Batcher
Options:
ERA5_MultiStep_BatcherERA5_and_Forcing_MultiStepERA5_and_Forcing_SingleStepOcean_Tensor_BatcherOcean_MultiStep_Batcher
The default (
ERA5_MultiStep_Batcher) is recommended for efficient parallel data loading.
Summary of Key Data Processing Recommendations#
Parameter |
Recommended Setting |
Notes |
|---|---|---|
|
|
Ensures data is properly normalized. |
|
|
Use longer history for improved forecasts. |
|
|
Multi-step training requires additional tuning. |
|
|
Skipping some timesteps helps manage memory. |
|
|
Set |
|
|
Controls forecast step size. |
|
|
Recommended for |
|
|
Optimized for performance. |
Training Configuration#
The trainer section controls how CREDIT handles GPU parallelism, gradient updates, checkpointing, and logging.
Training type and mode#
trainer:
type: era5-gen1 # era5-gen1, era5-gen2, or conus404
mode: none # Options: "none" (single GPU), "fsdp" (fully sharded), "ddp" (distributed)
Use
era5for global dataUse
fsdporddpfor multi-GPU training.
💡 For large models, fsdp helps distribute computation across multiple GPUs, reducing memory usage.
CREDIT supports single-GPU, multi-GPU, and distributed training.#
FSDP-Specific GPU Optimization#
If using fsdp, you can enable additional optimizations:
cpu_offload: False
activation_checkpoint: True
checkpoint_all_layers: False
cpu_offload: Moves gradients to CPU memory (frees GPU memory but can cause CPU OOM errors).activation_checkpoint: Saves activations in forward pass (reduces GPU memory but slows training).checkpoint_all_layers:True→ Checkpoints activations for all layers.False→ Uses custom layer-wise checkpointing (set incredit/distributed.py).
💡 Use activation_checkpoint: True if training large models on limited memory GPUs.
Torch Compilation#
Torch 2.0 introduces compiling to torchscript to speed up training.
compile: False
True→ Enablestorch.compile()(can improve performance).False→ Default setting (recommended for maximum compatibility).
💡 Setting compile: True may break custom models—test before enabling.
Checkpointing & Weight Management#
CREDIT automatically saves and reloads model states. It will warn you if you are trying to load a model when no weights are available. To continue a run (or to extend the multi-step training), it is crucial to set the weight-loading to True.
load_weights: True
load_optimizer: True
load_scaler: True
load_scheduler: True
load_weights→ Loads existing model weights.load_optimizer→ Restores optimizer state (needed for resuming training).load_scaler→ Loads mixed-precision gradient scaler (if using AMP).load_scheduler→ Restores learning rate scheduler state.
💡 When starting multi-step training, initially set only load_weights: True, then enable all options for full restoration.
Saving Checkpoints#
save_backup_weights: True
save_best_weights: True
save_backup_weights→ Saves a checkpoint at the start of every epoch (acts as a recovery point).save_best_weights→ Saves the best model based on validation loss.
💡 If skip_validation: True, save_best_weights will NOT work!
Logging & Training Metrics#
CREDIT logs training performance in training_log.csv.
save_metric_vars: True
True→ Saves metrics for all predicted variables.List of variables → Saves only the specified ones:
save_metric_vars: ["Z500", "Q500", "Q", "T"]
[]orNone→ Saves only bulk metrics (averaged over all variables).
💡 Reducing the number of tracked variables speeds up training logs.
Learning Rate Updates#
update_learning_rate: False
False→ Learning rate is controlled by the scheduler.True→ Manually updatesoptimizer.param_groups.
💡 Set this to False if you are using a scheduler!
Summary of Key Hardware Utilization Recommendations#
Parameter |
Recommended Setting |
Notes |
|---|---|---|
|
|
|
|
|
Saves GPU memory but can cause CPU OOM errors. |
|
|
Saves memory but slows training. |
|
|
Use custom layer-wise checkpointing. |
|
|
Test before enabling ( |
|
|
Creates a checkpoint every epoch. |
|
|
Saves best validation checkpoint (requires |
|
|
Controls what gets logged. |
|
|
Disable if using a scheduler. |
Learning Rate & Optimization#
learning_rate: 1.0e-03
use_scheduler: False
Set
use_scheduler: Trueto enable learning rate decay.
Regularization & Weight Decay#
weight_decay: 0
L2 regularization: Helps prevent overfitting by penalizing large weights.
0→ Turns off regularization.Typical values:
1e-5to1e-3(increase for stronger regularization).
💡 If training a very deep model, try weight_decay: 1e-4 to reduce overfitting.
Batch Size Configuration#
train_batch_size: 1
valid_batch_size: 1
ensemble_size: 1
train_batch_size: Number of samples per training batch.valid_batch_size: Number of samples per validation batch.ensemble_size: Controls stochastic ensemble training (default =1, meaning deterministic behavior).
💡 For multi-GPU training (fsdp or ddp), the effective batch size = train_batch_size × num_GPUs.
Number of Batches Per Epoch#
batches_per_epoch: 1000
valid_batches_per_epoch: 20
batches_per_epoch:0→ Uses the full dataset.Custom value (e.g.,
1000) → Limits the number of training batches per epoch.
valid_batches_per_epoch: Controls how many validation batches run per epoch.
💡 Reducing batches_per_epoch helps debug faster before full-scale training.
Early Stopping & Validation Skipping#
stopping_patience: 50
skip_validation: False
stopping_patience: Stops training if validation loss does not improve for N epochs.skip_validation:True→ Always saves weights, but does NOT run validation.False→ Runs validation before saving checkpoints.
💡 If skip_validation: True, save_best_weights will not work.
Epoch & Checkpoint Management#
start_epoch: 0
num_epoch: 10
reload_epoch: True
epochs: &epochs 70
start_epoch: First epoch (useful for resuming training).num_epoch: Total epochs before training stops.reload_epoch:True→ Reads the last saved epoch and resumes training.False→ Starts fresh.
epochs: total number of epochs that the scheduler sees
💡 If using epoch-based schedulers, reload_epoch: True ensures proper continuation.
Learning Rate Scheduling#
use_scheduler: False
scheduler:
scheduler_type: cosine-annealing-restarts
first_cycle_steps: 250
cycle_mult: 6.0
max_lr: 1.0e-05
min_lr: 1.0e-08
warmup_steps: 249
gamma: 0.7
use_scheduler→ Enables learning rate scheduling (TrueorFalse).Supported scheduler types:
cosine-annealing→ Reduces LR smoothly over epochs.cosine-annealing-restarts→ Periodically resets the LR.step-lr→ Reduces LR at fixed intervals.
💡 For long training runs, cosine-annealing-restarts helps escape bad local minima by periodically resetting the LR.
Mixed Precision & Gradient Scaling#
To improve GPU memory efficiency, CREDIT supports mixed precision training:
amp: False
mixed_precision:
param_dtype: "float32"
reduce_dtype: "float32"
buffer_dtype: "float32"
amp: True→ Enables PyTorch’s Automatic Mixed Precision (AMP).mixed_precision→ Fine-grained FSDP precision control:param_dtype: Weight precision (e.g.,"float32","bfloat16").reduce_dtype: Precision for gradients during backprop.buffer_dtype: Buffer storage precision.
💡 For large models, use param_dtype: "bfloat16" to reduce memory usage with minimal accuracy loss.
Gradient Accumulation & Clipping#
grad_accum_every: 1
grad_max_norm: 'dynamic'
grad_accum_every:1→ Normal training.>1→ Accumulates gradients over multiple steps before updating weights (useful for small batch sizes).
grad_max_norm:'dynamic'→ Uses adaptive gradient clipping.0→ No clipping.
💡 Enable gradient accumulation (grad_accum_every > 1) if batch size is constrained by memory but you need a higher effective batch size.
CPU Thread & Prefetch Optimization#
CREDIT allows fine-tuning CPU utilization for better dataloader performance.
thread_workers: 4
valid_thread_workers: 4
prefetch_factor: 4
thread_workers: Number of CPU threads for loading training data.valid_thread_workers: Number of CPU threads for validation data.prefetch_factor: Number of samples preloaded into the buffer (works withERA5_MultiStep_Batcher).
💡 Increase thread_workers for faster data loading, but avoid exceeding available CPU cores.
Summary of Key Training Strategy Recommendations#
Parameter |
Recommended Setting |
Notes |
|---|---|---|
|
|
Helps prevent overfitting. |
|
|
Larger batch size speeds up training. |
|
|
Reduce for faster debugging. |
|
|
Stops training if no improvement. |
|
|
Needed for |
|
|
Ensures proper resumption of training. |
|
|
Improves long-term stability. |
|
|
Saves GPU memory. |
|
|
Prevents gradient explosion. |
|
|
Tune based on available CPUs. |
Model Configuration#
CREDIT supports multiple architectures. Example:
type: "crossformer"
frames: 1
image_height: 640
image_width: 1280
levels: 16
channels: 4
surface_channels: 7
type: Model architecture (crossformer,fuxi, etc.).frames: Number of input states (historical time steps).image_height,image_width: Spatial resolution (latitude × longitude).levels: Number of atmospheric levels.channels: Number of upper-air variables.
Here’s an expanded and structured section detailing the model configuration, including explanations of architecture choices, spatial resolution, patch embeddings, attention mechanisms, and normalization techniques.
Selecting a Model Architecture#
type: "crossformer"
crossformer→ Default model based on transformer architecture.fuxi→ Alternative model architecture.debugger→ Debugging mode (useful for checking data flow).
💡 The choice of architecture affects model scalability and computational efficiency.
Temporal and Spatial Resolution#
frames: 1
image_height: 640
image_width: 1280
levels: 16
frames: Number of historical time steps used as input.image_height,image_width: Spatial resolution of the input fields (latitude × longitude).levels: Number of vertical pressure levels for upper-air variables.
💡 For higher resolution datasets, ensure these values match the input data format.
Channel Configuration#
channels: 4
surface_channels: 7
input_only_channels: 3
output_only_channels: 0
channels→ Number of upper-air input variables.surface_channels→ Number of surface input variables.input_only_channels→ Channels for dynamic forcing, static features, or external variables.output_only_channels→ Reserved for diagnostic variables (default =0).
💡 If using additional input features (e.g., solar forcing), update input_only_channels.
Patch Embedding (For Transformer-Based Models)#
CREDIT supports patch-based embeddings, where the spatial domain is divided into small patches for transformer processing.
patch_width: 1
patch_height: 1
frame_patch_size: 1
patch_width,patch_height→ Size of each spatial patch (latitude × longitude).frame_patch_size→ Number of time steps per patch (default =1).
💡 Larger patch sizes can reduce computational cost but may impact fine-scale feature representation.
Transformer Depth and Dimensions#
dim: [32, 64, 128, 256]
depth: [2, 2, 2, 2]
dim→ Hidden size at each transformer layer.depth→ Number of transformer blocks per stage.
💡 Deeper models capture more complex patterns but require more memory.
Attention Mechanism#
CREDIT supports global and local attention mechanisms to efficiently model atmospheric dynamics.
global_window_size: [10, 5, 2, 1]
local_window_size: 10
global_window_size→ Size of global attention windows at each layer.local_window_size→ Size of local attention windows.
💡 Smaller window sizes focus on localized interactions, while larger sizes improve long-range dependencies.
Cross-Embedding (Multi-Scale Feature Extraction)#
cross_embed_kernel_sizes:
- [4, 8, 16, 32]
- [2, 4]
- [2, 4]
- [2, 4]
cross_embed_strides: [2, 2, 2, 2]
cross_embed_kernel_sizes→ Defines kernel sizes for hierarchical embeddings.cross_embed_strides→ Controls how much spatial downsampling occurs.
💡 Larger kernel sizes extract broader-scale features, while smaller strides preserve fine details.
Regularization & Normalization#
CREDIT includes various techniques to improve training stability and prevent overfitting.
attn_dropout: 0.
ff_dropout: 0.
use_spectral_norm: True
attn_dropout→ Dropout rate for attention layers (default =0.0).ff_dropout→ Dropout rate for feed-forward layers (default =0.0).use_spectral_norm→ Enables spectral normalization (helps with stability in deep networks).
💡 Increase dropout (0.1 - 0.3) for regularization in larger models.
Interpolation & Output Matching#
interp: True
True→ Interpolates outputs to match input spatial resolution.False→ Outputs raw model predictions.
💡 Set interp: True to ensure predictions align with input grid resolution.
Summary of Key Model Recommendations#
Parameter |
Recommended Setting |
Notes |
|---|---|---|
|
|
Default transformer-based model. |
|
|
More frames improve historical context. |
|
|
Must match input dataset resolution. |
|
|
Number of vertical pressure levels. |
|
|
Controls model capacity. |
|
|
Number of layers per stage. |
|
|
Attention window size per layer. |
|
|
Regularization for attention layers. |
|
|
Stabilizes training. |
|
|
Ensures output matches input grid. |
Handling Boundary Effects with Padding#
To improve numerical stability at domain edges, CREDIT supports boundary padding.
padding_conf:
activate: True
mode: earth
pad_lat: 80
pad_lon: 80
activate: True→ Enables padding at spatial domain edges.mode: 'earth'→ Specifies Earth-system-aware padding (useful for atmospheric models), which is described in Schreck et al. 2025pad_lat→ Extends padding by80latitude points.pad_lon→ Extends padding by80longitude points.
💡 Padding ensures continuity at boundaries, preventing artifacts in global simulations.
Summary of Key Padding Recommendations#
Parameter |
Recommended Setting |
Notes |
|---|---|---|
|
|
Enables domain padding. |
|
|
Uses Earth-system-specific padding. |
|
|
Adjust based on dataset resolution. |
|
|
Ensures global continuity. |
Here is a vastly expanded and fully structured explanation of the post-processing (post_conf) section in CREDIT. This covers conservation schemes, tracer corrections, and energy/mass balance adjustments in depth.
Post-Block (post_conf)#
The post-processing block (post_conf) enforces physical conservation constraints on model outputs, correcting imbalances in mass, water, energy, and tracers.
Activating Post-Processing#
post_conf:
activate: True
True→ Enables post-processing corrections.False→ Disables post-processing (not recommended for production runs).
💡 Always enable post_conf for physically consistent forecasts.
Stochastic Kinetic Energy Backscatter (SKEBS)#
SKEBS introduces stochastic perturbations to correct underdispersed forecasts in weather models.
True→ Enables kinetic energy backscatter corrections (experimental).False→ Disables SKEBS.
💡 Enable if testing ensemble perturbations for uncertainty quantification.
skebs:
activate: True
freeze_base_model_weights: True # turn off training of the basemodel
# skebs module training options
trainable: True # is skebs trainable at all
freeze_dissipation_weights: False # turn off training for dissipation
freeze_pattern_weights: True # turn off training for the spectral pattern
lmax: None # lmax, mmax for spectral transforms
mmax: None
# custom initialization of alpha
alpha_init: 0.95
train_alpha: False #trains alpha no matter what
# dissipation config:
zero_out_levels_top_of_model: 3 # zero out backscatter at top k levels of the model
dissipation_scaling_coefficient: 10.
dissipation_type: FCNN
# available types:
# - prescribed: fixed dissipation rate spatially, varies by level starts at sigma_max level (see below)
# - uniform: fixed dissipation rate spatially, varies by level starts at 2.5
# - FCNN: two layer small MLP
# - FCNN_wide: four layer wide MLP
# - unet: user specified arch, default: unet++
# - CNN: single 3x3 convolution with padding for each column
# unet - see models/unet.py for examples
# architecture:
padding: 48
# prescribed dissipation:
sigma_max: 2.0 # what sigma level to set as the max wind. perturbation will be roughly sigma_max * std for wind at each level
# spectral filters, will anneal to 0 from anneal_start (linspace)
max_pattern_wavenum: 60
pattern_filter_anneal_start: 40
max_backscatter_wavenum: 100
backscatter_filter_anneal_start: 90
# [Optional] default is off
train_backscatter_filter: False
train_pattern_filter: False
# data config - does the backscatter model get statics variables?
use_statics: False
# [Optional] early skebs shutoff on iteration number:
iteration_stop: 0 # if 0, skebs is always run
#### debugging ####
# write files during training:
write_train_debug_files: False #writing out files while training, if this is False
write_train_every: 999
# write files during inference
write_rollout_debug_files: False # saves only when no_grad
Conservation Schemes#
CREDIT enforces physical conservation laws for:
Water Conservation (tracers, precipitation, evaporation).
Mass Conservation (fixes inconsistencies in pressure/height fields).
Energy Conservation (balances fluxes and temperature).
General Settings for Conservation Fixers#
Each conservation scheme follows these shared settings:
# Applies the correction method
activate: True
# Converts from normalized values back to real units before applying fixes
denorm: True
# Runs the correction outside the model (useful for multi-step training)
activate_outside_model: False
# Specifies the grid type:
# "pressure" = constant pressure levels
# "sigma" = hybrid sigma-pressure levels
grid_type: "sigma"
# Required grid variables (latitude, longitude, vertical levels)
lon_lat_level_name: ["lon2d", "lat2d", "coef_a", "coef_b"]
# Specifies whether levels represent layer edges (midpoint=True) or centers (midpoint=False)
midpoint: True
💡 For sigma-coordinate models, ensure the physics file includes coef_a and coef_b. These are the sigma pressure level files in units Pa and Fraction, respectively
Tracer Fixer: Ensuring Non-Negative Water Content#
This correction ensures no negative values for total water content and precipitation.
tracer_fixer:
activate: True
denorm: True
tracer_name: ["specific_total_water", "total_precipitation"]
tracer_thres: [0, 0]
tracer_name→ List of variables to fix (e.g., specific humidity, precipitation).tracer_thres→ Threshold values (e.g.,0means no negative values allowed).
💡 Negative values can appear due to numerical instability—this ensures physically meaningful water content.
Global Mass Fixer#
This correction ensures total mass is conserved across all vertical levels.
global_mass_fixer:
activate: True
activate_outside_model: False
simple_demo: False
denorm: True
grid_type: "sigma"
midpoint: True
fix_level_num: 7
lon_lat_level_name: ["lon2d", "lat2d", "coef_a", "coef_b"]
surface_pressure_name: ["SP"]
specific_total_water_name: ["specific_total_water"]
fix_level_num: 7→ Ensures conservation only up to the 7th level (avoids modifying upper layers).surface_pressure_name→ Name of the surface pressure variable (used for pressure-mass balancing).specific_total_water_name→ Name of the specific humidity variable.
💡 Use this to prevent mass drift in long-term climate simulations.
Global Water Fixer#
This correction ensures global water conservation by adjusting precipitation and evaporation terms.
global_water_fixer:
activate: True
activate_outside_model: False
simple_demo: False
denorm: True
grid_type: "sigma"
midpoint: True
lon_lat_level_name: ["lon2d", "lat2d", "coef_a", "coef_b"]
surface_pressure_name: ["SP"]
specific_total_water_name: ["specific_total_water"]
precipitation_name: ["total_precipitation"]
evaporation_name: ["evaporation"]
precipitation_name→ Variable name for total precipitation.evaporation_name→ Variable name for evaporation flux.
💡 Prevents artificial drift in atmospheric moisture by correcting evaporation/precipitation imbalances.
Global Energy Fixer#
This correction ensures total energy conservation by adjusting heat fluxes, radiation, and wind kinetic energy.
global_energy_fixer:
activate: True
activate_outside_model: False
simple_demo: False
denorm: True
grid_type: "sigma"
midpoint: True
lon_lat_level_name: ["lon2d", "lat2d", "coef_a", "coef_b"]
surface_pressure_name: ["SP"]
air_temperature_name: ["temperature"]
specific_total_water_name: ["specific_total_water"]
u_wind_name: ["u_component_of_wind"]
v_wind_name: ["v_component_of_wind"]
surface_geopotential_name: ["geopotential_at_surface"]
TOA_net_radiation_flux_name: ["top_net_solar_radiation", "top_net_thermal_radiation"]
surface_net_radiation_flux_name: ["surface_net_solar_radiation", "surface_net_thermal_radiation"]
surface_energy_flux_name: ["surface_sensible_heat_flux", "surface_latent_heat_flux"]
Key Adjustments#
Variable |
Purpose |
|---|---|
|
Balances total heat content. |
|
Adjusts for latent heat effects. |
|
Ensures kinetic energy conservation. |
|
Ensures consistency with potential energy. |
|
Accounts for top-of-atmosphere radiation balance. |
|
Balances incoming and outgoing radiation. |
|
Adjusts for surface energy exchanges. |
💡 Use this to prevent temperature drift and ensure radiative balance in climate models.
Summary of Key Conservation Fixers#
Fixer |
Purpose |
Key Variables |
|---|---|---|
Tracer Fixer |
Prevents negative water values |
|
Mass Fixer |
Ensures total air mass conservation |
|
Water Fixer |
Balances precipitation and evaporation |
|
Energy Fixer |
Maintains energy balance (radiation, heat, wind) |
|
Best Practices#
✅ Always enable post_conf for physically consistent model outputs.
✅ Ensure save_loc_physics contains required grid variables (lon2d, lat2d, coef_a, coef_b).
✅ Adjust fix_level_num if conservation should only apply to certain layers.
✅ Test with simple_demo: True first to visualize corrections before full training.
Loss Configuration#
The loss section defines how CREDIT computes training loss, including options for custom loss functions, spectral constraints, and latitude-based weighting.
Selecting the Training Loss Function#
training_loss: "mse"
Available loss functions:
"mse"→ Mean Squared Error (default; penalizes large errors)."mae"→ Mean Absolute Error (more robust to outliers)."huber"→ Huber Loss (combination of MSE and MAE)."logcosh"→ Log-Cosh Loss (similar to Huber, smooths large errors)."xtanh"→ Custom loss using hyperbolic tangent."xsigmoid"→ Custom loss using sigmoid transformation."KCRPS"→ bias corrected CRPS for ensemble training."almost-fair-crps"→ bias-corrected CRPS for ensemble training with small ensembles.
💡 mse is recommended for smooth loss surfaces, while huber or logcosh are better for handling outliers.
Power & Spectral Loss#
CREDIT supports spectral and power-based losses to penalize errors in the frequency domain.
use_power_loss: False
use_spectral_loss: False
spectral_lambda_reg: 0.1
spectral_wavenum_init: 20
use_power_loss→ Enables power spectrum loss (recommended for climate models).use_spectral_loss→ Enables spectral loss (alternative to power loss).spectral_lambda_reg→ Weighting factor for spectral loss (0.1= mild effect).spectral_wavenum_init→ Truncates low-wavenumber components, ensuring loss focuses on fine-scale structures.
💡 Enable only one of use_power_loss or use_spectral_loss—they should not be used together.
Latitude-Based Loss Weighting#
Since Earth’s surface area varies with latitude, CREDIT supports weighting loss by latitude.
latitude_weights: "/path/to/latitude_weights.nc"
use_latitude_weights: True
latitude_weights→ NetCDF file containingcos(latitude)as a variable (coslat).use_latitude_weights: True→ Enables latitude-based weighting to prevent polar regions from dominating training loss.
💡 This is strongly recommended for global models to ensure loss scaling matches physical area coverage.
Variable-Specific Loss Weighting#
CREDIT allows custom loss weighting per variable, ensuring critical variables are penalized more heavily.
use_variable_weights: False
True→ Enables custom per-variable loss weighting.False→ All variables contribute equally to the loss function.
Example: Custom Variable Weights#
variable_weights:
U: [0.132, 0.123, 0.113, 0.104, 0.095, 0.085, 0.076, 0.067, 0.057, 0.048, 0.039, 0.029, 0.02, 0.011, 0.005]
V: [0.132, 0.123, 0.113, 0.104, 0.095, 0.085, 0.076, 0.067, 0.057, 0.048, 0.039, 0.029, 0.02, 0.011, 0.005]
T: [0.132, 0.123, 0.113, 0.104, 0.095, 0.085, 0.076, 0.067, 0.057, 0.048, 0.039, 0.029, 0.02, 0.011, 0.005]
Q: [0.132, 0.123, 0.113, 0.104, 0.095, 0.085, 0.076, 0.067, 0.057, 0.048, 0.039, 0.029, 0.02, 0.011, 0.005]
SP: 0.1
t2m: 1.0
V500: 0.1
U500: 0.1
T500: 0.1
Z500: 0.1
Q500: 0.1
Upper-air variables (
U,V,T,Q): Different weights per level.Surface variables (
SP,t2m, etc.): Single weight per variable.
💡 Increase weighting for critical variables (e.g., T500, Z500) to improve accuracy in key forecast fields.
Summary of Key Loss Recommendations#
Parameter |
Recommended Setting |
Notes |
|---|---|---|
|
|
Use |
|
|
Set |
|
|
Do not enable both spectral and power loss. |
|
|
Adjust to control spectral penalty strength. |
|
|
Recommended for global datasets. |
|
|
Enable if some variables are more important. |
Prediction (Inference) Configuration#
The predict section controls how CREDIT runs forecasts after training, including:
Batching and parallel execution
Forecast initialization settings
Storage format for predicted fields
Post-processing options (e.g., low-pass filtering, anomaly computation)
GPU Usage for Inference#
CREDIT supports single-GPU and distributed inference.
mode: none # Options: "none", "fsdp", "ddp"
none→ Runs inference on a single GPU.fsdp→ Fully Sharded Data Parallel (recommended for multi-GPU).ddp→ Distributed Data Parallel (alternative for multi-GPU).
💡 Use fsdp for large models to optimize memory usage during inference.
Batch Size & Ensemble Forecasting#
batch_size: 1
ensemble_size: 1
batch_size→ Number of forecast initializations processed at once.ensemble_size→ Number of ensemble members per initialization.
💡 Increase batch_size if running inference on multiple GPUs.
Forecast Initialization Settings#
CREDIT can initialize forecasts at specific times and run for a set duration.
forecasts:
type: "custom"
start_year: 2019
start_month: 1
start_day: 1
start_hours: [0, 12]
duration: 1152
days: 10
type→"custom"(default; allows user-defined start dates).start_year,start_month,start_day→ Defines the first forecast initialization.start_hours→ List of times per day for initializing forecasts (e.g.,0for 00Z,12for 12Z).duration→ Total number of days to initialize forecasts.Should be divisible by the number of GPUs for parallel execution.
days→ Forecast lead time in days (e.g.,10= 10-day forecast).
💡 For year-long forecasts, set duration: 365 and start_hours: [0] (daily initialization).
Output Storage & File Naming#
save_forecast: '/path/to/forecast_output/'
Defines where forecast outputs are stored.
Each initialization creates a separate subdirectory inside
save_forecast/.Output files are saved in NetCDF format (
.nc).
💡 Ensure the path has enough storage capacity for long-duration forecasts!
Selecting Output Variables#
metadata: '/path/to/metadata/era5.yaml'
CREDIT automatically selects which variables to save based on this metadata file.
To save all variables, remove
save_varsfromconfiguration.yml.
💡 Modify metadata.yaml if custom variables need to be included/excluded.
Low-Pass Filtering for Smoother Predictions#
use_laplace_filter: False
True→ Applies a low-pass filter to reduce high-frequency noise.False→ Saves raw model outputs without filtering.
💡 Enable use_laplace_filter: True if forecasts contain unrealistic high-frequency oscillations.
Climatology File for Anomaly Computation#
CREDIT can compute anomaly correlations using a reference climatology.
climatology: '/path/to/climatology.nc'
If provided,
rollout_metrics.pywill compute Anomaly ACC (Anomaly Correlation Coefficient).If missing, Pearson correlation is used instead.
💡 Use a 30-year climatology (e.g., ERA5 1990-2019) for best results.
Summary of Key Prediction Recommendations#
Parameter |
Recommended Setting |
Notes |
|---|---|---|
|
|
|
|
|
Processes multiple initializations at once. |
|
|
Supports probabilistic forecasting. |
|
|
Runs forecasts twice daily. |
|
|
Should be divisible by the number of GPUs. |
|
|
Enable if forecasts contain high-frequency noise. |
|
|
Improves anomaly-based evaluation. |
PBS Job Submission (HPC)#
For running CREDIT on NCAR HPC systems (Derecho, Casper):
pbs:
conda: "credit-derecho"
project: "NAML0001"
job_name: "train_model"
walltime: "12:00:00"
nodes: 8
ncpus: 64
ngpus: 4
nodes,ncpus,ngpus: Adjust based on compute resources.For Casper: Change
queue: 'casper'and specifygpu_type: 'v100'.
Troubleshooting#
Issue |
Possible Cause |
Solution |
|---|---|---|
Training loss does not decrease |
Learning rate too high/low |
Adjust |
Model runs out of memory |
Batch size too large |
Reduce |
Output fields look unrealistic |
Conservation schemes disabled |
Ensure |
Forecasts diverge quickly |
Model lacks historical context |
Increase |
Data loading errors |
Incorrect file format or missing variables |
Ensure |
Best Practices#
Check Data Formats: Ensure variables follow expected dimensions
(time, level, lat, lon).Use a Seed for Reproducibility: Keep
seedfixed unless testing variations.Enable Conservation Schemes: To maintain physical consistency.
Run Small Tests First: Before launching full-scale HPC jobs, test with fewer epochs (
num_epoch: 5).
Additional Resources#
This guide is a living document—please report issues or suggest improvements! 🚀