Prediction Rollouts#

Prediction Ingredients#

Before beginning rollouts of a CREDIT model, you will need the following ingredients/files available on your machine.

  1. 🌎 Initial conditions for upper air and surface variables in Zarr format. If running processed ERA5 on Derecho or Casper, you can access the processed files at /glade/campaign/cisl/aiml/credit/era5_zarr/. The y_TOTAL*.zarr and SixHourly_y_TOTAL*.zarr are at 0.28 degree grid spacing, and SixHourly_y_ONEdeg*.zarr for 1 degree data.

  2. 🌞 Dynamic forcing files covering the full period of prediction. In the current CREDIT models, the only dynamic forcing variable is top-of-atmosphere shortwave irradiance. Pre-calculated solar irradiance values integrated over 1 and 6 hour periods are available on Derecho/Casper at /glade/campaign/cisl/aiml/credit/credit_solar_nc_6h_0.25deg and credit_solar_nc_1h_0.25deg. You can calculate top of atmosphere solar irradiance for any grid and integration time period with credit_calc_global_solar. If you plan to issue regular predictions, we recommend pre-computing solar irradiance values for a given year of inference rather than calculating on the fly.

  3. ⛰️ Static forcing files with and without normalization. These forcing files include elements like terrain height, land-sea mask, and land-use type. Static forcing files for the initial CREDIT models are currently archived at /glade/campaign/cisl/aiml/credit/static_scalers/. static_norm_old.nc has normalized terrain height and land sea mask, while unnormalized values are in LSM_static_variables_ERA5_zhght.nc. The unnormalized values are needed for interpolation to pressure and height levels.

  4. Files containing the mean and standard deviation scaling values for each variable. Currently, CREDIT uses values stored in netCDF files. These are currently stored on Derecho in /glade/campaign/cisl/aiml/credit/static_scalers/. The appropriate files to use are mean_6h_1979_2018_16lev_0.25deg.nc for the mean and std_residual_6h_1979_2018_16lev_0.25deg.nc for the combined standard deviation of each variable and the standard deviation of the temporal residual.

Realtime Rollouts#

The goal of realtime inference is to launch model forecasts from GFS, GEFS, or ERA5 initial conditions. The predict section of your configuration file should contain the following fields:

predict:
  mode: none
  realtime:
    forecast_start_time: "2025-04-14 12:00:00" # change to your init date
    forecast_end_time: "2025-04-24 12:00:00" # Should be sometime after init date
    forecast_timestep: "6h" # Needs to contain h for hours and should match 1 or 6 hour model.
  initial_condition_path: "/path/to/gfs_init/" # change 
  static_fields: "/Users/dgagne/data/CREDIT_data/LSM_static_variables_ERA5_zhght.nc" # Static forcing file.
  metadata: '/Users/dgagne/miles-credit/credit/metadata/era5.yaml' # Path to metadata for output
  save_forecast: '/Users/dgagne/data/wxformer_6h_test/' # path to save forecast data

If you want to use GFS initial conditions, run python applications/gfs_init.py -c <config file>. It will download fields from a GFS initial condition on model levels, which are archived for the past 10 days on the NOAA NOMADS server. GDAS Analyses and GFS initial timesteps on model levels are also available on Google Cloud back to 2021. The credit_gfs_init program regrids the data onto the appropriate CREDIT grid and interpolates in the vertical from the GFS to selected CREDIT ERA5 hybrid sigma-pressure levels.

Important

credit_gfs_init requires xesmf, which depends on the ESMF suite and cannot be installed from PyPI. The easist way to install xesmf without messing up your CREDIT environment is to run conda install -c conda-forge esmf esmpy then pip install xesmf after building your CREDIT environment first.

If you want to launch ensemble rollouts, you can use credit_gefs_init to convert raw GEFS cube sphere data to grids for CREDIT models.

Realtime rollouts are handled by credit_rollout_realtime. Update the paths in the data section of the config file to point to the GFS initial conditions zarr file. credit_rollout_realtime only outputs one forecast at a time.

Rollout to netCDF for ERA5 initiated forecasts#

credit rollout generates forecasts for many initialization times using processed ERA5 data as initial conditions. It supports deterministic and ensemble rollouts, serial and parallel modes, single and multi-node execution.

Add the following section to your config file:

predict:
    mode: none
    forecasts:
        type: "custom"       # keep it as "custom"
        start_year: 2020     # year of the first initialization (where rollout will start)
        start_month: 1       # month of the first initialization
        start_day: 1         # day of the first initialization
        start_hours: [0, 12] # hour-of-day for each initialization, 0 for 00Z, 12 for 12Z
        duration: 32         # number of days to initialize, starting from the (year, mon, day) above
                             # duration should be divisible by the number of GPUs
                             # (e.g., duration: 384 for 365-day rollout using 32 GPUs)
        days: 10             # forecast lead time as days (1 means 24-hour forecast)
    ensemble_size: 1         # set > 1 to save ensemble members to NetCDF

Running locally#

# Deterministic rollout (reads ensemble_size from config)
credit rollout -c config.yml

# Ensemble rollout β€” override ensemble_size from the CLI
credit rollout -c config.yml --ensemble-size 50

# Multi-GPU on a single node
credit rollout -c config.yml -m ddp

Submitting PBS jobs#

Use credit submit to submit rollout jobs to the cluster. The --rollout flag switches from training submission to parallel rollout submission. --jobs N splits init times across N independent PBS jobs (all start at once, no afterok chain).

# Submit 10 parallel rollout jobs on Casper (deterministic or ensemble β€” set by config)
credit submit --cluster casper -c config.yml --rollout --jobs 10

# Override ensemble size at submission time
credit submit --cluster casper -c config.yml --rollout --jobs 10 --gpus 1

# Dry run β€” inspect the PBS scripts before submitting
credit submit --cluster casper -c config.yml --rollout --jobs 10 --dry-run

--jobs controls how many PBS nodes split the init-time work. ensemble_size in the config (or --ensemble-size at the CLI) controls how many ensemble members are run per init time. These are independent settings.

Multi-node rollout (MPI)#

For MPI-enabled PyTorch installations:

nodes=( $( cat $PBS_NODEFILE ) )
head_node=${nodes[0]}
head_node_ip=$(ssh $head_node hostname -i | awk '{print $1}')
export NUM_RANKS=32
MASTER_ADDR=$head_node_ip
MASTER_PORT=1234
mpiexec -n $NUM_RANKS -ppn 4 --cpu-bind none python applications/rollout_to_netcdf_v2.py -c config.yml

Interpolation to constant pressure and height above ground levels#

Both credit_rollout_realtime and credit_rollout_to_netcdf support vertical interpolation to constant pressure and constant height above ground level (AGL) levels from the hybrid sigma-pressure levels used by most models in CREDIT. To enable interpolation, add the following lines to your config file in the predict section

data:
  level_ids: [10, 30, 40, 50, 60, 70, 80, 90, 95, 100, 105, 110, 120, 130, 136, 137]
predict:
  interp_pressure:
    pressure_levels: [300.0, 500.0, 850.0, 925.0] # in hPa
    height_levels: [100.0, 500.0, 1000.0, 2000.0, 3000.0, 4000.0, 5000.0, 6000.0] # in meters

More configuration options are listed in full_state_pressure_interpolation in credit/interp.py and can be set from the config file in the interp_pressure section. The interpolation routine interpolates to pressure levels using approximately the same approach as ECMWF, although results will not be exactly the same due to slight numerical and implementation differences. The routine also calculates pressure and geopotential on all levels. Mean sea level pressure is also calculated in this routine.

Saving compressed and chunked netCDF files#

By default, the rollout scripts will save uncompressed netCDF files. These can grow to be quite large if you are producing a lot of forecasts and are saving all the fields. Space can be saved greatly by turning on netCDF compression and setting chunks that align with your preferred access pattern. Encoding options like the ones below go into the config file.

model:
    # crossformer example
    type: "crossformer"
    frames: 1                         # number of input states (default: 1)
    image_height: &height 640         # number of latitude grids (default: 640)
    image_width: &width 1280          # number of longitude grids (default: 1280)
    levels: &levels 16                # number of upper-air variable levels (default: 15)
    channels: 4                       # upper-air variable channels
predict:
  ua_var_encoding:
    zlib: True # turns on zlib compression.
    complevel: 1 # ranges from 1 to 9. 1 is faster with a lower compression ratio, 9 is slower.
    shuffle: True
    chunksizes: [1, *levels, *height, *width]

  pressure_var_encoding:
    zlib: True
    complevel: 1
    shuffle: True
    chunksizes: [ 1, 4, *height, *width] # second dim should match number of interp pres. levels
    
  height_var_encoding:
    zlib: True
    complevel: 1
    shuffle: True
    chunksizes: [ 1, 8, *height, *width] # second dim should match number of interp height levels

  surface_var_encoding:
    zlib: true
    complevel: 1
    shuffle: True
    chunksizes: [1, *height, *width]

Running Rollouts with the v2 Data Schema#

If you trained with trainer.type: era5-gen2, use the v2 rollout commands. The same YAML config used for training drives inference β€” no separate rollout config is needed.

Batch rollout to NetCDF#

credit rollout steps the model forward over a set of historical initial conditions and writes one NetCDF file per forecast:

credit rollout -c config/wxformer_1dg_6hr_v2.yml

To run on multiple GPUs pass --mode ddp:

credit rollout -c config/wxformer_1dg_6hr_v2.yml --mode ddp

The predict block in your config controls which dates are run and where output goes:

predict:
    mode: ddp           # none | ddp
    batch_size: 4       # initial conditions per GPU per batch
    ensemble_size: 1    # > 1 enables ensemble inference (requires ensemble model)
    forecasts:
        type: "custom"
        start_year: 2020
        start_month: 1
        start_day: 1
        start_hours: [0, 12]   # UTC hours to initialise each day
        duration: 1             # forecast length in days
        days: 1                 # number of days to run from start date
    metadata: '/path/to/credit/metadata/era5.yaml'
    save_forecast: '/glade/derecho/scratch/$USER/CREDIT_runs/my_run'
    use_laplace_filter: False

Output files land in save_forecast/. Filename format is <YYYY><MM><DD><HH>Z_<lead_hours>h.nc.

Realtime forecast from a single init time#

credit realtime runs one forecast from a user-specified initialisation time, writing output as it steps (useful for operational or near-realtime use):

credit realtime -c config/wxformer_1dg_6hr_v2.yml \
    --init-time 2024-01-15T00 \
    --steps 40

--steps 40 = 40 Γ— 6 h = 10-day forecast. Output lands in predict.save_forecast.

To override the output directory:

credit realtime -c config.yml --init-time 2024-06-01T12 --steps 40 \
    --save-dir /tmp/test_forecast

Quick sanity-check after training#

The fastest way to verify a freshly trained model produces sensible output:

# Plot 2m temperature in physical units (Kelvin) β€” recommended starting point
credit plot -c config/wxformer_1dg_6hr_v2.yml --field VAR_2T --denorm

# Multiple fields at once
credit plot -c config/wxformer_1dg_6hr_v2.yml --field VAR_2T SP --denorm

# 3D variable: temperature at level index 5 (pressure-level ordering)
credit plot -c config/wxformer_1dg_6hr_v2.yml --field temperature --level 5 --denorm

# Point at a specific checkpoint or date
credit plot -c config/wxformer_1dg_6hr_v2.yml --field VAR_2T \
    --checkpoint /glade/derecho/scratch/$USER/CREDIT_runs/my_run/checkpoint.pt \
    --sample-date 2020-06-15T00 --denorm

Each PNG is saved to <save_loc>/plots/ and shows truth | prediction | difference as a global map.

--denorm converts outputs from normalised (Οƒ) units to physical units using the mean and std files from your config β€” e.g. Kelvin for temperature, Pascals for surface pressure. Without --denorm the colourbar is in standard-deviation units, which is useful for diagnosing normalisation issues but harder to interpret at a glance.

What to look for:

Symptom

Likely cause

Loss > 100 or NaN

Normalisation broken β€” check mean/std paths

Prediction is uniform (no structure)

Too few epochs or learning rate too high

Tiling / grid artefacts in prediction

Normal at early epochs for window-based models; disappears with training

Difference panel is smooth and small

Training is going well

NCAR data paths#

The built-in v2 configs already point to the shared ERA5 archive at /glade/campaign/cisl/aiml/ksha/CREDIT_data/ and the shared metadata at /glade/u/home/akn7/miles-credit/credit/metadata/era5.yaml. No path edits are needed for NCAR users.