Preparing your dataset for CREDIT models#

Overall steps#

Customized data can be prepared through the following steps:

  • Pack your variables into yearly files using netCDF or zarr data formats.

  • Prepare mean and std files for z-score normalization.

  • (Optional) Prepare your static and periodic forcing variables.

For example, you can have:

train_data_2000.zarr, train_data_2001.zarr, …, train_data_2024.zarr

Files need to be referenced in YAML config file using regular expressions:

data:
  # upper-air variables
  variables: ['your_varnames',]
  save_loc: '/your_path/train_data_*.zarr'

Supported variable types#

The following variable types are supported:

  • Upper-air variables: variables in yearly files with coordinates of (time, level, latitude, longitude) and as both inputs and outputs. e.g., air temperature.

  • Surface variables: variables in yearly files with coordinates of (time, latitude, longitude) and as both inputs and outputs. e.g., surface pressure.

  • Dynamic forcing variables: variables in yearly files with coordinates of (time, latitude, longitude) and as input only. e.g., solar radiation.

  • Diagnostic variables: variables in yearly files with coordinates of (time, latitude, longitude) and as output only. e.g., precipitation.

  • Periodic forcing variables: variables in a single file with coordinates of (time, latitude, longitude) with the time coordinate covering 366 days of a year. This variable type will be used repeatedly as annual cycles and for inputs only. e.g., periodic sea surface temperature.

  • Static variables: variables a single file with coordinates of (latitude, longitude) and as input only. e.g., ‘terrain elevation’.

CREDIT uses periodic forcing variables and static variables as they are, the user should normalize them.

All variable information needs to be added in the data section of YAML config file:

data
    # upper-air variables
    variables: ['your_varnames',]
    save_loc: '/your_path/train_data_*.zarr'
    
    # surface variables
    surface_variables: ['your_varnames',]
    save_loc_surface: '/your_path/train_data_*.zarr'
  
    # dynamic forcing variables
    dynamic_forcing_variables: ['your_varnames',]
    save_loc_dynamic_forcing: '/your_path/train_data_*.zarr'
  
    # diagnostic variables
    diagnostic_variables: ['your_varnames',]
    save_loc_diagnostic: '/your_path/train_data_*.zarr'
    
    # periodic forcing variables
    forcing_variables: ['your_varnames',]
    save_loc_forcing: '/your_path/forcing_*.zarr'
    
    # static variables
    static_variables: ['your_varnames',]
    save_loc_static: '/your_path/static_*.zarr'

z-score files#

The mean and standard deviation of Upper-air variables, Surface variables, Dynamic forcing variables, Diagnostic variables should be prepared and listed in the YAML config file:

data
    mean_path: '/your_path/mean_file.nc'
    std_path: '/your_path/std_file.nc'

Variable types of periodic forcing variables and static variables do not have mean and std entries.

Mandatory steps#

  • The customized dataset must have coordinate names and orders of time, level, latitude, and longitude.

  • Z-score files must have the coordinate name level.

  • Upper-air variables are mandatory to run CREDIT models. Other variable types are optional.

  • All prepared data should not have NaN values.

  • All listed variable names must exist in the file.

  • The CREDIT data loader supports yearly files only.

Test the validity of your data preparation#

Once the data is prepared and registered in the YAML config file. You can run the following code for a quick test:

from credit.parser import credit_main_parser, training_data_check, predict_data_check
import yaml 

# load your config
config_name = 'your_path/model.yml'
with open(config_name, 'r') as stream:
    conf = yaml.safe_load(stream)

conf = credit_main_parser(conf, parse_training=True, parse_predict=True, print_summary=True)

training_data_check(conf, print_summary=True)

predict_data_check(conf, print_summary=True)