credit.pbs#

Attributes#

Functions#

launch_script(config_file, script_path[, launch, backend])

Generates and optionally launches a PBS script for a single-node MPI job on Casper.

launch_script_mpi(config_file, script_path[, launch, ...])

Generates and optionally launches a PBS script for a multi-node MPI job.

launch_script_torchrun(config_file, script_path[, ...])

Generates and optionally launches a PBS script using torchrun.

get_num_cpus()

Module Contents#

credit.pbs.logger#
credit.pbs.launch_script(config_file, script_path, launch=True, backend='nccl')#

Generates and optionally launches a PBS script for a single-node MPI job on Casper.

Parameters:
  • config_file (str) – Path to the YAML configuration file.

  • script_path (str) – Path to the script that will be executed by the PBS job.

  • launch (bool, optional) – If True, the PBS job will be submitted to the queue. Defaults to True.

  • backend (str, optional) – Backend for distributed training. Defaults to ‘nccl’.

credit.pbs.launch_script_mpi(config_file, script_path, launch=True, backend='nccl')#

Generates and optionally launches a PBS script for a multi-node MPI job.

Parameters:
  • config_file (str) – Path to the YAML configuration file.

  • script_path (str) – Path to the script that will be executed by the MPI job.

  • launch (bool, optional) – If True, the PBS job will be submitted to the queue. Defaults to True.

  • backend (str, optional) – Backend to be used for distributed training (e.g., ‘nccl’). Defaults to ‘nccl’.

credit.pbs.launch_script_torchrun(config_file, script_path, launch=True, backend='nccl')#

Generates and optionally launches a PBS script using torchrun.

Preferred over launch_script_mpi for FSDP2 / v2-parallelism jobs — torchrun manages rendezvous and sets LOCAL_RANK / RANK / WORLD_SIZE automatically. Single-node jobs use c10d + localhost; multi-node jobs broadcast the head node IP for the rendezvous endpoint.

Parameters:
  • config_file (str) – Path to the YAML config file.

  • script_path (str) – Path to the training script (e.g., applications/train_gen2.py).

  • launch (bool) – If True, submit with qsub. Defaults to True.

  • backend (str) – torch.distributed backend. Defaults to ‘nccl’.

credit.pbs.get_num_cpus()#
credit.pbs.config_file = '../config/vit2d.yml'#