credit.pbs#
Attributes#
Functions#
|
Generates and optionally launches a PBS script for a single-node MPI job on Casper. |
|
Generates and optionally launches a PBS script for a multi-node MPI job. |
|
Generates and optionally launches a PBS script using torchrun. |
Module Contents#
- credit.pbs.logger#
- credit.pbs.launch_script(config_file, script_path, launch=True, backend='nccl')#
Generates and optionally launches a PBS script for a single-node MPI job on Casper.
- Parameters:
config_file (str) – Path to the YAML configuration file.
script_path (str) – Path to the script that will be executed by the PBS job.
launch (bool, optional) – If True, the PBS job will be submitted to the queue. Defaults to True.
backend (str, optional) – Backend for distributed training. Defaults to ‘nccl’.
- credit.pbs.launch_script_mpi(config_file, script_path, launch=True, backend='nccl')#
Generates and optionally launches a PBS script for a multi-node MPI job.
- Parameters:
config_file (str) – Path to the YAML configuration file.
script_path (str) – Path to the script that will be executed by the MPI job.
launch (bool, optional) – If True, the PBS job will be submitted to the queue. Defaults to True.
backend (str, optional) – Backend to be used for distributed training (e.g., ‘nccl’). Defaults to ‘nccl’.
- credit.pbs.launch_script_torchrun(config_file, script_path, launch=True, backend='nccl')#
Generates and optionally launches a PBS script using torchrun.
Preferred over launch_script_mpi for FSDP2 / v2-parallelism jobs — torchrun manages rendezvous and sets LOCAL_RANK / RANK / WORLD_SIZE automatically. Single-node jobs use c10d + localhost; multi-node jobs broadcast the head node IP for the rendezvous endpoint.
- Parameters:
config_file (str) – Path to the YAML config file.
script_path (str) – Path to the training script (e.g., applications/train_gen2.py).
launch (bool) – If True, submit with qsub. Defaults to True.
backend (str) – torch.distributed backend. Defaults to ‘nccl’.
- credit.pbs.get_num_cpus()#
- credit.pbs.config_file = '../config/vit2d.yml'#