credit.pbs
==========

.. py:module:: credit.pbs


Attributes
----------

.. autoapisummary::

   credit.pbs.logger
   credit.pbs.config_file


Functions
---------

.. autoapisummary::

   credit.pbs.launch_script
   credit.pbs.launch_script_mpi
   credit.pbs.launch_script_torchrun
   credit.pbs.get_num_cpus


Module Contents
---------------

.. py:data:: logger

.. py:function:: launch_script(config_file, script_path, launch=True, backend='nccl')

   Generates and optionally launches a PBS script for a single-node MPI job on Casper.

   :param config_file: Path to the YAML configuration file.
   :type config_file: str
   :param script_path: Path to the script that will be executed by the PBS job.
   :type script_path: str
   :param launch: If True, the PBS job will be submitted to the queue. Defaults to True.
   :type launch: bool, optional
   :param backend: Backend for distributed training. Defaults to 'nccl'.
   :type backend: str, optional


.. py:function:: launch_script_mpi(config_file, script_path, launch=True, backend='nccl')

   Generates and optionally launches a PBS script for a multi-node MPI job.

   :param config_file: Path to the YAML configuration file.
   :type config_file: str
   :param script_path: Path to the script that will be executed by the MPI job.
   :type script_path: str
   :param launch: If True, the PBS job will be submitted to the queue. Defaults to True.
   :type launch: bool, optional
   :param backend: Backend to be used for distributed training (e.g., 'nccl'). Defaults to 'nccl'.
   :type backend: str, optional


.. py:function:: launch_script_torchrun(config_file, script_path, launch=True, backend='nccl')

   Generates and optionally launches a PBS script using torchrun.

   Preferred over launch_script_mpi for FSDP2 / v2-parallelism jobs — torchrun
   manages rendezvous and sets LOCAL_RANK / RANK / WORLD_SIZE automatically.
   Single-node jobs use c10d + localhost; multi-node jobs broadcast the head
   node IP for the rendezvous endpoint.

   :param config_file: Path to the YAML config file.
   :type config_file: str
   :param script_path: Path to the training script (e.g., applications/train_gen2.py).
   :type script_path: str
   :param launch: If True, submit with qsub. Defaults to True.
   :type launch: bool
   :param backend: torch.distributed backend. Defaults to 'nccl'.
   :type backend: str


.. py:function:: get_num_cpus()

.. py:data:: config_file
   :value: '../config/vit2d.yml'


