credit.distributed#

Functions#

`setup`(rank, world_size, mode[, backend])	Initializes the distributed process group.
`get_rank_info`(trainer_mode)	Gets rank and size information for distributed training.
`should_not_checkpoint`(module)
`distributed_model_wrapper`(conf, neural_network, device)	Wraps the neural network model for distributed training.

Module Contents#

credit.distributed.setup(rank, world_size, mode, backend='nccl')#

Initializes the distributed process group.

Parameters:

rank (int) – The rank of the process within the distributed setup.
world_size (int) – The total number of processes in the distributed setup.
mode (str) – The mode of operation (e.g., ‘fsdp’, ‘ddp’).
backend (str, optional) – The backend to use for distributed training. Defaults to ‘nccl’.

credit.distributed.get_rank_info(trainer_mode)#

Gets rank and size information for distributed training.

Parameters:: trainer_mode (str) – The mode of training (e.g., ‘fsdp’, ‘ddp’).
Returns:: A tuple containing LOCAL_RANK (int), WORLD_RANK (int), and WORLD_SIZE (int).
Return type:: tuple

credit.distributed.should_not_checkpoint(module)#

credit.distributed.distributed_model_wrapper(conf, neural_network, device)#

Wraps the neural network model for distributed training.

Supports modes: ‘fsdp’, ‘ddp’, ‘domain_parallel’, ‘fsdp+domain_parallel’.

For domain_parallel modes, the model’s Conv2d/Conv3d/ConvTranspose2d/GroupNorm layers are replaced with domain-parallel equivalents that handle halo exchange and distributed normalization. For fsdp+domain_parallel, domain-parallel conversion is done first, then FSDP wrapping uses the data-parallel subgroup.

Parameters:

conf (dict) – The configuration dictionary containing training settings.
neural_network (torch.nn.Module) – The neural network model to be wrapped.
device (torch.device) – The device on which the model will be trained.

Returns:

The wrapped model ready for distributed training.

Return type:

torch.nn.Module