credit.distributed#

Functions#

setup(rank, world_size, mode[, backend])

Initializes the distributed process group.

get_rank_info(trainer_mode)

Gets rank and size information for distributed training.

should_not_checkpoint(module)

distributed_model_wrapper(conf, neural_network, device)

Wraps the neural network model for distributed training.

Module Contents#

credit.distributed.setup(rank, world_size, mode, backend='nccl')#

Initializes the distributed process group.

Parameters:
  • rank (int) – The rank of the process within the distributed setup.

  • world_size (int) – The total number of processes in the distributed setup.

  • mode (str) – The mode of operation (e.g., ‘fsdp’, ‘ddp’).

  • backend (str, optional) – The backend to use for distributed training. Defaults to ‘nccl’.

credit.distributed.get_rank_info(trainer_mode)#

Gets rank and size information for distributed training.

Parameters:

trainer_mode (str) – The mode of training (e.g., ‘fsdp’, ‘ddp’).

Returns:

A tuple containing LOCAL_RANK (int), WORLD_RANK (int), and WORLD_SIZE (int).

Return type:

tuple

credit.distributed.should_not_checkpoint(module)#
credit.distributed.distributed_model_wrapper(conf, neural_network, device)#

Wraps the neural network model for distributed training.

Supports modes: ‘fsdp’, ‘ddp’, ‘domain_parallel’, ‘fsdp+domain_parallel’.

For domain_parallel modes, the model’s Conv2d/Conv3d/ConvTranspose2d/GroupNorm layers are replaced with domain-parallel equivalents that handle halo exchange and distributed normalization. For fsdp+domain_parallel, domain-parallel conversion is done first, then FSDP wrapping uses the data-parallel subgroup.

Parameters:
  • conf (dict) – The configuration dictionary containing training settings.

  • neural_network (torch.nn.Module) – The neural network model to be wrapped.

  • device (torch.device) – The device on which the model will be trained.

Returns:

The wrapped model ready for distributed training.

Return type:

torch.nn.Module