credit.distributed#
Functions#
|
Initializes the distributed process group. |
|
Gets rank and size information for distributed training. |
|
|
|
Wraps the neural network model for distributed training. |
Module Contents#
- credit.distributed.setup(rank, world_size, mode, backend='nccl')#
Initializes the distributed process group.
- Parameters:
rank (int) – The rank of the process within the distributed setup.
world_size (int) – The total number of processes in the distributed setup.
mode (str) – The mode of operation (e.g., ‘fsdp’, ‘ddp’).
backend (str, optional) – The backend to use for distributed training. Defaults to ‘nccl’.
- credit.distributed.get_rank_info(trainer_mode)#
Gets rank and size information for distributed training.
- Parameters:
trainer_mode (str) – The mode of training (e.g., ‘fsdp’, ‘ddp’).
- Returns:
A tuple containing LOCAL_RANK (int), WORLD_RANK (int), and WORLD_SIZE (int).
- Return type:
tuple
- credit.distributed.should_not_checkpoint(module)#
- credit.distributed.distributed_model_wrapper(conf, neural_network, device)#
Wraps the neural network model for distributed training.
Supports modes: ‘fsdp’, ‘ddp’, ‘domain_parallel’, ‘fsdp+domain_parallel’.
For domain_parallel modes, the model’s Conv2d/Conv3d/ConvTranspose2d/GroupNorm layers are replaced with domain-parallel equivalents that handle halo exchange and distributed normalization. For fsdp+domain_parallel, domain-parallel conversion is done first, then FSDP wrapping uses the data-parallel subgroup.
- Parameters:
conf (dict) – The configuration dictionary containing training settings.
neural_network (torch.nn.Module) – The neural network model to be wrapped.
device (torch.device) – The device on which the model will be trained.
- Returns:
The wrapped model ready for distributed training.
- Return type:
torch.nn.Module