credit.domain_parallel.manager#
Domain Parallel Manager - process group creation and coordination.
Creates a 2D logical mesh of (data_parallel, domain_parallel) process groups from a flat world of GPUs. Domain-parallel ranks share the same data sample but hold different spatial shards. Data-parallel ranks hold the same spatial shard but process different data samples.
- Example with 8 GPUs and domain_parallel_size=2:
domain groups: [0,1], [2,3], [4,5], [6,7] data-parallel groups: [0,2,4,6], [1,3,5,7]
Attributes#
Classes#
Manages process groups for domain parallelism. |
Functions#
|
Initialize the global DomainParallelManager singleton. |
Get the global DomainParallelManager singleton. |
Module Contents#
- credit.domain_parallel.manager.logger#
- credit.domain_parallel.manager._MANAGER = None#
- class credit.domain_parallel.manager.DomainParallelManager(world_size, domain_parallel_size, shard_dim=-2)#
Manages process groups for domain parallelism.
- Parameters:
world_size – Total number of GPUs.
domain_parallel_size – Number of GPUs per domain-parallel group.
shard_dim – Which spatial dimension to shard. -2 means latitude (H) in a (B, C, H, W) tensor.
- world_size#
- domain_parallel_size#
- data_parallel_size#
- shard_dim = -2#
- _domain_group_idx#
- _domain_rank#
- _dp_rank#
- _domain_group = None#
- _dp_group = None#
- property domain_group#
Process group for domain-parallel communication (halo exchange, reductions).
- property data_parallel_group#
Process group for data-parallel communication (gradient sync).
- property domain_rank#
Rank within the domain-parallel group (0 to domain_parallel_size-1).
- property domain_world_size#
Number of ranks in the domain-parallel group.
- property dp_rank#
Rank within the data-parallel group.
- property dp_world_size#
Number of ranks in the data-parallel group.
- property is_first_domain_rank#
True if this is the first rank in its domain group (north edge).
- property is_last_domain_rank#
True if this is the last rank in its domain group (south edge).
- neighbor_ranks()#
Returns (prev_rank, next_rank) global ranks for halo exchange.
Returns None for non-existent neighbors at edges.
- credit.domain_parallel.manager.initialize_domain_parallel(world_size, domain_parallel_size, shard_dim=-2)#
Initialize the global DomainParallelManager singleton.
- Parameters:
world_size – Total number of GPUs.
domain_parallel_size – Number of GPUs per domain group.
shard_dim – Spatial dimension to shard (-2 for lat in BCHW).
- Returns:
DomainParallelManager instance.
- credit.domain_parallel.manager.get_domain_parallel_manager()#
Get the global DomainParallelManager singleton.
- Returns:
DomainParallelManager or None if not initialized.