credit.domain_parallel.manager#

Domain Parallel Manager - process group creation and coordination.

Creates a 2D logical mesh of (data_parallel, domain_parallel) process groups from a flat world of GPUs. Domain-parallel ranks share the same data sample but hold different spatial shards. Data-parallel ranks hold the same spatial shard but process different data samples.

Example with 8 GPUs and domain_parallel_size=2:

domain groups: [0,1], [2,3], [4,5], [6,7] data-parallel groups: [0,2,4,6], [1,3,5,7]

Attributes#

Classes#

DomainParallelManager

Manages process groups for domain parallelism.

Functions#

initialize_domain_parallel(world_size, ...[, shard_dim])

Initialize the global DomainParallelManager singleton.

get_domain_parallel_manager()

Get the global DomainParallelManager singleton.

Module Contents#

credit.domain_parallel.manager.logger#
credit.domain_parallel.manager._MANAGER = None#
class credit.domain_parallel.manager.DomainParallelManager(world_size, domain_parallel_size, shard_dim=-2)#

Manages process groups for domain parallelism.

Parameters:
  • world_size – Total number of GPUs.

  • domain_parallel_size – Number of GPUs per domain-parallel group.

  • shard_dim – Which spatial dimension to shard. -2 means latitude (H) in a (B, C, H, W) tensor.

world_size#
domain_parallel_size#
data_parallel_size#
shard_dim = -2#
_domain_group_idx#
_domain_rank#
_dp_rank#
_domain_group = None#
_dp_group = None#
property domain_group#

Process group for domain-parallel communication (halo exchange, reductions).

property data_parallel_group#

Process group for data-parallel communication (gradient sync).

property domain_rank#

Rank within the domain-parallel group (0 to domain_parallel_size-1).

property domain_world_size#

Number of ranks in the domain-parallel group.

property dp_rank#

Rank within the data-parallel group.

property dp_world_size#

Number of ranks in the data-parallel group.

property is_first_domain_rank#

True if this is the first rank in its domain group (north edge).

property is_last_domain_rank#

True if this is the last rank in its domain group (south edge).

neighbor_ranks()#

Returns (prev_rank, next_rank) global ranks for halo exchange.

Returns None for non-existent neighbors at edges.

credit.domain_parallel.manager.initialize_domain_parallel(world_size, domain_parallel_size, shard_dim=-2)#

Initialize the global DomainParallelManager singleton.

Parameters:
  • world_size – Total number of GPUs.

  • domain_parallel_size – Number of GPUs per domain group.

  • shard_dim – Spatial dimension to shard (-2 for lat in BCHW).

Returns:

DomainParallelManager instance.

credit.domain_parallel.manager.get_domain_parallel_manager()#

Get the global DomainParallelManager singleton.

Returns:

DomainParallelManager or None if not initialized.