credit.parallel#

CREDIT v2 parallelism package.

Provides FSDP2, Tensor Parallelism (TP), and integration with Negin’s domain parallelism — all composed via PyTorch DeviceMesh.

Config block (trainer.parallelism):

data: fsdp2 | ddp | none — data-parallel mode tensor: int >= 1 — TP degree (1 = disabled) domain: int >= 1 — spatial domain shards (1 = disabled)

Total GPUs = dp_size × tensor × domain

where dp_size = world_size // (tensor × domain)

Usage (called from distributed_model_wrapper_gen2):

mesh, submeshes = build_device_mesh(conf[“trainer”][“parallelism”]) if submeshes.get(“tp”):

model = apply_tensor_parallel(model, submeshes[“tp”])

if submeshes.get(“domain”):

model = apply_domain_parallel(model, submeshes[“domain”])

if conf[“trainer”][“parallelism”][“data”] == “fsdp2”:

model = apply_fsdp2(model, submeshes.get(“dp”), conf)

elif conf[“trainer”][“parallelism”][“data”] == “ddp”:

model = apply_ddp(model, submeshes.get(“dp”))

Submodules#

Functions#

build_device_mesh(parallelism_conf[, device])

Build a DeviceMesh from a parallelism config block.

apply_fsdp2(→ torch.nn.Module)

Apply FSDP2 to model using the data-parallel submesh.

apply_tensor_parallel(→ torch.nn.Module)

Walk model and apply TP to all blocks that declare _tp_col/_tp_row.

get_domain_manager(model)

get_raw_model(model)

shard_spatial(tensor, manager)

unpad_shard_interp(y_pred, padding_opt, manager, ...)

sync_domain_gradients(model, manager)

Average gradients across the domain-parallel group.

Package Contents#

credit.parallel.build_device_mesh(parallelism_conf: dict, device: str = 'cuda')#

Build a DeviceMesh from a parallelism config block.

Parameters:
  • parallelism_conf – dict with keys: data (str): “fsdp2” | “ddp” | “none” tensor (int): TP degree, >= 1 domain (int): domain parallel degree, >= 1

  • device – “cuda” (default) or “cpu” for tests

Returns:

DeviceMesh (or None if no parallelism) submeshes: dict mapping dim name -> submesh (or None if single-dim)

Keys present: “dp” if dp > 1, “tp” if tp > 1, “domain” if domain > 1

Return type:

mesh

Raises:

ValueError – if world_size is not divisible by tensor * domain.

credit.parallel.apply_fsdp2(model: torch.nn.Module, dp_mesh, conf: dict) torch.nn.Module#

Apply FSDP2 to model using the data-parallel submesh.

Shards Transformer and UpBlock/UpBlockPS submodules first, then wraps the whole model.

Parameters:
  • model – Raw (or TP-converted) model.

  • dp_mesh – 1-D DeviceMesh for the data-parallel dimension. Pass None to shard over the default global mesh.

  • conf – Full training config dict (reads trainer.amp for mp_policy).

Returns:

model with fully_shard applied (in-place, returns same object).

credit.parallel.apply_tensor_parallel(model: torch.nn.Module, tp_mesh) torch.nn.Module#

Walk model and apply TP to all blocks that declare _tp_col/_tp_row.

Any nn.Module subclass can opt in by setting two class attributes:

class MyBlock(nn.Module):
    _tp_col = "proj_up"   # dotted path to the column-parallel layer
    _tp_row = "proj_out"  # dotted path to the row-parallel layer

Paths may address layers inside Sequentials (e.g. "layers.1"). Supported layer types: nn.Conv2d (1×1 only) and nn.Linear.

Converts in-place. Safe to call before apply_fsdp2.

Parameters:
  • model – The model to convert.

  • tp_mesh – 1-D DeviceMesh for the tensor-parallel dimension.

Returns:

model (same object, modified in-place).

credit.parallel.get_domain_manager(model)#
credit.parallel.get_raw_model(model)#
credit.parallel.shard_spatial(tensor, manager)#
credit.parallel.unpad_shard_interp(y_pred, padding_opt, manager, image_h, image_w)#
credit.parallel.sync_domain_gradients(model, manager)#

Average gradients across the domain-parallel group.

See credit.parallel.collectives.allreduce_grads_avg for the bucketing and DTensor handling.