credit.parallel#

CREDIT v2 parallelism package.

Provides FSDP2, Tensor Parallelism (TP), and integration with Negin’s domain parallelism — all composed via PyTorch DeviceMesh.

Config block (trainer.parallelism):

data: fsdp2 | ddp | none — data-parallel mode tensor: int >= 1 — TP degree (1 = disabled) domain: int >= 1 — spatial domain shards (1 = disabled)

Total GPUs = dp_size × tensor × domain

where dp_size = world_size // (tensor × domain)

Usage (called from distributed_model_wrapper_gen2):

mesh, submeshes = build_device_mesh(conf[“trainer”][“parallelism”]) if submeshes.get(“tp”):

model = apply_tensor_parallel(model, submeshes[“tp”])

if submeshes.get(“domain”):: model = apply_domain_parallel(model, submeshes[“domain”])
if conf[“trainer”][“parallelism”][“data”] == “fsdp2”:: model = apply_fsdp2(model, submeshes.get(“dp”), conf)
elif conf[“trainer”][“parallelism”][“data”] == “ddp”:: model = apply_ddp(model, submeshes.get(“dp”))

Submodules#

Functions#

`build_device_mesh`(parallelism_conf[, device])	Build a DeviceMesh from a parallelism config block.
`apply_fsdp2`(→ torch.nn.Module)	Apply FSDP2 to model using the data-parallel submesh.
`apply_tensor_parallel`(→ torch.nn.Module)	Walk model and apply TP to all blocks that declare `_tp_col`/`_tp_row`.
`get_domain_manager`(model)
`get_raw_model`(model)
`shard_spatial`(tensor, manager)
`unpad_shard_interp`(y_pred, padding_opt, manager, ...)
`sync_domain_gradients`(model, manager)	Average gradients across the domain-parallel group.

Package Contents#

credit.parallel.build_device_mesh(parallelism_conf: dict, device: str = 'cuda')#

Build a DeviceMesh from a parallelism config block.

Parameters:

parallelism_conf – dict with keys: data (str): “fsdp2” | “ddp” | “none” tensor (int): TP degree, >= 1 domain (int): domain parallel degree, >= 1
device – “cuda” (default) or “cpu” for tests

Returns:

DeviceMesh (or None if no parallelism) submeshes: dict mapping dim name -> submesh (or None if single-dim)

Keys present: “dp” if dp > 1, “tp” if tp > 1, “domain” if domain > 1

Return type:

mesh

Raises:

ValueError – if world_size is not divisible by tensor * domain.

credit.parallel.apply_fsdp2(model: torch.nn.Module, dp_mesh, conf: dict) → torch.nn.Module#

Apply FSDP2 to model using the data-parallel submesh.

Shards Transformer and UpBlock/UpBlockPS submodules first, then wraps the whole model.

Parameters:

model – Raw (or TP-converted) model.
dp_mesh – 1-D DeviceMesh for the data-parallel dimension. Pass None to shard over the default global mesh.
conf – Full training config dict (reads trainer.amp for mp_policy).

Returns:

model with fully_shard applied (in-place, returns same object).

credit.parallel.apply_tensor_parallel(model: torch.nn.Module, tp_mesh) → torch.nn.Module#

Walk model and apply TP to all blocks that declare _tp_col/_tp_row.

Any nn.Module subclass can opt in by setting two class attributes:

class MyBlock(nn.Module):
    _tp_col = "proj_up"   # dotted path to the column-parallel layer
    _tp_row = "proj_out"  # dotted path to the row-parallel layer

Paths may address layers inside Sequentials (e.g. "layers.1"). Supported layer types: nn.Conv2d (1×1 only) and nn.Linear.

Converts in-place. Safe to call before apply_fsdp2.

Parameters:

model – The model to convert.
tp_mesh – 1-D DeviceMesh for the tensor-parallel dimension.

Returns:

model (same object, modified in-place).

credit.parallel.get_domain_manager(model)#

credit.parallel.get_raw_model(model)#

credit.parallel.shard_spatial(tensor, manager)#

credit.parallel.unpad_shard_interp(y_pred, padding_opt, manager, image_h, image_w)#

credit.parallel.sync_domain_gradients(model, manager)#

Average gradients across the domain-parallel group.

See credit.parallel.collectives.allreduce_grads_avg for the bucketing and DTensor handling.