credit.parallel#
CREDIT v2 parallelism package.
Provides FSDP2, Tensor Parallelism (TP), and integration with Negin’s domain parallelism — all composed via PyTorch DeviceMesh.
- Config block (trainer.parallelism):
data: fsdp2 | ddp | none — data-parallel mode tensor: int >= 1 — TP degree (1 = disabled) domain: int >= 1 — spatial domain shards (1 = disabled)
- Total GPUs = dp_size × tensor × domain
where dp_size = world_size // (tensor × domain)
- Usage (called from distributed_model_wrapper_gen2):
mesh, submeshes = build_device_mesh(conf[“trainer”][“parallelism”]) if submeshes.get(“tp”):
model = apply_tensor_parallel(model, submeshes[“tp”])
- if submeshes.get(“domain”):
model = apply_domain_parallel(model, submeshes[“domain”])
- if conf[“trainer”][“parallelism”][“data”] == “fsdp2”:
model = apply_fsdp2(model, submeshes.get(“dp”), conf)
- elif conf[“trainer”][“parallelism”][“data”] == “ddp”:
model = apply_ddp(model, submeshes.get(“dp”))
Submodules#
Functions#
|
Build a DeviceMesh from a parallelism config block. |
|
Apply FSDP2 to model using the data-parallel submesh. |
|
Walk model and apply TP to all blocks that declare |
|
|
|
|
|
|
|
|
|
Average gradients across the domain-parallel group. |
Package Contents#
- credit.parallel.build_device_mesh(parallelism_conf: dict, device: str = 'cuda')#
Build a DeviceMesh from a parallelism config block.
- Parameters:
parallelism_conf – dict with keys: data (str): “fsdp2” | “ddp” | “none” tensor (int): TP degree, >= 1 domain (int): domain parallel degree, >= 1
device – “cuda” (default) or “cpu” for tests
- Returns:
DeviceMesh (or None if no parallelism) submeshes: dict mapping dim name -> submesh (or None if single-dim)
Keys present: “dp” if dp > 1, “tp” if tp > 1, “domain” if domain > 1
- Return type:
mesh
- Raises:
ValueError – if world_size is not divisible by tensor * domain.
- credit.parallel.apply_fsdp2(model: torch.nn.Module, dp_mesh, conf: dict) torch.nn.Module#
Apply FSDP2 to model using the data-parallel submesh.
Shards Transformer and UpBlock/UpBlockPS submodules first, then wraps the whole model.
- Parameters:
model – Raw (or TP-converted) model.
dp_mesh – 1-D DeviceMesh for the data-parallel dimension. Pass None to shard over the default global mesh.
conf – Full training config dict (reads trainer.amp for mp_policy).
- Returns:
model with fully_shard applied (in-place, returns same object).
- credit.parallel.apply_tensor_parallel(model: torch.nn.Module, tp_mesh) torch.nn.Module#
Walk model and apply TP to all blocks that declare
_tp_col/_tp_row.Any
nn.Modulesubclass can opt in by setting two class attributes:class MyBlock(nn.Module): _tp_col = "proj_up" # dotted path to the column-parallel layer _tp_row = "proj_out" # dotted path to the row-parallel layer
Paths may address layers inside Sequentials (e.g.
"layers.1"). Supported layer types:nn.Conv2d(1×1 only) andnn.Linear.Converts in-place. Safe to call before apply_fsdp2.
- Parameters:
model – The model to convert.
tp_mesh – 1-D DeviceMesh for the tensor-parallel dimension.
- Returns:
model (same object, modified in-place).
- credit.parallel.get_domain_manager(model)#
- credit.parallel.get_raw_model(model)#
- credit.parallel.shard_spatial(tensor, manager)#
- credit.parallel.unpad_shard_interp(y_pred, padding_opt, manager, image_h, image_w)#
- credit.parallel.sync_domain_gradients(model, manager)#
Average gradients across the domain-parallel group.
See credit.parallel.collectives.allreduce_grads_avg for the bucketing and DTensor handling.