credit.trainers.preflight#

Pre-training checks that run before the epoch loop starts.

The goal is to catch silent hangs and OOM conditions early and emit clear, actionable error messages rather than letting jobs hang on the cluster for hours.

Public API#

estimate_dataloader_memory_gib(conf) -> float

Pure function. Computes the expected peak DataLoader CPU RAM footprint from trainer and data config. No IO, fully testable.

check_dataloader_startup(conf, loader, rank, timeout_s) -> None

Fetches one batch from loader with a timeout. Raises RuntimeError with a user-friendly message if the fetch hangs or if estimated memory looks dangerous.

check_model_gpu_memory(conf, model, optimizer, rank) -> None

Runs a synthetic forward/backward/optimizer step and logs peak VRAM.

Attributes#

Functions#

estimate_dataloader_memory_gib(→ float)

Estimate peak CPU RAM used by the DataLoader (GiB).

_available_ram_gib(→ float)

Return available system RAM in GiB, or 0 if psutil is not installed.

_fetch_one_batch(loader)

Return the first batch from loader, or raise on error.

check_dataloader_startup(→ None)

Run pre-training data loading checks (rank-0 only).

check_model_gpu_memory(→ None)

Run a synthetic forward/backward/optimizer step and log peak VRAM.

Module Contents#

credit.trainers.preflight.logger#
credit.trainers.preflight.estimate_dataloader_memory_gib(conf: dict) float#

Estimate peak CPU RAM used by the DataLoader (GiB).

Formula:

workers × prefetch_factor × batch_size × sample_bytes

where sample_bytes = H × W × total_channels × 4 (float32). Input and target tensors are counted separately (×2).

Parameters:

conf – Full training config dict.

Returns:

Estimated peak DataLoader RAM in GiB. Returns 0.0 if config is missing required keys (non-fatal — estimation is best-effort).

credit.trainers.preflight._available_ram_gib() float#

Return available system RAM in GiB, or 0 if psutil is not installed.

credit.trainers.preflight._fetch_one_batch(loader)#

Return the first batch from loader, or raise on error.

credit.trainers.preflight.check_dataloader_startup(conf: dict, loader, rank: int = 0, timeout_s: float = 300.0) None#

Run pre-training data loading checks (rank-0 only).

  1. Logs estimated DataLoader memory and warns if it looks dangerous.

  2. Attempts to fetch the first batch within timeout_s seconds. Raises RuntimeError with a clear, actionable message if it hangs.

Parameters:
  • conf – Full training config dict.

  • loader – Training DataLoader.

  • rank – Global rank. Checks only run on rank 0.

  • timeout_s – Seconds to wait for the first batch before failing.

credit.trainers.preflight.check_model_gpu_memory(conf: dict, model, optimizer, rank: int = 0) None#

Run a synthetic forward/backward/optimizer step and log peak VRAM.

Creates a zero-filled batch of the expected input shape, runs it through the model, backprops, and steps the optimizer. Logs peak VRAM so users can verify their model fits on the target GPU before a long training run.

Input channel count is inferred from the model config:

frames × (channels × levels + surface_channels + input_only_channels)

Skips silently if:
  • rank != 0 (only report from rank 0)

  • CUDA is not available

  • input channels cannot be inferred (returns 0)

  • any exception occurs during the synthetic pass

Parameters:
  • conf – Full training config dict.

  • model – The model (possibly DDP/FSDP wrapped).

  • optimizer – The optimizer (used to test a full optimizer step).

  • rank – Global rank. Check only runs on rank 0.