VaultLayer › Guides

How to fix CUDA out of memory during training

torch.cuda.OutOfMemoryError means the model, optimizer state, gradients, and activations don't fit in VRAM. Before renting a bigger GPU, there's a standard ladder of fixes — try them in this order.

The fix ladder, cheapest first

1. Shrink the batch, keep the math: cut batch_size and raise gradient accumulation steps so the effective batch stays the same. Activation memory scales with batch size — this is the single most common fix.
2. Gradient checkpointing: recompute activations in the backward pass instead of storing them (model.gradient_checkpointing_enable() in Hugging Face). Trades ~20-30% speed for a large memory cut.
3. Mixed precision: train in bf16/fp16 instead of fp32 — halves activation and gradient memory on modern GPUs.
4. Change the method: for fine-tuning, QLoRA or LoRA slash memory versus a full fine-tune — often the difference between fitting on 24 GB and needing 80 GB.
5. Shorter sequences: truncate or pack sequences — attention activation memory grows with context length.
6. Bigger GPU: if the model genuinely doesn't fit, move up a VRAM class. See how much GPU memory you need.

Diagnose before you fix

Check where memory goes before changing anything: watch live VRAM during the run. On VaultLayer, vl gpu-stats <job_id> shows VRAM, utilization, and temperature while training — if usage spikes at the first forward pass it's model+activations (fix ladder above); if it creeps up over steps you likely have a leak (tensors retained in a Python list, missing detach()).

Retry on a bigger GPU in one flag

When the ladder isn't enough, resubmit on a larger class: vl run --gpu A100 python train.py (or H100). Run vl gpus to compare VRAM and current prices, and vl estimate to see cost before submitting.

Frequently asked questions

Why do I get CUDA OOM partway through training instead of at the start?

Usually a memory leak (tensors accumulated with the graph attached — append loss.item(), not loss), a batch that varies in sequence length, or evaluation running with a larger batch than training. Watch VRAM over time to tell a spike from a creep.

Does gradient accumulation change results?

With the same effective batch size (micro-batch × accumulation steps), training math is nearly identical — it's the standard way to fit large-batch training in small VRAM.

How to fix CUDA out of memory during training

The fix ladder, cheapest first

Diagnose before you fix

Retry on a bigger GPU in one flag

Frequently asked questions

Keep every training job moving.

Related