How to fix CUDA out of memory during training
torch.cuda.OutOfMemoryError means the model, optimizer state, gradients, and activations don't fit in VRAM. Before renting a bigger GPU, there's a standard ladder of fixes — try them in this order.
The fix ladder, cheapest first
- 1. Shrink the batch, keep the math: cut
batch_sizeand raise gradient accumulation steps so the effective batch stays the same. Activation memory scales with batch size — this is the single most common fix. - 2. Gradient checkpointing: recompute activations in the backward pass instead of storing them (
model.gradient_checkpointing_enable()in Hugging Face). Trades ~20-30% speed for a large memory cut. - 3. Mixed precision: train in bf16/fp16 instead of fp32 — halves activation and gradient memory on modern GPUs.
- 4. Change the method: for fine-tuning, QLoRA or LoRA slash memory versus a full fine-tune — often the difference between fitting on 24 GB and needing 80 GB.
- 5. Shorter sequences: truncate or pack sequences — attention activation memory grows with context length.
- 6. Bigger GPU: if the model genuinely doesn't fit, move up a VRAM class. See how much GPU memory you need.
Diagnose before you fix
Check where memory goes before changing anything: watch live VRAM during the run. On VaultLayer, vl gpu-stats <job_id> shows VRAM, utilization, and temperature while training — if usage spikes at the first forward pass it's model+activations (fix ladder above); if it creeps up over steps you likely have a leak (tensors retained in a Python list, missing detach()).
Retry on a bigger GPU in one flag
When the ladder isn't enough, resubmit on a larger class: vl run --gpu A100 python train.py (or H100). Run vl gpus to compare VRAM and current prices, and vl estimate to see cost before submitting.
Frequently asked questions
Why do I get CUDA OOM partway through training instead of at the start?
Usually a memory leak (tensors accumulated with the graph attached — append loss.item(), not loss), a batch that varies in sequence length, or evaluation running with a larger batch than training. Watch VRAM over time to tell a spike from a creep.
Does gradient accumulation change results?
With the same effective batch size (micro-batch × accumulation steps), training math is nearly identical — it's the standard way to fit large-batch training in small VRAM.
Keep every training job moving.
Sign up, install the CLI, and submit your first training job in minutes — on your own cloud or elastic GPU capacity.
Sign up