VaultLayer › Guides

How to fix CUDA out of memory during training

torch.cuda.OutOfMemoryError means the model, optimizer state, gradients, and activations don't fit in VRAM. Before renting a bigger GPU, there's a standard ladder of fixes — try them in this order.

The fix ladder, cheapest first

Diagnose before you fix

Check where memory goes before changing anything: watch live VRAM during the run. On VaultLayer, vl gpu-stats <job_id> shows VRAM, utilization, and temperature while training — if usage spikes at the first forward pass it's model+activations (fix ladder above); if it creeps up over steps you likely have a leak (tensors retained in a Python list, missing detach()).

Retry on a bigger GPU in one flag

When the ladder isn't enough, resubmit on a larger class: vl run --gpu A100 python train.py (or H100). Run vl gpus to compare VRAM and current prices, and vl estimate to see cost before submitting.

Frequently asked questions

Why do I get CUDA OOM partway through training instead of at the start?

Usually a memory leak (tensors accumulated with the graph attached — append loss.item(), not loss), a batch that varies in sequence length, or evaluation running with a larger batch than training. Watch VRAM over time to tell a spike from a creep.

Does gradient accumulation change results?

With the same effective batch size (micro-batch × accumulation steps), training math is nearly identical — it's the standard way to fit large-batch training in small VRAM.

Keep every training job moving.

Sign up, install the CLI, and submit your first training job in minutes — on your own cloud or elastic GPU capacity.

Sign up