Which GPU should you use for AI training?
The right GPU for training depends on model size, precision, and budget more than on raw benchmarks. Here's how the common training GPUs compare on VRAM and throughput, and how to pick one without overpaying.
Common training GPUs compared
| GPU | VRAM | Best for |
|---|---|---|
| H100 | 80 GB | The largest models, FP8/bf16, and multi-node — highest throughput |
| A100 | 40 / 80 GB | The workhorse for large fine-tunes and full fine-tuning |
| L40S | 48 GB | Strong price/performance for mid-size fine-tunes |
| A10G | 24 GB | LoRA / QLoRA and smaller fine-tunes |
| RTX 4090 | 24 GB | Cost-effective single-GPU LoRA / QLoRA on interruptible capacity |
How to choose
- Start with VRAM. It has to hold the model, optimizer state, gradients, and activations. Full fine-tuning needs far more than LoRA or QLoRA, which quantize and cut memory sharply.
- Then throughput. H100 and A100 train fastest; L40S and A10G trade speed for cost on smaller jobs.
- Then cost and availability. The fastest GPU isn't the cheapest per finished job — match the GPU to the workload rather than always reaching for an H100.
Picking a GPU on VaultLayer
VaultLayer can route to the cheapest available GPU automatically, or pin a class with vl run --gpu H100 python train.py. Run vl gpus to list available types with their VRAM and current best price before you submit.
Frequently asked questions
How much GPU memory do I need to fine-tune a 7B or 13B model?
With QLoRA, a 24 GB card (A10G or RTX 4090) can handle a 7B and often a 13B model. Full fine-tuning of those sizes typically wants an A100 or H100. VRAM, not raw speed, is usually the deciding factor.
Do I need an H100, or is an A100 or L40S enough?
Only the largest models and multi-node runs really need H100 throughput. Many fine-tunes run comfortably and more cheaply on A100, L40S, or even 24 GB cards with QLoRA.
Keep every training job moving.
VaultLayer is in invite-only early access for teams running real GPU workloads.
Get early access