VaultLayer › Learn

Which GPU should you use for AI training?

The right GPU for training depends on model size, precision, and budget more than on raw benchmarks. Here's how the common training GPUs compare on VRAM and throughput, and how to pick one without overpaying.

Common training GPUs compared

GPU	VRAM	Best for
H100	80 GB	The largest models, FP8/bf16, and multi-node — highest throughput
A100	40 / 80 GB	The workhorse for large fine-tunes and full fine-tuning
L40S	48 GB	Strong price/performance for mid-size fine-tunes
A10G	24 GB	LoRA / QLoRA and smaller fine-tunes
RTX 4090	24 GB	Cost-effective single-GPU LoRA / QLoRA on interruptible capacity

How to choose

Start with VRAM. It has to hold the model, optimizer state, gradients, and activations. Full fine-tuning needs far more than LoRA or QLoRA, which quantize and cut memory sharply.
Then throughput. H100 and A100 train fastest; L40S and A10G trade speed for cost on smaller jobs.
Then cost and availability. The fastest GPU isn't the cheapest per finished job — match the GPU to the workload rather than always reaching for an H100.

Picking a GPU on VaultLayer

VaultLayer can route to the cheapest available GPU automatically, or pin a class with vl run --gpu H100 python train.py. Run vl gpus to list available types with their VRAM and current best price before you submit.

Frequently asked questions

How much GPU memory do I need to fine-tune a 7B or 13B model?

With QLoRA, a 24 GB card (A10G or RTX 4090) can handle a 7B and often a 13B model. Full fine-tuning of those sizes typically wants an A100 or H100. VRAM, not raw speed, is usually the deciding factor.

Do I need an H100, or is an A100 or L40S enough?

Only the largest models and multi-node runs really need H100 throughput. Many fine-tunes run comfortably and more cheaply on A100, L40S, or even 24 GB cards with QLoRA.

Keep every training job moving.

VaultLayer is in invite-only early access for teams running real GPU workloads.

Get early access