QLoRA vs LoRA vs full fine-tuning
These three fine-tuning methods trade GPU memory against flexibility. Full fine-tuning updates every weight, LoRA trains small adapters on a frozen model, and QLoRA adds 4-bit quantization so large models fit on small GPUs. Here's how to choose.
The three methods compared
| Method | What it does | GPU memory |
|---|---|---|
| Full fine-tune | Updates all model weights | Highest — weights, gradients, and optimizer state for every parameter |
| LoRA | Trains small low-rank adapters on a frozen base | Moderate — only the adapters have gradients/optimizer state |
| QLoRA | LoRA on a 4-bit quantized base | Lowest — quantization shrinks the frozen weights sharply |
How to choose
- QLoRA when GPU memory is the constraint — it fits large models on a single smaller card with a small quality trade-off.
- LoRA for most fine-tunes — near-full quality at a fraction of the memory, with no quantization.
- Full fine-tuning when you need maximum quality or are changing the model deeply, and have the GPU budget for it.
For the actual VRAM numbers by model size, see how much GPU memory to fine-tune an LLM.
Running each on VaultLayer
Pick the method with one flag: vl run --train-mode qlora|lora|full python train.py, and pair it with --model-params to size the GPU for your model. See fine-tune LLMs on your own cloud for the full workflow.
Frequently asked questions
Is QLoRA worse than full fine-tuning?
For many tasks the quality gap is small, and QLoRA makes large models trainable on limited hardware. Full fine-tuning can still win when you need maximum quality or are reshaping the model substantially.
Which should I start with?
QLoRA or LoRA are the usual starting points — cheaper, faster, and enough for most fine-tunes. Move to a full fine-tune only if the adapter approach leaves quality on the table.
Keep every training job moving.
VaultLayer is in invite-only early access for teams running real GPU workloads.
Get early access