VaultLayer › Learn

QLoRA vs LoRA vs full fine-tuning

These three fine-tuning methods trade GPU memory against flexibility. Full fine-tuning updates every weight, LoRA trains small adapters on a frozen model, and QLoRA adds 4-bit quantization so large models fit on small GPUs. Here's how to choose.

The three methods compared

Method	What it does	GPU memory
Full fine-tune	Updates all model weights	Highest — weights, gradients, and optimizer state for every parameter
LoRA	Trains small low-rank adapters on a frozen base	Moderate — only the adapters have gradients/optimizer state
QLoRA	LoRA on a 4-bit quantized base	Lowest — quantization shrinks the frozen weights sharply

How to choose

QLoRA when GPU memory is the constraint — it fits large models on a single smaller card with a small quality trade-off.
LoRA for most fine-tunes — near-full quality at a fraction of the memory, with no quantization.
Full fine-tuning when you need maximum quality or are changing the model deeply, and have the GPU budget for it.

For the actual VRAM numbers by model size, see how much GPU memory to fine-tune an LLM.

Running each on VaultLayer

Pick the method with one flag: vl run --train-mode qlora|lora|full python train.py, and pair it with --model-params to size the GPU for your model. See fine-tune LLMs on your own cloud for the full workflow.

Frequently asked questions

Is QLoRA worse than full fine-tuning?

For many tasks the quality gap is small, and QLoRA makes large models trainable on limited hardware. Full fine-tuning can still win when you need maximum quality or are reshaping the model substantially.

Which should I start with?

QLoRA or LoRA are the usual starting points — cheaper, faster, and enough for most fine-tunes. Move to a full fine-tune only if the adapter approach leaves quality on the table.

Keep every training job moving.

VaultLayer is in invite-only early access for teams running real GPU workloads.

Get early access