VaultLayer › Learn

FSDP vs DDP: which should you use?

Both are PyTorch data parallelism: every GPU sees different data, gradients are synchronized. The difference is memory. DDP keeps a full model copy per GPU; FSDP shards the model across them. Pick by whether the model fits.

At a glance

	DDP	FSDP
Model per GPU	Full replica	Sharded params, grads, optimizer state
Memory ceiling	Model must fit on ONE GPU	Model must fit across ALL GPUs
Communication	All-reduce gradients	All-gather/reduce-scatter each layer — more traffic
Speed when both fit	Usually faster and simpler	Overhead from gather/scatter
Typical use	Small/medium models, throughput scaling	Large models that don't fit one GPU

The decision rule

If the model (plus optimizer state) fits on one GPU, use DDP — it's simpler and typically faster. Reach for FSDP when it doesn't fit: sharding is what lets a 70B-class model train across a group of 80 GB GPUs. Mixed cases (fits with QLoRA, not full) often mean choosing the fine-tuning method first, then the parallelism.

Running either on VaultLayer

Standard torchrun-style DDP and FSDP scripts run unchanged — single node with multiple GPUs, or across machines via multi-node training, where VaultLayer provisions the group, wires NCCL networking, and recovers the whole job from checkpoint on failure.

Frequently asked questions

Is FSDP always better than DDP for big models?

FSDP is necessary when the model doesn't fit on one GPU. When it does fit, DDP is usually faster because it avoids per-layer gather/scatter traffic. Use the simplest thing that fits.

Does FSDP work across multiple machines?

Yes — FSDP shards across all GPUs in the job, including multi-node groups. Inter-node network bandwidth matters more for FSDP than DDP, so node placement and networking setup count.

FSDP vs DDP: which should you use?

At a glance

The decision rule

Running either on VaultLayer

Frequently asked questions

Keep every training job moving.

Related