VaultLayer › Learn

FSDP vs DDP: which should you use?

Both are PyTorch data parallelism: every GPU sees different data, gradients are synchronized. The difference is memory. DDP keeps a full model copy per GPU; FSDP shards the model across them. Pick by whether the model fits.

At a glance

 DDPFSDP
Model per GPUFull replicaSharded params, grads, optimizer state
Memory ceilingModel must fit on ONE GPUModel must fit across ALL GPUs
CommunicationAll-reduce gradientsAll-gather/reduce-scatter each layer — more traffic
Speed when both fitUsually faster and simplerOverhead from gather/scatter
Typical useSmall/medium models, throughput scalingLarge models that don't fit one GPU

The decision rule

If the model (plus optimizer state) fits on one GPU, use DDP — it's simpler and typically faster. Reach for FSDP when it doesn't fit: sharding is what lets a 70B-class model train across a group of 80 GB GPUs. Mixed cases (fits with QLoRA, not full) often mean choosing the fine-tuning method first, then the parallelism.

Running either on VaultLayer

Standard torchrun-style DDP and FSDP scripts run unchanged — single node with multiple GPUs, or across machines via multi-node training, where VaultLayer provisions the group, wires NCCL networking, and recovers the whole job from checkpoint on failure.

Frequently asked questions

Is FSDP always better than DDP for big models?

FSDP is necessary when the model doesn't fit on one GPU. When it does fit, DDP is usually faster because it avoids per-layer gather/scatter traffic. Use the simplest thing that fits.

Does FSDP work across multiple machines?

Yes — FSDP shards across all GPUs in the job, including multi-node groups. Inter-node network bandwidth matters more for FSDP than DDP, so node placement and networking setup count.

Keep every training job moving.

Sign up, install the CLI, and submit your first training job in minutes — on your own cloud or elastic GPU capacity.

Sign up