Fault-tolerant GPU training that survives interruptions
Interruptible and spot GPUs are cheap because they can be reclaimed at any moment — and a reclaim mid-run normally wipes out hours of training progress. Fault-tolerant training removes that risk: VaultLayer detects the failure and resumes your job from the last checkpoint, so progress is never lost.
Why training jobs lose progress
When a provider reclaims a spot instance, the host crashes, or a GPU goes unhealthy, most setups hand you a fresh, empty pod. Without your own checkpoint sync and resume logic, the run restarts from step zero and you pay for the same work twice. Writing that recovery glue correctly — for every framework and provider — is the part teams underestimate.
How checkpoint-and-resume works
VaultLayer checkpoints your training state to durable storage as the job runs, on a health-gated cadence. If the host is interrupted, VaultLayer re-provisions and restarts your script from the last saved step:
vl run python train.py # failover is on by default
vl run --no-failover python train.py # opt out for a single attempt
Hugging Face Trainer, PyTorch, JAX/Flax, Lightning, and DeepSpeed are all supported; the resume snippet is usually auto-inserted on your first run. See how checkpoint & resume works on interruptible GPUs.
Cross-provider recovery
Recovery is not limited to the same machine. When your original GPU or even an entire provider runs out of capacity, VaultLayer can resume on the next available option — your own BYOC capacity first, then external GPU capacity when that path is configured — so a single provider's outage doesn't stall the job.
Frequently asked questions
What happens if a GPU dies mid-run?
VaultLayer detects it and resumes your training from the last checkpoint on available compute — your connected BYOC capacity first, or external GPU capacity when configured. No restart from scratch.
Do I have to write my own checkpointing?
No. VaultLayer handles checkpoint sync and resume. The integration is optional and usually auto-inserted on your first run; you can also add three lines yourself for fine control.
Can I turn failover off?
Yes. Failover is on by default; pass --no-failover to run a single attempt with no automatic recovery.
Keep every training job moving.
VaultLayer is in invite-only early access for teams running real GPU workloads.
Get early access