My training job keeps getting preempted — how do I finish it?
Preemption is the price of cheap GPUs: spot and interruptible capacity can be reclaimed by the provider at any time, sometimes with seconds of notice. You can't prevent it — the winning strategy is making preemption cost minutes instead of the whole run.
Why it keeps happening
Spot capacity is surplus — when paying demand returns, your instance is reclaimed. Reclaim rates vary by GPU class, region, and time of day; a hot GPU class (H100s in a popular region) can see multiple preemptions in a single day. If your job restarts from step 0 each time, it may literally never finish.
The survival checklist
- Checkpoint to durable storage on a regular cadence — object storage, not the instance disk that vanishes with the machine. See checkpointing to S3.
- Resume automatically — restart from the last complete checkpoint without a human in the loop (the PyTorch pattern).
- Fail over across capacity — if the same GPU class is being reclaimed repeatedly, resume on a different provider or class instead of re-entering the same fight.
- Escalate deliberately: for a deadline-critical run, on-demand capacity costs more per hour but can be cheaper than paying for the same spot hours three times.
What VaultLayer automates
This checklist is exactly what fault-tolerant training on VaultLayer does by default: checkpoints sync durably as the job runs, preemption is detected, and the job resumes from the last step on available capacity — the same provider or another. Repeatedly unhealthy machines are excluded from re-provisioning. You submit once; the job finishes.
Frequently asked questions
How much progress do I lose per preemption?
At most the work since the last durable checkpoint. With a sane cadence that's minutes. With no checkpointing it's everything — which is why frequent preemption makes unprotected spot training effectively unusable.
Should I just switch to on-demand GPUs?
Sometimes. If your job is short or deadline-critical, on-demand can be the cheaper total. For long runs, checkpoint-and-resume on spot capacity usually wins — you keep the discount and stop paying for repeated work.
Keep every training job moving.
Sign up, install the CLI, and submit your first training job in minutes — on your own cloud or elastic GPU capacity.
Sign up