VaultLayer › Guides

My training job keeps getting preempted — how do I finish it?

Preemption is the price of cheap GPUs: spot and interruptible capacity can be reclaimed by the provider at any time, sometimes with seconds of notice. You can't prevent it — the winning strategy is making preemption cost minutes instead of the whole run.

Why it keeps happening

Spot capacity is surplus — when paying demand returns, your instance is reclaimed. Reclaim rates vary by GPU class, region, and time of day; a hot GPU class (H100s in a popular region) can see multiple preemptions in a single day. If your job restarts from step 0 each time, it may literally never finish.

The survival checklist

What VaultLayer automates

This checklist is exactly what fault-tolerant training on VaultLayer does by default: checkpoints sync durably as the job runs, preemption is detected, and the job resumes from the last step on available capacity — the same provider or another. Repeatedly unhealthy machines are excluded from re-provisioning. You submit once; the job finishes.

Frequently asked questions

How much progress do I lose per preemption?

At most the work since the last durable checkpoint. With a sane cadence that's minutes. With no checkpointing it's everything — which is why frequent preemption makes unprotected spot training effectively unusable.

Should I just switch to on-demand GPUs?

Sometimes. If your job is short or deadline-critical, on-demand can be the cheaper total. For long runs, checkpoint-and-resume on spot capacity usually wins — you keep the discount and stop paying for repeated work.

Keep every training job moving.

Sign up, install the CLI, and submit your first training job in minutes — on your own cloud or elastic GPU capacity.

Sign up