VaultLayer › Guides

My training job keeps getting preempted — how do I finish it?

Preemption is the price of cheap GPUs: spot and interruptible capacity can be reclaimed by the provider at any time, sometimes with seconds of notice. You can't prevent it — the winning strategy is making preemption cost minutes instead of the whole run.

Why it keeps happening

Spot capacity is surplus — when paying demand returns, your instance is reclaimed. Reclaim rates vary by GPU class, region, and time of day; a hot GPU class (H100s in a popular region) can see multiple preemptions in a single day. If your job restarts from step 0 each time, it may literally never finish.

The survival checklist

Checkpoint to durable storage on a regular cadence — object storage, not the instance disk that vanishes with the machine. See checkpointing to S3.
Resume automatically — restart from the last complete checkpoint without a human in the loop (the PyTorch pattern).
Fail over across capacity — if the same GPU class is being reclaimed repeatedly, resume on a different provider or class instead of re-entering the same fight.
Escalate deliberately: for a deadline-critical run, on-demand capacity costs more per hour but can be cheaper than paying for the same spot hours three times.

What VaultLayer automates

This checklist is exactly what fault-tolerant training on VaultLayer does by default: checkpoints sync durably as the job runs, preemption is detected, and the job resumes from the last step on available capacity — the same provider or another. Repeatedly unhealthy machines are excluded from re-provisioning. You submit once; the job finishes.

Frequently asked questions

How much progress do I lose per preemption?

At most the work since the last durable checkpoint. With a sane cadence that's minutes. With no checkpointing it's everything — which is why frequent preemption makes unprotected spot training effectively unusable.

Should I just switch to on-demand GPUs?

Sometimes. If your job is short or deadline-critical, on-demand can be the cheaper total. For long runs, checkpoint-and-resume on spot capacity usually wins — you keep the discount and stop paying for repeated work.

My training job keeps getting preempted — how do I finish it?

Why it keeps happening

The survival checklist

What VaultLayer automates

Frequently asked questions

Keep every training job moving.

Related