How checkpoint & resume works on interruptible GPUs
Spot and interruptible GPUs can be reclaimed at any moment. Checkpoint-and-resume is the technique that makes them safe for long training runs: training state is written to durable storage as the job progresses, and after an interruption the job restarts from the last saved step instead of from scratch.
Checkpointing: saving progress
A checkpoint captures the state needed to continue training — model weights, optimizer state, and the current step. Saving it to durable storage (object storage, not just local disk) on a regular, health-gated cadence means that if the host disappears, the latest checkpoint is still safe and complete.
Resume: picking up where it stopped
On interruption, a new host is provisioned, the latest checkpoint is restored, and training continues from that step. The key details are getting the restore path right across framework versions and only resuming from a checkpoint that finished writing — a partially written checkpoint is worse than none.
How VaultLayer handles it
VaultLayer checkpoints your run automatically and resumes from the last good step on available compute — the same provider, or another when capacity runs out. Integration is usually auto-inserted on your first run; for Hugging Face Trainer it's a single resume argument, and for PyTorch or JAX it's about three lines. Failover is on by default. See fault-tolerant training for the end-to-end flow.
Frequently asked questions
How often should training checkpoint?
Often enough that a reclaim costs minutes, not hours — but not so often that checkpoint I/O dominates. VaultLayer saves on a health-gated cadence and resumes only from a checkpoint that finished writing.
Does resume work across different providers?
Yes. Because checkpoints are stored durably, VaultLayer can resume on a different GPU or provider when the original capacity is gone — not just on the same machine.
Keep every training job moving.
VaultLayer is in invite-only early access for teams running real GPU workloads.
Get early access