VaultLayer › Guides

How to resume PyTorch training after a crash

If training dies partway — a spot reclaim, a host crash, or an out-of-memory error — you don't want to start over. The fix is checkpointing the right state and restoring it on restart. Here's how to do it by hand in PyTorch, and how to skip the boilerplate entirely.

1. Save the right state, atomically

A checkpoint needs enough to continue: model weights, optimizer state, and the current step (add the LR scheduler if you use one). Write it to a temp file and rename, so a process killed mid-write never leaves a corrupt file that looks like the latest checkpoint:

checkpoint = {
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
    "step": step,
}
tmp = path + ".tmp"
torch.save(checkpoint, tmp)
os.replace(tmp, path)   # atomic rename — never a half-written "latest"

Checkpoint on a regular cadence so an interruption costs minutes, not hours.

2. Restore on startup

On restart, load the latest complete checkpoint and continue the loop from the saved step instead of zero:

start_step = 0
if os.path.exists(path):
    ckpt = torch.load(path, map_location="cpu")
    model.load_state_dict(ckpt["model"])
    optimizer.load_state_dict(ckpt["optimizer"])
    start_step = ckpt["step"] + 1

for step in range(start_step, total_steps):
    ...

Store the checkpoint on durable storage (object storage, not just local disk) so it survives the host disappearing.

3. Or let VaultLayer handle it

The boilerplate above is the same on every project — and easy to get subtly wrong. VaultLayer does it for you: vl run python train.py checkpoints your run and resumes from the last step on available compute if the host fails, with failover on by default. The integration is usually auto-inserted on your first run; for Hugging Face Trainer it's a single resume argument. See how checkpoint & resume works on interruptible GPUs.

Frequently asked questions

How often should I checkpoint?

Often enough that a crash costs minutes rather than hours, but not so often that checkpoint I/O dominates training time. Tie it to a step interval and save to durable storage.

Why did my training resume from step 0?

Usually because the resume path loaded a missing or partially written checkpoint, or the saved step counter wasn't restored. Write checkpoints atomically (temp file + rename) and resume only from the latest complete one.

Keep every training job moving.

VaultLayer is in invite-only early access for teams running real GPU workloads.

Get early access