VaultLayer › Guides

How to checkpoint training to S3

A checkpoint on the instance's local disk dies with the instance — and on spot capacity, instances die. Durable checkpointing means every save lands in object storage (S3, GCS, R2) so a new machine can pick up where the old one stopped.

The pattern

  1. Save atomically on local disk first — write to a temp file, then rename. Fast, and never leaves a half-written file that looks complete.
  2. Upload in the background — push the finished file to S3 while training continues (a thread or a sync tool), so the GPU isn't idle during upload.
  3. Mark the latest — only after the upload completes, update a small "latest" pointer (a key like latest.json). Resume logic reads the pointer, never a possibly-mid-upload file.
  4. Prune old checkpoints — keep the last N so storage doesn't grow unbounded.

The failure mode this prevents: an instance dies mid-upload and your resume path loads a truncated checkpoint. The pointer-after-upload step is what makes the scheme safe.

What VaultLayer does for you

VaultLayer runs this pipeline automatically: your job's checkpoints sync to durable storage as training runs, health-gated so only complete saves count, and resume pulls the latest good state onto the replacement machine. On BYOC, checkpoints go to your own bucket — S3, GCS, or Azure Blob — connected with vl connect storage.

If you're wiring it yourself

Combine the atomic-save snippet from resuming PyTorch after a crash with a background uploader, and test the ugly path: kill the process mid-upload and confirm resume still loads a complete checkpoint. That test is the difference between a checkpointing system and a checkpointing hope.

Frequently asked questions

How often should checkpoints upload to S3?

Every checkpoint. The local save is your fast path; the upload is what survives. If upload time becomes a bottleneck, checkpoint less often or save adapter-only state (LoRA) — don't skip durability.

Can I checkpoint to my own bucket instead of the platform's?

On VaultLayer, yes — BYOC jobs checkpoint to your own S3/GCS/Azure bucket via vl connect storage. Managed jobs use managed storage with job-scoped credentials.

Keep every training job moving.

Sign up, install the CLI, and submit your first training job in minutes — on your own cloud or elastic GPU capacity.

Sign up