How to checkpoint training to S3
A checkpoint on the instance's local disk dies with the instance — and on spot capacity, instances die. Durable checkpointing means every save lands in object storage (S3, GCS, R2) so a new machine can pick up where the old one stopped.
The pattern
- Save atomically on local disk first — write to a temp file, then rename. Fast, and never leaves a half-written file that looks complete.
- Upload in the background — push the finished file to S3 while training continues (a thread or a sync tool), so the GPU isn't idle during upload.
- Mark the latest — only after the upload completes, update a small "latest" pointer (a key like
latest.json). Resume logic reads the pointer, never a possibly-mid-upload file. - Prune old checkpoints — keep the last N so storage doesn't grow unbounded.
The failure mode this prevents: an instance dies mid-upload and your resume path loads a truncated checkpoint. The pointer-after-upload step is what makes the scheme safe.
What VaultLayer does for you
VaultLayer runs this pipeline automatically: your job's checkpoints sync to durable storage as training runs, health-gated so only complete saves count, and resume pulls the latest good state onto the replacement machine. On BYOC, checkpoints go to your own bucket — S3, GCS, or Azure Blob — connected with vl connect storage.
If you're wiring it yourself
Combine the atomic-save snippet from resuming PyTorch after a crash with a background uploader, and test the ugly path: kill the process mid-upload and confirm resume still loads a complete checkpoint. That test is the difference between a checkpointing system and a checkpointing hope.
Frequently asked questions
How often should checkpoints upload to S3?
Every checkpoint. The local save is your fast path; the upload is what survives. If upload time becomes a bottleneck, checkpoint less often or save adapter-only state (LoRA) — don't skip durability.
Can I checkpoint to my own bucket instead of the platform's?
On VaultLayer, yes — BYOC jobs checkpoint to your own S3/GCS/Azure bucket via vl connect storage. Managed jobs use managed storage with job-scoped credentials.
Keep every training job moving.
Sign up, install the CLI, and submit your first training job in minutes — on your own cloud or elastic GPU capacity.
Sign up