VaultLayer › Frameworks

Run PyTorch training on cloud GPUs

VaultLayer runs your existing PyTorch training script on cloud GPUs with no code changes — single-GPU or distributed — and adds checkpointing and automatic recovery so a reclaimed or crashed GPU never costs you the run.

Run your script as-is

There's no SDK to adopt and no decorators to add. Point vl run at the command you already use:

vl run python train.py
vl run --gpu H100 python train.py     # pin a GPU class

The default training image ships PyTorch and CUDA, so most scripts run with no setup. Need extra packages? Drop a requirements.txt next to your script and it's installed before training, or bring your own image with --image.

Single-GPU to multi-node

Standard PyTorch distributed works the same way. Single-GPU jobs need nothing special; for DDP and FSDP across multiple GPUs or machines, see distributed multi-node training — VaultLayer provisions the group, wires the networking, and recovers the whole job on failure.

Checkpoint and resume

Add checkpoint integration once and VaultLayer resumes your run from the last step if a host is reclaimed or crashes — usually auto-inserted on your first run. The full pattern is in how to resume PyTorch training after a crash, and fault-tolerant training covers the end-to-end recovery flow.

Frequently asked questions

Do I need to change my PyTorch code?

No. vl run python train.py wraps the command you already use; your PyTorch script runs unchanged. Checkpoint integration is optional and usually auto-inserted.

Does it support DDP and FSDP?

Yes. Standard PyTorch DDP and FSDP scripts run on VaultLayer, single-node or across nodes — see the distributed multi-node training page.

Keep every training job moving.

VaultLayer is in invite-only early access for teams running real GPU workloads.

Get early access