Invite-only beta · $25 free credit

Reliable AI training
on affordable GPUs.

VaultLayer makes affordable, interruptible GPUs finish. When a GPU crashes or gets reclaimed mid-run, your job auto-resumes from its last checkpoint — all the savings, none of the lost work. Start a job, walk away, come back to a finished model.

Get early access See how it works
Built for any long-running GPU job
Fine-tuning Batch inference Synthetic data Hyperparameter sweeps Evals Embeddings
Why VaultLayer
Your training always finishes.
VaultLayer lets you train on the most affordable GPUs anywhere and still get a finished model. If a GPU is reclaimed or crashes mid-run, we detect it and resume from your last checkpoint — big-cloud reliability at a fraction of the price, with zero babysitting.
💸

Affordable, automatically

Run vl run python train.py and the lowest-cost available GPU wins — across every provider we support. No portals, no price-shopping, no lock-in.

🛡️

Survives any failure

When a GPU dies mid-run — reclaimed, crashed, or gone — VaultLayer detects it, shuts it down cleanly, and resumes your training from where it left off — on the same provider or a different one, whichever is available.

🧩

Zero code changes

No SDK to import, no decorators, no framework to adopt. Your PyTorch or JAX script doesn't know it's being managed. vl run wraps the command you already have.


How it works
Drop in. Walk away.
Two commands. We handle provisioning, checkpointing, failure detection, and cross-provider recovery.
# one-time setup
pip install vaultlayer && vaultlayer init

# run any training script — that's it
vl run python train.py

Submit your script

Wraps any PyTorch / JAX / HuggingFace training script unchanged.

Lowest-cost GPU wins

Routed to the lowest-cost available provider with capacity, in seconds.

Auto-checkpoint

Your training progress is saved to secure storage as the job runs, so no work is ever lost.

Dies → resumes

A failed GPU triggers a re-provision and resume from the last checkpoint — same provider or another, whichever comes back fastest.


The difference
With VaultLayer vs. without.

Without VaultLayer

  • A reclaimed or crashed GPU kills your run — hours of progress gone.
  • You restart from scratch and pay for the same work twice.
  • You babysit jobs, or overpay for "safe" hardware to avoid failures.
  • You write and maintain your own checkpointing and recovery glue.
💸 What you actually pay for: idle boot minutes, re-runs after a crash, and a premium for "safe" hardware.

With VaultLayer

  • Auto-resume from your last checkpoint the moment a GPU dies.
  • No lost work, no paying twice — the job just keeps going.
  • Run on the most affordable GPU available, reliably.
  • Zero code changes — we wrap the command you already run.
💸 ~47% lower cost on average — you pay only for the GPU that finishes your job.
Same job — a 10-epoch Qwen2.5-1.5B fine-tune (9m23s on an H100)
VaultLayer
$0.48
AWS on-demand H100
~$1.08
~55% lower cost on the same job.

From an early user
"The babysitting tax. Renting H100s directly from RunPod / Lambda / Voltage Park, I'd lose 20–30 minutes per session to wrapper work around the actual training: hunting for a region with capacity, paying for idle minutes while the pod boots and the image pulls, wiring my own checkpoint sync to S3/R2 so a spot preemption wouldn't wipe out an hour of progress, and writing fragile resume logic because providers just hand you a fresh pod after a crash. … The 10-epoch fine-tune finished in 9 minutes 23 seconds for $0.48 — clean exit, adapter saved, no leftover pod to clean up, no manual sync, no babysitting."
Amit Pal · CTO, Vettable

FAQ
Questions, answered.
What happens if a GPU dies mid-run?

VaultLayer detects it and your training resumes from where it left off — on the same provider or another, whichever's available. No lost progress, no restart from scratch.

Do I have to change my training code?

No SDK, no decorators, no framework to adopt. vl run python train.py wraps the command you already use — your PyTorch / JAX / Hugging Face script runs unchanged.

Can I pick the GPU, or is it automatic?

Both. By default VaultLayer routes to the best available GPU at the lowest current rate. Want a specific one — say, to run faster? Pin it: vl run --gpu H100 python train.py.

How does pricing work?

You pay for the GPU at the current market rate, and you always see the quote before a run starts — billed on actual run time, no surprises. Rates move with supply and demand, so VaultLayer shops for the best available price each run.

Why not just add my own checkpointing?

You can — but you'll spend time hunting for capacity, paying for idle boot minutes, wiring checkpoint sync, and writing resume logic for every run. VaultLayer does all of that automatically, so you start a job and walk away.

Is my workload isolated from other users?

Yes. Each job runs on its own provisioned GPU with credentials scoped to that job — you're not sharing a container with other customers, and no one can access your data (or you theirs).

Which frameworks does it work with?

Any standard training script — PyTorch, JAX, Hugging Face / TRL, Axolotl, and similar. If it runs with python train.py, it works. VaultLayer is focused on training and fine-tuning, not real-time inference serving.

How do I get access?

VaultLayer is in invite-only early access. Request an invite and we'll email you a token plus $25 in free credits.


Start a job. Walk away.
Come back to a finished model.

VaultLayer is in private, invite-only beta. Drop your email and we'll send you an invite as soon as a spot opens up — no marketing drips.

🎁 Early access includes $25 in free credits

Questions? rahuljain@vaultlayer.cloud