VaultLayer Docs — Run PyTorch / JAX / Hugging Face GPU training jobs from the CLI

Quickstart #

Five commands from zero to a finished training run.

pip install vaultlayer
vaultlayer init                          # one-time authentication
vl run train.py                          # submit your training script
vl logs <job_id> --follow                # stream logs as the job runs
vl credits                               # check your balance

Run a training job #

Submit any Python training script — VaultLayer routes it to GPU capacity, streams logs back, checkpoints as it runs, and auto-recovers on interruption. The command is the same whether you wrote your script in PyTorch, JAX, or Hugging Face (see Frameworks).

The interpreter is optional for a local .py script: vl run train.py and vl run python train.py do the same thing. Explicit launchers — python -u, torchrun, accelerate launch — are always passed through unchanged.

`vl run train.py`	Submit your script on the default route — the cheapest available GPU across your connected / allow-listed providers.
`vl run --gpu H100 train.py`	Pin a specific GPU type (H100, A100, L40S, A10G, RTX 4090, etc.).
`vl run --byoc train.py`	Bring Your Own Cloud — run on your cloud account and checkpoint to your bucket, never the managed pool. See BYOC.
`vl run --data s3://bucket/path train.py`	Point the job at your dataset. Supports `s3://`, `gs://`, `az://`, `hf://` (Hugging Face), and `r2://` (a dataset you uploaded with `vl sync`). Cloud sources are mirrored once before training starts. See Storage.
`vl run --image your-org/your-image:tag train.py`	Use a custom container image instead of the default training base image. Add `--registry-user` / `--registry-pass` (or the matching env vars) for a private registry.
`vl run --env KEY=VALUE train.py`	Pass an environment variable through to the remote job (e.g. `--env WANDB_API_KEY=…`). Repeat the flag for multiple values.
`vl run --project nlp --experiment sft-llama7b train.py`	Tag the run for per-project / per-experiment cost attribution — shows up in `vl spend`.
`vl run --keep-alive 30m train.py`	Hold the instance alive for 30 min after the script exits — pull artifacts, inspect logs while the host is warm, or fix and re-try. Range: 5m–24h, billed at the job's GPU rate. See Keep alive.
`vl run --verbose train.py`	Show provider names, retry counts, and bootstrap progress. `--quiet` suppresses everything except the final result.
`vl run --yes train.py`	Skip the pre-flight confirmation prompt (for scripts / CI).

Tip: Run vl run --help to see every available option.

What's preinstalled: The default training image ships PyTorch, CUDA, Hugging Face Transformers, Accelerate, PEFT, TRL, and the common fine-tuning stack — your script runs unchanged. Need extra packages? Drop a requirements.txt next to your script (auto-installed before training) — scaffold a starter with vaultlayer requirements init — or bring your own image with --image.

Advanced routing, cost, and safety flags:

`--train-mode qlora\|lora\|full`	Fine-tuning mode — `qlora` (default) is cheapest, `full` is a full fine-tune. Pair with `--model-params 13` to size the GPU for a 13B model.
`--provider "<name>"`	Restrict the run to specific providers (comma-separated). `--excluded-providers` is a hard denylist that is never relaxed.
`--regions eu-central-1,eu-west-1`	Restrict provisioning to specific regions (e.g. GDPR-only). `--excluded-regions` blocks regions. Applies to region-aware providers.
`--max-cost 50`	Auto-cancel the job once spend reaches this USD threshold (minimum $1).
`--setup setup.sh`	Run a shell script — a file path or an inline command (e.g. `--setup "pip install wandb"`) — on the node before training.
`--min-cuda 12.8`	Only route to hosts whose GPU driver supports this CUDA version or newer (e.g. for torch 2.8 / cu128 + `torch.compile`).
`--checkpoint-gb N`	Reserve N GB of scratch space for checkpoints when training a large model.
`--accept-interruptible`	Opt into cheaper interruptible / spot capacity — VaultLayer auto-checkpoints and resumes from the last step on preemption.
`--strict-preflight`	Treat preflight warnings as hard errors — fail fast instead of prompting.
`--skip-preflight`	Skip the preflight environment check and all prompts (non-interactive / CI).
`--skip-compile-check`	Skip the `torch.compile` compatibility probe.
`--no-failover`	Disable auto-recovery on interruption. Failover is on by default — it retries on the next available provider from your last checkpoint.

Multi-node distributed training (private beta):

Link multiple GPU nodes into one elastic torchrun job — FSDP / DeepSpeed / DDP, whole-gang checkpointing, and automatic reshard on node loss. Available on managed capacity, or on your own AWS account (BYOC gang), validated to multi-node H100. In private beta — contact us to enable it for your account.

`--nodes N`	Number of linked GPU nodes (2–8). `N=1` is standard single-node. Multi-node requires explicit `--gpu` and `--gpus-per-node`.
`--gpus-per-node N`	GPUs per node. Required when `--nodes` > 1 — multi-node has no default. RunPod: 1–8; AWS: must match the instance shape (H100/A100 → 8 per node, A10G/V100 → 1).
`--min-nodes N`	Elastic floor: the lowest node count `torchrun` may shrink to after a permanent node loss (`0` = fixed gang).
`--dist-framework fsdp\|ddp\|deepspeed`	Distributed launch strategy.
`--max-restarts N`	`torchrun` elastic whole-gang restart ceiling.

Frameworks #

VaultLayer wraps the command you already run — PyTorch, JAX, or Hugging Face — with no code changes required. The run command is identical across all of them:

vl run train.py

The only optional step is checkpoint integration. Add it once and VaultLayer can resume your job from the last saved step if a GPU is reclaimed or the host fails mid-run. vaultlayer init writes the helper files locally, and on a first run vl run offers to auto-insert the right snippet for you — so the lines below are usually added for you, not by hand. The helper module itself is provided automatically on the GPU node; you never upload it.

PyTorch — run it, then add 3 lines for resume-on-interruption:

vl run train.py

# inside train.py (optional — enables auto-resume across migrations):
from vaultlayer_checkpoint import checkpoint, restore, CHECKPOINT_DIR
start_step = restore(CHECKPOINT_DIR, model=model, optimizer=optimizer)
# ...inside your training loop:
checkpoint(step=step, model=model, optimizer=optimizer, save_path=CHECKPOINT_DIR)

Hugging Face (Transformers Trainer / TRL / Accelerate) — checkpoints auto-save and VaultLayer resumes from the last one automatically. Nothing to add:

vl run train.py

# Single-node: checkpoints save to mirrored storage and training
# auto-resumes after an interruption — no code changes.
# Multi-node only (advanced): set output_dir=$VAULTLAYER_CHECKPOINT_DIR, then
# pass trainer.train(resume_from_checkpoint=os.environ.get("VAULTLAYER_RESUME_CHECKPOINT"))

JAX / Flax — run it, then add 3 lines for resume-on-interruption:

vl run train.py

# inside train.py (optional — enables auto-resume across migrations):
from vaultlayer_checkpoint_jax import vl_restore, vl_checkpoint
params, opt_state, start_step = vl_restore(params, opt_state)
# ...inside your training loop:
vl_checkpoint(step, params, opt_state, loss=float(loss))

Also supported: PyTorch Lightning (single-node auto-resume — no code changes) and DeepSpeed (from vaultlayer_checkpoint_deepspeed import vl_init, vl_checkpoint). If it runs with python train.py, it works on VaultLayer. Run vl examples to download ready-to-run scripts.

Bring Your Own Cloud (BYOC) #

Run jobs on your own cloud account — AWS, Azure, GCP, or any provider — instead of the managed pool. VaultLayer provisions on your compute, checkpoints to your bucket, and adds the reliability layer — orchestration, monitoring, and auto-recovery — on top. A BYOC job is never routed to the managed pool, and there is no per-run charge from VaultLayer: your compute is billed by your own cloud provider under a monthly contract.

`vl connect lambda-labs`	Connect your Lambda Labs account. Also: `vl connect runpod`, `vl connect vast-ai`.
`vl connect aws`	Connect your AWS account — STS role (recommended) or static keys. Azure and GCP too: `vl connect azure` / `vl connect gcp`.
`vl connect list`	Show which compute and storage accounts are connected.
`vl connect test`	Validate that connected credentials still work.
`vl run --byoc train.py`	Run on your connected cloud. Fails closed with a hard error if no compute credentials are set — it never falls back to the managed pool.

What you need: at least one cloud connected with compute credentials — AWS, Azure, GCP, or any provider. Set them with vl connect (stored in ~/.vaultlayer/config.env) or place them in a vaultlayer.env file — any one of: LAMBDA_LABS_API_KEY; or GCP_SERVICE_ACCOUNT_JSON + GCP_PROJECT_ID; or the four AZURE_* service-principal fields; or AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY (or VL_STS_ROLE_ARN for a role).

Watch + inspect a running job #

Status, logs, GPU health, recent jobs — all from the CLI.

`vl status <job_id>`	One-shot snapshot of the job's current state.
`vl logs <job_id>`	Show recent log output from your script.
`vl logs <job_id> --tail 200`	Show the last 200 lines.
`vl logs <job_id> --follow`	Stream log output live as it lands.
`vl gpu-stats <job_id>`	Live GPU VRAM, utilization, temperature, and disk usage — useful for tuning batch size.
`vl jobs`	Show your job history, most recent first. Jobs in a keep-alive window show a countdown: `KEEP_ALIVE (12m04s left)`.
`vl ps`	Show all active and recent jobs (with status).

When a job fails #

The CLI shows the last 15 lines of error output inline on every failure — no extra command needed. For deeper investigation:

`vl diagnose <job_id>`	One-command post-failure investigation — failure cause, last logs, GPU snapshot, fix suggestions.
`vl logs <job_id> --tail 200`	Pull more error context if `vl diagnose` isn't enough.
`vl download <job_id>`	Download job checkpoints, artifacts, and manifest after a run finishes.

Save money on debugging: Catch import and syntax bugs locally in 1 second before submitting to a GPU:
python -m py_compile train.py

Keep the GPU alive for debugging #

After your script exits, the instance is normally torn down so billing stops. Pass --keep-alive to hold it for a debugging window — inspect logs while the host is still warm, pull artifacts, or fix and re-try without paying a fresh cold-boot.

`vl run --keep-alive 30m train.py`	Submit a job that stays up for 30 min after exit. Range: 5m–24h, billed at the job's GPU rate.
`vl extend <job_id> 20m`	Extend the keep-alive window by another 20 min. Each extension can push the deadline to at most 24h from the moment you run it (a rolling cap — not 24h total). Not every provider supports mid-window extension — if you hit that, set a longer `--keep-alive` at submit instead.
`vl terminate <job_id>`	End the window early and destroy the instance. Billing stops immediately. Use `-y` / `--yes` to skip confirmation.
`vl jobs`	Shows a live countdown for jobs still in the window: `KEEP_ALIVE (12m04s left)`.

Watch your spend: Each window or extension is capped at 24h from the moment it's set, but repeated vl extend calls can keep an instance alive — and billing at the full GPU rate — indefinitely. End it early with vl terminate as soon as you're done — don't let an idle window outlive the value of debugging.

Stop or restart a job #

`vl stop <job_id>`	Stop a running job and terminate its instance now. Stop does not force a final checkpoint — you resume from your last auto-checkpoint (the periodic save your script writes, e.g. Hugging Face `save_steps`, synced to storage as training runs). Continue with `vl restart`.
`vl restart <job_id>`	Restart a stopped, suspended, or interrupted job from its last checkpoint.
`vl delete-job <job_id>`	Delete all saved data for a job.

Pre-flight estimation #

See what a job will cost before you submit. Estimates are price quotes only — they don't check your credit balance (use vl credits for that).

`vl estimate train.py`	Estimate the job cost across available GPU options before submission.
`vl gpus`	List available GPU types with VRAM and current best price.
`vl env-check`	Validate the remote training environment without submitting a full run. The check itself takes ~30 seconds (typically ~$0.04) — but a first run on a cold host can take several minutes of billed boot time pulling the container before it starts. Point it at your own probe with `--script check_env.py`, validate a custom image with `--docker-image name:tag`, or pin the probe GPU with `--gpu TYPE`.
`vl regions list-all`	List the AWS region codes you can pass to `--regions`. Region pinning applies to AWS only; the default managed GPU capacity is global and ignores region selection.
`vl connect`	Connect compute providers or data storage to VaultLayer.

Datasets #

Two ways to get data to a job: upload local data once with vl sync (then reuse it via r2://), or point straight at cloud storage with --data.

`vl sync /path/to/data`	Upload a local dataset once. Reuse it on any run with `--data r2://<dataset-id>`.
`vl upload /path/to/data`	Upload a dataset (alias of `vl sync`).
`vl datasets`	List uploaded datasets and their `r2://` IDs.
`vl datasets delete <dataset-id>`	Delete a dataset. Files are purged within 24h and monthly storage billing stops immediately.
`vl download <job_id>`	Download a finished job's checkpoints and artifacts to your machine.

Connect your own storage #

Bring your own buckets — jobs read your datasets and write checkpoints back to storage you control, instead of going through an upload.

`vl connect storage`	Interactive setup for your cloud storage credentials (S3-compatible, Google Cloud Storage, or Azure Blob).
`vl run --data s3://bucket/path train.py`	Point a job at connected storage. Schemes: `s3://`, `gs://`, `az://`, `hf://`, `r2://`.
`vl connect`	Interactive picker — connect storage or compute to VaultLayer.
`vl connect list`	Show which accounts are currently connected.
`vl connect test`	Verify a connected account's credentials still work.
`vl connect remove`	Remove a connected account.

Your credentials stay yours. Connected keys are scoped to your account, used only to run your jobs, and never printed in logs. Re-run vl connect anytime to update or replace them.

Account, credits, and tagging #

`vl credits`	Show your current credit balance.
`vl credits buy`	Top up your balance — opens a Stripe-hosted checkout in your browser. Credits are added automatically once payment completes.
`vl spend`	Spend breakdown for the last 90 days. Group with `--by day\|experiment\|project\|user`, bound the window with `--since` / `--until`, widen to your org with `--scope org` (optionally `--user USER_ID`), and cap rows with `--limit N`.
`vl dashboard`	Open the web dashboard (jobs, spend, GPU stats) in your browser — or add `--no-browser` to just print the URL. Sign in with your account email — a magic link is sent; no password.
`vl tag <job_id> --project X --experiment Y`	Retroactively tag a past job with a project and/or experiment name.
`vaultlayer init --reauth`	Re-authenticate if your token expires. Pass `--token <TOKEN>` to authenticate non-interactively (CI).
`vaultlayer init --reset-pin`	Reset a forgotten recovery PIN — verify a one-time code sent to your account email, then set a new PIN. Also re-authenticates this machine.
`vl examples`	Download ready-to-run example training scripts.
`vl update`	Update the VaultLayer CLI to the latest version.
`vl feedback`	Submit feedback or a crash report.
`vl --version`	Print the installed CLI version.
`vl --help`	Show all top-level commands.

Troubleshooting #

"Token revoked. Run: vaultlayer init --reauth" Run vaultlayer init --reauth and enter your account email. If the same message keeps coming back after a successful reauth, check that VAULTLAYER_TOKEN isn't exported in your shell from a source .env step — run unset VAULTLAYER_TOKEN and retry.

Job stuck at "Loading training environment (2–5 min)..." for 5+ minutes First runs pull a multi-GB container, which can take 3–7 minutes on a cold instance. The container is cached after the first pull, so subsequent runs start faster. If you see no log output after 10+ minutes, run vl status <job_id> to confirm the phase.

Job fails with "exit code 1" The CLI shows the last 15 lines of your script's error output inline. If you need more context, run vl logs <job_id> --tail 200. To catch import and syntax errors in 1 second locally (before paying for a GPU), run python -m py_compile your_script.py.

Job fails immediately on submission (no GPU started) Usually means you're out of credits or the script path is wrong. Check vl credits, and confirm the file exists with ls your_script.py.

VRAM out-of-memory partway through training Reduce batch size or use a larger GPU. Run vl gpus to see VRAM per option. Live VRAM during a run: vl gpu-stats <job_id>.

Need help? #

Email rahuljain@vaultlayer.cloud for anything — bugs, feature requests, or quick questions on how to use a command. We typically reply within a day.

For one-off feedback or a crash report from the CLI, you can also run vl feedback.