Quickstart #

Five commands from zero to a finished training run.

pip install vaultlayer
vaultlayer init                          # one-time authentication
vl run python train.py                   # submit your training script
vl logs <job_id> --follow                # stream logs as the job runs
vl credits                               # check your balance

Run a training job #

Submit any Python training script — the lowest-cost available GPU wins.

vl run python train.pySubmit your script — defaults to the lowest-cost available GPU.
vl run --gpu H100 python train.pyPin a specific GPU type (H100, A100, L40S, L4, A10G, RTX 4090, etc.).
vl run --project nlp python train.pyTag the run with a project — shows up in vl spend.
vl run --experiment sft-llama7b python train.pyTag the run with an experiment name for per-run cost attribution.
vl run --data s3://bucket/path python train.pyPoint the job at your dataset. Supports s3://, gs://, az://, hf:// (Hugging Face), and r2:// (a dataset you uploaded with vl sync). Cloud sources are mirrored once before training starts. See Storage.
vl run --image your-org/your-image:tag python train.pyUse a custom container image instead of the default training base image.
vl run --env KEY=VALUE python train.pyPass an environment variable through to the remote job. Repeat the flag for multiple values.
vl run --accept-interruptible python train.pySkip the pre-flight warning when picking an interruptible (consumer-tier) GPU.
vl run --verbose python train.pyShow extra status output during the run.
vl run --quiet python train.pySuppress all output except the final job result.
vl run --keep-alive 30m python train.pyHold the instance alive for 30 min after the script exits — pull artifacts, inspect logs while the host is warm, or fix and re-try. Range: 5m–24h, billed at the job's GPU rate. See Keep alive.
Tip: Run vl run --help to see every available option.
What's preinstalled: The default training image ships PyTorch, CUDA, Hugging Face Transformers, Accelerate, PEFT, TRL, and the common fine-tuning stack — your script runs unchanged. Need extra packages? Drop a requirements.txt next to your script (auto-installed before training), or bring your own image with --image.

Watch + inspect a running job #

Status, logs, GPU health, recent jobs — all from the CLI.

vl status <job_id>One-shot snapshot of the job's current state.
vl logs <job_id>Show recent log output from your script.
vl logs <job_id> --tail 200Show the last 200 lines.
vl logs <job_id> --followStream log output live as it lands.
vl gpu-stats <job_id>Live GPU VRAM, utilization, temperature, and disk usage — useful for tuning batch size.
vl jobsShow your job history, most recent first. Jobs in a keep-alive window show a countdown: KEEP_ALIVE (12m04s left).
vl psShow all active and recent jobs (with status).

When a job fails #

The CLI shows the last 15 lines of error output inline on every failure — no extra command needed. For deeper investigation:

vl diagnose <job_id>One-command post-failure investigation — failure cause, last logs, GPU snapshot, fix suggestions.
vl logs <job_id> --tail 200Pull more error context if vl diagnose isn't enough.
vl download <job_id>Download job checkpoints, artifacts, and manifest after a run finishes.
Save money on debugging: Catch import and syntax bugs locally in 1 second before submitting to a GPU:
python -m py_compile train.py

Keep the GPU alive for debugging #

After your script exits, the instance is normally torn down so billing stops. Pass --keep-alive to hold it for a debugging window — inspect logs while the host is still warm, pull artifacts, or fix and re-try without paying a fresh cold-boot.

vl run --keep-alive 30m python train.pySubmit a job that stays up for 30 min after exit. Range: 5m–24h, billed at the job's GPU rate.
vl extend <job_id> 20mExtend the keep-alive window by another 20 min. Caps at 24h total. Not every provider supports mid-window extension — if you hit that, set a longer --keep-alive at submit instead.
vl terminate <job_id>End the window early and destroy the instance. Billing stops immediately. Use -y / --yes to skip confirmation.
vl jobsShows a live countdown for jobs still in the window: KEEP_ALIVE (12m04s left).
Watch your spend: The window is hard-capped at 24h, and you pay the full GPU rate while it's open. End it early with vl terminate as soon as you're done — don't let an idle window outlive the value of debugging.

Stop or restart a job #

vl stop <job_id>Stop a running job (checkpoints before terminating so you don't lose progress).
vl restart <job_id>Restart a suspended or interrupted job from its last checkpoint.
vl delete-job <job_id>Delete all saved data for a job.

Pre-flight estimation #

See what a job will cost before you submit, and validate your script runs end-to-end affordably.

vl estimate python train.pyEstimate the job cost across available GPU options before submission.
vl gpusList available GPU types with VRAM and current best price.
vl env-checkValidate the remote training environment in ~30 seconds (~$0.04) without submitting a full run.
vl regionsQuery available regions for provisioning.
vl connectConnect compute providers or data storage to VaultLayer.

Datasets #

Two ways to get data to a job: upload local data once with vl sync (then reuse it via r2://), or point straight at cloud storage with --data.

vl sync /path/to/dataUpload a local dataset once. Reuse it on any run with --data r2://<dataset-id>.
vl upload /path/to/dataUpload a dataset (alias of vl sync).
vl datasetsList uploaded datasets and their r2:// IDs.
vl datasets delete <dataset-id>Delete a dataset. Files are purged within 24h and monthly storage billing stops immediately.
vl download <job_id>Download a finished job's checkpoints and artifacts to your machine.

Connect your own storage #

Bring your own buckets — jobs read your datasets and write checkpoints back to storage you control, instead of going through an upload.

vl connect storageInteractive setup for your cloud storage credentials (S3-compatible, Google Cloud Storage, or Azure Blob).
vl run --data s3://bucket/path python train.pyPoint a job at connected storage. Schemes: s3://, gs://, az://, hf://, r2://.
vl connectInteractive picker — connect storage or compute to VaultLayer.
vl connect listShow which accounts are currently connected.
vl connect testVerify a connected account's credentials still work.
vl connect removeRemove a connected account.
Your credentials stay yours. Connected keys are scoped to your account, used only to run your jobs, and never printed in logs. Re-run vl connect anytime to update or replace them.

Account, credits, and tagging #

vl creditsShow your current credit balance.
vl credits buyTop up your balance — opens a Stripe-hosted checkout in your browser. Credits are added automatically once payment completes.
vl spendSpend breakdown by experiment, project, user, or day (last 90 days).
vl tag <job_id> --project X --experiment YRetroactively tag a past job with a project and/or experiment name.
vaultlayer init --reauthRe-authenticate if your token expires.
vl examplesDownload ready-to-run example training scripts.
vl updateUpdate the VaultLayer CLI to the latest version.
vl feedbackSubmit feedback or a crash report.
vl --versionPrint the installed CLI version.
vl --helpShow all top-level commands.

Troubleshooting #

"Token revoked. Run: vaultlayer init --reauth" Run vaultlayer init --reauth and enter your account email. If the same message keeps coming back after a successful reauth, check that VAULTLAYER_TOKEN isn't exported in your shell from a source .env step — run unset VAULTLAYER_TOKEN and retry.
Job stuck at "Loading training environment (2–5 min)..." for 5+ minutes First runs pull a multi-GB container, which can take 3–7 minutes on a cold instance. The container is cached after the first pull, so subsequent runs start faster. If you see no log output after 10+ minutes, run vl status <job_id> to confirm the phase.
Job fails with "exit code 1" The CLI shows the last 15 lines of your script's error output inline. If you need more context, run vl logs <job_id> --tail 200. To catch import and syntax errors in 1 second locally (before paying for a GPU), run python -m py_compile your_script.py.
Job fails immediately on submission (no GPU started) Usually means you're out of credits or the script path is wrong. Check vl credits, and confirm the file exists with ls your_script.py.
VRAM out-of-memory partway through training Reduce batch size or use a larger GPU. Run vl gpus to see VRAM per option. Live VRAM during a run: vl gpu-stats <job_id>.

Need help? #

Email rahuljain@vaultlayer.cloud for anything — bugs, feature requests, or quick questions on how to use a command. We typically reply within a day.

For one-off feedback or a crash report from the CLI, you can also run vl feedback.