Train Hugging Face models on cloud GPUs
VaultLayer runs Hugging Face training — Transformers Trainer, TRL, Accelerate, and PEFT/QLoRA — on cloud GPUs without code changes. Checkpoints auto-save, and one resume argument lets a job continue from the last step after an interruption.
Run Trainer or TRL unchanged
The default training image ships Transformers, Accelerate, PEFT, and TRL, so your script runs as-is:
vl run python train.py
No SDK, no decorators, no container conventions to learn — VaultLayer wraps the command you already run.
Auto-resume with one argument
The Hugging Face Trainer already writes checkpoints; pass the VaultLayer resume path so an interrupted job picks up from the last one instead of restarting:
import os
trainer.train(resume_from_checkpoint=os.environ.get("VAULTLAYER_RESUME_CHECKPOINT"))
VaultLayer syncs those checkpoints to durable storage and re-provisions on failure — see how checkpoint & resume works.
LoRA, QLoRA, and full fine-tunes
Run any fine-tuning mode with --train-mode qlora|lora|full, sized to your model. For choosing a GPU by model size, see GPU types for training; for the broader workflow, fine-tune LLMs on your own cloud.
Frequently asked questions
Which Hugging Face libraries are supported?
Transformers (including the Trainer), TRL, Accelerate, and PEFT. If your script runs with python train.py, it runs on VaultLayer.
How does resume work with the Trainer?
Pass resume_from_checkpoint with the VAULTLAYER_RESUME_CHECKPOINT environment variable. VaultLayer stores checkpoints durably and restarts the job from the last one after an interruption.
Keep every training job moving.
VaultLayer is in invite-only early access for teams running real GPU workloads.
Get early access