Command Palette
Search for a command to run

Jobs

Run batch training, fine-tuning, and data processing on self-terminating GPU instances.

Outpost Jobs runs batch workloads — model training, fine-tuning, data processing, evaluation — on GPU instances that automatically terminate when the work is done. Billing stops the moment your script exits.

Key features

  • Self-terminating — instances shut down when your job completes. No idle resources, no forgotten machines.
  • Spot instance support — run on spot instances at a fraction of on-demand cost. Outpost handles preemption and recovery.
  • Multi-node distribution — scale training across multiple nodes with a single configuration parameter.
  • Log streaming — stream stdout/stderr to the dashboard and CLI in real time.
  • Pay-per-second — billed for actual compute time only. No minimum commitments.

Quick start

# Launch a training job on 4x A100s outpost jobs launch --name train-resnet --gpus A100:4 --cloud aws --region us-east-1 --command "torchrun --nproc_per_node=4 train.py --epochs 50" # Check status outpost jobs status train-resnet # Stream logs outpost jobs logs train-resnet

How it works

  1. Define — specify the GPU, command, and resource requirements.
  2. Launch — submit the job from the CLI or dashboard. Outpost provisions the instance and starts execution.
  3. Monitor — stream logs in real time, track GPU utilization.
  4. Complete — when the job finishes, Outpost terminates the instance and stops billing.

Use cases

  • Model training — train on high-end GPUs without managing infrastructure.
  • Fine-tuning — fine-tune foundation models (LLaMA, Mistral, Gemma) on your own data.
  • Data processing — large-scale ETL, dataset preparation, feature engineering on GPU instances.
  • Evaluation — run evaluation suites, benchmark inference latency, compare architectures.

Spot instances

Spot instances offer up to 90% savings compared to on-demand. Outpost manages the complexity:

  • Automatic failover — if a spot instance is reclaimed, Outpost provisions a replacement.
  • Cross-cloud fallback — if capacity is unavailable on one provider, Outpost can fall back to another.
Tip
For fault-tolerant training, save checkpoints periodically. If a spot instance is preempted, your job can resume from the latest checkpoint.

Next steps

  • Create a Job — step-by-step guide to define, launch, and monitor a job

Previous Custom Domains

Next Create a Job