Jobs

Run batch training, fine-tuning, and data processing on self-terminating GPU instances.

Outpost Jobs runs batch workloads — model training, fine-tuning, data processing, evaluation — on GPU instances that automatically terminate when the work is done. Billing stops the moment your script exits.

Key features

Self-terminating — instances shut down when your job completes. No idle resources, no forgotten machines.
Spot instance support — run on spot instances at a fraction of on-demand cost. Outpost handles preemption and recovery.
Multi-node distribution — scale training across multiple nodes with a single configuration parameter.
Log streaming — stream stdout/stderr to the dashboard and CLI in real time.
Pay-per-second — billed for actual compute time only. No minimum commitments.

Quick start

# Launch a training job on 4x A100s
outpost jobs launch 
  --name train-resnet 
  --gpus A100:4 
  --cloud aws 
  --region us-east-1 
  --command "torchrun --nproc_per_node=4 train.py --epochs 50"
 
# Check status
outpost jobs status train-resnet
 
# Stream logs
outpost jobs logs train-resnet

# Launch a training job on 4x A100s
outpost jobs launch 
  --name train-resnet 
  --gpus A100:4 
  --cloud aws 
  --region us-east-1 
  --command "torchrun --nproc_per_node=4 train.py --epochs 50"
 
# Check status
outpost jobs status train-resnet
 
# Stream logs
outpost jobs logs train-resnet

How it works

Define — specify the GPU, command, and resource requirements.
Launch — submit the job from the CLI or dashboard. Outpost provisions the instance and starts execution.
Monitor — stream logs in real time, track GPU utilization.
Complete — when the job finishes, Outpost terminates the instance and stops billing.

Use cases

Model training — train on high-end GPUs without managing infrastructure.
Fine-tuning — fine-tune foundation models (LLaMA, Mistral, Gemma) on your own data.
Data processing — large-scale ETL, dataset preparation, feature engineering on GPU instances.
Evaluation — run evaluation suites, benchmark inference latency, compare architectures.

Spot instances

Spot instances offer up to 90% savings compared to on-demand. Outpost manages the complexity:

Automatic failover — if a spot instance is reclaimed, Outpost provisions a replacement.
Cross-cloud fallback — if capacity is unavailable on one provider, Outpost can fall back to another.

[Note] For fault-tolerant training, save checkpoints periodically. If a spot instance is preempted, your job can resume from the latest checkpoint.

Next steps

Create a Job — step-by-step guide to define, launch, and monitor a job

Create a Job

Outpost Enterprise