Autoscaling

Configure traffic-aware autoscaling with CoDel-based replica management, scale-to-zero, and fine-tuned scaling policies.

Outpost Services automatically scales your application between a minimum and maximum replica count based on real-time traffic. The autoscaler uses a CoDel-based algorithm that targets P99 latency under 100ms, ensuring your service stays responsive without over-provisioning resources.

Scaling modes

Outpost supports two scaling modes: fixed replicas for predictable workloads and autoscaling for dynamic traffic patterns.

Fixed replicas

Use fixed replicas when your traffic is consistent and you want a guaranteed number of instances running at all times.

service:
  readiness_probe: /health
  replicas: 3

service:
  readiness_probe: /health
  replicas: 3

Outpost will maintain exactly 3 replicas. If a replica fails, it is automatically replaced.

Autoscaling

Autoscaling dynamically adjusts replica count based on incoming request volume. Configure it with a replica_policy:

service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 1
    max_replicas: 10
    target_qps_per_replica: 5
    upscale_delay_seconds: 300
    downscale_delay_seconds: 1200

service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 1
    max_replicas: 10
    target_qps_per_replica: 5
    upscale_delay_seconds: 300
    downscale_delay_seconds: 1200

Configuration reference

Parameter	Type	Default	Description
`min_replicas`	integer	`1`	Minimum number of replicas. Set to `0` to enable scale-to-zero.
`max_replicas`	integer	required	Upper bound on replica count. Controls maximum cost.
`target_qps_per_replica`	integer	required	Target queries per second each replica should handle. The autoscaler adds replicas when average QPS exceeds this value and removes them when it drops below.
`upscale_delay_seconds`	integer	`300`	Seconds to wait before scaling up after demand exceeds the target. Prevents thrashing from short traffic bursts.
`downscale_delay_seconds`	integer	`1200`	Seconds to wait before scaling down after demand drops. Provides a buffer for traffic that may return.

How the CoDel algorithm works

Traditional autoscalers react to simple CPU or memory thresholds. Outpost uses an adaptation of the Controlled Delay (CoDel) algorithm -- originally designed for network queue management -- applied to request queuing.

The algorithm works in three stages:

1. Queue monitoring. Every incoming request enters a per-replica queue. The autoscaler continuously measures how long requests spend waiting in the queue before being processed.

2. Latency targeting. The controller maintains a target for P99 queue latency of under 100ms. When queue sojourn times consistently exceed this target, the system determines that replicas are overloaded.

3. Scaling decisions. Based on the observed QPS and queue latency, the autoscaler calculates the ideal replica count:

If average QPS per replica exceeds target_qps_per_replica and queue latency is rising, replicas are added (up to max_replicas).
If average QPS per replica is well below the target and queue latency is minimal, replicas are removed (down to min_replicas).
Scaling actions respect the configured delay parameters to prevent oscillation.

[Note] Threshold-based autoscalers (scale at 80% CPU) often react too late for latency-sensitive workloads. CoDel responds to queue pressure — the earliest signal that a service is becoming overloaded — before latency degrades to the point where users notice.

Scale-to-zero

For services with intermittent traffic, setting min_replicas: 0 enables scale-to-zero. When no requests arrive for the duration of downscale_delay_seconds, Outpost terminates all replicas and stops billing for compute.

service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 0
    max_replicas: 10
    target_qps_per_replica: 5
    upscale_delay_seconds: 60
    downscale_delay_seconds: 1800

service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 0
    max_replicas: 10
    target_qps_per_replica: 5
    upscale_delay_seconds: 60
    downscale_delay_seconds: 1800

When a new request arrives, Outpost provisions a replica and routes the request once the readiness probe passes.

[Warning] Scale-to-zero introduces cold start time. For GPU workloads (loading large models), this can take several minutes. Consider setting min_replicas: 1 for latency-sensitive production services and reserving scale-to-zero for development, staging, or batch inference endpoints.

Cold start optimization

To reduce cold start latency for scale-to-zero services:

Keep setup lightweight. Pre-bake dependencies into a container or use cached layers.
Use a fast readiness probe. Load models asynchronously and report ready once the HTTP server is listening.
Increase downscale_delay_seconds. A longer cooldown means replicas stay warm through brief traffic gaps.

Monitoring autoscaling behavior

The Outpost dashboard provides real-time visibility into scaling decisions:

Replica count over time: See how your service scales in response to traffic.
QPS per replica: Verify that the autoscaler is maintaining your target.
P99 latency: Confirm that queue latency stays under the 100ms target.
Scaling events: A timeline of scale-up and scale-down actions with the triggering metrics.

Access these metrics from your service's Monitoring tab in the dashboard.

Best practices

Start conservative, then tune. Begin with a generous target_qps_per_replica (lower number = more headroom per replica) and adjust based on observed latency.

Set max_replicas to control costs. The autoscaler will never exceed this limit, even under extreme load. Size it based on your budget and expected peak traffic.

Use longer delays for expensive replicas. GPU instances take time to provision. Set upscale_delay_seconds high enough to avoid scaling on brief spikes, and downscale_delay_seconds long enough to avoid repeatedly paying cold start costs.

Match target_qps_per_replica to your workload profile. A lightweight API proxy might handle 100+ QPS per replica, while an LLM inference server might max out at 2-5 QPS. Benchmark your application to find the right value.

Test in staging first. Validate your autoscaling configuration under simulated load before deploying to production. Use tools like wrk, hey, or locust to generate realistic traffic patterns.

Example configurations

High-throughput API

service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 2
    max_replicas: 20
    target_qps_per_replica: 100
    upscale_delay_seconds: 60
    downscale_delay_seconds: 300

service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 2
    max_replicas: 20
    target_qps_per_replica: 100
    upscale_delay_seconds: 60
    downscale_delay_seconds: 300

GPU model serving

service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 1
    max_replicas: 5
    target_qps_per_replica: 3
    upscale_delay_seconds: 300
    downscale_delay_seconds: 1800

service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 1
    max_replicas: 5
    target_qps_per_replica: 3
    upscale_delay_seconds: 300
    downscale_delay_seconds: 1800

Development / staging (scale-to-zero)

service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 0
    max_replicas: 2
    target_qps_per_replica: 5
    upscale_delay_seconds: 0
    downscale_delay_seconds: 600

service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 0
    max_replicas: 2
    target_qps_per_replica: 5
    upscale_delay_seconds: 0
    downscale_delay_seconds: 600

Next steps

Deploy a Service — configuration and deployment guide
Custom Domains — DNS, TLS, and wildcard setup