Autoscaling
Configure traffic-aware autoscaling with CoDel-based replica management, scale-to-zero, and fine-tuned scaling policies.
Outpost Services automatically scales your application between a minimum and maximum replica count based on real-time traffic. The autoscaler uses a CoDel-based algorithm that targets P99 latency under 100ms, ensuring your service stays responsive without over-provisioning resources.
Scaling modes
Outpost supports two scaling modes: fixed replicas for predictable workloads and autoscaling for dynamic traffic patterns.
Fixed replicas
Use fixed replicas when your traffic is consistent and you want a guaranteed number of instances running at all times.
Outpost will maintain exactly 3 replicas. If a replica fails, it is automatically replaced.
Autoscaling
Autoscaling dynamically adjusts replica count based on incoming request volume. Configure it with a replica_policy:
Configuration reference
| Parameter | Type | Default | Description |
|---|---|---|---|
min_replicas | integer | 1 | Minimum number of replicas. Set to 0 to enable scale-to-zero. |
max_replicas | integer | required | Upper bound on replica count. Controls maximum cost. |
target_qps_per_replica | integer | required | Target queries per second each replica should handle. The autoscaler adds replicas when average QPS exceeds this value and removes them when it drops below. |
upscale_delay_seconds | integer | 300 | Seconds to wait before scaling up after demand exceeds the target. Prevents thrashing from short traffic bursts. |
downscale_delay_seconds | integer | 1200 | Seconds to wait before scaling down after demand drops. Provides a buffer for traffic that may return. |
How the CoDel algorithm works
Traditional autoscalers react to simple CPU or memory thresholds. Outpost uses an adaptation of the Controlled Delay (CoDel) algorithm -- originally designed for network queue management -- applied to request queuing.
The algorithm works in three stages:
1. Queue monitoring. Every incoming request enters a per-replica queue. The autoscaler continuously measures how long requests spend waiting in the queue before being processed.
2. Latency targeting. The controller maintains a target for P99 queue latency of under 100ms. When queue sojourn times consistently exceed this target, the system determines that replicas are overloaded.
3. Scaling decisions. Based on the observed QPS and queue latency, the autoscaler calculates the ideal replica count:
- If average QPS per replica exceeds
target_qps_per_replicaand queue latency is rising, replicas are added (up tomax_replicas). - If average QPS per replica is well below the target and queue latency is minimal, replicas are removed (down to
min_replicas). - Scaling actions respect the configured delay parameters to prevent oscillation.
Scale-to-zero
For services with intermittent traffic, setting min_replicas: 0 enables scale-to-zero. When no requests arrive for the duration of downscale_delay_seconds, Outpost terminates all replicas and stops billing for compute.
When a new request arrives, Outpost provisions a replica and routes the request once the readiness probe passes.
Cold start optimization
To reduce cold start latency for scale-to-zero services:
- Keep setup lightweight. Pre-bake dependencies into a container or use cached layers.
- Use a fast readiness probe. Load models asynchronously and report ready once the HTTP server is listening.
- Increase
downscale_delay_seconds. A longer cooldown means replicas stay warm through brief traffic gaps.
Monitoring autoscaling behavior
The Outpost dashboard provides real-time visibility into scaling decisions:
- Replica count over time: See how your service scales in response to traffic.
- QPS per replica: Verify that the autoscaler is maintaining your target.
- P99 latency: Confirm that queue latency stays under the 100ms target.
- Scaling events: A timeline of scale-up and scale-down actions with the triggering metrics.
Access these metrics from your service's Monitoring tab in the dashboard.
Best practices
Start conservative, then tune. Begin with a generous target_qps_per_replica (lower number = more headroom per replica) and adjust based on observed latency.
Set max_replicas to control costs. The autoscaler will never exceed this limit, even under extreme load. Size it based on your budget and expected peak traffic.
Use longer delays for expensive replicas. GPU instances take time to provision. Set upscale_delay_seconds high enough to avoid scaling on brief spikes, and downscale_delay_seconds long enough to avoid repeatedly paying cold start costs.
Match target_qps_per_replica to your workload profile. A lightweight API proxy might handle 100+ QPS per replica, while an LLM inference server might max out at 2-5 QPS. Benchmark your application to find the right value.
Test in staging first. Validate your autoscaling configuration under simulated load before deploying to production. Use tools like wrk, hey, or locust to generate realistic traffic patterns.
Example configurations
High-throughput API
GPU model serving
Development / staging (scale-to-zero)
Next steps
- Deploy a Service — configuration and deployment guide
- Custom Domains — DNS, TLS, and wildcard setup
Previous → Deploy a Service
Next Custom Domains →