Performance & Optimization

Auto-Scaling Setup — Pay for What You Use, Handle What Comes

Running at peak capacity 24/7 wastes money. Running at minimum capacity and hoping traffic stays low risks outages. Auto-scaling dynamically adjusts your infrastructure to match demand — scaling up during traffic spikes and scaling down during quiet periods. We configure auto-scaling policies based on the right metrics for your workload, with proper cooldown periods and target tracking that actually works.

Need this done for your project?

We implement, you ship. Async, documented, done in days.

Start a Brief

Why Default Auto-Scaling Fails

AWS's default auto-scaling uses CPU utilization as the scaling metric. This is fine for CPU-bound workloads but misleading for I/O-bound applications (which most web services are). Your API might have 20% CPU utilization while being completely saturated on database connections. Scaling based on CPU adds more instances that also saturate the database, making things worse.

The second common failure is cooldown configuration. Default cooldowns are 300 seconds (5 minutes). If traffic spikes suddenly, the first scale-up event adds capacity, then the cooldown prevents additional scaling for 5 minutes — even if the new capacity is still insufficient. By the time the next scale-up triggers, requests have been timing out for minutes.

Scaling to zero is another missed opportunity. Your staging environment does not need to run overnight. Your development ECS services do not need to be up on weekends. Auto-scaling to zero (or near-zero with Fargate spot) during non-business hours can cut non-production costs by 60-70%.

Our Auto-Scaling Implementation

Metric Selection: We choose scaling metrics based on your workload characteristics. For web services, we typically use request count per target (ALB metric) — it scales based on actual load, not indirect proxies like CPU. For queue workers, we use queue depth (SQS ApproximateNumberOfMessagesVisible). For memory-intensive workloads, we use custom CloudWatch metrics for memory utilization. The metric must correlate with user-perceived performance, not just resource consumption.

Target Tracking: We use target tracking policies that maintain a target value for the chosen metric. Example: maintain 100 requests per target. If each instance handles 100 requests and traffic increases to 500 requests, the policy scales to 5 instances. When traffic drops, it scales back down. Target tracking handles both scale-up and scale-down automatically without step-scaling configuration.

ECS Auto-Scaling: For ECS Fargate, we configure Application Auto Scaling with target tracking on request count or CPU. Minimum and maximum task counts are set per environment — production might scale 2-20, staging 0-3. We enable Fargate Spot for non-production workloads to reduce costs by up to 70%.

Kubernetes HPA and Karpenter: For Kubernetes, we configure Horizontal Pod Autoscaler (HPA) with custom metrics from Prometheus via the Prometheus Adapter. Karpenter (or Cluster Autoscaler) provisions new nodes when pods cannot be scheduled. We configure pod disruption budgets to ensure scaling events do not take down more pods than your service can tolerate.

Scheduled Scaling: For predictable traffic patterns (business hours, weekday peaks), we add scheduled scaling actions that pre-provision capacity before the load arrives. This avoids the reactive delay of metric-based scaling and ensures capacity is ready when users show up. For non-production, scheduled actions scale to zero outside business hours.

Predictive Scaling: For workloads with consistent daily patterns, we configure AWS Predictive Scaling, which uses machine learning to forecast traffic and pre-scale proactively. This works best for services with regular, repeatable traffic patterns — e-commerce sites, B2B SaaS with business-hours usage, or media sites with morning traffic peaks.

What You Get

A production auto-scaling configuration:

  • Metric-based policies — scaling on the metric that correlates with actual user impact
  • Target tracking — automatic scale-up and scale-down to maintain target performance
  • Environment-appropriate limits — production scales wide, non-production scales to zero
  • Scheduled scaling — pre-provisioned capacity for predictable patterns
  • Cooldown tuning — appropriate cooldowns that balance responsiveness and stability
  • Cost optimization — Fargate Spot, scheduled scale-down, and right-sized minimum capacity
  • Monitoring — scaling event dashboards, capacity headroom alerts, and cost-per-request tracking

Why Anubiz Engineering

100% async — no calls, no meetings
Delivered in days, not weeks
Full documentation included
Production-grade from day one
Security-first approach
Post-delivery support included

Ready to get started?

Skip the research. Tell us what you need, and we'll scope it, implement it, and hand it back — fully documented and production-ready.