Kubernetes

Kubernetes Horizontal Pod Autoscaler (HPA): Complete Guide

The Horizontal Pod Autoscaler automatically adjusts the number of pod replicas in a Deployment based on observed metrics. When traffic spikes, HPA scales up to handle load; when traffic drops, it scales down to save resources. Properly configured, it keeps your application responsive without over-provisioning.

Need this done for your project?

We implement, you ship. Async, documented, done in days.

Start a Brief

Basic HPA with CPU and Memory Metrics

Create an HPA with `kubectl autoscale deployment my-app --cpu-percent=70 --min=2 --max=10`. This tells Kubernetes to maintain an average CPU utilization of 70% across all pods, scaling between 2 and 10 replicas. The metrics-server must be installed in your cluster (it is included by default in most managed K8s services). You can also target memory utilization by defining the HPA in YAML with `type: Resource` and `name: memory`. Always set resource requests on your pods, since HPA calculates utilization as a percentage of the requested amount.

Custom Metrics with Prometheus Adapter

CPU and memory are often poor scaling signals. For web applications, requests-per-second or queue depth are better indicators. The Prometheus Adapter exposes Prometheus metrics as Kubernetes custom metrics, which HPA can consume. Install the adapter, configure a metrics rule that maps a PromQL query to a Kubernetes metric name, and reference it in your HPA spec with `type: Pods` or `type: Object`. For example, scale based on `http_requests_per_second` averaged across pods, or scale a worker deployment based on the length of a RabbitMQ queue.

Scaling Behavior and Stabilization

HPA v2 introduces `behavior` policies that control how fast scaling occurs. Set `scaleUp.stabilizationWindowSeconds` to prevent rapid scaling during brief traffic spikes, and `scaleDown.stabilizationWindowSeconds` (default 300s) to avoid premature scale-down. You can also limit the rate of change: `policies` with `type: Percent` and `value: 100` means HPA can at most double the replicas per scaling event. This prevents a sudden burst of metrics from scaling to max replicas instantly, which could overwhelm downstream services.

Why Anubiz Engineering

100% async — no calls, no meetings

Delivered in days, not weeks

Full documentation included

Production-grade from day one

Security-first approach

Post-delivery support included

Ready to get started?

Skip the research. Tell us what you need, and we'll scope it, implement it, and hand it back — fully documented and production-ready.

Start a Brief Kubernetes Deployment Service