Horizontal Pod Autoscaling

This page provides an overview of Horizontal Pod Autoscaler (HPA) and explains how it works. You can also read about how to configure and use Horizontal Pod Autoscaler on your clusters.

HPA changes the shape of your Kubernetes workload by automatically increasing or decreasing the number of Pods in response to the workload's CPU or memory consumption, or in response to custom metrics reported from within Kubernetes or external metrics from sources outside of your cluster.

HPA cannot be used for workloads that cannot be scaled, such as DaemonSets.


When you first deploy your workload to a Kubernetes cluster, you may not be sure about its resource requirements and how those requirements might change depending on usage patterns, external dependencies, or other factors. HPA helps to ensure that your workload functions consistently in different situations, and allows you to control costs by only paying for extra capacity when you need it.

It's not always easy to predict the indicators that show whether your workload is under-resourced or under-utilized. HPA can automatically scale the number of Pods in your workload based on one or more metrics of the following types:

  • Actual resource usage: when a given Pod's CPU or memory usage exceeds a threshold. This can be expressed as a raw value or as a percentage of the amount the Pod requests for that resource.

  • Custom metrics: based on any metric reported by a Kubernetes object in a cluster, such as the rate of client requests per second or I/O writes per second.

    This can be useful if your application is prone to network bottlenecks, rather than CPU or memory.

  • External metrics: based on a metric from an application or service external to your cluster.

    For example, your workload might need more CPU when ingesting a large number of requests from a pipeline such as Pub/Sub. You can create an external metric for the size of the queue, and configure HPA to automatically increase the number of Pods when the queue size reaches a given threshold, and to reduce the number of Pods when the queue size shrinks.

You can combine HPA with Vertical Pod Autoscaler, with some limitations, discussed in the next section of this topic.

How HPA works

Each configured Horizontal Pod Autoscaler object operates using a control loop. A separate HPA object exists for each workflow. Each HPA object periodically checks a given workload's metrics against the target thresholds you configure, and changes the shape of the workload automatically.

Per-Pod resources

For resources that are allocated per-Pod, such as CPU, the controller queries the resource metrics API for each container running in the Pod.

  • If you specify a raw value for CPU or memory, the value is used.
  • If you specify a percentage value for CPU or memory, HPA calculates the average utilization value as a percentage of that Pod's CPU or memory requests.
  • Custom and external metrics are expressed as raw values or average values.

The controller uses the average or raw value for a reported metric to produce a ratio, and uses that ration to autoscale the workload. You can read a description of the Horizontal Pod Autoscaler algorithm in the Kubernetes project documentation.

Responding to multiple metrics

If you configure a workload to autoscale based on multiple metrics, HPA evaluates each metric separately and uses the scaling algorithm to determine the new workload scale based on each one. The largest scale is selected for the autoscale action.

If one or more of the metrics are unavailable for some reason, HPA still scales up based on the largest size calculated, but does not scale down.

Preventing thrashing

Thrashing refers to a situation HPA attempts to perform subsequent autoscaling actions before the workload finishes responding to prior autoscaling actions. To prevent thrashing, HPA chooses the largest recommendation based on the last five minutes.


  • Do not use HPA together with Vertical Pod Autoscaling (VPA) on CPU or memory. However, you can use HPA with VPA if HPA evaluates metrics other than CPU or memory.
  • If you have a Deployment, don't configure HPA on the ReplicaSet or Replication Controller backing it. When you perform a rolling update on the Deployment or Replication Controller, it is effectively replaced by a new Replication Controller. Instead configure HPA on the Deployment itself.

Interacting with HPA objects

You can configure an HPA for a workload, and get information about autoscaling events and what caused them, by visiting the Workloads page in Google Cloud Console.

Each HPA exists in the cluster as an hpa object. You can use commands like kubectl get hpa or kubectl describe hpa [HPA-NAME] to interact with these objects.

You can also create hpa objects using the kubectl autoscale command.

To learn more about configuring and observing HPAs, visit Configure a Horizontal Pod Autoscaler.

What's next