Horizontal Pod autoscaling

Autopilot Standard

This page provides an overview of horizontal Pod autoscaling and explains how it works in Google Kubernetes Engine (GKE). You can also read about how to configure and use horizontal Pod autoscaling on your clusters.

The Horizontal Pod Autoscaler changes the shape of your Kubernetes workload by automatically increasing or decreasing the number of Pods in response to the workload's CPU or memory consumption, or in response to custom metrics reported from within Kubernetes or external metrics from sources outside of your cluster.

GKE clusters with node auto provisioning automatically scale the number of nodes in the cluster based on changes in the number of Pods. For that reason, we recommend that you use horizontal Pod autoscaling for all clusters.

Why use horizontal Pod autoscaling

When you first deploy your workload to a Kubernetes cluster, you may not be sure about its resource requirements and how those requirements might change depending on usage patterns, external dependencies, or other factors. Horizontal Pod autoscaling helps to ensure that your workload functions consistently in different situations, and lets you control costs by only paying for extra capacity when you need it.

It's not always easy to predict the indicators that show whether your workload is under-resourced or under-utilized. The Horizontal Pod Autoscaler can automatically scale the number of Pods in your workload based on one or more metrics of the following types:

Actual resource usage: when a given Pod's CPU or memory usage exceeds a threshold. This can be expressed as a raw value or as a percentage of the amount the Pod requests for that resource.
Custom metrics: based on any metric reported by a Kubernetes object in a cluster, such as the rate of client requests per second or I/O writes per second.

This can be useful if your application is prone to network bottlenecks, rather than CPU or memory.
External metrics: based on a metric from an application or service external to your cluster.

For example, your workload might need more CPU when ingesting a large number of requests from a pipeline such as Pub/Sub. You can create an external metric for the size of the queue, and configure the Horizontal Pod Autoscaler to automatically increase the number of Pods when the queue size reaches a given threshold, and to reduce the number of Pods when the queue size shrinks.

You can combine a Horizontal Pod Autoscaler with a Vertical Pod Autoscaler, with some limitations.

How horizontal Pod autoscaling works

Each configured Horizontal Pod Autoscaler operates using a control loop. A separate Horizontal Pod Autoscaler exists for each workload. Each Horizontal Pod Autoscaler periodically checks a given workload's metrics against the target thresholds you configure, and changes the shape of the workload automatically.

Per-Pod resources

For resources that are allocated per-Pod, such as CPU, the controller queries the resource metrics API for each container running in the Pod.

If you specify a raw value for CPU or memory, the value is used.
If you specify a percentage value for CPU or memory, the Horizontal Pod Autoscaler calculates the average utilization value as a percentage of that Pod's CPU or memory requests.
Custom and external metrics are expressed as raw values or average values.

The controller uses the average or raw value for a reported metric to produce a ratio, and uses that ratio to autoscale the workload. You can read a description of the Horizontal Pod Autoscaler algorithm in the Kubernetes project documentation.

Responding to multiple metrics

If you configure a workload to autoscale based on multiple metrics, the Horizontal Pod Autoscaler evaluates each metric separately and uses the scaling algorithm to determine the new workload scale based on each one. The largest scale is selected for the autoscale action.

If one or more of the metrics are unavailable for some reason, the Horizontal Pod Autoscaler still scales up based on the largest size calculated, but does not scale down.

Preventing thrashing

Thrashing refers to a situation in which the Horizontal Pod Autoscaler attempts to perform subsequent autoscaling actions before the workload finishes responding to prior autoscaling actions. To prevent thrashing, the Horizontal Pod Autoscaler chooses the largest recommendation based on the last five minutes.

Limitations

Don't use the Horizontal Pod Autoscaler together with the Vertical Pod Autoscaler on CPU or memory. You can use the Horizontal Pod Autoscaler with the Vertical Pod Autoscaler for other metrics. You can configure multidimensional Pod autoscaling (in beta) in order to scale horizontally on CPU and vertically on memory at the same time.
If you have a Deployment, don't configure horizontal Pod autoscaling on the ReplicaSet or Replication Controller backing it. When you perform a rolling update on the Deployment or Replication Controller, it is replaced by a new Replication Controller. Instead configure horizontal Pod autoscaling on the Deployment itself.
You can't use Horizontal Pod autoscaling for workloads that cannot be scaled, such as DaemonSets.
Horizontal Pod autoscaling exposes metrics as Kubernetes resources, which imposes limitations on metric names such as no uppercase or '/' characters. Your metric adapter might allow renaming. For example, see the prometheus-adapter as operator.
Horizontal Pod Autoscaler won't scale down if any of the metrics that it's configured to monitor are unavailable. To check if you have unavailable metrics, see Viewing details about a Horizontal Pod Autoscaler.

Scalability

While the Horizontal Pod Autoscaler doesn't have a hard limit on the number of supported HPA objects, its performance can be affected as this number grows. Specifically, the period between HPA recalculations might become longer than the standard 15 seconds.

In GKE minor version 1.22 or later, the recalculation period should stay within 15 seconds with up to 300 HPA objects.
In GKE minor version 1.31 or later, if the Performance HPA profile is configured, the recalculation period should stay within 15 seconds with up to 1,000 HPA objects. Learn how to configure the Performance HPA profile.
In GKE minor version 1.33 or later, if the Performance HPA profile is configured, the recalculation period should stay within 15 seconds with up to 5,000 HPA objects. The Performance HPA profile is enabled by default on all clusters that meet the requirements.

The following factors can also affect performance:

Scaling on multiple metrics: each metric adds a fetch call for recommendation calculations, which affects the recalculation period.
The latency of the custom metrics stack: response times over approximately 50 milliseconds would be more than typically observed with the standard Kubernetes metrics, affecting the recalculation period.

Interacting with `HorizontalPodAutoscaler` objects

You can configure a Horizontal Pod Autoscaler for a workload, and get information about autoscaling events and what caused them, by visiting the Workloads page in the Google Cloud console.

Each Horizontal Pod Autoscaler exists in the cluster as a HorizontalPodAutoscaler object. You can use commands like kubectl get hpa or kubectl describe hpa HPA_NAME to interact with these objects.

You can also create HorizontalPodAutoscaler objects using the kubectl autoscale command.

What's next

Learn how to configure horizontal Pod autoscaling
Learn how to manually scale an Application
Learn how to scale to zero using KEDA
Learn more about Vertical Pod Autoscaler
Learn more about Cluster Autoscaler