This page explains how to automatically resize your Google Kubernetes Engine (GKE) cluster's node pools based on the demands of your workloads. When demand is high, the cluster autoscaler adds nodes to the node pool. When demand is low, the cluster autoscaler scales back down to a minimum size that you designate. This can increase the availability of your workloads when you need it, while controlling costs. You can also configure cluster autoscaler on a cluster.
GKE's cluster autoscaler automatically resizes the number of nodes in a given node pool, based on the demands of your workloads. You don't need to manually add or remove nodes or over-provision your node pools. Instead, you specify a minimum and maximum size for the node pool, and the rest is automatic.
If resources are deleted or moved when autoscaling your cluster, your workloads might experience transient disruption. For example, if your workload consists of a controller with a single replica, that replica's Pod might be rescheduled onto a different node if its current node is deleted. Before enabling cluster autoscaler, design your workloads to tolerate potential disruption or ensure that critical Pods are not interrupted.
How cluster autoscaler works
Cluster autoscaler works on a per-node pool basis. When you configure a node pool with cluster autoscaler, you specify a minimum and maximum size for the node pool.
Cluster autoscaler increases or decreases the size of the node pool automatically, based on the resource requests (rather than actual resource utilization) of Pods running on that node pool's nodes. It periodically checks the status of Pods and nodes, and takes action:
- If Pods are unschedulable because there are not enough nodes in the node pool, cluster autoscaler adds nodes, up to the maximum size of the node pool.
- If nodes are under-utilized, and all Pods could be scheduled even with fewer nodes in the node pool, Cluster autoscaler removes nodes, down to the minimum size of the node pool. If the node cannot be drained gracefully after a timeout period (currently 10 minutes), the node is forcibly terminated. The grace period is not configurable for GKE clusters.
If your Pods have requested too few resources (or haven't changed the defaults, which might be insufficient) and your nodes are experiencing shortages, cluster autoscaler does not correct the situation. You can help ensure cluster autoscaler works as accurately as possible by making explicit resource requests for all of your workloads.
Cluster autoscaler makes the following assumptions when resizing a node pool:
- All replicated Pods can be restarted on some other node, possibly causing a brief disruption. If your services are not disruption-tolerant, using cluster autoscaler is not recommended.
- Users or administrators are not manually managing nodes; it can override any manual node management operations you perform.
- All nodes in a single node pool have the same set of labels.
- Cluster autoscaler considers the relative cost of the instance types in the various pools, and attempts to expand the least expensive possible node pool. The reduced cost of node pools containing preemptible VMs is taken into account.
- Labels manually added after initial cluster or node pool creation are not
tracked. Nodes created by cluster autoscaler are assigned labels specified
--node-labelsat the time of node pool creation.
Balancing across zones
If your node pool contains multiple managed instance groups with the same instance type, cluster autoscaler attempts to keep these managed instance group sizes balanced when scaling up. This can help prevent an uneven distribution of nodes among managed instance groups in multiple zones of a node pool.
For more information on how cluster autoscaler makes balancing decisions, see the Kubernetes documentation's FAQ for autoscaling
Minimum and maximum node pool size
You can specify the minimum and maximum size for each node pool in your cluster, and cluster autoscaler makes rescaling decisions within these boundaries. If the current node pool size is lower than the specified minimum or greater than the specified maximum when you enable autoscaling, the autoscaler waits to take effect until a new node is needed in the node pool or until a node can be safely deleted from the node pool.
When you autoscale clusters, node pool scaling limits are determined by zone availability.
For example, the following command creates an autoscaling multi-zonal cluster with six nodes across three zones, with a minimum of one node per zone and a maximum of four nodes per zone:
gcloud container clusters create example-cluster \ --zone us-central1-a \ --node-locations us-central1-a,us-central1-b,us-central1-f \ --num-nodes 2 --enable-autoscaling --min-nodes 1 --max-nodes 4
The total size of this cluster is between three and twelve nodes, spread across three zones. If one of the zones fails, the total size of cluster becomes between two and eight nodes.
The decision of when to remove a node is a trade-off between optimizing for utilization or the availability of resources. Removing underutilized nodes improves cluster utilization, but new workloads might have to wait for resources to be provisioned again before they can run.
You can specify which autoscaling profile to use when making such decisions. The currently available profiles are:
balanced: The default profile.
optimize-utilization: Prioritize optimizing utilization over keeping spare resources in the cluster. When selected, the cluster autoscaler scales down the cluster more aggressively: it can remove more nodes, and remove nodes faster. This profile has been optimized for use with batch workloads that are not sensitive to start-up latency. We do not currently recommend using this profile with serving workloads.
In GKE version 1.18 and later, when you specify the
optimize-utilization autoscaling profile, GKE prefers to
schedule Pods in nodes that already have high utilization, helping the cluster
autoscaler to identify and remove underutilized nodes. To achieve this
optimization, GKE sets the scheduler name in the Pod spec to
gke.io/optimize-utilization-scheduler. Pods that specify a custom scheduler
are not affected.
The following command enables
optimize-utilization autoscaling profile in an
gcloud beta container clusters update example-cluster \ --autoscaling-profile optimize-utilization
Considering Pod scheduling and disruption
When scaling down, cluster autoscaler respects scheduling and eviction rules set on Pods. These restrictions can prevent a node from being deleted by the autoscaler. A node's deletion could be prevented if it contains a Pod with any of these conditions:
- The Pod's affinity or anti-affinity rules prevent rescheduling.
- The Pod has local storage.
- The Pod is not managed by a Controller such as a Deployment, StatefulSet, Job or ReplicaSet.
An application's PodDisruptionBudget can also prevent autoscaling; if deleting nodes would cause the budget to be exceeded, the cluster does not scale down.
For more information about cluster autoscaler and preventing disruptions, see the following questions in the Cluster autoscaler FAQ:
- How does scale-down work?
- Does Cluster autoscaler work with PodDisruptionBudget in scale-down?
- What types of Pods can prevent Cluster autoscaler from removing a node?
You can find more information about cluster autoscaler in the Autoscaling FAQ in the open-source Kubernetes project.
Cluster autoscaler has the following limitations:
- Local PersistentVolumes are currently not supported by cluster autoscaler.
- Scaling up a node group of size 0, for Pods requesting resources beyond CPU, memory and GPU (ex. ephemeral-storage).
- The cluster autoscaler supports up to 5000 nodes running 30 Pods each. For more details on scalability guarantees, refer to Scalability report.
- When scaling down, the cluster autoscaler honors a graceful termination period of 10 minutes for rescheduling the node's Pods onto a different node before forcibly terminating the node.
- Occasionally, the cluster autoscaler cannot scale down completely and an extra node exists after scaling down. This can occur when required system Pods are scheduled onto different nodes, because there is no trigger for any of those Pods to be moved to a different node. See I have a couple of nodes with low utilization, but they are not scaled down. Why?. To work around this limitation, you can configure a Pod disruption budget.
- Custom scheduling with altered Filters is not supported.
- Learn how to autoscale your nodes.
- Learn how to auto-upgrade your nodes.
- Learn how to auto-repair your nodes.