This page explains Google Kubernetes Engine's cluster autoscaler feature. To learn how to autoscale clusters, refer to Autoscaling a Cluster.
GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run. With autoscaling enabled, GKE automatically adds a new node to your cluster if you've created new Pods that don't have enough capacity to run; conversely, if a node in your cluster is underutilized and its Pods can be run on other nodes, GKE can delete the node.
Cluster autoscaling allows you to pay only for resources that are needed at any given moment, and to automatically get additional resources when demand increases.
Keep in mind that when resources are deleted or moved in the course of autoscaling your cluster, your services can experience some disruption. For example, if your service consists of a controller with a single replica, that replica's Pod might be restarted on a different node if its current node is deleted. Before enabling autoscaling, ensure that your services can tolerate potential disruption.
How cluster autoscaler works
The cluster autoscaler works on a per-node pool basis. For each node pool, the autoscaler periodically checks whether there are any Pods that are not being scheduled and are waiting for a node with available resources. If such Pods exist, and the autoscaler determines that resizing a node pool would allow the waiting Pods to be scheduled, then the autoscaler expands that node pool.
Cluster autoscaler also measures the usage of each node against the node pool's total demand for capacity. If a node has had no new Pods scheduled on it for a set period of time, and all Pods running on that node can be scheduled onto other nodes in the pool, the autoscaler moves the Pods and deletes the node.
Note that cluster autoscaler works based on Pod resource requests, that is, how many resources your Pods have requested. Cluster autoscaler does not take into account the resources your Pods are actively using. Essentially, cluster autoscaler trusts that the Pod resource requests you've provided are accurate and schedules Pods on nodes based on that assumption.
If your Pods have requested too few resources (or haven't changed the defaults, which might be insufficient) and your nodes are experiencing shortages, cluster autoscaler does not correct the situation. You can help ensure cluster autoscaler works as accurately as possible by making explicit resource requests for all of your workloads.
Cluster autoscaler makes the following assumptions when resizing a node pool:
- All replicated Pods can be restarted on some other node, possibly causing a brief disruption. If your services are not disruption-tolerant, using autoscaling is not recommended.
- Users or administrators are not manually managing nodes; it may override any manual node management operations you perform.
- All nodes in a single node pool have the same set of labels.
- The cluster autoscaler considers the relative cost of the instance types in the various pools, and attempts to expand the least expensive possible node pool. The reduced cost of node pools containing preemptible VMs is taken into account.
- Labels manually added after initial cluster or node pool creation are not
tracked. Nodes created by the cluster autoscaler are assigned labels specified
--node-labelsat the time of node pool creation.
Balancing across zones
If your cluster contains multiple node pools with the same instance type, cluster autoscaler will attempt to keep those node pools' sizes balanced when scaling up. This can help prevent an uneven distribution of node pool sizes when you have node pools in multiple zones.
If you have a node pool that you want to exempt from pool size balancing, you can do so by giving that node pool a custom label.
For more information on how cluster autoscaler makes balancing decisions, see the Kubernetes documentation's FAQ for autoscaling
Minimum and maximum node pool size
You can specify the minimum and maximum size for each node pool in your cluster, and cluster autoscaler makes rescaling decisions within these boundaries. If the current node pool size is lower than the specified minimum or greater than the specified maximum when you enable autoscaling, the autoscaler waits to take effect until a new node is needed in the node pool or until a node can be safely deleted from the node pool.
When you autoscale clusters, node pool scaling limits are determined by zone availability.
For example, the following command creates an autoscaling multi-zone cluster with six nodes across three zones, with a minimum of one node per zone and a maximum of four nodes per zone:
gcloud container clusters create example-cluster \ --zone us-central1-a \ --node-locations us-central1-a,us-central1-b,us-central1-f \ --num-nodes 2 --enable-autoscaling --min-nodes 1 --max-nodes 4
The total size of this cluster is between three and twelve nodes, spread across three zones. If one of the zones fails, the total size of cluster becomes between two and eight nodes.
Considering Pod scheduling and disruption
When scaling down, cluster autoscaler respects scheduling and eviction rules set on Pods. These restrictions can prevent a node from being deleted by the autoscaler. A node's deletion could be prevented if it contains a Pod with any of these conditions:
- The Pod's affinity or anti-affinity rules prevent rescheduling.
- The Pod has local storage.
- The Pod is not managed by a Controller such as a Deployment, StatefulSet, Job or ReplicaSet.
An application's PodDisruptionBudget can also prevent autoscaling; if deleting nodes would cause the budget to be exceeded, the cluster does not scale down.
You can find more information about cluster autoscaler in the Autoscaling FAQ in the open-source Kubernetes project.
Cluster autoscaler has following limitations:
- Cluster autoscaler supports up to 1000 nodes running 30 Pods each. For more details on scalability guarantees, refer to Scalability report.
- When scaling down, cluster autoscaler supports a graceful termination period for a Pod of up to 10 minutes. A Pod is always killed after a maximum of 10 minutes, even if the Pod is configured with a higher grace period.