This page explains how node auto-provisioning works in Standard Google Kubernetes Engine (GKE) clusters. With node auto-provisioning, nodes are automatically scaled to meet the requirements of your workloads.
With Autopilot clusters, you don't need to manually provision nodes or manage node pools because GKE automatically manages node scaling and provisioning.
Why use node auto-provisioning
Node auto-provisioning automatically manages and scales a set of node pools on the user's behalf. Without node auto-provisioning, the GKE cluster autoscaler creates nodes only from user-created node pools. With node auto-provisioning, GKE automatically creates and deletes node pools.
Unsupported features
Node auto-provisioning doesn't create node pools that use any of the following features. However, the cluster autoscaler scales nodes in existing node pools with these features:
- GKE Sandbox.
- Windows operating systems.
- Controlling reservation affinity.
- Autoscaling local PersistentVolumes.
- Auto-provisioning nodes with local SSDs as ephemeral storage.
- Auto-provisioning through custom scheduling that uses altered Filters.
- Configuring simultaneous multi-threading (SMT).
How node auto-provisioning works
Node auto-provisioning is a mechanism of the cluster autoscaler. The cluster autoscaler only scales existing node pools. With node auto-provisioning enabled, the cluster autoscaler can create node pools automatically based on the specifications of unschedulable Pods.
Node auto-provisioning creates node pools based on the following information:
- CPU, memory, and ephemeral storage resource requests.
- GPU requests.
- Pending Pods' node affinities and label selectors.
- Pending Pods' node taints and tolerations.
Resource limits
Node auto-provisioning and the cluster autoscaler have limits at the following levels:
- Node pool level: Auto-provisioned node pools are limited to 1000 nodes.
- Cluster level:
- Any auto-provisioning limits that you define are enforced based on the total CPU and memory resources used across all node pools, not just auto-provisioned pools.
- The cluster autoscaler does not create new nodes if doing so would exceed one of the defined limits. If limits are already exceeded, GKE doesn't delete the nodes.
Workload separation
If pending Pods have node affinities and tolerations, node auto-provisioning can provision nodes with matching labels and taints.
Node auto-provisioning might create node pools with labels and taints if all of the following conditions are met:
- A pending Pod requires a node with a specific label key and value.
- The Pod has a toleration for a taint with the same key.
- The toleration is for the
NoSchedule
effect,NoExecute
effect, or all effects.
For instructions, refer to Configure workload separation in GKE.
Limitations of using labels for workload separation
Node auto-provisioning triggers new node pool creation when you use labels
supported by node auto-provisioning, like cloud.google.com/gke-spot
or machine
families. You can use other labels in your Pod manifests to narrow down the
nodes on which GKE places Pods, but GKE won't use
these labels to provision new node pools. For the list of labels that don't
explicitly trigger node pool creation, see Limitations of workload separation
with taints and
tolerations.
Deletion of auto-provisioned node pools
When there are no nodes in an auto-provisioned node pool, GKE deletes the node pool. GKE does not delete node pools that are not auto-provisioned.
Supported machine types
Node auto-provisioning considers the Pod requirements in your cluster to determine what type of nodes would best fit those Pods.
By default, GKE uses the E2 machine series unless any of the following conditions apply:
- The workload requests a feature that is not available in the E2 machine series. For example, if a GPU is requested by the workload, the N1 machine series is used for the new node pool.
- The workload requests TPU resources. To learn more about TPUs, see the Introduction to Cloud TPU.
- The workload uses the
machine-family
label. For more information, see Using a custom machine family.
If the Pod requests GPUs, node auto-provisioning assigns a machine type sufficiently large to support the number of GPUs that the Pod requests. The number of GPUs restricts the CPU and memory that the node can have. For more information, see GPU platforms.
Supported node images
Node auto-provisioning creates node pools using one of the following node images:
- Container-Optimized OS (
cos_containerd
). - Ubuntu (
ubuntu_containerd
).
Supported machine learning accelerators
Node auto-provisioning can create node pools with hardware accelerators such as GPU and Cloud TPU. Node auto-provisioning supports TPUs in GKE version 1.28 and later.
GPUs
If the Pod requests GPUs, node auto-provisioning assigns a machine type sufficiently large to support the number of GPUs that the Pod requests. The number of GPUs restricts the CPU and memory that the node can have. For more information, see GPU platforms.
Cloud TPUs
GKE supports Tensor Processing Units (TPUs) to accelerate machine learning workloads. Both single-host TPU slice node pool and multi-host TPU slice node pool support autoscaling and auto-provisioning.
With the
--enable-autoprovisioning
flag on a GKE cluster,
GKE creates or deletes single-host or multi-host TPU slice node pools with a TPU
version and topology that meets the requirements of pending workloads.
When you use --enable-autoscaling
, GKE scales the node pool based on its type, as follows:
Single-host TPU slice node pool: GKE adds or removes TPU nodes in the existing node pool. The node pool may contain any number of TPU nodes between zero and the maximum size of the node pool as determined by the --max-nodes and the --total-max-nodes flags. When the node pool scales, all the TPU nodes in the node pool have the same machine type and topology. To learn more how to create a single-host TPU slice node pool, see Create a node pool.
Multi-host TPU slice node pool: GKE atomically scales up the node pool from zero to the number of nodes required to satisfy the TPU topology. For example, with a TPU node pool with a machine type
ct5lp-hightpu-4t
and a topology of16x16
, the node pool contains 64 nodes. The GKE autoscaler ensures that this node pool has exactly 0 or 64 nodes. When scaling back down, GKE evicts all scheduled pods, and drains the entire node pool to zero. To learn more how to create a multi-host TPU slice node pool, see Create a node pool.
If a specific TPU slice has no Pods that are running or are pending to be scheduled, GKE scales down the node pool. Multi-host TPU slice node pools are scaled down atomically. Single-host TPU slice node pools are scaled down by removing individual single-host TPU slices.
When you enable node auto-provisioning with TPUs, GKE makes
scaling decisions based on the values defined in the Pod request. The following
manifest is an example of a Deployment specification that results in one node
pool that contains TPU v4 slice with a 2x2x2
topology and two
ct4p-hightpu-4t
machines:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tpu-workload
labels:
app: tpu-workload
spec:
replicas: 2
selector:
matchLabels:
app: nginx-tpu
template:
metadata:
labels:
app: nginx-tpu
spec:
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
cloud.google.com/gke-tpu-topology: 2x2x2
cloud.google.com/reservation-name: my-reservation
containers:
- name: nginx
image: nginx:1.14.2
resources:
requests:
google.com/tpu: 4
limits:
google.com/tpu: 4
ports:
- containerPort: 80
Where:
cloud.google.com/gke-tpu-accelerator
: The TPU version and type. For example, you can use any of the following:- TPU v4 with
tpu-v4-podslice
- TPU v5e with
tpu-v5-lite-podslice
. - TPU Trillium (v6e) with
tpu-v6e-slice
.
- TPU v4 with
cloud.google.com/gke-tpu-topology
: The number and physical arrangement of TPU chips within a TPU slice. When creating a node pool and enabling node auto-provisioning, you select the TPU topology. For more information about Cloud TPU topologies, see TPU configurations.limit.google.com/tpu
: The number of TPU chips on the TPU VM. Most configurations have just one correct value. However, thetpu-v5-lite-podslice
with2x4
topology configuration:- If you specify
google.com/tpu = 8
, node auto-provisioning scales up single-host TPU slice node pool adding onect5lp-hightpu-8t
machine. - If you specify
google.com/tpu = 4
, node auto-provisioning creates a multi-host TPU slice node pool with twoct5lp-hightpu-4t
machines.
- If you specify
cloud.google.com/reservation-name
: The name of the reservation that the workload uses. If omitted, the workload doesn't use any reservation.
If you set the accelerator type to tpu-v6e-slice
(to indicate TPU Trillium),
node auto-provisioning makes the following decisions:
Values set in the Pod manifest | Decided by node auto-provisioning | |||
---|---|---|---|---|
gke-tpu-topology |
limit.google.com/tpu |
Type of node pool | Node pool size | Machine type |
1x1 | 1 | Single-host TPU slice | Flexible | ct6e-standard-1t |
2x2 | 4 | Single-host TPU slice | Flexible | ct6e-standard-4t |
2x4 | 8 | Single-host TPU slice | Flexible | ct6e-standard-8t |
2x4 | 4 | Multi-host TPU slice | 2 | ct6e-standard-4t |
4x4 | 4 | Multi-host TPU slice | 4 | ct6e-standard-4t |
4x8 | 4 | Multi-host TPU slice | 8 | ct6e-standard-4t |
8x8 | 4 | Multi-host TPU slice | 16 | ct6e-standard-4t |
8x16 | 4 | Multi-host TPU slice | 32 | ct6e-standard-4t |
16x16 | 4 | Multi-host TPU slice | 64 | ct6e-standard-4t |
If you set the accelerator type to tpu-v4-podslice
(to indicate TPU v4), node
auto-provisioning makes the following decisions:
Values set in the Pod manifest | Decided by node auto-provisioning | |||
---|---|---|---|---|
gke-tpu-topology |
limit.google.com/tpu |
Type of node pool | Node pool size | Machine type |
2x2x1 | 4 | Single-host TPU slice | Flexible | ct4p-hightpu-4t |
{A}x{B}x{C} | 4 | Multi-host TPU slice | {A}x{B}x{C}/4 | ct4p-hightpu-4t |
The product of {A}x{B}x{C} defines the number of chips in the node pool. For
example, you can define a small topology of 64 chips with combinations such as
4x4x4
. If you use topologies larger than 64
chips, the values you assign to {A},{B}, and {C} must meet the following
conditions:
- {A},{B}, and {C} are either all lower than or equal to four, or multiples of four.
- The largest topology supported is
12x16x16
. - The assigned values keep the A ≤ B ≤ C
pattern. For example,
2x2x4
or2x4x4
for small topologies.
If you set the accelerator type to tpu-v5-lite-podslice
(to indicate TPU v5e
with machine types that begin with ct5lp-
), node auto-provisioning makes the
following decisions:
Values set in the Pod manifest | Decided by node auto-provisioning | |||
---|---|---|---|---|
gke-tpu-topology |
limit.google.com/tpu |
Type of node pool | Node pool size | Machine type |
1x1 | 1 | Single-host TPU slice | Flexible | ct5lp-hightpu-1t |
2x2 | 4 | Single-host TPU slice | Flexible | ct5lp-hightpu-4t |
2x4 | 8 | Single-host TPU slice | Flexible | ct5lp-hightpu-8t |
2x41 | 4 | Multi-host TPU slice | 2 (8/4) | ct5lp-hightpu-4t |
4x4 | 4 | Multi-host TPU slice | 4 (16/4) | ct5lp-hightpu-4t |
4x8 | 4 | Multi-host TPU slice | 8 (32/4) | ct5lp-hightpu-4t |
4x8 | 4 | Multi-host TPU slice | 16 (32/4) | ct5lp-hightpu-4t |
8x8 | 4 | Multi-host TPU slice | 16 (64/4) | ct5lp-hightpu-4t |
8x16 | 4 | Multi-host TPU slice | 32 (128/4) | ct5lp-hightpu-4t |
16x16 | 4 | Multi-host TPU slice | 64 (256/4) | ct5lp-hightpu-4t |
-
Special case where the machine type depends on the value you defined in the
google.com/tpu
limits field. ↩
If you set the accelerator type to tpu-v5-lite-device
(to indicate TPU v5e
with machine types that begin with ct5l-
), node auto-provisioning makes the
following decisions:
Values set in the Pod manifest | Decided by node auto-provisioning | |||
---|---|---|---|---|
gke-tpu-topology |
limit.google.com/tpu |
Type of node pool | Node pool size | Machine type |
1x1 | 1 | Single-host TPU slice | Flexible | ct5l-hightpu-1t |
2x2 | 4 | Single-host TPU slice | Flexible | ct5l-hightpu-4t |
2x4 | 8 | Single-host TPU slice | Flexible | ct5l-hightpu-8t |
To learn how to set up node auto-provisioning, see Configuring TPUs.
Support for Spot VMs
Node auto-provisioning supports creating node pools based on Spot VMs.
Creating node pools based on Spot VMs is only considered if
unschedulable pods with a toleration for the
cloud.google.com/gke-spot="true":NoSchedule
taint exist. The taint is
automatically applied to nodes in auto-provisioned node pools that are based on
Spot VMs.
You can combine using the toleration with a nodeSelector
or node affinity rule
for the cloud.google.com/gke-spot="true"
or
cloud.google.com/gke-provisioning=spot
(for nodes running GKE
version 1.25.5-gke.2500 or later) node labels to ensure that your workloads only
run on node pools based on
Spot VMs.
Support for Pods requesting ephemeral storage
Node auto-provisioning supports creating node pools when Pods request ephemeral storage. The size of the boot disk provisioned in the node pools is constant for all new auto-provisioned node pools. This size of the boot disk can be customized.
The default is 100 GiB. Ephemeral storage backed by local SSDs is not supported.
Node auto-provisioning will provision a node pool only if the allocatable ephemeral storage of a node with a specified boot disk is greater than or equal to the ephemeral storage request of a pending Pod. If the ephemeral storage request is higher than what is allocatable, node auto-provisioning will not provision a node pool. Disk sizes for nodes are not dynamically configured based on ephemeral storage requests of pending Pods.
Scalability limitations
Node auto-provisioning has the same limitations as the cluster autoscaler, as well as the following additional limitations:
- Limit on number of separated workloads
- Node auto-provisioning supports a maximum of 100 distinct separated workloads.
- Limit on number of node pools
- Node auto-provisioning de-prioritizes creating new node pools when the number of pools in the cluster approaches 100. Creating over 100 node pools is possible but only when creating a node pool is the only option to schedule a pending Pod.