This page explains how node auto-provisioning works in Standard Google Kubernetes Engine (GKE) clusters. With node auto-provisioning, nodes are automatically scaled to meet the requirements of your workloads.
With Autopilot clusters, you don't need to manually provision nodes or manage node pools because GKE automatically manages node scaling and provisioning.
Why use node auto-provisioning
Node auto-provisioning automatically manages and scales a set of node pools on the user's behalf. Without node auto-provisioning, the GKE cluster autoscaler creates nodes only from user-created node pools. With node auto-provisioning, GKE automatically creates and deletes node pools.
Unsupported features
Node auto-provisioning doesn't create node pools that use any of the following features. However, the cluster autoscaler scales nodes in existing node pools with these features:
- GKE Sandbox.
- Windows operating systems.
- Controlling reservation affinity.
- Autoscaling local PersistentVolumes.
- Auto-provisioning nodes with local SSDs as ephemeral storage.
- Auto-provisioning through custom scheduling that uses altered Filters.
- Configuring simultaneous multi-threading (SMT).
How node auto-provisioning works
Node auto-provisioning is a mechanism of the cluster autoscaler, which only scales existing node pools. With node auto-provisioning enabled, the cluster autoscaler can create node pools automatically based on the specifications of unschedulable Pods.
Node auto-provisioning creates node pools based on the following information:
- CPU, memory, and ephemeral storage resource requests.
- GPU requests.
- Pending Pods' node affinities and label selectors.
- Pending Pods' node taints and tolerations.
Resource limits
Node auto-provisioning and the cluster autoscaler have limits at the following levels:
- Node pool level: Auto-provisioned node pools are limited to 1000 nodes.
- Cluster level:
- Any auto-provisioning limits that you define are enforced based on the total CPU and memory resources used across all node pools, not just auto-provisioned pools.
- The cluster autoscaler does not create new nodes if doing so would exceed one of the defined limits. If limits are already exceeded, GKE doesn't delete the nodes.
Workload separation
If pending Pods have node affinities and tolerations, node auto-provisioning can provision nodes with matching labels and taints.
Node auto-provisioning might create node pools with labels and taints if all of the following conditions are met:
- A pending Pod requires a node with a specific label key and value.
- The Pod has a toleration for a taint with the same key.
- The toleration is for the
NoSchedule
effect,NoExecute
effect, or all effects.
For instructions, refer to Configure workload separation in GKE.
Deletion of auto-provisioned node pools
When there are no nodes in an auto-provisioned node pool, GKE deletes the node pool. GKE does not delete node pools that are not auto-provisioned.
Supported machine types
Node auto-provisioning considers the Pod requirements in your cluster to determine what type of nodes would best fit those Pods.
By default, GKE uses the E2 machine series unless any of the following conditions apply:
- The workload requests a feature that is not available in the E2 machine series. For example, if a GPU is requested by the workload, the N1 machine series is used for the new node pool.
- The workload requests TPU resources. To learn more about TPUs, see the Introduction to Cloud TPU.
- The workload uses the
machine-family
label. For more information, see Using a custom machine family.
If the Pod requests GPUs, node auto-provisioning assigns a machine type sufficiently large to support the number of GPUs that the Pod requests. The number of GPUs restricts the CPU and memory that the node can have. For more information, see GPU platforms.
Supported node images
Node auto-provisioning creates node pools using one of the following node images:
- Container-Optimized OS (
cos_containerd
). - Ubuntu (
ubuntu_containerd
).
Supported machine learning accelerators
Node auto-provisioning can create node pools with hardware accelerators such as GPU and Cloud TPU. Node auto-provisioning supports TPUs in GKE version 1.28 and later.
GPUs
If the Pod requests GPUs, node auto-provisioning assigns a machine type sufficiently large to support the number of GPUs that the Pod requests. The number of GPUs restricts the CPU and memory that the node can have. For more information, see GPU platforms.
Cloud TPUs
In GKE version 1.28 and later, if the Pod requests Cloud TPUs, node auto-provisioning scales up by creating node pools with TPU resources based on the Pod requirements such as TPU version, topology, machine type, and type of node pool. The Pod can also specify a reservation that GKE uses to provision TPU nodes.
GKE supports the following type of node pools:
Single-host TPU slice: Node auto-provisioning creates a node pool that contains one or more independent TPU VMs, which scales like a regular GKE node pool. GKE increases or decreases the number of nodes based on workload demand.
Multi-host TPU slice: Node auto-provisioning creates a node pool that contains two or more interconnected TPU VMs. Each TPU VM has an attached TPU device that has a number of TPU chips. The number of nodes is defined by TPU topology and the
machine-type
. The number of TPU nodes equals the number of TPU chips in the topology divided by the number of TPU chips on each node (TPU topology / TPU chips in each node). For example, thect4p-hightpu-4t
machine type has four chips. If you request a2x2x2
topology, the node pool has two nodes (2x2x2/4
). Node auto-provisioning creates an empty node pool first and scales it atomically from zero to the target number of nodes. When GKE scales down a multi-host TPU slice, its non-empty nodes are drained and the TPU node pool is deleted.
If a specific TPU slice has no Pods that are running or are pending to be scheduled, GKE scales down the node pool. Multi-host TPU slice node pools are scaled down atomically. Single-host TPU slice node pools are scaled down by removing individual single-host TPU slices.
When you enable node auto-provisioning with TPUs, GKE makes
scaling decisions based on the values defined in the Pod request. The following
manifest is an example of a Deployment specification that results in one node
pool that contains TPU v4 slice with a 2x2x2
topology and two
ct4p-hightpu-4t
machines:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tpu-workload
labels:
app: tpu-workload
spec:
replicas: 2
selector:
matchLabels:
app: nginx-tpu
template:
metadata:
labels:
app: nginx-tpu
spec:
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
cloud.google.com/gke-tpu-topology: 2x2x2
cloud.google.com/reservation-name: my-reservation
containers:
- name: nginx
image: nginx:1.14.2
resources:
requests:
google.com/tpu: 4
limits:
google.com/tpu: 4
ports:
- containerPort: 80
Where:
cloud.google.com/gke-tpu-accelerator
: The TPU version and type. For example, TPU v4 withtpu-v4-podslice
or TPU v5e withtpu-v5-lite-podslice
.cloud.google.com/gke-tpu-topology
: The number and physical arrangement of TPU chips within a TPU slice. When creating a node pool and enabling node auto-provisioning, you select the TPU topology. For more information about Cloud TPU topologies, see TPU configurations.limit.google.com/tpu
: The number of TPU chips on the TPU VM. Most configurations have just one correct value. However, thetpu-v5-lite-podslice
with2x4
topology configuration:- If you specify
google.com/tpu = 8
, node auto-provisioning scales up single-host TPU slice node pool adding onect5lp-hightpu-8t
machine. - If you specify
google.com/tpu = 4
, node auto-provisioning creates a multi-host TPU slice node pool with twoct5lp-hightpu-4t
machines.
- If you specify
cloud.google.com/reservation-name
: The name of the reservation that the workload uses. If omitted, the workload doesn't use any reservation.
If you set tpu-v4-podslice
, node auto-provisioning makes the following
decisions:
Values set in the Pod manifest | Decided by node auto-provisioning | |||
---|---|---|---|---|
gke-tpu-topology |
limit.google.com/tpu |
Type of node pool | Node pool size | Machine type |
2x2x1 | 4 | Single-host TPU slice | Flexible | ct4p-hightpu-4t |
{A}x{B}x{C} | 4 | Multi-host TPU slice | {A}x{B}x{C}/4 | ct4p-hightpu-4t |
The product of {A}x{B}x{C} defines the number of chips in the node pool. For
example, you can define a small topology of 64 chips with combinations such as
4x4x4
. If you use topologies larger than 64
chips, the values you assign to {A},{B}, and {C} must meet the following
conditions:
- {A},{B}, and {C} are either all lower than or equal to four, or multiples of four.
- The largest topology supported is
12x16x16
. - The assigned values keep the A ≤ B ≤ C
pattern. For example,
2x2x4
or2x4x4
for small topologies.
If you set tpu-v5-lite-podslice
, node auto-provisioning makes the following
decisions:
Values set in the Pod manifest | Decided by node auto-provisioning | |||
---|---|---|---|---|
gke-tpu-topology |
limit.google.com/tpu |
Type of node pool | Node pool size | Machine type |
1x1 | 1 | Single-host TPU slice | Flexible | ct5lp-hightpu-1t |
2x2 | 4 | Single-host TPU slice | Flexible | ct5lp-hightpu-4t |
2x4 | 8 | Single-host TPU slice | Flexible | ct5lp-hightpu-8t |
2x41 | 4 | Multi-host TPU slice | 2 (8/4) | ct5lp-hightpu-4t |
4x4 | 4 | Multi-host TPU slice | 4 (16/4) | ct5lp-hightpu-4t |
4x8 | 4 | Multi-host TPU slice | 8 (32/4) | ct5lp-hightpu-4t |
4x8 | 4 | Multi-host TPU slice | 16 (32/4) | ct5lp-hightpu-4t |
8x8 | 4 | Multi-host TPU slice | 16 (64/4) | ct5lp-hightpu-4t |
8x16 | 4 | Multi-host TPU slice | 32 (128/4) | ct5lp-hightpu-4t |
16x16 | 4 | Multi-host TPU slice | 64 (256/4) | ct5lp-hightpu-4t |
-
Special case where the machine type depends on the value you defined in the
google.com/tpu
limits field. ↩
To learn how to set up node auto-provisioning, see Configuring TPUs.
Support for Spot VMs
Node auto-provisioning supports creating node pools based on Spot VMs.
Creating node pools based on Spot VMs is only considered if
unschedulable pods with a toleration for the
cloud.google.com/gke-spot="true":NoSchedule
taint exist. The taint is
automatically applied to nodes in auto-provisioned node pools that are based on
Spot VMs.
You can combine using the toleration with a nodeSelector
or node affinity rule
for the cloud.google.com/gke-spot="true"
or
cloud.google.com/gke-provisioning=spot
(for nodes running GKE
version 1.25.5-gke.2500 or later) node labels to ensure that your workloads only
run on node pools based on
Spot VMs.
Support for Pods requesting ephemeral storage
Node auto-provisioning supports creating node pools when Pods request ephemeral storage. The size of the boot disk provisioned in the node pools is constant for all new auto-provisioned node pools. This size of the boot disk can be customized.
The default is 100 GiB. Ephemeral storage backed by local SSDs is not supported.
Node auto-provisioning will provision a node pool only if the allocatable ephemeral storage of a node with a specified boot disk is greater than or equal to the ephemeral storage request of a pending Pod. If the ephemeral storage request is higher than what is allocatable, node auto-provisioning will not provision a node pool. Disk sizes for nodes are not dynamically configured based on ephemeral storage requests of pending Pods.
Scalability limitations
Node auto-provisioning has the same limitations as the cluster autoscaler, as well as the following additional limitations:
- Limit on number of separated workloads
- Node auto-provisioning supports a maximum of 100 distinct separated workloads.
- Limit on number of node pools
- Node auto-provisioning de-prioritizes creating new node pools when the number of pools in the cluster approaches 100. Creating over 100 node pools is possible but only when creating a node pool is the only option to schedule a pending Pod.