About node auto-provisioning

Standard

This page explains how node auto-provisioning works in Standard Google Kubernetes Engine (GKE) clusters. With node auto-provisioning, nodes are automatically scaled to meet the requirements of your workloads.

With Autopilot clusters, you don't need to manually provision nodes or manage node pools because GKE automatically manages node scaling and provisioning.

Why use node auto-provisioning

Node auto-provisioning automatically manages and scales a set of node pools on the user's behalf. Without node auto-provisioning, the GKE cluster autoscaler creates nodes only from user-created node pools. With node auto-provisioning, GKE automatically creates and deletes node pools.

Unsupported features

Node auto-provisioning doesn't create node pools that use any of the following features. However, the cluster autoscaler scales nodes in existing node pools with these features:

GKE Sandbox.
Windows operating systems.
Controlling reservation affinity.
Autoscaling local PersistentVolumes.
Auto-provisioning nodes with local SSDs as ephemeral storage.
Auto-provisioning through custom scheduling that uses altered Filters.
Configuring simultaneous multi-threading (SMT).

How node auto-provisioning works

Node auto-provisioning is a mechanism of the cluster autoscaler, which only scales existing node pools. With node auto-provisioning enabled, the cluster autoscaler can create node pools automatically based on the specifications of unschedulable Pods.

Node auto-provisioning creates node pools based on the following information:

CPU, memory, and ephemeral storage resource requests.
GPU requests.
Pending Pods' node affinities and label selectors.
Pending Pods' node taints and tolerations.

Resource limits

Node auto-provisioning and the cluster autoscaler have limits at the following levels:

Node pool level: Auto-provisioned node pools are limited to 1000 nodes.
Cluster level:
- Any auto-provisioning limits that you define are enforced based on the total CPU and memory resources used across all node pools, not just auto-provisioned pools.
- The cluster autoscaler does not create new nodes if doing so would exceed one of the defined limits. If limits are already exceeded, GKE doesn't delete the nodes.

Workload separation

If pending Pods have node affinities and tolerations, node auto-provisioning can provision nodes with matching labels and taints.

Node auto-provisioning might create node pools with labels and taints if all of the following conditions are met:

A pending Pod requires a node with a specific label key and value.
The Pod has a toleration for a taint with the same key.
The toleration is for the NoSchedule effect, NoExecute effect, or all effects.

For instructions, refer to Configure workload separation in GKE.

Deletion of auto-provisioned node pools

When there are no nodes in an auto-provisioned node pool, GKE deletes the node pool. GKE does not delete node pools that are not auto-provisioned.

Supported machine types

Node auto-provisioning considers the Pod requirements in your cluster to determine what type of nodes would best fit those Pods.

By default, GKE uses the E2 machine series unless any of the following conditions apply:

The workload requests a feature that is not available in the E2 machine series. For example, if a GPU is requested by the workload, the N1 machine series is used for the new node pool.
The workload requests TPU resources. To learn more about TPUs, see the Introduction to Cloud TPU.
The workload uses the machine-family label. For more information, see Using a custom machine family.

If the Pod requests GPUs, node auto-provisioning assigns a machine type sufficiently large to support the number of GPUs that the Pod requests. The number of GPUs restricts the CPU and memory that the node can have. For more information, see GPU platforms.

Supported node images

Node auto-provisioning creates node pools using one of the following node images:

Container-Optimized OS (cos_containerd).
Ubuntu (ubuntu_containerd).

Supported machine learning accelerators

Node auto-provisioning can create node pools with hardware accelerators such as GPU and Cloud TPU. Node auto-provisioning supports TPUs in GKE version 1.28 and later.

GPUs

Cloud TPUs

GKE supports Tensor Processing Units (TPUs) to accelerate machine learning workloads. Both single-host TPU slice node pool and multi-host TPU slice node pool support autoscaling and auto-provisioning.

With the --enable-autoprovisioning flag on a GKE cluster, GKE creates or deletes single-host or multi-host TPU slice node pools with a TPU version and topology that meets the requirements of pending workloads.

When you use --enable-autoscaling, GKE scales the node pool based on its type, as follows:

Single-host TPU slice node pool: GKE adds or removes TPU nodes in the existing node pool. The node pool may contain any number of TPU nodes between zero and the maximum size of the node pool as determined by the --max-nodes and the --total-max-nodes flags. When the node pool scales, all the TPU nodes in the node pool have the same machine type and topology. To learn more how to create a single-host TPU slice node pool, see Create a node pool.
Multi-host TPU slice node pool: GKE atomically scales up the node pool from zero to the number of nodes required to satisfy the TPU topology. For example, with a TPU node pool with a machine type ct5lp-hightpu-4t and a topology of 16x16, the node pool contains 64 nodes. The GKE autoscaler ensures that this node pool has exactly 0 or 64 nodes. When scaling back down, GKE evicts all scheduled pods, and drains the entire node pool to zero. To learn more how to create a multi-host TPU slice node pool, see Create a node pool.

If a specific TPU slice has no Pods that are running or are pending to be scheduled, GKE scales down the node pool. Multi-host TPU slice node pools are scaled down atomically. Single-host TPU slice node pools are scaled down by removing individual single-host TPU slices.

When you enable node auto-provisioning with TPUs, GKE makes scaling decisions based on the values defined in the Pod request. The following manifest is an example of a Deployment specification that results in one node pool that contains TPU v4 slice with a 2x2x2 topology and two ct4p-hightpu-4t machines:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: tpu-workload
      labels:
        app: tpu-workload
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: nginx-tpu
      template:
        metadata:
          labels:
            app: nginx-tpu
        spec:
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
            cloud.google.com/gke-tpu-topology: 2x2x2
            cloud.google.com/reservation-name: my-reservation
          containers:
          - name: nginx
            image: nginx:1.14.2
            resources:
              requests:
                google.com/tpu: 4
              limits:
               google.com/tpu: 4
            ports:
            - containerPort: 80

Where:

cloud.google.com/gke-tpu-accelerator: The TPU version and type. For example, TPU v4 with tpu-v4-podslice or TPU v5e with tpu-v5-lite-podslice.
cloud.google.com/gke-tpu-topology: The number and physical arrangement of TPU chips within a TPU slice. When creating a node pool and enabling node auto-provisioning, you select the TPU topology. For more information about Cloud TPU topologies, see TPU configurations.
limit.google.com/tpu: The number of TPU chips on the TPU VM. Most configurations have just one correct value. However, the tpu-v5-lite-podslice with 2x4 topology configuration:
- If you specify google.com/tpu = 8, node auto-provisioning scales up single-host TPU slice node pool adding one ct5lp-hightpu-8t machine.
- If you specify google.com/tpu = 4, node auto-provisioning creates a multi-host TPU slice node pool with two ct5lp-hightpu-4t machines.
cloud.google.com/reservation-name: The name of the reservation that the workload uses. If omitted, the workload doesn't use any reservation.

If you set tpu-v4-podslice, node auto-provisioning makes the following decisions:

Values set in the Pod manifest		Decided by node auto-provisioning
`gke-tpu-topology`	`limit.google.com/tpu`	Type of node pool	Node pool size	Machine type
2x2x1	4	Single-host TPU slice	Flexible	`ct4p-hightpu-4t`
{A}x{B}x{C}	4	Multi-host TPU slice	{A}x{B}x{C}/4	`ct4p-hightpu-4t`

The product of {A}x{B}x{C} defines the number of chips in the node pool. For example, you can define a small topology of 64 chips with combinations such as 4x4x4. If you use topologies larger than 64 chips, the values you assign to {A},{B}, and {C} must meet the following conditions:

{A},{B}, and {C} are either all lower than or equal to four, or multiples of four.
The largest topology supported is 12x16x16.
The assigned values keep the A ≤ B ≤ C pattern. For example, 2x2x4 or 2x4x4 for small topologies.

If you set tpu-v5-lite-podslice, node auto-provisioning makes the following decisions:

Values set in the Pod manifest		Decided by node auto-provisioning
`gke-tpu-topology`	`limit.google.com/tpu`	Type of node pool	Node pool size	Machine type
1x1	1	Single-host TPU slice	Flexible	`ct5lp-hightpu-1t`
2x2	4	Single-host TPU slice	Flexible	`ct5lp-hightpu-4t`
2x4	8	Single-host TPU slice	Flexible	`ct5lp-hightpu-8t`
2x4¹	4	Multi-host TPU slice	2 (8/4)	`ct5lp-hightpu-4t`
4x4	4	Multi-host TPU slice	4 (16/4)	`ct5lp-hightpu-4t`
4x8	4	Multi-host TPU slice	8 (32/4)	`ct5lp-hightpu-4t`
4x8	4	Multi-host TPU slice	16 (32/4)	`ct5lp-hightpu-4t`
8x8	4	Multi-host TPU slice	16 (64/4)	`ct5lp-hightpu-4t`
8x16	4	Multi-host TPU slice	32 (128/4)	`ct5lp-hightpu-4t`
16x16	4	Multi-host TPU slice	64 (256/4)	`ct5lp-hightpu-4t`

Special case where the machine type depends on the value you defined in the google.com/tpu limits field. ↩

If you set the accelerator type to tpu-v5-lite-device, node auto-provisioning makes the following decisions:

Values set in the Pod manifest		Decided by node auto-provisioning
`gke-tpu-topology`	`limit.google.com/tpu`	Type of node pool	Node pool size	Machine type
1x1	1	Single-host TPU slice	Flexible	`ct5l-hightpu-1t`
2x2	4	Single-host TPU slice	Flexible	`ct5l-hightpu-4t`
2x4	8	Single-host TPU slice	Flexible	`ct5l-hightpu-8t`

To learn how to set up node auto-provisioning, see Configuring TPUs.

Support for Spot VMs

Node auto-provisioning supports creating node pools based on Spot VMs.

Creating node pools based on Spot VMs is only considered if unschedulable pods with a toleration for the cloud.google.com/gke-spot="true":NoSchedule taint exist. The taint is automatically applied to nodes in auto-provisioned node pools that are based on Spot VMs.

You can combine using the toleration with a nodeSelector or node affinity rule for the cloud.google.com/gke-spot="true" or cloud.google.com/gke-provisioning=spot (for nodes running GKE version 1.25.5-gke.2500 or later) node labels to ensure that your workloads only run on node pools based on Spot VMs.

Support for Pods requesting ephemeral storage

Node auto-provisioning supports creating node pools when Pods request ephemeral storage. The size of the boot disk provisioned in the node pools is constant for all new auto-provisioned node pools. This size of the boot disk can be customized.

The default is 100 GiB. Ephemeral storage backed by local SSDs is not supported.

Node auto-provisioning will provision a node pool only if the allocatable ephemeral storage of a node with a specified boot disk is greater than or equal to the ephemeral storage request of a pending Pod. If the ephemeral storage request is higher than what is allocatable, node auto-provisioning will not provision a node pool. Disk sizes for nodes are not dynamically configured based on ephemeral storage requests of pending Pods.

Scalability limitations

Node auto-provisioning has the same limitations as the cluster autoscaler, as well as the following additional limitations:

Limit on number of separated workloads: Node auto-provisioning supports a maximum of 100 distinct separated workloads.
Limit on number of node pools: Node auto-provisioning de-prioritizes creating new node pools when the number of pools in the cluster approaches 100. Creating over 100 node pools is possible but only when creating a node pool is the only option to schedule a pending Pod.