This page shows you how to resolve issues with GKE Standard mode node pools.
Node pool creation issues
This section lists issues that might occur when creating new node pools in Standard clusters and provides suggestions for how you might fix them.
Node pool creation fails due to resource availability
The following issue occurs when you create a node pool with specific hardware in a Google Cloud zone that doesn't have enough hardware available to meet your requirements.
To validate that node pool creation failed because a zone didn't have enough resources, check your logs for relevant error messages.
Go to Logs Explorer in the Google Cloud console:
In the Query field, specify the following query:
log_id(cloudaudit.googleapis.com/activity) resource.labels.cluster_name="CLUSTER_NAME" protoPayload.status.message:("ZONE_RESOURCE_POOL_EXHAUSTED" OR "does not have enough resources available to fulfill the request" OR "resource pool exhausted" OR "does not exist in zone")
CLUSTER_NAMEwith the name of your GKE cluster.
Click Run query.
You might see one of the following error messages:
resource pool exhausted
The zone does not have enough resources available to fulfill the request. Try a different zone, or try again later.
Machine type with name '<code><var>MACHINE_NAME</var></code>' does not exist in zone '<code><var>ZONE_NAME</var></code>'
To resolve this issue, try the following suggestions:
- Ensure that the selected Google Cloud region or zone has the specific hardware that you need. Use the Compute Engine availability table to check whether specific zones support specific hardware. Choose a different Google Cloud region or zone for your nodes that might have better availability of the hardware that you need.
- Create the node pool with smaller machine types. Increase the number of nodes in the node pool so that the total compute capacity remains the same.
- Use Compute Engine capacity reservation to reserve the resources in advance.
- Use best-effort provisioning, described in the following section, to successfully create the node pool if it can provision at least a specified minimum number of nodes out of the requested number.
For certain hardware, you can use best-effort provisioning, which tells GKE to successfully create the node pool if it can provision at least a specified minimum number of nodes. GKE continues attempting to provision the remaining nodes to satisfy the original request over time. To tell GKE to use best-effort provisioning, use the following command:
gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --node-locations=ZONE1,ZONE2,... \ --machine-type=MACHINE_TYPE --best-effort-provision \ --min-provision-nodes=MINIMUM_NODES
Replace the following:
NODE_POOL_NAME: the name of the new node pool.
ZONE1,ZONE2,...: the Compute Engine zones for the nodes. These zones must support the selected hardware.
MACHINE_TYPE: the Compute Engine engine machine type for the nodes. For example,
MINIMUM_NODES: the minimum number of nodes for GKE to provision and successfully create the node pool. If omitted, the default is
For example, consider a scenario in which you need 10 nodes with attached
NVIDIA A100 40GB GPUs in
us-central1-c. According to the
GPU regions and zones availability table,
this zone supports A100 GPUs. To avoid node pool creation failure if 10 GPU
machines aren't available, you use best-effort provisioning.
gcloud container node-pools create a100-nodes \ --cluster=ml-cluster \ --node-locations=us-central1-c \ --num-nodes=10 \ --machine-type=a2-highgpu-1g \ --accelerator=type=nvidia-tesla-a100,count=1 \ --best-effort-provision \ --min-provision-nodes=5
GKE creates the node pool even if only five GPUs are available in
us-central1-c. Over time, GKE attempts to provision more nodes
until there are 10 nodes in the node pool.
Migrate workloads between node pools
Use the following instructions to migrate workloads from one node pool to another node pool. If you want to change the machine attributes of the nodes in your node pool, see Vertically scale by changing the node machine attributes.
How to migrate Pods to a new node pool
To migrate Pods to a new node pool, you must do the following:
Cordon the existing node pool: This operation marks the nodes in the existing node pool as unschedulable. Kubernetes stops scheduling new Pods to these nodes once you mark them as unschedulable.
Drain the existing node pool: This operation evicts the workloads running on the nodes of the existing node pool gracefully.
These steps cause Pods running in your existing node pool to gracefully terminate. Kubernetes reschedules them onto other available nodes.
To make sure Kubernetes terminates your applications gracefully, your containers
should handle the SIGTERM
signal. Use this approach to close active connections to clients and commit or
rollback database transactions in a clean way. In your Pod manifest, you can use
spec.terminationGracePeriodSeconds field to specify how long Kubernetes
must wait before stopping containers in the Pod. This defaults to 30 seconds.
You can read more about Pod
in the Kubernetes documentation.
You can cordon and drain nodes using the
kubectl cordon and
Create node pool and migrate workloads
To migrate your workloads to a new node pool, create the new node pool, then cordon and drain the nodes in the existing node pool:
Add a node pool to your cluster.
Verify that the new node pool is created by running the following command:
gcloud container node-pools list --cluster CLUSTER_NAME
Run the following command to see which node the Pods are running on (see the
kubectl get pods -o=wide
Get a list of nodes in the existing node pool, replacing
EXISTING_NODE_POOL_NAMEwith the name:
kubectl get nodes -l cloud.google.com/gke-nodepool=EXISTING_NODE_POOL_NAME
kubectl cordon NODEcommand (substitute
NODEwith the names from the previous command). The following shell command iterates each node in the existing node pool and marks them as unschedulable:
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=EXISTING_NODE_POOL_NAME -o=name); do kubectl cordon "$node"; done
Optionally, update your workloads running on the existing node pool to add a nodeSelector for the label
NEW_NODE_POOL_NAMEis the name of the new node pool. This ensures that GKE places those workloads on nodes in the new node pool.
Drain each node by evicting Pods with an allotted graceful termination period of 10 seconds:
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=EXISTING_NODE_POOL_NAME -o=name); do kubectl drain --force --ignore-daemonsets --delete-emptydir-data --grace-period=GRACEFUL_TERMINATION_SECONDS "$node"; done
GRACEFUL_TERMINATION_PERIOD_SECONDSwith the required amount of time for graceful termination.
Run the following command to see that the nodes in the existing node pool have
SchedulingDisabledstatus in the node list:
kubectl get nodes
Additionally, you should see that the Pods are now running on the nodes in the new node pool:
kubectl get pods -o=wide
Delete the existing node pool if don't need it anymore:
gcloud container node-pools delete default-pool --cluster CLUSTER_NAME