This page shows you how to discover and resolve issues with cluster autoscaler not scaling up nodes in your Google Kubernetes Engine (GKE) clusters.
This page is for Application developers who want to resolve an unexpected or negative situation with their app or service and Platform admins and operators who want to prevent interruption to delivery of products and services.
Understand when cluster autoscaler scales up your nodes
Before you proceed to the troubleshooting steps, it can be helpful to understand when cluster autoscaler would try to scale up your nodes. Cluster autoscaler only adds nodes when existing resources are insufficient.
Every 10 seconds, cluster autoscaler checks if there are any Pods that are unschedulable. A Pod becomes unschedulable when the Kubernetes scheduler cannot place it on any existing node due to insufficient resources, node constraints, or unmet Pod requirements.
When cluster autoscaler finds unschedulable Pods, it evaluates if adding a node would allow the Pod to get scheduled. If adding a node would let a Pod get scheduled cluster autoscaler adds a new node to the managed instance group (MIG). The Kubernetes scheduler can then schedule the Pod on the newly provisioned node.
Check if you have unschedulable Pods
To determine if your cluster needs to scale up, check for unscheduled Pods:
In the Google Cloud console, go to the Workloads page.
In the
Filter field, enterunschedulable
and press Enter.If there are any Pods listed, then you have unschedulable Pods. To troubleshoot unschedulable Pods, see Error: Pod unschedulable. Resolving the underlying cause of unschedulable Pods can often enable cluster autoscaler to scale up. To identify and resolve errors that are specific to cluster autoscaler, explore the following sections.
If there are no Pods listed, cluster autoscaler doesn't need to scale up and is working as expected.
Check if you previously had unschedulable Pods
If you're investigating what caused cluster autoscaler to fail in the past, check for previously unschedulable Pods:
In the Google Cloud console, go to the Logs Explorer page.
Specify a time range for the log entries that you want to view.
In the query pane, enter the following query:
logName="projects/PROJECT_ID/logs/events" jsonPayload.source.component="default-scheduler" jsonPayload.reason="FailedScheduling"
Replace
PROJECT_ID
with your project ID.Click Run query.
If there are any results listed, then you had unschedulable Pods in the time range that you specified.
Check if the issue is caused by a limitation
After you've confirmed that you have unscheduled Pods, make sure your issue with cluster autoscaler isn't caused by one of the limitations for the cluster autoscaler.
View errors
You can often diagnose the cause of scale up issues by viewing error messages:
If you've already seen an error message, see the error messages table for advice on resolving the error.
If you haven't seen a message yet, use one of the following options:
- Issues less than 72 hours old: View error notifications in the Google Cloud console.
- Issues over 72 hours old: View errors in events in Cloud Logging.
View errors in notifications
If the issue you observed happened less than 72 hours ago, view notifications about errors in the Google Cloud console. These notifications provide valuable insights into why cluster autoscaler didn't scale up and offer advice on how to resolve the error and view relevant logs for further investigation.
To view the notifications in the Google Cloud console, complete the following steps:
In the Google Cloud console, go to the Kubernetes clusters page.
Review the Notifications column. The following notifications are associated with scale up issues:
Can't scale up
Can't scale up pods
Can't scale up a node pool
Click the relevant notification to see a pane with details about what caused the issue and recommended actions to resolve it.
Optional: To view the logs for this event, click Logs. This action takes you to Logs Explorer with a pre-populated query to help you further investigate the scaling event. To learn more about how scale up events work, see View cluster autoscaler events.
If you're still experiencing issues after reviewing the advice in the notification, consult the error messages tables for further help.
View errors in events
If the issue you observed happened more than 72 hours ago, view events in Cloud Logging. When there has been an error, it's often recorded in an event.
To view cluster autoscaler logs in the Google Cloud console, complete the following steps:
In the Google Cloud console, go to the Kubernetes clusters page.
Select the name of the cluster that you want to investigate to view its Cluster details page.
On the Cluster details page, click the Logs tab.
On the Logs tab, click the Autoscaler Logs tab to view the logs.
Optional: To apply more advanced filters to narrow the results, click the button with the arrow on the right side of the page to view the logs in Logs Explorer.
To learn more about scale up events work, see View cluster autoscaler events. For one example of how to use Cloud Logging, see the following troubleshooting example.
Example: Troubleshoot an issue over 72 hours old
The following example shows you how you might investigate and resolve an issue with a cluster not scaling up.
Scenario: For the past hour, a Pod has been marked as unschedulable. Cluster autoscaler did not provision any new nodes to schedule the Pod.
Solution:
- Because the issue happened over 72 hours ago, you investigate the issue using Cloud Logging instead of looking at the notification messages.
- In Cloud Logging, you find the logging details for cluster autoscaler events, as described in View errors in events.
You search for
scaleUp
events that contain the Pod that you're investigating in thetriggeringPods
field. You could filter the log entries, including filtering by a particular JSON field value. Learn more in Advanced logs queries.You don't find any scale up events. However, if you did, you could try to find an
EventResult
that contains the sameeventId
as thescaleUp
event. You could then look at theerrorMsg
field and consult the list of possible scaleUp error messages.Because you didn't find any
scaleUp
events, you continue to search fornoScaleUp
events and review the following fields:unhandledPodGroups
: contains information about the Pod (or Pod's controller).reason
: provides global reasons indicating scaling up could be blocked.skippedMigs
: provides reasons why some MIGs might be skipped.
You find a
noScaleUp
event for your Pod, and all MIGs in therejectedMigs
field have the same reason message ID of"no.scale.up.mig.failing.predicate"
with two parameters:"NodeAffinity"
and"node(s) did not match node selector"
.
Resolution:
After consulting the list of error messages, you discover that cluster autoscaler can't scale up a node pool because of a failing scheduling predicate for the pending Pods. The parameters are the name of the failing predicate and the reason why it failed.
To resolve the issue, you review the manifest of the Pod, and discover that it has a node selector that doesn't match any MIG in the cluster. You delete the selector from the manifest of the Pod and recreate the Pod. Cluster autoscaler adds a new node and the Pod is scheduled.
Resolve scale up errors
After you have identified your error, use the following tables to help you understand what caused the error and how to resolve it.
ScaleUp errors
You can find event error messages for scaleUp
events in the corresponding
eventResult
event, in the resultInfo.results[].errorMsg
field.
Message | Details | Parameters | Mitigation |
---|---|---|---|
"scale.up.error.out.of.resources" |
Resource errors occur when you try to request new resources in a zone that cannot accommodate your request due to the current unavailability of a Compute Engine resource, such as GPUs or CPUs. | Failing MIG IDs. | Follow the resource availability troubleshooting steps in the Compute Engine documentation. |
"scale.up.error.quota.exceeded" |
The scaleUp event failed because some of the MIGs couldn't be increased, due to exceeded Compute Engine quota. | Failing MIG IDs. | Check the Errors tab of the MIG in the Google Cloud console to see what quota is being exceeded. After you know which quota is being exceeded, follow the instructions to request a quota increase. |
"scale.up.error.waiting.for.instances.timeout" |
Scale up of managed instance group failed to scale up due to timeout. | Failing MIG IDs. | This message should be transient. If it persists, contact Cloud Customer Care for further investigation. |
"scale.up.error.ip.space.exhausted" |
Can't scale up because instances in some of the managed instance groups ran out of IPs. This means that the cluster doesn't have enough unallocated IP address space to use to add new nodes or Pods. | Failing MIG IDs. | Follow the troubleshooting steps in Not enough free IP address space for Pods. |
"scale.up.error.service.account.deleted" |
Can't scale up because the service account was deleted. | Failing MIG IDs. | Try to undelete the service account. If that procedure is unsuccessful, contact Cloud Customer Care for further investigation. |
Reasons for a noScaleUp event
A noScaleUp
event is periodically emitted when there are unschedulable Pods
in the cluster and cluster autoscaler cannot scale the cluster up to schedule
the Pods. noScaleUp
events are best-effort, and don't cover all possible cases.
NoScaleUp top-level reasons
Top-level reason messages for noScaleUp
events appear in the
noDecisionStatus.noScaleUp.reason
field. The message contains a top-level
reason for why cluster autoscaler cannot scale the cluster up.
Message | Details | Mitigation |
---|---|---|
"no.scale.up.in.backoff" |
No scale up because scaling up is in a backoff period (temporarily blocked). This message that can occur during scale up events with a large number of Pods. | This message should be transient. Check this error after a few minutes. If this message persists, contact Cloud Customer Care for further investigation. |
NoScaleUp top-level node auto-provisioning reasons
Top-level node auto-provisioning reason messages for noScaleUp
events appear
in the noDecisionStatus.noScaleUp.napFailureReason
field. The message contains
a top-level reason for why cluster autoscaler cannot provision new node pools.
Message | Details | Mitigation |
---|---|---|
"no.scale.up.nap.disabled" |
Node auto provisioning couldn't scale up because node auto provisioning is not enabled at cluster level. If node auto-provisioning is disabled, new nodes won't be automatically provisioned if the pending Pod has requirements that can't be satisfied by any existing node pools. |
Review the cluster configuration and consider enabling node auto-provisioning. |
NoScaleUp MIG-level reasons
MIG-level reason messages for noScaleUp
events appear in the
noDecisionStatus.noScaleUp.skippedMigs[].reason
and
noDecisionStatus.noScaleUp.unhandledPodGroups[].rejectedMigs[].reason
fields.
The message contains a reason why cluster autoscaler can't increase the size of
a particular MIG.
Message | Details | Parameters | Mitigation |
---|---|---|---|
"no.scale.up.mig.skipped" |
Cannot scale up a MIG because it was skipped during the simulation. | Reasons why the MIG was skipped (for example, missing a Pod requirement). | Review the parameters included in the error message and address why the MIG was skipped. |
"no.scale.up.mig.failing.predicate" |
Can't scale up a node pool because of a failing scheduling predicate for the pending Pods. | Name of the failing predicate and reasons why it failed. | Review the Pod requirements, such as affinity rules, taints or tolerations, and resource requirements. |
NoScaleUp Pod-group-level node auto-provisioning reasons
Pod-group-level node auto-provisioning reason messages for noScaleUp
events
appear in the
noDecisionStatus.noScaleUp.unhandledPodGroups[].napFailureReasons[]
field. The
message contains a reason why cluster autoscaler cannot provision a new node
pool to schedule a particular Pod group.
Message | Details | Parameters | Mitigation |
---|---|---|---|
"no.scale.up.nap.pod.gpu.no.limit.defined" |
Node auto-provisioning couldn't provision any node group because a pending Pod has a GPU request, but GPU resource limits are not defined at the cluster level. | Requested GPU type. | Review the pending Pod's GPU request, and update the cluster-level node auto-provisioning configuration for GPU limits. |
"no.scale.up.nap.pod.gpu.type.not.supported" |
Node auto-provisioning did not provision any node group for the Pod because it has requests for an unknown GPU type. | Requested GPU type. | Check the pending Pod's configuration for the GPU type to ensure that it matches a supported GPU type. |
"no.scale.up.nap.pod.zonal.resources.exceeded" |
Node auto-provisioning did not provision any node group for the Pod in this zone because doing so would either violate the cluster-wide maximum resource limits, exceed the available resources in the zone, or there is no machine type that could fit the request. | Name of the considered zone. | Review and update cluster-wide maximum resource limits, the Pod resource requests, or the available zones for node auto-provisioning. |
"no.scale.up.nap.pod.zonal.failing.predicates" |
Node auto-provisioning did not provision any node group for the Pod in this zone because of failing predicates. | Name of the considered zone and reasons why predicates failed. | Review the pending Pod's requirements, such as affinity rules, taints, tolerations, or resource requirements. |
Conduct further investigation
The following sections provide guidance on how to use Logs Explorer and
gcpdiag
to gain additional insights into your errors.
Investigate errors in Logs Explorer
If you want to further investigate your error message, view logs specific to your error:
In the Google Cloud console, go to the Logs Explorer page.
In the query pane, enter the following query:
resource.type="k8s_cluster" log_id("container.googleapis.com/cluster-autoscaler-visibility") jsonPayload.resultInfo.results.errorMsg.messageId="ERROR_MESSAGE"
Replace
ERROR_MESSAGE
with the message that you want to investigate. For example,scale.up.error.out.of.resources
.Click Run query.
Debug some errors with gcpdiag
gcpdiag
is an open source tool created with support from Google Cloud
technical engineers. It isn't an officially supported Google Cloud product.
If you've experienced one of the following error messages, you can use
gcpdiag
to help troubleshoot the issue:
scale.up.error.out.of.resources
scale.up.error.quota.exceeded
scale.up.error.waiting.for.instances.timeout
scale.up.error.ip.space.exhausted
scale.up.error.service.account.deleted
For a list and description of all gcpdiag
tool flags, see the gcpdiag
usage
instructions.
Resolve complex scale up errors
The following sections offer guidance on resolving errors where the mitigations involve multiple steps and errors that don't have a cluster autoscaler event message associated with them.
Issue: Pod doesn't fit on node
Cluster autoscaler only schedules a Pod on a node if there is a node with sufficient resources such as GPUs, memory, and storage to meet the Pod's requirements. To determine if this is why cluster autoscaler didn't scale up, compare resource requests with the resources provided.
The following example shows you how to check CPU resources but the same steps are applicable for GPUs, memory, and storage resources. To compare CPU requests with CPUs provisioned, complete the following steps:
In the Google Cloud console, go to the Workloads page.
Click the
PodUnschedulable
error message.In the Details pane, click the name of the Pod. If there are multiple Pods, start with the first Pod and repeat the following process for each Pod.
In the Pod details page, go to the Events tab.
From the Events tab, go to the YAML tab.
Make note each container's resource requests in the Pod to find what the resource requests total is. For example, in the following Pod configuration, the Pod needs 2 vCPUs:
resources: limits: cpu: "3" requests: cpu: "2"
View the node pool details from the cluster with the unscheduled Pod:
In the Google Cloud console, go to the Kubernetes clusters page.
Click the name of the cluster that has the
Pods unschedulable
error message.In the Cluster details page, go to the Nodes tab.
In the Node pools section, make note of the value in the Machine type column. For example,
n1-standard-1
.Compare the resource request with the vCPUs provided by the machine type. For example, if a Pod requests 2 vCPUs, but the available nodes have the
n1-standard-1
machine type, the nodes would only have 1 vCPU. With a configuration like this, cluster autoscaler wouldn't trigger scale up because even if it added a new node, this Pod wouldn't fit on it. If you want to know more about available machine types, see Machine families resource and comparison guide in the Compute Engine documentation.
Also keep in mind that the allocatable resources of a node are less than the total resources, as a portion is needed to run system components. To learn more about how this is calculated, see Node allocatable resources.
To resolve this issue, decide if the resource requests defined for the workload are suitable for your needs. If the machine type shouldn't be changed, create a node pool with a machine type that can support the request coming from the Pod. If the Pod resource requests aren't accurate, update the Pod's definition so that the Pods can fit on nodes.
Issue: Unhealthy clusters preventing scale up
Cluster autoscaler might not perform scale up if it considers a cluster to be unhealthy. Cluster unhealthiness isn't based on the control plane being healthy, but on the ratio of healthy and ready nodes. If 45% of nodes in a cluster are unhealthy or not ready, cluster autoscaler halts all operations.
If this is why your cluster autoscaler isn't scaling up, there is an event in
the cluster autoscaler ConfigMap with the type Warning
with ClusterUnhealthy
listed as the reason.
To view the ConfigMap, run the following command:
kubectl describe configmap cluster-autoscaler-status -n kube-system
To resolve this issue, decrease the number of unhealthy nodes.
It's also possible that some of the nodes are ready, though not considered ready
by cluster autoscaler. This happens when a taint with the prefix
ignore-taint.cluster-autoscaler.kubernetes.io/
is present on a node. Cluster
autoscaler considers a node to be NotReady
as long as that taint is present.
If the behavior is caused by the presence of
ignore-taint.cluster-autoscaler.kubernetes.io/.*
taint, remove it.
What's next
- Review the Kubernetes cluster autoscaler FAQ.
- Watch a YouTube video about troubleshooting and resolving scaling issues.
- If you need additional assistance, reach out to Cloud Customer Care.