Viewing cluster autoscaler events

The Google Kubernetes Engine (GKE) cluster autoscaler emits visibility events, which are available as log entries in Cloud Logging. This page shows you how to view those logged events to gain insight on when and why the GKE cluster autoscaler makes autoscaling decisions.

Availability requirements

The ability to view logged events for cluster autoscaler is available in the following cluster versions:

Event type Cluster version
status, scaleUp, scaleDown, eventResult 1.15.4-gke.7 and later
nodePoolCreated, nodePoolDeleted. 1.15.4-gke.18 and later
noScaleUp 1.16.6-gke.3 and later
noScaleDown 1.16.8-gke.2 and later

Viewing events

The visibility events for the cluster autoscaler are stored in a Cloud Logging log, in the same project as where your GKE cluster is located.

To view the logs, perform the following:

  1. In the Cloud Console, go to the Logs Viewer page.

    Go to the Logs Viewer

  2. Search for the logs using the basic or advanced query interface.

    To search for logs using the basic query interface, perform the following:

    1. From the resources drop-down list, select Kubernetes Cluster, then select the location of your cluster, and the name of your cluster.
    2. From the logs type drop-down list, select container.googleapis.com/cluster-autoscaler-visibility.
    3. From the time-range drop-down list, select the desired time range.

    To search for logs using the advanced query interface, apply the following advanced filter:

    resource.type="k8s_cluster"
    resource.labels.location="cluster-location"
    resource.labels.cluster_name="cluster-name"
    logName="projects/project-id/logs/container.googleapis.com%2Fcluster-autoscaler-visibility"
    

    where:

    • cluster-location is the location of the cluster you are inspecting.
    • cluster-name is the name of your cluster.
    • project-id is the ID of the project.

Types of events

All logged events are in the JSON format and can be found in the jsonPayload field of a log entry. All timestamps in the events are UNIX second timestamps.

Here's a summary of the types of events emitted by the cluster autoscaler:

Event type Description
status Occurs periodically and describes the size of all autoscaled node pools and the target size of all autoscaled node pools as observed by the cluster autoscaler.
scaleUp Occurs when cluster autoscaler scales the cluster up.
scaleDown Occurs when cluster autoscaler scales the cluster down.
eventResult Occurs when a scaleUp or a scaleDown event completes successfully or unsuccessfully.
nodePoolCreated Occurs when cluster autoscaler with node auto-provisioning enabled creates a new node pool.
nodePoolDeleted Occurs when cluster autoscaler with node auto-provisioning enabled deletes a node pool.
noScaleUp Occurs when there are unschedulable Pods in the cluster, and cluster autoscaler cannot scale the cluster up to accommodate the Pods.
noScaleDown Occurs when there are nodes that are blocked from being deleted by cluster autoscaler.

Status event

A status event is emitted periodically, and describes the actual size of all autoscaled node pools and the target size of all autoscaled node pools as observed by cluster autoscaler.

Example

The following log sample shows a status event:

{
  "status": {
    "autoscaledNodesCount": 4,
    "autoscaledNodesTarget": 4,
    "measureTime": "1582898536"
  }
}

ScaleUp event

A scaleUp event is emitted when the cluster autoscaler scales the cluster up. This event contains information about which Managed instance groups (MIGs) were scaled up, by how many nodes, and which unschedulable Pods triggered the event.

The list of triggering Pods is truncated to 50 arbitrary entries. The actual number of triggering Pods can be found in the triggeringPodsTotalCount field.

Example

The following log sample shows a scaleUp event:

{
  "decision": {
    "decideTime": "1582124907",
    "eventId": "ed5cb16d-b06f-457c-a46d-f75dcca1f1ee",
    "scaleUp": {
      "increasedMigs": [
        {
          "mig": {
            "name": "test-cluster-default-pool-a0c72690-grp",
            "nodepool": "default-pool",
            "zone": "us-central1-c"
          },
          "requestedNodes": 1
        }
      ],
      "triggeringPods": [
        {
          "controller": {
            "apiVersion": "apps/v1",
            "kind": "ReplicaSet",
            "name": "test-85958b848b"
          },
          "name": "test-85958b848b-ptc7n",
          "namespace": "default"
        }
      ],
      "triggeringPodsTotalCount": 1
    }
  }
}

ScaleDown event

A scaleDown event is emitted when cluster autoscaler scales the cluster down. This event contains information about which nodes are being removed, and which Pods have to be evicted because of that.

The cpuRatio and memRatio fields describe the CPU and memory utilization of the node, as a percentage. This utilization is a sum of Pod requests divided by node allocatable, not real utilization.

The list of evicted Pods is truncated to 50 arbitrary entries. The actual number of evicted Pods can be found in the evictedPodsTotalCount field.

Example

The following log sample shows a scaleDown event:

{
  "decision": {
    "decideTime": "1580594665",
    "eventId": "340dac18-8152-46ff-b79a-747f70854c81",
    "scaleDown": {
      "nodesToBeRemoved": [
        {
          "evictedPods": [
            {
              "controller": {
                "apiVersion": "apps/v1",
                "kind": "ReplicaSet",
                "name": "kube-dns-5c44c7b6b6"
              },
              "name": "kube-dns-5c44c7b6b6-xvpbk"
            }
          ],
          "evictedPodsTotalCount": 1,
          "node": {
            "cpuRatio": 23,
            "memRatio": 5,
            "mig": {
              "name": "test-cluster-default-pool-c47ef39f-grp",
              "nodepool": "default-pool",
              "zone": "us-central1-f"
            },
            "name": "test-cluster-default-pool-c47ef39f-p395"
          }
        }
      ]
    }
  }
}

EventResult event

An eventResult event is emitted when a scaleUp or a scaleDown event completes successfully or unsuccessfully. This event contains a list of event IDs (from the eventId field in scaleUp or scaleDown events), along with error messages. An empty error message indicates the event completed successfully. A list of eventResult events are aggregated in the results field.

To diagnose errors, consult the ScaleUp errors and ScaleDown errors sections.

Example

The following log sample shows an eventResult event:

{
  "resultInfo": {
    "measureTime": "1582878896",
    "results": [
      {
        "eventId": "2fca91cd-7345-47fc-9770-838e05e28b17"
      },
      {
        "errorMsg": {
          "messageId": "scale.down.error.failed.to.delete.node.min.size.reached",
          "parameters": [
            "test-cluster-default-pool-5c90f485-nk80"
          ]
        },
        "eventId": "ea2e964c-49b8-4cd7-8fa9-fefb0827f9a6"
      }
    ]
  }
}

NodePoolCreated event

A nodePoolCreated event is emitted when cluster autoscaler with node auto-provisioning enabled creates a new node pool. This event contains the name of the created node pool and a list of its MIGs. If the node pool was created because of a scaleUp event, the eventId of the corresponding scaleUp event is included in the triggeringScaleUpId field.

Example

The following log sample shows a nodePoolCreated event:

{
  "decision": {
    "decideTime": "1585838544",
    "eventId": "822d272c-f4f3-44cf-9326-9cad79c58718",
    "nodePoolCreated": {
      "nodePools": [
        {
          "migs": [
            {
              "name": "test-cluster-nap-n1-standard--b4fcc348-grp",
              "nodepool": "nap-n1-standard-1-1kwag2qv",
              "zone": "us-central1-f"
            },
            {
              "name": "test-cluster-nap-n1-standard--jfla8215-grp",
              "nodepool": "nap-n1-standard-1-1kwag2qv",
              "zone": "us-central1-c"
            }
          ],
          "name": "nap-n1-standard-1-1kwag2qv"
        }
      ],
      "triggeringScaleUpId": "d25e0e6e-25e3-4755-98eb-49b38e54a728"
    }
  }
}

NodePoolDeleted event

A nodePoolDeleted event is emitted when cluster autoscaler with node auto-provisioning enabled deletes a node pool.

Example

The following log sample shows a nodePoolDeleted event:

{
  "decision": {
    "decideTime": "1585830461",
    "eventId": "68b0d1c7-b684-4542-bc19-f030922fb820",
    "nodePoolDeleted": {
      "nodePoolNames": [
        "nap-n1-highcpu-8-ydj4ewil"
      ]
    }
  }
}

NoScaleUp event

A noScaleUp event is periodically emitted when there are unschedulable Pods in the cluster and cluster autoscaler cannot scale the cluster up to accommodate the Pods.

  • noScaleUp events are best-effort, that is, these events do not cover all possible reasons for why cluster autoscaler cannot scale up.
  • noScaleUp events are throttled to limit the produced log volume. Each persisting reason is only emitted every couple of minutes.
  • All the reasons can be arbitrarily split across multiple events. For example, there is no guarantee that all rejected MIG reasons for a single Pod group will appear in the same event.
  • The list of unhandled Pod groups is truncated to 50 arbitrary entries. The actual number of unhandled Pod groups can be found in the unhandledPodGroupsTotalCount field.

Reason fields

The following fields help to explain why scaling up did not occur:

  • reason: Provides a global reason for why cluster autoscaler is prevented from scaling up. Refer to the NoScaleUp top-level reasons section for details.
  • napFailureReason: Provides a global reason preventing cluster autoscaler from provisioning additional node pools (for example, node auto-provisioning is disabled). Refer to the NoScaleUp top-level node auto-provisioning reasons section for details.
  • skippedMigs[].reason: Provides information about why a particular MIG was skipped. Cluster autoscaler skips some MIGs from consideration for any Pod during a scaling up attempt (for example, because adding another node would exceed cluster-wide resource limits). Refer to the NoScaleUp MIG-level reasons section for details.
  • unhandledPodGroups: Contains information about why a particular group of unschedulable Pods does not trigger scaling up. The Pods are grouped by their immediate controller. Pods without a controller are in groups by themselves. Each Pod group contains an arbitrary example Pod and the number of Pods in the group, as well as the following reasons:
    • napFailureReasons: Reasons why cluster autoscaler cannot provision a new node pool to accommodate this Pod group (for example, Pods have affinity constraints). Refer to the NoScaleUp Pod-level node auto-provisioning reasons section for details
    • rejectedMigs[].reason: Per-MIG reasons why cluster autoscaler cannot increase the size of a particular MIG to accommodate this Pod group (for example, the MIG's node is too small for the Pods). Refer to the NoScaleUp MIG-level reasons section for details.

Example

The following log sample shows a noScaleUp event:

{
  "noDecisionStatus": {
    "measureTime": "1582523362",
    "noScaleUp": {
      "skippedMigs": [
        {
          "mig": {
            "name": "test-cluster-nap-n1-highmem-4-fbdca585-grp",
            "nodepool": "nap-n1-highmem-4-1cywzhvf",
            "zone": "us-central1-f"
          },
          "reason": {
            "messageId": "no.scale.up.mig.skipped",
            "parameters": [
              "max cluster cpu limit reached"
            ]
          }
        }
      ],
      "unhandledPodGroups": [
        {
          "napFailureReasons": [
            {
              "messageId": "no.scale.up.nap.pod.zonal.resources.exceeded",
              "parameters": [
                "us-central1-f"
              ]
            }
          ],
          "podGroup": {
            "samplePod": {
              "controller": {
                "apiVersion": "v1",
                "kind": "ReplicationController",
                "name": "memory-reservation2"
              },
              "name": "memory-reservation2-6zg8m",
              "namespace": "autoscaling-1661"
            },
            "totalPodCount": 1
          },
          "rejectedMigs": [
            {
              "mig": {
                "name": "test-cluster-default-pool-b1808ff9-grp",
                "nodepool": "default-pool",
                "zone": "us-central1-f"
              },
              "reason": {
                "messageId": "no.scale.up.mig.failing.predicate",
                "parameters": [
                  "NodeResourcesFit",
                  "Insufficient memory"
                ]
              }
            }
          ]
        }
      ],
      "unhandledPodGroupsTotalCount": 1
    }
  }
}

NoScaleDown event

A noScaleDown event is periodically emitted when there are nodes which are blocked from being deleted by cluster autoscaler.

  • Nodes that cannot be removed because their utilization is high are not included in noScaleDown events.
  • NoScaleDown events are best effort, that is, these events do not cover all possible reasons for why cluster autoscaler cannot scale down.
  • NoScaleDown events are throttled to limit the produced log volume. Each persisting reason will only be emitted every couple of minutes.
  • The list of nodes is truncated to 50 arbitrary entries. The actual number of nodes can be found in the nodesTotalCount field.

Reason fields

The following fields help to explain why scaling down did not occur:

  • reason: Provides a global reason for why cluster autoscaler is prevented from scaling down (for example, a backoff period after recently scaling up). Refer to the NoScaleDown top-level reasons section for details.
  • nodes[].reason: Provides per-node reasons for why cluster autoscaler is prevented from deleting a particular node (for example, there's no place to move the node's Pods to). Refer to the NoScaleDown node-level reasons section for details.

Example

The following log sample shows a noScaleDown event:

{
  "noDecisionStatus": {
    "measureTime": "1582858723",
    "noScaleDown": {
      "nodes": [
        {
          "node": {
            "cpuRatio": 42,
            "mig": {
              "name": "test-cluster-default-pool-f74c1617-grp",
              "nodepool": "default-pool",
              "zone": "us-central1-c"
            },
            "name": "test-cluster-default-pool-f74c1617-fbhk"
          },
          "reason": {
            "messageId": "no.scale.down.node.no.place.to.move.pods"
          }
        }
      ],
      "nodesTotalCount": 1,
      "reason": {
        "messageId": "no.scale.down.in.backoff"
      }
    }
  }
}

Debugging scenarios

This section provides guidance for how to debug scaling events.

Cluster not scaling up

Scenario: I created a Pod in my cluster but it's stuck in the Pending state for the past hour. Cluster autoscaler did not provision any new nodes to accommodate the Pod.

Solution:

  1. In the Logs Viewer, find the logging details for cluster autoscaler events, as described in the Viewing events section.
  2. Search for scaleUp events that contain the desired Pod in the triggeringPods field. You can filter the log entries, including filtering by a particular JSON field value. Learn more in Advanced logs queries.

    1. Find an EventResult that contains the same eventId as the scaleUp event.
    2. Look at the errorMsg field and consult the list of possible scaleUp error messages.

    ScaleUp error example: For a scaleUp event, you discover the error is "scale.up.error.quota.exceeded", which indicates that "A scaleUp event failed because some of the MIGs could not be increased due to exceeded quota". To resolve the issue, you review your quota settings and increase the settings that are close to being exceeded. Cluster autoscaler adds a new node and the Pod is scheduled.

  3. Otherwise, search for noScaleUp events and review the following fields:

    • unhandledPodGroups: contains information about the Pod (or Pod's controller).
    • reason: provides global reasons indicating scaling up could be blocked.
    • skippedMigs: provides reasons why some MIGs might be skipped.
  4. Refer to the following sections that contain possible reasons for noScaleUp events:

    NoScaleUp example: You found a noScaleUp event for your Pod, and all MIGs in the rejectedMigs field have the same reason message ID of "no.scale.up.mig.failing.predicate" with two parameters:"NodeAffinity" and "node(s) did not match node selector". After consulting the list of error messages, you discover that you "cannot scale up a MIG because a predicate failed for it"; the parameters are the name of the failing predicate and the reason why it failed. To resolve the issue, you review the Pod spec, and discover that it has a node selector that doesn't match any MIG in the cluster. You delete the selector from the Pod spec and recreate the Pod. Cluster autoscaler adds a new node and the Pod is scheduled.

  5. If there are no noScaleUp events, use other debugging methods to resolve the issue.

Cluster not scaling down

Scenario: I have a node in my cluster that has utilized only 10% of its CPU and memory for the past couple of days. Despite the low utilization, cluster autoscaler did not delete the node as expected.

Solution:

  1. In the Logs Viewer, find the logging details for cluster autoscaler events, as described in the Viewing events section.
  2. Search for scaleDown events that contain the desired node in the nodesToBeRemoved field. You can filter the log entries, including filtering by a particular JSON field value. Learn more in Advanced logs queries.
    1. In the scaleDown event, search for an EventResult event that contains the associated eventId.
    2. Look at the errorMsg field and consult the list of possible scaleDown error messages.
  3. Otherwise, search for noScaleDown events that have the desired node in the nodes field. Review the reason field for any global reasons indicating that scaling down could be blocked.
  4. Refer to the following sections that contain possible reasons for noScaleDown events:

    NoScaleDown example: You found a noScaleDown event that contains a per-node reason for your node. The message ID is "no.scale.down.node.pod.has.local.storage" and there is a single parameter: "test-single-pod". After consulting the list of error messages, you discover this means that the "Pod is blocking scale down because it requests local storage". You consult the Kubernetes Cluster Autoscaler FAQ and find out that the solution is to add a "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" annotation to the Pod. After applying the annotation, cluster autoscaler scales down the cluster correctly.

  5. If there are no noScaleDown events, use other debugging methods to resolve the issue.

Messages

The events emitted by the cluster autoscaler use parameterized messages (seen in the messageId field) to provide explanations for the event.

This section provides descriptions for various messageId and its corresponding parameters. However, this section does not contain all possible messages, and may be extended at any time.

ScaleUp errors

Error messages for scaleUp events are found in the corresponding eventResult event, in the resultInfo.results[].errorMsg field.

Message Description
"scale.up.error.out.of.resources"
The scaleUp event failed because some of the MIGs could not be increased, due to lack of resources.

Parameters: Failing MIG IDs.

"scale.up.error.quota.exceeded"
The scaleUp event failed because some of the MIGs could not be increased, due to exceeded Compute Engine quota.

Parameters: Failing MIG IDs.

"scale.up.error.waiting.for.instances.timeout"
The scaleUp event failed because instances in some of the MIGs failed to appear in time.

Parameters: Failing MIG IDs.

ScaleDown errors

Error messages for scaleDown events are found in the corresponding eventResult event, in the resultInfo.results[].errorMsg field.

Message Description
"scale.down.error.failed.to.mark.to.be.deleted"
The scaleDown event failed because a node could not be marked for deletion.

Parameters: Failing node name.

"scale.down.error.failed.to.evict.pods"
The scaleDown event failed because some of the Pods could not be evicted from a node.

Parameters: Failing node name.

"scale.down.error.failed.to.delete.node.min.size.reached"
The scaleDown event failed because a node could not be deleted due to the cluster already being at minimal size.

Parameters: Failing node name.

Reasons for a NoScaleUp event

NoScaleUp top-level reasons

Top-level reason messages for noScaleUp events appear in the noDecisionStatus.noScaleUp.reason field. The message contains a top-level reason for why cluster autoscaler cannot scale the cluster up.

Message Description
"no.scale.up.in.backoff"
A noScaleUp occurred because scaling-up is in a backoff period (temporarily blocked).

NoScaleUp top-level node auto-provisioning reasons

Top-level node auto-provisioning reason messages for noScaleUp events appear in the noDecisionStatus.noScaleUp.napFailureReason field. The message contains a top-level reason for why cluster autoscaler cannot provision new node pools.

Message Description
"no.scale.up.nap.disabled"
Node auto-provisioning did not provision any node groups because node auto-provisioning was disabled. See Enabling Node auto-provisioning for more details.
"no.scale.up.nap.no.locations.available"
Node auto-provisioning did not provision any node groups because no node auto-provisioning locations were available.

NoScaleUp MIG-level reasons

MIG-level reason messages for noScaleUp events appear in the noDecisionStatus.noScaleUp.skippedMigs[].reason and noDecisionStatus.noScaleUp.unhandledPodGroups[].rejectedMigs[].reason fields. The message contains a reason why cluster autoscaler cannot increase the size of a particular MIG.

Message Description
"no.scale.up.mig.skipped"
Cannot scale up a MIG because it was skipped during the simulation.

Parameters: human-readable reasons why it was skipped.

"no.scale.up.mig.failing.predicate"
Cannot scale up a MIG because a predicate failed for the MIG.

Parameters: Name of the failing predicate, human-readable reasons why it failed.

NoScaleUp Pod-group-level node auto-provisioning reasons

Pod-group-level node auto-provisioning reason messages for noScaleUp events appear in the noDecisionStatus.noScaleUp.unhandledPodGroups[].napFailureReasons[] field. The message contains a reason why cluster autoscaler cannot provision a new node pool to accommodate a particular Pod group.

Message Description
"no.scale.up.nap.pod.gpu.no.limit.defined"
Node auto-provisioning did not provision any node group for the Pod because the Pod has a GPU request, and the GPU doesn't have a limit defined.

Parameters: Requested GPU type.

"no.scale.up.nap.pod.gpu.type.not.supported"
Node auto-provisioning did not provision any node group for the Pod because it specifies an unsupported GPU. See Configuring GPU limits for more details.

Parameters: Requested GPU type.

"no.scale.up.nap.pod.gpu.other.error"
Node auto-provisioning did not provision any node group for the Pod because of other issues with the GPU configuration. See Configuring GPU limits for more details.
"no.scale.up.nap.pod.zonal.resources.exceeded"
Node auto-provisioning did not provision any node group for the Pod in this zone because doing so would violate resource limits.

Parameters: Name of the considered zone.

"no.scale.up.nap.pod.zonal.failing.predicates"
Node auto-provisioning did not provision any node group for the Pod in this zone because of failing predicates.

Parameters: Name of the considered zone, human-readable reasons why predicates failed.

Reasons for a NoScaleDown event

NoScaleDown top-level reasons

Top-level reason messages for noScaleDown events appear in the noDecisionStatus.noScaleDown.reason field. The message contains a top-level reason why cluster autoscaler cannot scale the cluster down.

Message Description
"no.scale.down.in.backoff"
A noScaleDown event occurred because scaling-down is in a backoff period (temporarily blocked).
"no.scale.down.in.progress"
A noScaleDown event occurred because a previous scaleDown event is still in progress.

NoScaleDown node-level reasons

Node-level reason messages for noScaleDown events appear in the noDecisionStatus.noScaleDown.nodes[].reason field. The message contains a reason why cluster autoscaler cannot remove a particular node.

Message Description
"no.scale.down.node.scale.down.disabled.annotation"
Node cannot be removed because it has a "scale down disabled" annotation. See the Kubernetes Cluster Autoscaler FAQ for more details.
"no.scale.down.node.node.group.min.size.reached"
Node cannot be removed because its node group is already at its minimal size.
"no.scale.down.node.minimal.resource.limits.exceeded"
Node cannot be removed because it would violate cluster-wide minimal resource limits.
"no.scale.down.node.no.place.to.move.pods"
Node cannot be removed because there's no place to move its Pods to.
"no.scale.down.node.pod.not.backed.by.controller"
Pod is blocking scale down because it is not backed by a controller. See the Kubernetes Cluster Autoscaler FAQ for more details.

Parameters: Name of the blocking pod.

"no.scale.down.node.pod.has.local.storage"
Pod is blocking scale down because it requests local storage. See the Kubernetes Cluster Autoscaler FAQ for more details.

Parameters: Name of the blocking pod.

"no.scale.down.node.pod.not.safe.to.evict.annotation"
Pod is blocking scale down because it has a "not safe to evict" annotation. See the Kubernetes Cluster Autoscaler FAQ for more details.

Parameters: Name of the blocking pod.

"no.scale.down.node.pod.kube.system.unmovable"
Pod is blocking scale down because it's a non-daemonset, non-mirrored, non-pdb-assigned kube-system pod. See the Kubernetes Cluster Autoscaler FAQ for more details.

Parameters: Name of the blocking pod.

"no.scale.down.node.pod.not.enough.pdb"
Pod is blocking scale down because it doesn't have enough PodDisruptionBudget left. See the Kubernetes Cluster Autoscaler FAQ for more details.

Parameters: Name of the blocking pod.

What's next