Identify which pod triggered Autoscaler to add nodes

Problem

You notice that Google Kubernetes Engine Cluster Autoscaler added nodes to a cluster. You would like to know how to identify which pod triggered it, and to which node pool the node was added.

Environment

Google Kubernetes Engine
Cluster Autoscaler

Solution

You can understand the events related to auto scaling from within the project using the following means:

Cluster-autoscaler-visibility logs from Cloud Logging provide status and decision events.

Use the following Cloud Logging Advanced Filter, replacing CLUSTER_NAME and PROJECT_NAME with the information of your project and cluster:

logName="projects/PROJECT_NAME/logs/container.googleapis.com%2Fcluster-autoscaler-visibility"
resource.labels.cluster_name=CLUSTER_NAME
-"noDecisionStatus"
-"resultInfo"
-"status"

In the decision event logs, look at jsonPayload which helps identify if it is a ScaleUp event or ScaleDown Event. In the Scaleup event block, it indicates which Managed Instance Group (MIG) / node pool the scaleup operation was scheduled on. It also helps in identifying the number of pods triggering the scaleup and the number of nodes requested.

Example of Scaleup Event

 "jsonPayload": {
    "decision": {
      "scaleUp": {

        "triggeringPodsTotalCount": 1,
        "triggeringPods": [
          {
            "namespace": "default",
            "name": "gpu-pod"
          }
        ],
        "increasedMigs": [
          {
            "mig": {
              "name": "gke-cluster-gpu-np-91ae67a2-grp",
              "zone": "us-central1-a",
              "nodepool": "gpu-np"
            },
            "requestedNodes": 1
          }

Scaledown event jsonPayload helps identify the node name that was removed from respective nodepool.

Example of Scaledown Event

  "jsonPayload": {
    "decision": {
      "scaleDown": {
        "nodesToBeRemoved": [
          {
            "node": {
              "name": "gke-cluster-gpu-np-91ae67a2-2nrt",
              "mig": {
                "zone": "us-central1-a",
                "nodepool": "gpu-np",
                "name": "gke-cluster-gpu-np-91ae67a2-grp"
              }

Cluster-autoscaler-status config map of the cluster from the Object browser menu under Kubernetes Engine in the Admin console. From the cluster-autoscaler-status config map, the last TransitionTime, the number of nodes currently registered on each nodepool can be identified. The event section stores the historical events for 60 minutes before it is cleared.

user@cloudshell:~ (gke-project)$ kubectl describe configmaps cluster-autoscaler-status -n kube-system
Name:         cluster-autoscaler-status
Namespace:    kube-system
Labels:       <none>
Annotations:  cluster-autoscaler.kubernetes.io/last-updated: 2021-11-25 18:40:07.187647947 +0000 UTC

Data
====
status:
----
Cluster-autoscaler status at 2021-11-25 02:50:00.339233866 +0000 UTC:
Cluster-wide:
  Health:      Healthy (ready=5 unready=0 notStarted=0 longNotStarted=0 registered=5 longUnregistered=0)
               LastProbeTime:      2021-11-25 18:40:07.037933877 +0000 UTC m=+1774240.331236149
               LastTransitionTime: 2021-11-05 05:49:52.663234191 +0000 UTC m=+25.956536488
  ScaleUp:     NoActivity (ready=5 registered=5)
               LastProbeTime:      2021-11-25 18:40:07.037933877 +0000 UTC m=+1774240.331236149
               LastTransitionTime: 2021-11-25 02:30:16.422735958 +0000 UTC m=+1716049.716038243
  ScaleDown:   NoCandidates (candidates=0)
               LastProbeTime:      2021-11-25 18:40:07.037933877 +0000 UTC m=+1774240.331236149
               LastTransitionTime: 2021-11-25 16:27:52.826844322 +0000 UTC m=+1766306.120146611

NodeGroups:
 Name:        https://content.googleapis.com/compute/v1/projects/project/zones/us-central1-a/instanceGroups/gke-cluster-pool-8cpu-cd2383fa-grp
  Health:      Healthy (ready=5 unready=0 notStarted=0 longNotStarted=0 registered=5 longUnregistered=0 cloudProviderTarget=5 (minSize=3, maxSize=8))
               LastProbeTime:      2021-11-25 18:40:07.037933877 +0000 UTC m=+1774240.331236149
               LastTransitionTime: 2021-11-05 05:49:52.663234191 +0000 UTC m=+25.956536488
  ScaleUp:     NoActivity (ready=5 cloudProviderTarget=5)
               LastProbeTime:      2021-11-25 18:40:07.037933877 +0000 UTC m=+1774240.331236149
               LastTransitionTime: 2021-11-05 05:49:52.663234191 +0000 UTC m=+25.956536488
  ScaleDown:   NoCandidates (candidates=0)
               LastProbeTime:      2021-11-25 18:40:07.037933877 +0000 UTC m=+1774240.331236149

               LastTransitionTime: 2021-11-25 16:27:52.826844322 +0000 UTC m=+1766306.120146611

  Name:        https://content.googleapis.com/compute/v1/projects/gke-project/zones/us-central1-a/instanceGroups/gke-cluster-gpu-np-91ae67a2-grp
  Health:      Healthy (ready=0 unready=0 notStarted=0 longNotStarted=0 registered=0 longUnregistered=0 cloudProviderTarget=0 (minSize=0, maxSize=1))
               LastProbeTime:      0001-01-01 00:00:00 +0000 UTC
               LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
  ScaleUp:     NoActivity (ready=0 cloudProviderTarget=0)
               LastProbeTime:      0001-01-01 00:00:00 +0000 UTC
               LastTransitionTime: 2021-11-25 02:30:16.422735958 +0000 UTC m=+1716049.716038243
  ScaleDown:   NoCandidates (candidates=0)
               LastProbeTime:      2021-11-25 18:40:07.037933877 +0000 UTC m=+1774240.331236149
               LastTransitionTime: 2021-11-25 02:40:21.423099054 +0000 UTC m=+1716654.716401333

Events:
  Type    Reason          Age    From                Message
  ----    ------          ----   ----                -------
  Normal  ScaledUpGroup   21m    cluster-autoscaler  Scale-up: setting group https://content.googleapis.com/compute/v1/projects/gke-project/zones/us-central1-a/instanceGroups/gke-cluster-gpu-np-91ae67a2-grp size to 1
  Normal  ScaleDownEmpty  9m48s  cluster-autoscaler  Scale-down: removing empty node gke-cluster-gpu-np-91ae67a2-2nrt

Cause

Cluster autoscaler will check every 10 seconds if there are any pods which are in unschedulable or pending states. If so, then it will check if adding a new node will allow the pod to get scheduled. If yes, then it will resize the node pool to add a new node. The scheduler will then schedule the pod on the newly provisioned node.

Although you do not always need to monitor these events, if ever you feel like there is something wrong with autoscaling, it is useful to review the Autoscaler events and status.