Problem
Environment
- Google Kubernetes Engine
- Cluster Autoscaler
Solution
You can understand the events related to auto scaling from within the project using the following means:
- Cluster-autoscaler-visibility logs from Cloud Logging provide status and decision events.
Use the following Cloud Logging Advanced Filter, replacing CLUSTER_NAME and PROJECT_NAME with the information of your project and cluster:
logName="projects/PROJECT_NAME/logs/container.googleapis.com%2Fcluster-autoscaler-visibility" resource.labels.cluster_name=CLUSTER_NAME -"noDecisionStatus" -"resultInfo" -"status"
In the decision event logs, look at jsonPayload which helps identify if it is a ScaleUp event or ScaleDown Event. In the Scaleup event block, it indicates which Managed Instance Group (MIG) / node pool the scaleup operation was scheduled on. It also helps in identifying the number of pods triggering the scaleup and the number of nodes requested.
Example of Scaleup Event
"jsonPayload": { "decision": { "scaleUp": { "triggeringPodsTotalCount": 1, "triggeringPods": [ { "namespace": "default", "name": "gpu-pod" } ], "increasedMigs": [ { "mig": { "name": "gke-cluster-gpu-np-91ae67a2-grp", "zone": "us-central1-a", "nodepool": "gpu-np" }, "requestedNodes": 1 }
Scaledown event jsonPayload helps identify the node name that was removed from respective nodepool.
Example of Scaledown Event
"jsonPayload": { "decision": { "scaleDown": { "nodesToBeRemoved": [ { "node": { "name": "gke-cluster-gpu-np-91ae67a2-2nrt", "mig": { "zone": "us-central1-a", "nodepool": "gpu-np", "name": "gke-cluster-gpu-np-91ae67a2-grp" }
user@cloudshell:~ (gke-project)$ kubectl describe configmaps cluster-autoscaler-status -n kube-system Name: cluster-autoscaler-status Namespace: kube-system Labels: <none> Annotations: cluster-autoscaler.kubernetes.io/last-updated: 2021-11-25 18:40:07.187647947 +0000 UTC Data ==== status: ---- Cluster-autoscaler status at 2021-11-25 02:50:00.339233866 +0000 UTC: Cluster-wide: Health: Healthy (ready=5 unready=0 notStarted=0 longNotStarted=0 registered=5 longUnregistered=0) LastProbeTime: 2021-11-25 18:40:07.037933877 +0000 UTC m=+1774240.331236149 LastTransitionTime: 2021-11-05 05:49:52.663234191 +0000 UTC m=+25.956536488 ScaleUp: NoActivity (ready=5 registered=5) LastProbeTime: 2021-11-25 18:40:07.037933877 +0000 UTC m=+1774240.331236149 LastTransitionTime: 2021-11-25 02:30:16.422735958 +0000 UTC m=+1716049.716038243 ScaleDown: NoCandidates (candidates=0) LastProbeTime: 2021-11-25 18:40:07.037933877 +0000 UTC m=+1774240.331236149 LastTransitionTime: 2021-11-25 16:27:52.826844322 +0000 UTC m=+1766306.120146611 NodeGroups: Name: https://content.googleapis.com/compute/v1/projects/project/zones/us-central1-a/instanceGroups/gke-cluster-pool-8cpu-cd2383fa-grp Health: Healthy (ready=5 unready=0 notStarted=0 longNotStarted=0 registered=5 longUnregistered=0 cloudProviderTarget=5 (minSize=3, maxSize=8)) LastProbeTime: 2021-11-25 18:40:07.037933877 +0000 UTC m=+1774240.331236149 LastTransitionTime: 2021-11-05 05:49:52.663234191 +0000 UTC m=+25.956536488 ScaleUp: NoActivity (ready=5 cloudProviderTarget=5) LastProbeTime: 2021-11-25 18:40:07.037933877 +0000 UTC m=+1774240.331236149 LastTransitionTime: 2021-11-05 05:49:52.663234191 +0000 UTC m=+25.956536488 ScaleDown: NoCandidates (candidates=0) LastProbeTime: 2021-11-25 18:40:07.037933877 +0000 UTC m=+1774240.331236149 LastTransitionTime: 2021-11-25 16:27:52.826844322 +0000 UTC m=+1766306.120146611 Name: https://content.googleapis.com/compute/v1/projects/gke-project/zones/us-central1-a/instanceGroups/gke-cluster-gpu-np-91ae67a2-grp Health: Healthy (ready=0 unready=0 notStarted=0 longNotStarted=0 registered=0 longUnregistered=0 cloudProviderTarget=0 (minSize=0, maxSize=1)) LastProbeTime: 0001-01-01 00:00:00 +0000 UTC LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC ScaleUp: NoActivity (ready=0 cloudProviderTarget=0) LastProbeTime: 0001-01-01 00:00:00 +0000 UTC LastTransitionTime: 2021-11-25 02:30:16.422735958 +0000 UTC m=+1716049.716038243 ScaleDown: NoCandidates (candidates=0) LastProbeTime: 2021-11-25 18:40:07.037933877 +0000 UTC m=+1774240.331236149 LastTransitionTime: 2021-11-25 02:40:21.423099054 +0000 UTC m=+1716654.716401333 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScaledUpGroup 21m cluster-autoscaler Scale-up: setting group https://content.googleapis.com/compute/v1/projects/gke-project/zones/us-central1-a/instanceGroups/gke-cluster-gpu-np-91ae67a2-grp size to 1 Normal ScaleDownEmpty 9m48s cluster-autoscaler Scale-down: removing empty node gke-cluster-gpu-np-91ae67a2-2nrt
Cause
Cluster autoscaler will check every 10 seconds if there are any pods which are in unschedulable or pending states. If so, then it will check if adding a new node will allow the pod to get scheduled. If yes, then it will resize the node pool to add a new node. The scheduler will then schedule the pod on the newly provisioned node.
Although you do not always need to monitor these events, if ever you feel like there is something wrong with autoscaling, it is useful to review the Autoscaler events and status.