Unable to schedule workloads after cluster update

Problem

You have multiple workloads impacted in production and the cluster is not able to autoscale, impacting end user's ability to use your service. You are having issues with workloads and believe this might be related to a CLUSTER_UPDATE operation that took place recently.

Furthermore, you have also noticed metrics-server is down shortly after that and this is affecting Horizontal Pod Autoscaling (HPA) as well and kubectl commands timing out.

Environment

  • Google Kubernetes Engine Version:​ 1.16.13-gke.404 and Node Version:​ 1.15.12-gke.20 (Version information for context only, as it was not relevant)
  • New Relic Mutating Webhook is configured on the cluster with a scope of all namespaces

Solution

The issue was mitigated by deleting the New Relic Mutating Webhook.

Cause

Mutating Webhooks work by intercepting API requests and changing them. If you are experiencing similar symptoms, and you have webhooks configured on your cluster, it would be important to confirm their configuration to make sure they are not preventing valid requests from reaching the Kubernetes API.

In this case, the webhook was deployed with a scope of all namespaces, which means it was impacting every single pod scheduling on the cluster due to a misconfiguration.