Troubleshoot upgrades


This page shows you how to resolve issues with Google Kubernetes Engine (GKE) cluster upgrades.

If you need additional assistance, reach out to Cloud Customer Care.

kube-apiserver unhealthy after control plane upgrade

The following issue occurs when you start a manual control plane upgrade of your cluster GKE version. Some user-deployed admission webhooks can block system components from creating permissive RBAC roles that are required to function correctly. During a control plane upgrade, Google Cloud re-creates the Kubernetes API server (kube-apiserver) component. If a webhook blocks the RBAC role for the API server component, the API server won't start and the cluster upgrade won't complete.

The error message in the gcloud CLI is similar to the following:

FAILED: All cluster resources were brought up, but: component "KubeApiserverReady" from endpoint "readyz of kube apiserver is not successful" is unhealthy.

To identify the failing webhook, check your GKE audit logs for RBAC calls with the following information:

protoPayload.resourceName="RBAC_RULE"
protoPayload.authenticationInfo.principalEmail="system:apiserver"

RBAC_RULE is the full name of an RBAC role, such as rbac.authorization.k8s.io/v1/clusterroles/system:controller:horizontal-pod-autoscaler.

The name of the failing webhook is displayed in the log with the following format:

admission webhook WEBHOOK_NAME denied the request

To resolve this issue, try the following:

  • Adjust your constraints to allow creating and updating ClusterRoles that have the system: prefix.
  • Adjust your webhook to not intercept requests for creating and updating system RBAC roles.
  • Disable the webhook.

Why does this happen?

Kubernetes auto-reconciles the default system RBAC roles with the default policies in the latest minor version. The default policies for system roles sometimes change in new Kubernetes versions.

To perform this reconciliation, GKE creates or updates the ClusterRoles and ClusterRoleBindings in the cluster. If you have a webhook that intercepts and rejects the create or update requests because of the scope of permissions that the default RBAC policies use, the API server can't function on the new minor version.

Workloads evicted after Standard cluster upgrade

Your workloads might be at risk of eviction after a cluster upgrade if the following are all true:

  • The system workloads require more space when the cluster's control plane is running the new GKE version.
  • Your existing nodes do not have enough resources to run the new system workloads and your existing workloads.
  • Cluster autoscaler is disabled for the cluster.

To resolve this issue, try the following steps:

Next steps

If you need additional assistance, reach out to Cloud Customer Care.