Troubleshoot upgrades

Autopilot Standard

This page shows you how to resolve issues with Google Kubernetes Engine (GKE) cluster upgrades.

If you need additional assistance, reach out to Cloud Customer Care.

Issue: kube-apiserver unhealthy after control plane upgrade

The following issue occurs when you start a manual control plane upgrade of your cluster GKE version. Some user-deployed admission webhooks can block system components from creating permissive RBAC roles that are required to function correctly. During a control plane upgrade, Google Cloud re-creates the Kubernetes API server (kube-apiserver) component. If a webhook blocks the RBAC role for the API server component, the API server won't start and the cluster upgrade won't complete.

Even if a webhook is working correctly, it can cause cluster upgrade to fail because the webhook might be unreachable from newly created control plane.

The error message in the gcloud CLI is similar to the following:

FAILED: All cluster resources were brought up, but: component "KubeApiserverReady" from endpoint "readyz of kube apiserver is not successful" is unhealthy.

To identify the failing webhook, check your GKE audit logs for RBAC calls with the following information:

protoPayload.resourceName="RBAC_RULE"
protoPayload.authenticationInfo.principalEmail="system:apiserver"

RBAC_RULE is the full name of an RBAC role, such as rbac.authorization.k8s.io/v1/clusterroles/system:controller:horizontal-pod-autoscaler.

The name of the failing webhook is displayed in the log with the following format:

admission webhook WEBHOOK_NAME denied the request

To resolve this issue, try the following:

Adjust your constraints to allow creating and updating ClusterRoles that have the system: prefix.
Adjust your webhook to not intercept requests for creating and updating system RBAC roles.
Disable the webhook.

Why does this happen?

Kubernetes auto-reconciles{track-name="k8sLink" track-type="troubleshooting"} the default system RBAC roles with the default policies in the latest minor version. The default policies for system roles sometimes change in new Kubernetes versions.

To perform this reconciliation, GKE creates or updates the ClusterRoles and ClusterRoleBindings in the cluster. If you have a webhook that intercepts and rejects the create or update requests because of the scope of permissions that the default RBAC policies use, the API server can't function on the new minor version.

Issue: Workloads evicted after Standard cluster upgrade

Your workloads might be at risk of eviction after a cluster upgrade if the following are all true:

The system workloads require more space when the cluster's control plane is running the new GKE version.
Your existing nodes don't have enough resources to run the new system workloads and your existing workloads.
Cluster autoscaler is disabled for the cluster.

To resolve this issue, try the following steps:

Issue: Node version not compatible with control plane version

Check what version of Kubernetes your cluster's control plane is running, and then check what version of Kubernetes your cluster's node pools are running. If any of the cluster's node pools are more than two minor versions older than the control plane, this might be causing issues with your cluster.

Periodically, the GKE team performs upgrades of the cluster control plane on your behalf. Control planes are upgraded to newer stable versions of Kubernetes. By default, a cluster's nodes have auto-upgrade enabled, and we recommend that you don't disable it.

If auto-upgrade is disabled for a cluster's nodes, and you don't manually upgrade your node pool version to a version that is compatible with the control plane, your control plane will eventually become incompatible with your nodes as the control plane is automatically upgraded over time. Incompatibility between your cluster's control plane and the nodes can cause unexpected issues.

The Kubernetes version and version skew support policy states that control planes are compatible with nodes up to two minor versions older than the control plane. For example, Kubernetes 1.19 control planes are compatible with Kubernetes 1.19, 1.18, and 1.17 nodes. To resolve this issue, manually upgrade the node pool version to a version that is compatible with the control plane.

If you are concerned about the upgrade process causing disruption to workloads running on the affected nodes, complete the following steps to migrate your workloads to a new node pool:

Create a new node pool with a compatible version.
Cordon the nodes of the existing node pool.
Optional: Update your workloads running on the existing node pool to add a nodeSelector for the label cloud.google.com/gke-nodepool:NEW_NODE_POOL_NAME, where NEW_NODE_POOL_NAME is the name of the new node pool. This ensures that GKE places those workloads on nodes in the new node pool.
Drain the existing node pool.
Check that the workloads are running successfully in the new node pool. If they are, you can delete the old node pool. If you notice workload disruptions, reschedule the workloads on the existing nodes by uncordoning the nodes in the existing node pool and draining the new nodes. Troubleshoot the issue and try again.

Issue: Pods stuck in pending state after configuring Node Allocatable

After configuring Node Allocatable and performing a node version upgrade, you might notice that Pods that were running changed to pending.

If Pods are pending after an upgrade, we suggest the following:

Ensure CPU and Memory requests for your Pods don't exceed their peak usage. With GKE reserving CPU and memory for overhead, Pods cannot request these resources. Pods that request more CPU or memory than they use prevent other Pods from requesting these resources, and might leave the cluster underutilized. For more information, see How Pods with resource requests are scheduled.
Consider resizing your cluster. For instructions, see Resizing a cluster.
Revert this change by downgrading your cluster. For instructions, see Manually upgrading a cluster or node pool.

Configure your cluster to send Kubernetes scheduler metrics to Cloud Monitoring and view scheduler metrics.

What's next

If you need additional assistance, reach out to Cloud Customer Care.