Understand the impact of failures in GKE on Bare Metal

GKE on Bare Metal is designed to limit the scope of failures and to prioritize functionality that's critical to business continuity. This document explains how the functionality of your clusters are impacted when there's a failure. This information can help you prioritize areas to troubleshoot if you have a problem.

The core functionality of GKE on Bare Metal includes the following categories:

  • Run workloads: Existing workloads can continue to run. This is the most important consideration to maintain business continuity. Even if your cluster has a problem the existing workloads might continue to run without interruption.
  • Manage workloads: You can create, update, and delete workloads. This is the second most important consideration to scale workloads when traffic increases, even if the cluster has a problem.
  • Manage user clusters: You can manage nodes, update, upgrade, and delete user clusters. This is less important than the application lifecycle considerations. If there's available capacity on the existing nodes, the inability to modify user clusters doesn't affect user workloads.
  • Manage admin clusters: You can update and upgrade the admin cluster.
    • For deployments that use separate admin and user clusters, this is the least important consideration because the admin cluster doesn't host any user workloads. If your admin cluster has a problem, your application workloads on other clusters continue to run without interruption.
    • If you use other deployment models, such as hybrid or standalone, the admin cluster runs application workloads. If the admin cluster has a problem and the control plane is down, you also can't manage application workloads or user cluster components.

The following sections use these categories of core functionality to describe the impact of specific types of failure scenarios. When there is disruption as part of a failure scenario, the duration (order) of the disruption is also noted, where possible.

Node failures

A node in GKE on Bare Metal might stop functioning or become unreachable on the network. Depending on the node pool and cluster that the failed machine is part of, there are several different failure modes.

Control plane node

The following table outlines the behavior for nodes that are part of the control plane in GKE on Bare Metal:

Run workloads Manage workloads Manage user clusters Manage admin clusters
Disruption (duration) No disruption Possible disruption (unknown) Possible disruption (unknown) Possible disruption (unknown)
Explanation If the node failure affects the single control plane node in a non-highly available (HA) user cluster, or if it affects not less than half of the control plane nodes in a HA user cluster, there is disruption. The control plane quorum of the user cluster is lost. If the node failure affects the single control plane node in a non-HA admin cluster, or if it affects no less than half of the control plane nodes in a HA admin cluster, there is disruption. The control plane quorum of the admin cluster is lost. If the node failure affects the single control plane node in a non-HA admin cluster, or if it affects no less than half of the control plane nodes in a HA admin cluster, there is disruption. The control plane quorum of the admin cluster is lost.
Recovery For more information, see how to recover from quorum loss. For more information, see how to recover from quorum loss. For more information, see how to recover from quorum loss.
Prevention Deploy user clusters in HA mode to minimize the possibility of disruption. Deploy admin clusters in HA mode to minimize the possibility of disruption. Deploy admin clusters in HA mode to minimize the possibility of disruption.

Load balancer node

The following table outlines the behavior for nodes that host the load balancers in GKE on Bare Metal. This guidance only applies to bundled load balancers with layer 2 mode. For manual load balancing, consult the failure modes of your external load balancers:

Run workloads Manage workloads Manage user clusters Manage admin clusters
Disruption (duration) Possible disruption (varies) Possible disruption (varies) Possible disruption (varies) Possible disruption (varies)
Explanation If external workloads rely on the data plane load balancer to communicate with workloads in the cluster and you only have one load balancer node, there is disruption. The control plane virtual IP address of the user cluster resides on one load balancer node. If the load balancer nodepool of the user cluster is not HA, there is disruption. The control plane virtual IP address of the admin cluster resides on one load balancer node. If the load balancer nodepool of the admin cluster is not HA, there is disruption. The control plane virtual IP address of the admin cluster resides on one load balancer node. If the load balancer nodepool of the admin cluster is not HA, there is disruption.
Recovery

If there are multiple load balancer nodes, MetalLB failover happens within a few seconds.

If not HA, consider deploying additional load balancer nodes.

If HA, failover is automatic and is in the order of seconds.

If not HA, consider deploying additional load balancer nodes

If HA, failover is automatic and is in the order of seconds.

If not HA, consider deploying additional load balancer nodes.

If HA, failover is automatic and is in the order of seconds.

If not HA, consider deploying additional load balancer nodes.

Prevention To minimize the possibility of disruption, deploy load balancer nodepools in HA mode. To minimize the possibility of disruption, deploy load balancer nodepools in HA mode. To minimize the possibility of disruption, deploy load balancer nodepools in HA mode. To minimize the possibility of disruption, deploy load balancer nodepools in HA mode.

Worker node

The following table outlines the behavior for worker nodes in GKE on Bare Metal:

Run workloads Manage workloads Manage user clusters Manage admin clusters
Disruption (duration) Possible disruption (order of seconds) No disruption No disruption No disruption
Explanation

The Pods that run on the failed node are disrupted, and are automatically rescheduled onto other healthy nodes with a default eviction timeout of 5 minutes.

If user applications have spare workload capacity and are spread across multiple nodes, the disruption isn't observable by clients that implement retries.

The Pods are automatically restarted on healthy nodes.

If the cluster doesn't have spare capacity, the disruption might last until new nodes are added to the cluster.

Recovery If the cluster doesn't have spare capacity, you must deploy more nodes spread across multiple failure zones and move failed workloads to the new nodes.
Prevention

Deploy nodes that spread across multiple failure zones.

Deploy workloads with multiple replicas spread across multiple failure zones to minimize the possibility of disruption.

Storage failure

Storage in GKE on Bare Metal might stop functioning or become unreachable on the network. Depending on the storage that fails, there are several different failure modes.

etcd

The contents of etcd might become corrupted due to ungraceful power down of the node or the underlying failure of storage. The following table outlines the behavior the core functionality due to etcd failures:

Run workloads Manage workloads Manage user clusters Manage admin clusters
Disruption (duration) No disruption Possible disruption (unknown) Possible disruption (unknown) Possible disruption (unknown)
Explanation If the existing workloads don't rely on the Kubernetes control plane, they continue to function without disruption. If etcd fails on a single control plane user cluster, or fails on no less than half of the control plane nodes in a HA user cluster, there is disruption. The control plane quorum of the user cluster is lost. If etcd fails on a single control plane admin cluster, or fails on no less than half of the control plane nodes in a HA admin cluster, there is disruption. The control plane quorum of the admin cluster is lost. If etcd fails on a single control plane admin cluster, or fails on no less than half of the control plane nodes in a HA admin cluster, there is disruption. The control plane quorum of the admin cluster is lost.
Recovery For more information, see how to recover from quorum loss. For more information, see how to recover from quorum loss. For more information, see how to recover from quorum loss.
Prevention To minimize the possibility of disruption, deploy user clusters in HA mode. To minimize the possibility of disruption, deploy admin clusters in HA mode. To minimize the possibility of disruption, deploy admin clusters in HA mode.

User application PersistentVolume

The following table outlines the behavior the core functionality due to the failure of a PersistentVolume:

Run workloads Manage workloads Manage user clusters Manage admin clusters
Disruption (duration) Possible disruption (unknown) No disruption No disruption No disruption
Explanation The workloads that use the failed PersistentVolume are affected.
Recovery
Prevention To minimize the possibility of disruption, deploy the user workload in HA mode.

Fluent Bit corrupted disk

The corruption of a Fluent Bit disk doesn't affect any core functionalities, but does impact the capability to collect and inspect logs on Google Cloud.

The SIGSEGV event can sometimes be observed from logs of stackdriver-log-forwarder. This error might be caused by the corrupted buffered logs on the disk.

Fluent Bit has a mechanism to filter out and drop the broken chunks. This feature is available in the fluent-bit version (v1.8.3) used in GKE on Bare Metal.

Out of LoadBalancer IP

If all the IP addresses in the assigned pools are currently occupied, newly created LoadBalancer services can't acquire a LoadBalancer IP address. This scenario impacts the capability of the clients of the service to talk to the LoadBalancer services.

To recover from this IP address exhaustion, assign more IP addresses to the address pool by modifying the cluster custom resource.

Certificate expiration

The certificates used in cluster operation might expire if the cluster hasn't upgraded for a year and no on-demand rotation has been performed.

Run workloads Manage workloads Manage user clusters Manage admin clusters
Disruption (duration) No Disruption Possible disruption (unknown) Possible disruption (unknown) Possible disruption (unknown)
Explanation If the user workloads don't talk to the kubernetes control plane components, there won't be disruptions. If the certificate authorities for user clusters expire, there will be a disruption. If the certificate authorities for admin clusters expire, there will be a disruption. If the certificate authorities for user clusters expire, there will be a disruption.
Recovery

Follow the steps to manually trigger a certificate rotation on the user cluster.

During CA rotation, there will be disruption.

Follow the steps to manually trigger a certificate rotation on the admin cluster.

During CA rotation, there will be disruption.

Follow the steps to manually trigger a certificate rotation on the admin cluster.

During CA rotation, there will be disruption.

Prevention Setup monitors for certificate expiration. An example metric "kubelet_certificate_manager_server_expiration_seconds" can be foundon the list of metrics.

Upgrade failures

Run workloads Manage workloads Manage user clusters Manage admin clusters
Disruption (duration) No Disruption No Disruption Possible disruption (unknown) Possible disruption (unknown)
Explanation

If the upgrade fails on the user cluster control plane, there is NO disruption to existing workloads.

If the upgrade fails on a particular worker node, the workloads on that node will be drained and moved to other healthy nodes if there is extra capacity on the healthy nodes.

Upgrade will stop if any one of the control plane nodes fails to upgrade. The cluster is still functional if upgrade fails if the user cluster is HA. If the upgrade fails on the admin cluster control plane, there is disruption until the upgrade finishes. If the upgrade fails on the admin cluster control plane, there is disruption until the upgrade finishes.
Recovery Upgrade is retryable. For more information, see how to diagnose upgrade issues and resume. Upgrade is retryable. For more information, see how to diagnose upgrade issues and resume.
Prevention For more information, see how to create a backup before upgrade. For more information, see how to create a backup before upgrade.

What's next

For more information on known product issues and workarounds, see GKE on Bare Metal known issues.