[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-07-22 (世界標準時間)。"],[],[],null,["# Understand the impact of failures in Google Distributed Cloud\n\nGoogle Distributed Cloud is designed to limit the scope of failures and to prioritize\nfunctionality that's critical to business continuity. This document explains how\nthe functionality of your clusters are impacted when there's a failure. This\ninformation can help you prioritize areas to troubleshoot if you have a problem.\n\nThe core functionality of Google Distributed Cloud includes the following categories:\n\n- **Run workloads**: Existing workloads can continue to run. This is the most important consideration to maintain business continuity. Even if your cluster has a problem the existing workloads might continue to run without interruption.\n- **Manage workloads**: You can create, update, and delete workloads. This is the second most important consideration to scale workloads when traffic increases, even if the cluster has a problem.\n- **Manage user clusters**: You can manage nodes, update, upgrade, and delete user clusters. This is less important than the application lifecycle considerations. If there's available capacity on the existing nodes, the inability to modify user clusters doesn't affect user workloads.\n- **Manage admin clusters** : You can update and upgrade the admin cluster.\n - For deployments that use separate admin and user clusters, this is the least important consideration because the admin cluster doesn't host any user workloads. If your admin cluster has a problem, your application workloads on other clusters continue to run without interruption.\n - If you use other deployment models, such as hybrid or standalone, the admin cluster runs application workloads. If the admin cluster has a problem and the control plane is down, you also can't manage application workloads or user cluster components.\n\nThe following sections use these categories of core functionality to describe\nthe impact of specific types of failure scenarios. When there is disruption as\npart of a failure scenario, the duration (order) of the disruption is also\nnoted, where possible.\n| **Tip:** When assessing the impact of a failure, use [Gemini Cloud Assist](/gemini/docs/cloud-assist/overview) ([Preview](/products#product-launch-stages)) to help you interpret complex behavior patterns and identify potential cascading effects. For example, you could ask: \"How would a Google Distributed Cloud control plane node failure affect my running workloads, and what are the long-term consequences if it's not restored quickly?\"\n\nNode failures\n-------------\n\nA node in Google Distributed Cloud might stop functioning or become unreachable on the\nnetwork. Depending on the node pool and cluster that the failed machine is part\nof, there are several different failure modes.\n\n### Control plane node\n\nThe following table outlines the behavior for nodes that are part of the control\nplane in Google Distributed Cloud:\n\n| **Note:** The disruption duration depends on the configuration of the liveness probe for the user workloads. In this document, the liveness probe is assumed to have granularity configured in the order of seconds.\n\n### Load balancer node\n\nThe following table outlines the behavior for nodes that host the load balancers\nin Google Distributed Cloud. This guidance only applies to bundled load balancers with\n[layer 2 mode](/kubernetes-engine/distributed-cloud/bare-metal/docs/installing/bundled-lb). For manual load balancing,\nconsult the failure modes of your external load balancers:\n\n### Worker node\n\nThe following table outlines the behavior for worker nodes in Google Distributed Cloud:\n\n| **Note:** The disruption duration depends on the configuration of the liveness probe for the user workloads. In this document, the liveness probe is is assumed to have granularity configured in the order of seconds.\n\nStorage failure\n---------------\n\nStorage in Google Distributed Cloud might stop functioning or become unreachable on the\nnetwork. Depending on the storage that fails, there are several different\nfailure modes.\n\n### etcd\n\nThe contents of `/var/lib/etcd` and `/var/lib/etcd-events` directories might\nbecome corrupted if there's an ungraceful power down of the node or an\nunderlying failure of storage. The following table outlines the behavior the\ncore functionality due to `etcd` failures:\n\n### User application `PersistentVolume`\n\nThe following table outlines the behavior the core functionality due to the\nfailure of a `PersistentVolume`:\n\n### Fluent Bit corrupted disk\n\nThe corruption of a Fluent Bit disk doesn't affect any core functionalities, but\ndoes impact the capability to collect and inspect logs on Google Cloud.\n\nThe `SIGSEGV` event can sometimes be observed from logs of\n`stackdriver-log-forwarder`. This error might be caused by the corrupted\nbuffered logs on the disk.\n\nFluent Bit has a mechanism to filter out and drop the broken chunks. This\nfeature is available in the fluent-bit version (v1.8.3) used in\nGoogle Distributed Cloud.\n\nOut of `LoadBalancer` IP\n------------------------\n\nIf all the IP addresses in the assigned pools are currently occupied, newly\ncreated `LoadBalancer` services can't acquire a `LoadBalancer` IP address. This\nscenario impacts the capability of the clients of the service to talk to the\n`LoadBalancer` services.\n\nTo recover from this IP address exhaustion,\n[assign more IP addresses to the address pool](/kubernetes-engine/distributed-cloud/bare-metal/docs/installing/bundled-lb#address-pools)\nby modifying the cluster custom resource.\n\nCertificate expiration\n----------------------\n\nGoogle Distributed Cloud generates a self-signed certificate authority (CA) during\nthe cluster installation process. The CA has a 10-year expiry and is responsible\nfor generating certificates, which expire after one year. Rotate certificates\nregularly to prevent cluster downtime. You can rotate certificates by\n[upgrading](/kubernetes-engine/distributed-cloud/bare-metal/docs/how-to/upgrade) your cluster, which is the recommended\nmethod. If you are unable to upgrade your cluster, you can perform an\n[on-demand CA rotation](/kubernetes-engine/distributed-cloud/bare-metal/docs/how-to/ca-rotation). For more information\nabout cluster certificates, see\n[PKI certificates and requirements](https://kubernetes.io/docs/setup/best-practices/certificates/)\nin the Kubernetes documentation.\n\nIf the cluster certificates have expired, they must be\n[renewed manually](/kubernetes-engine/distributed-cloud/bare-metal/docs/troubleshooting/expired-certs).\n\nUpgrade failures\n----------------\n\nWhat's next\n-----------\n\nFor more information on known product issues and workarounds, see\n[Google Distributed Cloud known issues](/kubernetes-engine/distributed-cloud/bare-metal/docs/troubleshooting/known-issues).\n\nIf you need additional assistance, reach out to\n\n[Cloud Customer Care](/support-hub).\nYou can also see\n[Getting support](/kubernetes-engine/distributed-cloud/bare-metal/docs/getting-support) for more information about support resources, including the following:\n\n- [Requirements](/kubernetes-engine/distributed-cloud/bare-metal/docs/getting-support#intro-support) for opening a support case.\n- [Tools](/kubernetes-engine/distributed-cloud/bare-metal/docs/getting-support#support-tools) to help you troubleshoot, such as your environment configuration, logs, and metrics.\n- Supported [components](/kubernetes-engine/distributed-cloud/bare-metal/docs/getting-support#what-we-support)."]]