Version 1.14. This version is no longer supported. For information about how to upgrade to version 1.15, see Upgrading Anthos on bare metal in the 1.15 documentation. For more information about supported and unsupported versions, see the Version history page in the latest documentation.
If nodes in Google Distributed Cloud fail, such as due to storage, network, or OS
misconfiguration, you want to efficiently restore cluster health. After you
restore the cluster health, you can troubleshoot the node failure.
This document shows you how to recover from node failure scenarios by resetting
a node, and forcefully removing the node if needed.
If you want to add or remove nodes from a cluster under normal circumstances
when a node hasn't failed, see
Update clusters.
Overview
When there's a node failure, sometimes you can't run reset commands on the nodes
as the node might be unreachable. You might need to forcefully remove the node
from the cluster.
COMMA_SEPARATED_IP: the IP addresses of the nodes to
reset, such as 10.200.0.8,10.200.0.9.
CLUSTER_NAME: the name of the target cluster that
contains the failed nodes.
ADMIN_KUBECONFIG: the path to the admin cluster
kubeconfig file.
You can now diagnose the node and fix any misconfigurations that caused the
initial failure. Skip the remaining steps in this section.
If the previous step to reset the node fails, you can forcefully remove the
node from the cluster. This forceful removal skips the previous step that runs
that reset commands, and only performs the step to remove the related
references to the node from the nodepool and cluster custom resources:
COMMA_SEPARATED_IP: the IP addresses of the nodes to
reset, such as 10.200.0.8,10.200.0.9.
CLUSTER_NAME: the name of the target cluster that
contains the failed nodes.
ADMIN_KUBECONFIG: the path to the admin cluster
kubeconfig file.
You can now diagnose the node and fix any misconfigurations that caused the
initial failure. Skip the remaining steps in this section.
If the previous step to reset the node fails, you can forcefully remove the
node from the cluster. This forceful removal skips the previous step that runs
that reset commands, and only performs the step to remove the related
references to the node from the nodepool and cluster custom resources:
COMMA_SEPARATED_IP: the IP addresses of the nodes to
reset, such as 10.200.0.8,10.200.0.9.
CLUSTER_NAME: the name of the target cluster that
contains the failed nodes.
KUBECONFIG_FILE: the path to the admin cluster
kubeconfig file.
If the failed nodes were part of the load balancer nodepools, after the nodes
recover they might contend for the control plane virtual IP address and make
the new cluster unstable. Run the reset commands against the failed nodes as
soon as possible after you recover the nodes.
What's next
For more information on how to add or remove nodes from a cluster when there
isn't a failure and check the node status, see
Update clusters.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[[["\u003cp\u003eThis document guides users on how to recover from node failures in Google Distributed Cloud, focusing on resetting and, if necessary, forcefully removing nodes from a cluster.\u003c/p\u003e\n"],["\u003cp\u003eTo remove a failed node, users should first try a clean reset using \u003ccode\u003ebmctl reset nodes\u003c/code\u003e, which resets the node and removes its references from the cluster.\u003c/p\u003e\n"],["\u003cp\u003eIf a clean reset fails, users can forcefully remove the node using \u003ccode\u003ebmctl reset nodes --force\u003c/code\u003e, which skips the reset command but removes the node's references.\u003c/p\u003e\n"],["\u003cp\u003eThe process for recovering from a single control plane node failure is similar to that of worker nodes, with the added step of \u003ccode\u003ebmctl\u003c/code\u003e cleaning the \u003ccode\u003eetcd\u003c/code\u003e membership.\u003c/p\u003e\n"],["\u003cp\u003eIn high availability (HA) control plane setups that have lost quorum, a \u003ccode\u003ebmctl restore\u003c/code\u003e command can be used on a healthy node to recover the cluster, followed by resetting the failed nodes.\u003c/p\u003e\n"]]],[],null,["# Reset a failed node in Google Distributed Cloud\n\n\u003cbr /\u003e\n\nIf nodes in Google Distributed Cloud fail, such as due to storage, network, or OS\nmisconfiguration, you want to efficiently restore cluster health. After you\nrestore the cluster health, you can troubleshoot the node failure.\n\nThis document shows you how to recover from node failure scenarios by resetting\na node, and forcefully removing the node if needed.\n\nIf you want to add or remove nodes from a cluster under normal circumstances\nwhen a node hasn't failed, see\n[Update clusters](/anthos/clusters/docs/bare-metal/1.14/how-to/update-clusters).\n\nOverview\n--------\n\nWhen there's a node failure, sometimes you can't run reset commands on the nodes\nas the node might be unreachable. You might need to forcefully remove the node\nfrom the cluster.\n\nWhen you\n[cleanly reset a node and update the cluster](/anthos/clusters/docs/bare-metal/1.14/how-to/update-clusters),\nthe following actions happen:\n\n1. The node resets, similar to `kubeadm reset`, and the machine reverts to the pre-installed state.\n2. The related references to the node are removed from the nodepool and cluster custom resources.\n\n| **Note:** In some of the following `bmctl` commands to reset nodes, the `--force` parameter indicates whether the reset commands (step 1) should be skipped. If the `--force` parameter is used, `bmctl` only performs the removal step (step 2), and doesn't run the reset commands.\n\n### Worker node\n\nTo remove a node from a cluster, first try to cleanly remove it:\n\n1. Try to cleanly reset the node. After the node is reset, the node is removed\n from the cluster:\n\n bmctl reset nodes \\\n --addresses \u003cvar translate=\"no\"\u003eCOMMA_SEPARATED_IPS\u003c/var\u003e \\\n --cluster \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e \\\n --kubeconfig \u003cvar translate=\"no\"\u003eADMIN_KUBECONFIG\u003c/var\u003e\n\n Replace the following values:\n - \u003cvar translate=\"no\"\u003eCOMMA_SEPARATED_IP\u003c/var\u003e: the IP addresses of the nodes to reset, such as `10.200.0.8,10.200.0.9`.\n - \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e: the name of the target cluster that contains the failed nodes.\n - \u003cvar translate=\"no\"\u003eADMIN_KUBECONFIG\u003c/var\u003e: the path to the admin cluster kubeconfig file.\n\n You can now diagnose the node and fix any misconfigurations that caused the\n initial failure. Skip the remaining steps in this section.\n2. If the previous step to reset the node fails, you can forcefully remove the\n node from the cluster. This forceful removal skips the previous step that runs\n that reset commands, and only performs the step to remove the related\n references to the node from the nodepool and cluster custom resources:\n\n bmctl reset nodes \\\n --addresses \u003cvar translate=\"no\"\u003eCOMMA_SEPARATED_IPS\u003c/var\u003e \\\n --cluster \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e \\\n --kubeconfig \u003cvar translate=\"no\"\u003eADMIN_KUBECONFIG\u003c/var\u003e \\\n --force\n\n You can now diagnose the node and fix any misconfigurations that caused the\n initial failure.\n3. If you forcefully removed the node from the node cluster in the previous\n step, run the `bmctl reset` command again to reset the nodes:\n\n bmctl reset nodes \\\n --addresses \u003cvar translate=\"no\"\u003eCOMMA_SEPARATED_IPS\u003c/var\u003e \\\n --cluster \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e \\\n --kubeconfig \u003cvar translate=\"no\"\u003eADMIN_KUBECONFIG\u003c/var\u003e\n\n### Single control plane node failure\n\nThe process is the same as for worker nodes. For control plane nodes, `bmctl`\nalso cleans the `etcd` membership.\n| **Note:** The cluster becomes non-HA after you remove the failed node. To return to a highly available state, [add a healthy node to the cluster](/anthos/clusters/docs/bare-metal/1.14/how-to/update-clusters).\n\nTo remove a node from a cluster, first try to cleanly remove it:\n\n1. Try to cleanly reset the node. After the node is reset, the node is removed\n from the cluster:\n\n bmctl reset nodes \\\n --addresses \u003cvar translate=\"no\"\u003eCOMMA_SEPARATED_IPS\u003c/var\u003e \\\n --cluster \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e \\\n --kubeconfig \u003cvar translate=\"no\"\u003eADMIN_KUBECONFIG\u003c/var\u003e\n\n Replace the following values:\n - \u003cvar translate=\"no\"\u003eCOMMA_SEPARATED_IP\u003c/var\u003e: the IP addresses of the nodes to reset, such as `10.200.0.8,10.200.0.9`.\n - \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e: the name of the target cluster that contains the failed nodes.\n - \u003cvar translate=\"no\"\u003eADMIN_KUBECONFIG\u003c/var\u003e: the path to the admin cluster kubeconfig file.\n\n You can now diagnose the node and fix any misconfigurations that caused the\n initial failure. Skip the remaining steps in this section.\n2. If the previous step to reset the node fails, you can forcefully remove the\n node from the cluster. This forceful removal skips the previous step that runs\n that reset commands, and only performs the step to remove the related\n references to the node from the nodepool and cluster custom resources:\n\n bmctl reset nodes \\\n --addresses \u003cvar translate=\"no\"\u003eCOMMA_SEPARATED_IPS\u003c/var\u003e \\\n --cluster \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e \\\n --kubeconfig \u003cvar translate=\"no\"\u003eADMIN_KUBECONFIG\u003c/var\u003e \\\n --force\n\n You can now diagnose the node and fix any misconfigurations that caused the\n initial failure.\n3. If you forcefully removed the node from the node cluster in the previous\n step, run the `bmctl reset` command again to reset the nodes:\n\n bmctl reset nodes \\\n --addresses \u003cvar translate=\"no\"\u003eCOMMA_SEPARATED_IPS\u003c/var\u003e \\\n --cluster \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e \\\n --kubeconfig \u003cvar translate=\"no\"\u003eADMIN_KUBECONFIG\u003c/var\u003e\n\n### Quorum lost in HA control plane\n\nIf too many control planes nodes in an HA cluster enter a failed state, the\ncluster loses quorum and becomes unavailable.\n| **Note:** When you need to restore management clusters, don't provide the kubeconfig file in the reset commands. If you provide the kubeconfig file for a management cluster, it forces a new cluster to perform the reset operation. When you restore a user cluster, provide the path to the kubeconfig file.\n\n1. To recover a cluster that has lost quorum, run the following command on a\n remaining healthy node:\n\n bmctl restore --control-plane-node \u003cvar translate=\"no\"\u003eCONTROL_PLANE_NODE\u003c/var\u003e \\\n --cluster \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e \\\n [--kubeconfig \u003cvar translate=\"no\"\u003eKUBECONFIG_FILE\u003c/var\u003e]\n\n Replace the following values:\n - \u003cvar translate=\"no\"\u003eCONTROL_PLANE_NODE\u003c/var\u003e: the IP addresses of a healthy node that remains as part of the cluster.\n - \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e: the name of the target cluster that contains the failed nodes.\n - \u003cvar translate=\"no\"\u003eKUBECONFIG_FILE\u003c/var\u003e: if recovering a user cluster, the path to the user cluster kubeconfig file.\n2. After you recover the failed nodes, run the `bmctl reset` command to reset\n the nodes:\n\n bmctl reset nodes \\\n --addresses \u003cvar translate=\"no\"\u003eCOMMA_SEPARATED_IPS\u003c/var\u003e \\\n --cluster \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e \\\n [--kubeconfig \u003cvar translate=\"no\"\u003eKUBECONFIG_FILE\u003c/var\u003e]\n\n Replace the following values:\n - \u003cvar translate=\"no\"\u003eCOMMA_SEPARATED_IP\u003c/var\u003e: the IP addresses of the nodes to reset, such as `10.200.0.8,10.200.0.9`.\n - \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e: the name of the target cluster that contains the failed nodes.\n - \u003cvar translate=\"no\"\u003eKUBECONFIG_FILE\u003c/var\u003e: the path to the admin cluster kubeconfig file.\n\n If the failed nodes were part of the load balancer nodepools, after the nodes\n recover they might contend for the control plane virtual IP address and make\n the new cluster unstable. Run the reset commands against the failed nodes as\n soon as possible after you recover the nodes.\n\n| **Note:** This process only handles the disaster recovery for a 3-node control plane HA deployment. This process doesn't support the recovery for HA setups with 5 nodes or more.\n\nWhat's next\n-----------\n\nFor more information on how to add or remove nodes from a cluster when there\nisn't a failure and check the node status, see\n[Update clusters](/anthos/clusters/docs/bare-metal/1.14/how-to/update-clusters)."]]