If a Cloud Bigtable cluster becomes unresponsive, replication makes it possible for incoming traffic to fail over to another cluster in the same instance. Failovers can be either manual or automatic, depending on the app profile an application is using and how the app profile is configured.
This page explains how manual and automatic failovers work in an instance that uses replication. To learn how to complete a failover, see Managing Failovers.
Before you read this page, you should be familiar with the overview of Cloud Bigtable replication.
If an app profile uses single-cluster routing to direct all requests to one cluster, you must use your own judgment to decide when to start failing over to a different cluster.
Here are some signals that might indicate that it would be helpful to fail over to a different cluster:
- The cluster starts to return a large number of transient system errors.
- A large number of requests start timing out.
- The average response latency increases to an unacceptable level.
Because these signals can appear for many different reasons, failing over to a different cluster is not guaranteed to resolve the underlying issue. Monitor your instance before and after the failover to verify that the metrics have improved.
For details about how to complete a manual failover, see Completing a manual failover.
If an app profile uses multi-cluster routing to automatically distribute traffic between clusters, Cloud Bigtable handles failovers automatically as well. When a cluster is unable to handle a request, Cloud Bigtable routes traffic to the other cluster.
Automatic failovers can occur even if a cluster is unavailable for a very short period of time. For example, if Cloud Bigtable routes a request to one cluster, and that cluster is excessively slow to reply or returns a transient error, Cloud Bigtable will typically retry that request on the other cluster.
Many automatic failovers are so brief that you won't notice them. However, if an automatic failover occurs for an extended period of time, and you are monitoring your instance, you will typically see that one cluster is using significantly more CPU than the other cluster.