If a Cloud Bigtable cluster becomes unresponsive, replication makes it possible for incoming traffic to fail over to another cluster in the same instance. Failovers can be either manual or automatic, depending on the app profile an application is using and how the app profile is configured.
This page explains how manual and automatic failovers work in an instance that uses replication. To learn how to complete a failover, see Managing Failovers.
Before you read this page, you should be familiar with the overview of Cloud Bigtable replication.
If an app profile uses single-cluster routing to direct all requests to one cluster, you must use your own judgement to decide when to start failing over to a different cluster.
Here are some signals that might indicate that it would be helpful to fail over to a different cluster:
- The cluster starts to return a large number of transient system errors.
- A large number of requests start timing out.
- The average response latency increases to an unacceptable level.
Because these signals can appear for many different reasons, failing over to a different cluster is not guaranteed to resolve the underlying issue. Monitor your instance before and after the failover to verify that the metrics have improved.
For details about how to complete a manual failover, see Completing a manual failover.
If an app profile uses multi-cluster routing, Cloud Bigtable handles failovers automatically. When the nearest cluster is unable to handle a request, Cloud Bigtable routes traffic to the nearest cluster that is available.
Automatic failovers can occur even if a cluster is unavailable for a very short period of time. For example, if Cloud Bigtable routes a request to one cluster, and that cluster is excessively slow to reply or returns a transient error, Cloud Bigtable will typically retry that request on another cluster.
If you are using multi-cluster routing and you send a request with a request deadline, Cloud Bigtable automatically fails over to another cluster midway to the deadline. For example, if a request is sent with a deadline of 2 seconds, and 1 second has passed with no response, the request is automatically rerouted to the next closest cluster.
If you are using replication with multi-cluster routing to achieve high availability for your application, you should locate your client servers or VMs in or near more than one Google Cloud region. This recommendation applies even if your application server is not hosted by Google Cloud, because your data enters the Google Cloud network through the Google Cloud region that is closest to your application server. Like any request, a failover completes more quickly over shorter distances.
Many automatic failovers are so brief that you won't notice them. You can check the Automatic Failovers graph in the Cloud Console to see the number of requests that were automatically rerouted over a given period of time: open the list of instances, click the instance name, then click Monitoring.