Failovers

If a Bigtable cluster becomes unresponsive, replication makes it possible for incoming traffic to fail over to another cluster in the same instance. Failovers can be either manual or automatic, depending on the app profile an application is using and how the app profile is configured.

This page explains how manual and automatic failovers work in an instance that uses replication. To learn how to complete a failover, see Managing Failovers.

Before you read this page, you should be familiar with the overview of Bigtable replication. You should also be familiar with the available routing options.

Manual failovers

If an app profile uses single-cluster routing to direct all requests to one cluster, you must use your own judgement to decide when to start failing over to a different cluster.

Here are some signals that might indicate that it would be helpful to fail over to a different cluster:

  • The cluster starts to return a large number of transient system errors.
  • A large number of requests start timing out.
  • The average response latency increases to an unacceptable level.

Because these signals can appear for many different reasons, failing over to a different cluster is not guaranteed to resolve the underlying issue. Monitor your instance before and after the failover to verify that the metrics have improved.

For details about how to complete a manual failover, see Completing a manual failover.

Automatic failovers

If an app profile uses multi-cluster routing, Bigtable handles failovers automatically. When the nearest cluster is unable to handle a request, Bigtable routes traffic to the nearest cluster that is available.

Automatic failovers can occur even if a cluster is unavailable for a very short period of time. For example, if Bigtable routes a request to one cluster, and that cluster is excessively slow to reply or returns a transient error, Bigtable will typically retry that request on another cluster.

If you are using multi-cluster routing and you send a request with a deadline, Bigtable automatically fails over when necessary to help meet the deadline. If the request deadline approaches but the initial cluster has not sent a response, Bigtable reroutes the request to the next closest cluster.

Bigtable uses an internal last write wins algorithm to handle any data conflicts that might occur as a result of failover before replication has completed. See Conflict resolution for more details.

If you are using replication with multi-cluster routing to achieve high availability (HA) for your application, you should locate your client servers or VMs in or near more than one Google Cloud region. This recommendation applies even if your application server is not hosted by Google Cloud, because your data enters the Google Cloud network through the Google Cloud region that is closest to your application server. Like any request, a failover completes more quickly over shorter distances.

Many automatic failovers are so brief that you won't notice them. You can check the Automatic Failovers graph in the Google Cloud console to see the number of requests that were automatically rerouted over a given period of time: open the list of instances, click the instance name, then click Monitoring.

What's next