Overview of the High Availability Configuration

This page describes the high availability configuration for Second Generation instances. The high availability configuration, sometimes called a cluster, provides failover capability.

For help with configuring high availability, see Configuring an Instance for High Availability.

What the high availability configuration is

A Second Generation instance is in a high availability configuration when it has a failover replica. The failover replica must be in a different zone than the original instance, also called the master. All changes made to the data on the master, including to user tables, are replicated to the failover replica using semisynchronous replication.

What the high availability configuration provides

If an instance configured for high availability experiences an outage or becomes unresponsive, Cloud SQL automatically fails over to the failover replica, and your data continues to be available to clients. This is called a failover.

This capability is built in for First Generation instances. Cloud SQL Second Generation provides the high availability configuration as an option so you can reduce your costs for non-production instances.

How to configure an instance for high availability

The easiest way to configure a Second Generation instance for high availability is when you create the instance. You can also configure an existing instance for high availability. For more information, see Configuring an Instance for High Availability.

Which instances should be configured for high availability

You should configure all of your instances that contain production data for high availability.

Requirements for the high availability configuration

Failover replicas must be in the same project and region as the master instance.

How configuring an instance for high availability affects your charges

The failover replica is billed as a separate instance.

When failover is triggered

Failover is triggered when one of the following scenarios occur:

  • The zone where the master instance is located experiences an outage.
  • The master instance is unresponsive for approximately 60 seconds.

In addition, for failover to be triggered, the configuration must meet all of the following requirements:

  • Replication must be in a healthy state.
  • There can be no operations in progress on the failover replica (for example, an export).
  • The master instance must be in a normal operating state (not stopped or undergoing maintenance).

You can also initiate failover manually. For information, see Initiating failover.

How failover affects your applications and your instances

When a master fails over to its failover replica, any existing connections to the instance are closed. However, your application can reconnect using the same connection string or IP address; you do not need to update your application after a failover.

For some time during the failover, your applications experience an outage. A major factor in outage duration is the size of the replication lag (how far the replica is behind the master) when the failover starts. This is because the replica cannot start servicing requests until it "catches up" to the state of the master. You should monitor replication lag and take steps to address it if it becomes too large for your failover duration requirements.

After the failover, the instance resumes serving data from the zone where the failover replica is located. You can see what zone your instance is serving data from by going to its Overview page in the GCP Console.

You should initiate a failover in a test environment to see exactly how your applications are affected.

About using the failover replica as a read replica

You can use the failover replica as a read replica to offload read operations from the master.

You can create only one failover replica for every master. You can create additional read replicas to offload read operations from the master.

For more information about creating read replicas, see Configuring Replication.

How the failover replica is configured

The failover replica is configured with the same database flags, users (including root) and passwords, authorized applications and networks, and databases as the master. You cannot change the replica's activation policy or maintenance window (failover replicas cannot be stopped, and can experience a disruptive upgrade at any time). You cannot enable backups on the replica; backups must be performed on the master instance.

When replication can be disabled

A master instance falls out of high availability mode when the failover replica becomes unavailable. This can happen, for example, if the network connection between the master instance and failover replica is interrupted, or if the failover replica is down due to its own zone failure. During this time, the master instance is not in high availability mode, and you will not be able to failover to the replica because it is not safe to do so. The failover replica resumes replication on reconnection, and high availability mode is reenabled when the failover replica catches up completely and returns to semi-synchronous replication.

How the high availability configuration affects backups and restores

You can only perform backups on the master instance. Before you can restore a master instance from a backup or perform a point-in-time recovery on a master, you must delete all replicas. After the restore completes, you must recreate the replicas.

How the high availability configuration differs between MySQL and PostgreSQL

There are some differences in the high availability configuration between MySQL and PostgreSQL instances that impact how you work with highly available instances:

  • Highly available PostgreSQL instances do not have a separate failover instance the way MySQL instances do.

    This has the following consequences:

    • There is no concept of replication lag, as there is for MySQL instances. As long as the secondary zone is healthy, failover can occur.

    • If you need to offload read operations, you must create a read replica.

  • If a failover occurs, read replicas replicating from a PostgreSQL regional instance do not change zones; they continue to serve data even if they are now in a different zone than the primary instance. You can initiate another failover, in this case called a failback, to return the regional instance to serving data from its original zone.

  • Enabling automatic backups is not required for highly available PostgreSQL instances, as it is for MySQL. However, enabling automatic backups is recommended for increased data durability.

How you view the health of your high availability configuration

Replica availability

The state of failover replica availability (true or false) is available as a metric of the master:

cloudsql.googleapis.com/database/mysql/replication/available_for_failover

This state is also included in the response of the Get request of the master instance in the failoverReplica.available field.

You can also see metrics for your high availability configuration by using Stackdriver. For a complete list of Cloud SQL metrics provided by Stackdriver, see the Cloud SQL metrics list. For more information about using Stackdriver with GCP, see the Stackdriver documentation.

Replication lag

For information about monitoring replication lag, see Monitoring replication lag.

Replication lag

What replication lag is

As provided by semisynchronous replication, every write operation to a master instance requires the same update to be made to its failover replica. To minimize performance impact on the master while ensuring that changes are never lost, the replica logs the update events and then performs the updates in order. If the update events arrive faster than the replica can perform them, the replica falls behind the master. The difference in time between when the master makes an update and when the replica catches up with that update from its log is called replication lag.

Monitoring replication lag

The state of replication lag is available as a metric of the replica:

cloudsql.googleapis.com/database/mysql/replication/seconds_behind_master

The value for this metric represents the number of seconds the replica is behind the master.

For information about setting up a Stackdriver alert for this metric, see Creating an alert for replication lag.

For information on viewing replication metrics, see Viewing and exporting MySQL error logs. You can also see metrics for your high availability configuration by using Stackdriver. For a complete list of Cloud SQL metrics provided by Stackdriver, see the Cloud SQL metrics list. For more information about using Stackdriver with GCP, see the Stackdriver documentation.

Addressing replication lag

If replication lag is due to an unusual spike of writes to the master, the failover replica can usually catch up to the master after the load decreases again. If replication lag is high but steady, you can delete the replica and recreate it. However, if the write load continues, and the replication lag continues to increase, you must take action. If failover lag becomes too large, it can affect the SLA coverage for your instance. Learn more.

Because replicas get the pending writes in a serial fashion, they function as a single-threaded instance. For this reason, increasing system resources on the replica is unlikely to address replication lag. If your replica is chronically falling behind the master, you must shard your database so that the write operations are shared between multiple master instances.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud SQL for MySQL