This page is an overview of the high availability (HA) configuration for MySQL Second Generation instances. To configure a new or existing instance for HA, see Configuring an Instance for High Availability.
HA configuration overview
The HA configuration, sometimes called a cluster, provides data redundancy. The configuration is made up of a primary instance (master) in the primary zone and a failover replica in the secondary zone. Through semisynchronous replication, all changes made to the primary instance's data and user tables are copied onto the failover replica. In the event of an instance or zone failure, this configuration reduces downtime, and your data continues to be available to client applications.
The HA failover replica is billed as a separate instance. For more information, see the pricing page.
The failover replica is configured with the same database flags, users (including root) and passwords, authorized applications and networks, and databases as the primary instance.
Failover replicas have the following limitations:
- The failover replica must be in the same region as the primary instance, but in a different zone.
- You can create only one failover replica for every primary instance.
- You cannot change the failover replica's activation policy or maintenance window. Failover replicas have the same maintenance window as the primary instance.
- You cannot enable backups on the failover replica. Backups must be performed on the primary instance.
- You can create a failover replica only from the primary instance, not from read replicas.
If an HA-configured instance becomes unresponsive, Cloud SQL automatically switches to serving data from the failover replica. This is called a failover. To see if failover has occurred, check your operations log's failover history.
Click the tabs to see how failover affects your instance.
The following process occurs:
The primary instance fails.
Each second, the primary instance writes to system database as a heartbeat signal. If multiple heartbeats aren't detected (and the failover replica is healthy), failover is initiated. This occurs if the primary instance is unresponsive for approximately 60 seconds or the primary zone experiences an outage.
Cloud SQL waits for the failover replica to catch up to the primary instance's state.
The amount of time this step takes is affected by replication lag.
The failover replica is promoted to the primary instance role.
The failover replica now serves data from the secondary zone, and the primary instance name and IP address move to the old failover replica. The client application reconnects to the new primary instance without needing to change its connection string because the primary instance IP address has moved automatically. To see what zone your instance is serving data from, go to its Overview page in the GCP Console.
A failover replica is recreated.
The new failover replica retains the incoming failover replica IP address and is automatically recreated in a healthy zone.
Read replicas are recreated.
New read replicas retain the incoming read replica IP address and are automatically recreated in a healthy zone.
For Cloud SQL to allow a failover, the configuration must meet the following requirements:
- Replication must be in a healthy state.
- The primary instance must be in a normal operating state (not stopped, undergoing maintenance, or performing a long-running operation).
- There can be no administrative operations in progress on the failover replica. If failover is initiated immediately after or during an administrative operation, like an export, the failover request will fail.
- The failover replica must be available. If failover is initiated, and the failover replica is unavailable, the failover request will fail. If the request is due to an unhealthy primary instance, then the primary instance will experience an outage.
- Replication lag must be acceptable. If replication lag is over 10 minutes, Cloud SQL will not initiate failover for an unhealthy instance. User-initiated failovers and failovers due to a zonal outage will still be attempted.
Failover replica availability
Failover replica availability is provided as a primary instance metric, not as a failover replica metric:
This state is also included in the
Get request response of the primary
instance in the
You can use Stackdriver to view your HA configuration metrics. For a complete list of Cloud SQL metrics provided by Stackdriver, see the Cloud SQL metrics list. For more information about using Stackdriver with GCP, see the Stackdriver documentation.
Through semisynchronous replication, every write operation to the primary instance requires the same update to be made to its failover replica. To minimize performance impact on the primary instance while ensuring that changes are never lost, the failover replica logs the update events and then performs the updates in order. If the update events arrive faster than the failover replica can perform them, the failover replica falls behind the primary instance. The difference in time from when the primary instance makes an update to when the failover replica catches up with that update from its log is called replication lag.
Although replication lag can increase failover time, all transactions handled by the primary instance are also recorded on the failover replica's logs; replication lag does not cause data loss.
Addressing replication lag
If replication lag is caused by an unusual spike of writes to the primary instance, the failover replica can usually catch up to the primary instance after the load decreases again. If replication lag is high but steady, you can delete the failover replica and recreate it. However, if the write load continues, and the replication lag continues to increase, you must take action. If replication lag becomes too high, it can affect the SLA coverage for your instance. Learn more.
Because replicas receive the pending writes in a serial fashion, they function as a single-threaded instance. Increasing RAM on the failover replica or increasing its disk to allow higher I/O throughput may help in some situations, but if your failover replica is chronically falling behind the primary instance, you must shard your database so that the write operations are shared between multiple primary instances.
The state of replication lag is provided as a failover replica metric:
The value for this metric represents the number of seconds the replica is behind the primary instance.
For information about setting up a Stackdriver alert for this metric, see Creating an alert for replication lag.
Backups and restores
Configuring an instance for HA does not affect your need for backups or how you create them, but it does affect how you restore an instance. Before you can restore your primary instance from a backup or perform a point-in-time recovery on your primary instance, you must delete all failover and read replicas. After the restore completes, you must recreate your failover and read replicas.
Applications and instances
There is no difference in working with non-HA and HA instances, so your application does not need to be configured in any particular way. When a failover occurs, any existing connections to the instance are closed. However, your application can reconnect using the same connection string or IP address; you do not need to update your application after failover. For some time during the failover, your applications won't be able to connect to the database.
To see exactly how your applications are affected by failover, you should manually initiate failover.
- Configure a new or existing instance for HA.
- Initiate failover.
- Learn more about managing your database connections.
- Learn more about regions and zones in Cloud SQL.