This page is an overview of the high availability (HA) configuration for Cloud SQL instances. To configure a new instance for HA, or to enable HA on an existing instance, see Enabling and disabling high availability on an instance.
HA configuration overview
The purpose of an HA configuration is to reduce downtime when a zone or instance becomes unavailable. This might happen during a zonal outage, or when an instance becomes corrupted. With HA, your data continues to be available to client applications.
The HA configuration, sometimes called a cluster, provides data redundancy. A Cloud SQL instance configured for HA is also called a regional instance and is located in a primary and secondary zone within the configured region. Within a regional instance, the configuration is made up of a primary instance and a standby instance. Through synchronous replication to each zone's persistent disk, all writes made to the primary instance are replicated to disks in both zones before a transaction is reported as committed. In the event of an instance or zone failure, the persistent disk is attached to the standby instance, and it becomes the new primary instance. Users are then rerouted to the new primary. This process is called a failover.
After a failover, the instance that received the failover continues to be the primary instance, even after the original instance comes back online. Once the zone or instance that experienced an outage becomes available again, the original primary instance is destroyed and recreated. Then it becomes the new standby instance. If a failover occurs in the future, the new primary will fail over to the original instance in the original zone.
If you need to have the primary instance in the zone that had the outage, you can do a failback. A failback performs the same steps as the failover, only in the opposite direction, to reroute traffic back to the original instance. To perform a failback, use the procedure in Initiating failover.
Regional persistent disk support for Cloud SQL and the Cloud SQL HA configuration have full Service Level Agreement (SLA) coverage. An HA-configured instance is charged at double the price of a standalone instance. This price includes CPU, RAM, and storage. For more information, see the pricing page.
Read replicas cannot be made highly available like primary instances. During a zonal outage, traffic to read replicas in that zone stops. Once the zone becomes available again, any read replicas in the zone will resume replication from the primary instance. If read replicas are in a zone that is not in an outage, they are connected to the standby instance when it becomes the primary instance.
As a best practice, put read replicas in a different zone from the primary and standby instances. For example, if you have a primary instance in zone A and a standby instance in zone B, put the read replicas in zone C. This practice ensures that read replicas continue to operate even if the zone for the primary instance goes down. You should also add business logic in the client application to send reads to the primary instance when read replicas are unavailable.
Note: The standby instance cannot be used for read queries. This differs from the Cloud SQL for MySQL legacy HA configuration.
If an HA-configured instance becomes unresponsive, Cloud SQL automatically switches to serving data from the standby instance. To see if failover has occurred, check your operation log's failover history.
Click the tabs to see how failover affects your instance.
The following process occurs:
The primary instance or zone fails.
Each second, the primary instance writes to a system database as a heartbeat signal. If multiple heartbeats aren't detected, failover is initiated. This occurs if the primary instance is unresponsive for approximately 60 seconds or the zone containing the primary instance experiences an outage.
The standby instance now serves data upon reconnection.
Through a shared static IP address with the primary instance, the standby instance now serves data from the secondary zone.
For Cloud SQL to allow a failover, the configuration must meet the following requirements:
- The primary instance must be in a normal operating state (not stopped, undergoing maintenance, or performing a long-running Cloud SQL instance operation such as a backup, import or export operation).
- The secondary zone and standby instance must both be in a healthy state. When the standby instance is unresponsive and/or replication to the secondary zone is interrupted, failover operations are blocked. After Cloud SQL repairs the standby instance and the secondary zone is available, replication resumes and Cloud SQL allows failover.
Backup and restore
Automated backups and point-in-time recovery must be enabled for high availability (point-in-time recovery uses binary logging).
Applications and instances
There is no difference in working with non-HA and HA instances, so your application does not need to be configured in any particular way. When failover occurs, any existing connections to the primary instance and read replicas are closed, and it will take approximately 2-3 minutes for connections to the primary instance to be reestablished. Connections to replicas can take longer. Your application reconnects using the same connection string or IP address, so you do not need to update your application after failover.
To see exactly how your applications are affected by failover, manually initiate failover.
Maintenance events affect primary instances configured with HA in the same way as any other instance. You can expect primary instances to be down during this time. To minimize impact to your service, you can set a maintenance window to control when downtime occurs.
When maintenance occurs on an instance, it does not fail over to the standby instance. Maintenance updates are applied to the standby instance at the same time as the primary instance.
Regional persistent disk performance depends on many factors. Specifically, look at vm instance type size and your workload input and output. Another metric to note is that the latency for regional persistent disk with solid-state drives (SSD) is higher than it would be for persistent disk with local SSD. What this means is that if your workload is not a streaming workload and latency sensitive, it can't reach the input/output operations per second (IOPS) limit as regional persistent disk with SSD has higher latency than a persistent disk with local SSD — this is because the redundancy needed to write two copies increases the tail latency.
Legacy MySQL high availability option
Until Q1 2021, you have the option of using the legacy process for adding high availability to MySQL instances, which uses a failover replica. The legacy functionality is not available in the Cloud Console. Instead, use gcloud or cURL commands. See Legacy configuration: Creating a new instance configured for high availability or Legacy configuration: Configuring an existing instance for high availability.
- Enabling and disabling high availability on an instance.
- Initiate failover.
- Learn more about managing your database connections.
- Learn more about regions and zones in Cloud SQL.