This page describes how to troubleshoot and fix replication lag for Cloud SQL read replicas.
Overview
Cloud SQL read replicas use SQL Server Always-On Availability Groups for replication. Changes are written to the transaction log in the primary instance. The primary instance forwards transactions to any secondary replica instances, where the changes are applied. The availability mode used is Asynchronous-commit mode.Replication lag can happen in a few scenarios, such as:
- The primary instance can't send the changes fast enough to the replica.
- The replica can't receive the changes quickly enough.
- The replica can't apply the changes quickly enough.
network_lag
metric
. The third reason is observed via the seconds_behind_master
metric. A
high seconds_behind_master
metric means that the replica can't apply
replication changes fast enough. These metrics are described in the
Monitor replication lag section below.
Optimize queries and schema
This section suggests some common query and schema optimizations you can make to improve replication performance.
Long-running queries in the read replica
Long-running queries in the replica might block replication for Cloud SQL. You might want to have separate replicas for online transaction processing (OLTP) and online analytical processing (OLAP) purposes and only send long-running queries to the OLAP replica.
Exclusive locks due to DDL
Data definition language (DDL) commands, such as ALTER TABLE
and
CREATE INDEX
, can cause replication lag in the replica due to
exclusive locks. To avoid lock contention, consider scheduling DDL execution
during times when the query load is lower on the replicas.
CREATE INDEX
, ALTER INDEX
, and INDEX
MAINTENANCE
can cause replication lag due to the large number of transaction
log records that these statements can generate.
Overloaded replica
If a read replica is receiving too many queries, replication could be blocked. Consider splitting the reads among multiple replicas to reduce the load on each one.
To avoid query spikes, consider throttling replica read queries in your application logic or in a proxy layer if you use one.
If there are spikes of activity on the primary instance, consider spreading out updates.
Monolithic primary database
Consider sharding the primary database vertically (or horizontally) to prevent one or more lagging tables from holding back all the other tables.
Monitor replication lag
You can use the replica_lag
and network_lag
metrics to monitor replication
lag and identify whether the cause of the lag is in the primary database,
the network, or the replica.
Metric | Description |
---|---|
Replication lag ( cloudsql.googleapis.com/database/replication/seconds_behind_master ) |
The number of seconds that the replica's state is lagging behind the state of the primary instance. This is the difference between the timestamp of the last redone log record on the secondary and the timestamp of the last sent log record on the primary. |
Network lag ( cloudsql.googleapis.com ) |
The difference between the timestamp of the last received log entry on the replica and the last sent log entry on the primary. |
Verify replication
To verify that replication is working, check the value of thecloudsql.googleapis.com/database/replication/state
metric on the primary
instance. If the state is Running
, then replication is healthy.