Replication lag

MySQL | PostgreSQL | SQL Server

This page describes how to troubleshoot and fix replication lag for Cloud SQL read replicas.

Overview

Cloud SQL read replicas use SQL Server Always-On Availability Groups for replication. Changes are written to the transaction log in the primary instance. The primary instance forwards transactions to any secondary replica instances, where the changes are applied. The availability mode used is Asynchronous-commit mode.

Replication lag can happen in a few scenarios, such as:

The primary instance can't send the changes fast enough to the replica.
The replica can't receive the changes quickly enough.
The replica can't apply the changes quickly enough.

The first two reasons above can be monitored with the

network_lag
metric

. The third reason is observed via the seconds_behind_master metric. A high seconds_behind_master metric means that the replica can't apply replication changes fast enough. These metrics are described in the Monitor replication lag section below.

Ensure replica is adequately provisioned

A replica instance that is smaller than the primary instance (for example, with less vCPUs and memory) can experience replication lag. A smaller replica might also have different default configuration flags compared to a larger primary instance. We recommend that the replica instance is at least as large as the primary instance to have enough resources to handle the replication load.

High CPU utilization on the replica can also cause replication lag. If the replica's CPU utilization is high (for example, over 90%), consider increasing the replica's CPU capacity.

Optimize queries and schema

This section suggests some common query and schema optimizations you can make to improve replication performance.

Long-running queries in the read replica

Long-running queries in the replica might block replication for Cloud SQL. You might want to have separate replicas for online transaction processing (OLTP) and online analytical processing (OLAP) purposes and only send long-running queries to the OLAP replica.

Exclusive locks due to DDL

Data definition language (DDL) commands, such as ALTER TABLE and CREATE INDEX, can cause replication lag in the replica due to exclusive locks. To avoid lock contention, consider scheduling DDL execution during times when the query load is lower on the replicas.

Additionally, DDL statements such as CREATE INDEX, ALTER INDEX, and

INDEX
MAINTENANCE

can cause replication lag due to the large number of transaction log records that these statements can generate.

Overloaded replica

If a read replica is receiving too many queries, replication could be blocked. Consider splitting the reads among multiple replicas to reduce the load on each one.

To avoid query spikes, consider throttling replica read queries in your application logic or in a proxy layer if you use one.

If there are spikes of activity on the primary instance, consider spreading out updates.

Monolithic primary database

Consider sharding the primary database vertically (or horizontally) to prevent one or more lagging tables from holding back all the other tables.

Monitor replication lag

You can use the replica_lag and network_lag metrics to monitor replication lag and identify whether the cause of the lag is in the primary database, the network, or the replica.

Metric	Description
Replication lag (`cloudsql.googleapis.com/database/replication/seconds_behind_master`)	The number of seconds that the replica's state is lagging behind the state of the primary instance. This is the difference between the timestamp of the last redone log record on the secondary and the timestamp of the last sent log record on the primary.
Network lag (`cloudsql.googleapis.com/database/replication/network_lag`)	The difference between the timestamp of the last received log entry on the replica and the last sent log entry on the primary.

Verify replication

To verify that replication is working, check the value of the cloudsql.googleapis.com/database/replication/state metric on the primary instance. If the state is Running, then replication is healthy.

What's next:

Promote replicas for regional migration or disaster recovery