This page describes how to detect and debug hotspots in your database. You can access statistics about hotspots in splits with both GoogleSQL and PostgreSQL.
Spanner stores your data as a contiguous key space, ordered by the primary keys of your tables and indexes. A split is a range of rows from a set of tables or an index. The split's start is called the split start. The split limit sets the end of the split. The split includes the split start, but not the split limit.
In Spanner, hotspots are situations where too many requests are sent to the same server which saturates the resources of the server and potentially causes high latencies. The splits affected by hotspots are known as hot or warm splits.
A split's hotspot statistic (identified in the system as
CPU_USAGE_SCORE
) is a measurement of the load on a split that's
constrained by the resources available on the server. This measurement is given
as a percentage. If more than 50% of the load on a split is constrained by the
available resources, then the split is considered warm. If 100% of the load on
a split is constrained, then the split is considered hot.
Spanner uses load-based splitting to evenly distribute the data load across the instance's servers. The warm and hot splits can be moved across servers for load balancing or can be broken into smaller splits. However, Spanner might not be able to balance the load, even after multiple attempts at splitting, due to anti-patterns in the application. Hence, persistent hotspots that last at least 10 minutes might need further troubleshooting and potential application changes.
The Spanner hot split statistics help you identify the splits
where hotspots occur. You can then make changes to your application or schema,
as needed. You can retrieve these statistics from the
SPANNER_SYS.SPLIT_STATS_TOP_MINUTE
system tables using SQL statements.
Availability of hot split statistics
Spanner provides the hot split statistics in the
SPANNER_SYS
schema. SPANNER_SYS
data is available only through
GoogleSQL and PostgreSQL interfaces. You
can use the following ways to access this data:
- A database's Spanner Studio page in the Google Cloud console
- The
gcloud spanner databases execute-sql
command - The
executeQuery
API
Spanner single read APIs don't support SPANNER_SYS
.
Hot split statistics
You use the following table to track hot splits:
SPANNER_SYS.SPLIT_STATS_TOP_MINUTE
: shows splits that are hot during 1-minute intervals.
These tables have the following properties:
- Each table contains data for non-overlapping time intervals of the duration the table name specifies.
Intervals are based on clock times:
- 1-minute intervals end on the minute.
After each interval, Spanner collects data from all servers and then makes the data available in the
SPANNER_SYS
tables shortly thereafter.For example, at 11:59:30 AM, the most recent intervals available to SQL queries are:
- 1 minute: 11:58:00-11:58:59 AM
Spanner groups the statistics by splits.
Each row contains a percentage that indicates how hot or warm a split is, for each split that Spanner captures statistics for during the specified interval.
If less than 50% of the load on a split is constrained by the available resources, then Spanner doesn't capture the statistic. If Spanner is unable to store all the hot splits during the interval, the system prioritizes the splits with the highest
CPU_USAGE_SCORE
percentage during the specified interval. If there are no splits returned, it's an indication of the absence of any hotspots.
Table schema
The following table shows the table schema for the following stats:
SPANNER_SYS.SPLIT_STATS_TOP_MINUTE
Column name | Type | Description |
---|---|---|
INTERVAL_END |
TIMESTAMP |
End of the time interval during which the split was hot |
SPLIT_START |
STRING |
The starting key of the range of rows in the split. The split start might also be <begin> , indicating the beginning of the key space |
SPLIT_LIMIT
|
STRING
|
The limit key for the range of rows in the split. The limit:
key might also be <end> , indicating the end of the key space| |
CPU_USAGE_SCORE
|
INT64
|
The CPU_USAGE_SCORE percentage of the splits. A
CPU_USAGE_SCORE percentage of 50% indicates the presence of warm or hot |
splits | |
AFFECTED_TABLES |
STRING ARRAY |
The tables whose rows might be in the split |
What are split start and split limit keys?
A split is a contiguous row range of a database, and is defined by its start and limit keys. A split can be a single row, a narrow row range, or a wide row range, and the split can include multiple tables or indexes.
The SPLIT_START
and SPLIT_LIMIT
columns identify the primary keys of a warm
or hot split.
Example schema
The following schema is an example table for the topics in this page.
GoogleSQL
CREATE TABLE Users (
UserId INT64 NOT NULL,
FirstName STRING(MAX),
LastName STRING(MAX),
) PRIMARY KEY(UserId);
CREATE INDEX UsersByFirstName ON Users(FirstName DESC);
CREATE TABLE Threads (
UserId INT64 NOT NULL,
ThreadId INT64 NOT NULL,
Starred BOOL,
) PRIMARY KEY(UserId, ThreadId),
INTERLEAVE IN PARENT Users ON DELETE CASCADE;
CREATE TABLE Messages (
UserId INT64 NOT NULL,
ThreadId INT64 NOT NULL,
MessageId INT64 NOT NULL,
Subject STRING(MAX),
Body STRING(MAX),
) PRIMARY KEY(UserId, ThreadId, MessageId),
INTERLEAVE IN PARENT Threads ON DELETE CASCADE;
CREATE INDEX MessagesIdx ON Messages(UserId, ThreadId, Subject),
INTERLEAVE IN Threads;
PostgreSQL
CREATE TABLE users
(
userid BIGINT NOT NULL PRIMARY KEY,-- INT64 to BIGINT
firstname VARCHAR(max),-- STRING(MAX) to VARCHAR(MAX)
lastname VARCHAR(max)
);
CREATE INDEX usersbyfirstname
ON users(firstname DESC);
CREATE TABLE threads
(
userid BIGINT NOT NULL,
threadid BIGINT NOT NULL,
starred BOOLEAN, -- BOOL to BOOLEAN
PRIMARY KEY (userid, threadid),
CONSTRAINT fk_threads_user FOREIGN KEY (userid) REFERENCES users(userid) ON
DELETE CASCADE -- Interleave to Foreign Key constraint
);
CREATE TABLE messages
(
userid BIGINT NOT NULL,
threadid BIGINT NOT NULL,
messageid BIGINT NOT NULL PRIMARY KEY,
subject VARCHAR(max),
body VARCHAR(max),
CONSTRAINT fk_messages_thread FOREIGN KEY (userid, threadid) REFERENCES
threads(userid, threadid) ON DELETE CASCADE
-- Interleave to Foreign Key constraint
);
CREATE INDEX messagesidx ON messages(userid, threadid, subject), REFERENCES
threads(userid, threadid);
Imagine your key space looks like this:
PRIMARY KEY |
---|
<begin> |
Users() |
Threads() |
Users(2) |
Users(3) |
Threads(3) |
Threads(3,"a") |
Messages(3,"a",1) |
Messages(3,"a",2) |
Threads(3, "aa") |
Users(9) |
Users(10) |
Threads(10) |
UsersByFirstName("abc") |
UsersByFirstName("abcd") |
<end> |
Example of splits
The following shows some example splits to help you understand what splits look like.
The SPLIT_START
and SPLIT_LIMIT
might indicate the row of a table or index,
or they can be <begin>
and <end>
, representing the boundaries of the key
space of the database. The SPLIT_START
and SPLIT_LIMIT
might also contain
truncated keys, which are keys preceding any full key in the table. For example,
Threads(10)
is a prefix for any Threads
row interleaved in Users(10)
.
SPLIT_START | SPLIT_LIMIT | AFFECTED_TABLES | EXPLANATION |
---|---|---|---|
Users(3) |
Users(10) |
UsersByFirstName , Users , Threads , Messages , MessagesIdx |
Split starts at row with UserId=3 and ends at the row before the row with UserId = 10 . The split contains the Users table rows and all its interleaved tables rows for UserId=3 to 10. |
Messages(3,"a",1) |
Threads(3,"aa") |
Threads , Messages , MessagesIdx |
The split starts at the row with UserId=3 , ThreadId="a" and MessageId=1 and ends at the row preceding the row with the key of UserId=3 and ThreadsId = "aa" . The split contains all the tables between Messages(3,"a",1) and Threads(3,"aa") . As the split_start and split_limit are interleaved in the same top-level table row, the split contains the interleaved tables rows between the start and limit. See schemas-overview to understand how interleaved tables are co-located. |
Messages(3,"a",1) |
<end> |
UsersByFirstName , Users , Threads , Messages , MessagesIdx |
The split starts in the messages table at the row with key UserId=3 , ThreadId="a" and MessageId=1 . The split hosts all the rows from the split_start to <end> , the end of the key space of the database. All the rows of the tables following the split_start , like Users(4) are included in the split. |
<begin> |
Users(9) |
UsersByFirstName , Users , Threads , Messages , MessagesIdx |
The split starts at <begin> , the beginning of the key space of the database and ends at the row preceding the Users row with UserId=9 . So the split has all the table rows preceding Users and all the rows of Users table preceding UserId=9 and the rows of its interleaved tables. |
Messages(3,"a",1) |
Threads(10) |
UsersByFirstName , Users , Threads , Messages , MessagesIdx |
Split starts at Messages(3,"a", 1) interleaved in Users(3) and ends at the row preceding Threads(10) . Threads(10) is a truncated split key that is a prefix of any key of the Threads table interleaved in Users(10) . |
Users() |
<end> |
UsersByFirstName , Users , Threads , Messages , MessagesIdx |
The split starts at the truncated split key of Users() which precedes any full key of the Users table. The split extends until the end of the possible key space in the database. The affected_tables hence cover the Users table, its interleaved tables and indexes and all the tables that might appear after users. |
Threads(10) |
UsersByFirstName("abc") |
UsersByFirstName , Users , Threads , Messages , MessagesIdx |
The split starts at the Threads row with UserId = 10 and ends at the index, UsersByFirstName at the key preceding "abc" . |
Example queries to find hot splits
The following example shows a SQL statement that you can use to retrieve the hot split statistics. You can run these SQL statements using the client libraries, gcloud, or the Google Cloud console.
GoogleSQL
SELECT t.split_start,
t.split_limit,
t.cpu_usage_score,
t.affected_tables,
FROM SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE t.interval_end =
(SELECT MAX(interval_end)
FROM SPANNER_SYS.SPLIT_STATS_TOP_MINUTE)
ORDER BY t.cpu_usage_score DESC;
PostgreSQL
SELECT t.split_start,
t.split_limit,
t.cpu_usage_score,
t.affected_tables
FROM SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE t.interval_end = (
SELECT MAX(interval_end)
FROM SPANNER_SYS.SPLIT_STATS_TOP_MINUTE
)
ORDER BY t.cpu_usage_score DESC;
The query output looks like the following:
SPLIT_START |
SPLIT_LIMIT |
CPU_USAGE_SCORE |
AFFECTED_TABLES |
---|---|---|---|
Users(13) |
Users(76) |
82 |
Messages,Users,Threads |
Users(101) |
Users(102) |
90 |
Messages,Users,Threads |
Threads(10, "a") |
Threads(10, "aa") |
100 |
Messages,Threads |
Messages(631, "abc", 1) |
Messages(631, "abc", 3) |
100 |
Messages |
Threads(12, "zebra") |
Users(14) |
76 |
Messages,Users,Threads |
Users(620) |
<end> |
100 |
Messages,Users,Threads |
Data retention for the hot split statistics
At a minimum, Spanner keeps data for each table for the following time period:
SPANNER_SYS.SPLIT_STATS_TOP_MINUTE
: Intervals that cover the previous 6 hours.
Troubleshoot hotspots using hot split statistics
This section describes how to detect and troubleshoot hotspots.
Select a time period to investigate
Check the latency metrics for your Spanner database to find the time period when your application experienced high latency and CPU usage. For example, it might show you that an issue started around 10:50 PM on May 18, 2024.
Find persistent hotspotting
As Spanner balances your load with load-based splitting,
we recommend that you investigate if hotspotting has continued for more than 10
minutes. You can do so by querying the
SPANNER_SYS.SPLIT_STATS_TOP_MINUTE
table, as shown in the
following example:
GoogleSQL
SELECT Count(DISTINCT t.interval_end)
FROM SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE t.utilization >= 50
AND t.interval_end >= "interval_end_date_time"
AND t.interval_end <= "interval_end_date_time";
Replace interval_end_date_time with the date and time for the
interval, using the format 2024-05-18T17:40:00Z
.
PostgreSQL
SELECT COUNT(DISTINCT t.interval_end)
FROM SPLIT_STATS_TOP_MINUTE t
WHERE t.utilization >= 50
AND t.interval_end >= 'interval_end_date_time'::timestamptz
AND t.interval_end <= 'interval_end_date_time'::timestamptz;
Replace interval_end_date_time with the date and time for the
interval, using the format 2024-05-18T17:40:00Z
.
If the previous query result is equal to 10, it means that your database is experiencing hotspotting that might need further debugging.
Find the splits with the highest CPU_USAGE_SCORE
level
For this example, we run the following SQL to find the row ranges with the
highest CPU_USAGE_SCORE
level:
GoogleSQL
SELECT t.split_start,
t.split_limit,
t.affected_tables,
t.cpu_usage_score
FROM SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE t.cpu_usage_score >= 50
AND t.interval_end = "interval_end_date_time";
Replace interval_end_date_time with the date and time for the
interval, using the format 2024-05-18T17:40:00Z
.
PostgreSQL
SELECT t.split_start,
t.split_limit,
t.affected_tables,
t.cpu_usage_score
FROM SPLIT_STATS_TOP_MINUTE t
WHERE t.cpu_usage_score = 100
AND t.interval_end = 'interval_end_date_time'::timestamptz;
Replace interval_end_date_time with the date and time for the
interval, using the format 2024-05-18T17:40:00Z
.
The previous SQL outputs the following:
SPLIT_START |
SPLIT_LIMIT |
CPU_USAGE_SCORE |
AFFECTED_TABLES |
---|---|---|---|
Users(180) |
<end> |
85 |
Messages,Users,Threads |
Users(24) |
Users(76) |
76 |
Messages,Users,Threads |
From this table of results, we can see that hotspots occurred on two splits. Spanner load-based splitting might try to resolve hotspots on these splits. However, it might not be able to do so if there are problematic patterns in the schema or workload. To detect if there are splits that need your intervention, we recommend tracking the splits for at least 10 minutes. For example, the following SQL tracks the first split over the last ten minutes.
GoogleSQL
SELECT t.interval_end,
t.split_start,
t.split_limit,
t.cpu_usage_score
FROM SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE t.split_start = "users(180)"
AND t.split_limit = "<end>"
AND t.interval_end >= "interval_end_date_time"
AND t.interval_end <= "interval_end_date_time";
Replace interval_end_date_time with the date and time for the
interval, using the format 2024-05-18T17:40:00Z
.
PostgreSQL
SELECT t.interval_end,
t.split_start,
t.split_limit,
t.cpu_usage_score
FROM SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE t.split_start = 'users(180)'
AND t.split_limit = ''
AND t.interval_end >= 'interval_end_date_time'::timestamptz
AND t.interval_end <= 'interval_end_date_time'::timestamptz;
Replace interval_end_date_time with the date and time for the
interval, using the format 2024-05-18T17:40:00Z
.
The previous SQL outputs the following:
INTERVAL_END |
SPLIT_START |
SPLIT_LIMIT |
CPU_USAGE_SCORE |
---|---|---|---|
2024-05-18T17:46:00Z |
Users(180) |
<end> |
85 |
2024-05-18T17:47:00Z |
Users(180) |
<end> |
85 |
2024-05-18T17:48:00Z |
Users(180) |
<end> |
85 |
2024-05-18T17:49:00Z |
Users(180) |
<end> |
85 |
2024-05-18T17:50:00Z |
Users(180) |
<end> |
85 |
The split seems to have been hot for the past few minutes. You might observe the split for longer to determine that the Spanner load-based splitting mitigates the hotspot. There might be cases wherein Spanner can't load balance any further.
For example, query the SPANNER_SYS.SPLIT_STATS_TOP_MINUTE
table. See the following example scenarios.
GoogleSQL
SELECT t.interval_end,
t.split_start,
t.split_limit,
t.cpu_usage_score
FROM SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE t.interval_end >= "interval_end_date_time"
AND t.interval_end <= "interval_end_date_time";
Replace interval_end_date_time with the date and time for the
interval, using the format 2024-05-18T17:40:00Z
.
PostgreSQL
SELECT t.interval_end,
t.split_start,
t.split_limit,
t._cpu_usage
FROM SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE t.interval_end >= 'interval_end_date_time'::timestamptz
AND t.interval_end <= 'interval_end_date_time'::timestamptz;
Replace interval_end_date_time with the date and time for the
interval, using the format 2024-05-18T17:40:00Z
.
Single hot row
In the following example, it looks like Threads(10,"spanner")
is in a
single row split which remained hot for over 10 minutes. This could happen when
there's a persistent load on a popular row.
INTERVAL_END |
SPLIT_START |
SPLIT_LIMIT |
CPU_USAGE_SCORE |
---|---|---|---|
2024-05-16T20:40:00Z |
Threads(10,"spanner") |
Threads(10,"spanner1") |
62 |
2024-05-16T20:41:00Z |
Threads(10,"spanner") |
Threads(10,"spanner1") |
62 |
2024-05-16T20:42:00Z |
Threads(10,"spanner") |
Threads(10,"spanner1") |
62 |
2024-05-16T20:43:00Z |
Threads(10,"spanner") |
Threads(10,"spanner1") |
62 |
2024-05-16T20:44:00Z |
Threads(10,"spanner") |
Threads(10,"spanner1") |
62 |
2024-05-16T20:45:00Z |
Threads(10,"spanner") |
Threads(10,"spanner1") |
62 |
2024-05-16T20:46:00Z |
Threads(10,"spanner") |
Threads(10,"spanner1") |
80 |
2024-05-16T20:47:00Z |
Threads(10,"spanner") |
Threads(10,"spanner1") |
80 |
2024-05-16T20:48:00Z |
Threads(10,"spanner") |
Threads(10,"spanner1") |
80 |
2024-05-16T20:49:00Z |
Threads(10,"spanner") |
Threads(10,"spanner1") |
100 |
2024-05-16T20:50:00Z |
Threads(10,"spanner") |
Threads(10,"spanner1") |
100 |
Spanner can't balance the load for this single key as it can't be split further.
Moving hotspot
In the following example, the load moves through contiguous splits over time, moving to a new split across time intervals.
INTERVAL_END |
SPLIT_START |
SPLIT_LIMIT |
CPU_USAGE_SCORE |
---|---|---|---|
2024-05-16T20:40:00Z |
Threads(1,"a") |
Threads(1,"aa") |
100 |
2024-05-16T20:41:00Z |
Threads(1,"aa") |
Threads(1,"ab") |
100 |
2024-05-16T20:42:00Z |
Threads(1,"ab") |
Threads(1,"c") |
100 |
2024-05-16T20:43:00Z |
Threads(1,"c") |
Threads(1,"ca") |
100 |
This could occur, for example, due to a workload that reads or writes keys in monotonically increasing order. Spanner can't balance the load to mitigate the effects of this application behavior.
Normal load balancing
Spanner tries to balance the load by adding more splits or moving splits around. The following example shows what that might look like.
INTERVAL_END |
SPLIT_START |
SPLIT_LIMIT |
CPU_USAGE_SCORE |
---|---|---|---|
2024-05-16T20:40:00Z |
Threads(1000,"zebra") |
<end> |
82 |
2024-05-16T20:41:00Z |
Threads(1000,"zebra") |
<end> |
90 |
2024-05-16T20:42:00Z |
Threads(1000,"zebra") |
<end> |
100 |
2024-05-16T20:43:00Z |
Threads(1000,"zebra") |
Threads(2000,"spanner") |
100 |
2024-05-16T20:44:00Z |
Threads(1200,"c") |
Threads(2000) |
92 |
2024-05-16T20:45:00Z |
Threads(1500,"c") |
Threads(1700,"zach") |
76 |
2024-05-16T20:46:00Z |
Threads(1700) |
Threads(1700,"c") |
76 |
2024-05-16T20:47:00Z |
Threads(1700) |
Threads(1700,"c") |
50 |
2024-05-16T20:48:00Z |
Threads(1700) |
Threads(1700,"c") |
39 |
Here, the larger split at 2024-05-16T17:40:00Z was split further into a smaller
split and as a result, the CPU_USAGE_SCORE
statistic decreased.
Spanner might not create splits into individual rows. The splits
mirror the workload causing the high CPU_USAGE_SCORE
statistic.
If you have observed a persistent hot split for over 10 minutes, see Best practices to mitigate hotspots.
Best practices to mitigate hotspots
If load-balancing doesn't decrease latency, the next step is to identify the cause of the hotspots. After that, options are to either reduce the hotspotting workload, or optimize the application schema and logic to avoid hotspots.
Identify the cause
Use Lock & Transaction Insights to look for transactions that have high lock wait time where the row range start key is within the hot split.
Use Query Insights to look for queries that read from the table that contains the hot split, and have recently increased latency, or a higher ratio of latency to CPU.
Use Oldest Active Queries to look for queries that read from the table that contains the hot split, and that have higher than expected latency.
Some special cases to watch for:
- Check to see if time to live (TTL) was enabled recently. If there are
a lot of splits from old data, then TTL can raise
CPU_USAGE_SCORE
levels during mass deletes. In this case, the issue should self-resolve once the initial deletions complete.
Optimize the workload
- Follow SQL best practices. Consider stale reads, writes that don't perform reads first, or adding indexes.
- Follow Schema best practices. Ensure your schema is designed to handle load balancing and avoid hotspots.
What's next
- Learn about schema design best practices.
- Learn about Key Visualizer.
- Look through examples of schema designs.
- Learn how to use the split hotspots insights dashboard