This page describes how to use CPU utilization metrics and charts, along with other introspection tools, to investigate high CPU usage in your database.
Identifying whether a system or user task is causing high CPU utilization
The Cloud Console provides several monitoring tools for Cloud Spanner, allowing you to see the status of the most essential metrics for your instance. One of these charts is CPU utilization - high priority. The chart shows you CPU utilization of high-priority tasks against the recommended maximum CPU usage for both single-region and multi- region instances. There are two types of tasks; user tasks, such as reads and writes, and system tasks, such as performing backups and compaction.
To distinguish between task types on this chart:
Click the More chart options icon on the chart.
Select View in Metrics Explorer.
In the Metrics Explorer, add
is_systemto the Group By option. The chart now displays CPU usage broken down by system task and user task.
Figure 1 shows an example of the CPU utilization by high priority chart.
Figure 1. CPU utilization by high priority chart in the Monitoring dashboard in Cloud Console.
Now, imagine you receive an alert from Cloud Monitoring that CPU usage has
increased significantly. You open the Monitoring dashboard for your instance
in Cloud Console and examine the CPU Utilization - high priority
chart by opening it in the Metrics Explorer. As shown in Figure 1, you
can see the increase in high priority CPU utilization and, looking at the value
is_system, you determine that the spike is caused by user tasks. The next
step is to find out what user operation is causing this CPU usage increase.
Identifying what user operation is causing the CPU utilization spike
The CPU utilization - high priority chart in Figure 1 showed us that high priority user tasks are the cause of higher CPU usage.
Next, we'll examine the CPU Utilization by operation type metric by creating a chart using the Cloud Monitoring metrics explorer. This chart will show us CPU utilization broken down by high-priority, user-initiated, operations.
What is a user-initiated operation?
A user-initiated operation is an operation that is initiated through an API request. Cloud Spanner groups these requests into operation types or categories, and we can display each operation type as a line on the CPU utilization by operation type chart. The following table describes the API methods that are included in each operation type.
|Includes reads which fetch rows from the database using key lookups and scans.|
|Includes reads inside read-write transactions.|
|Includes read operations performed using a set of partition tokens.|
|Includes execute Select SQL statement.|
|Includes execute Select statement inside read-write transactions.|
|Includes execute Select statement performed using a set of partition tokens.|
|Includes execute DML SQL statement.|
|Includes execute Partitioned DML SQL statement.|
|Includes begin, commit, and rollback transactions.|
|Includes PartitionQuery, PartitionRead, Create Database, Create Instance, session related operations, internal time-critical serving operations, etc.|
Creating a chart for CPU utilization by operations type in Metrics Explorer
- In the Cloud Console, select Monitoring, or use the following button:
- Select Metrics Explorer in the navigation pane.
In the Find resource type and metric field, enter the value
spanner.googleapis.com/instance/cpu/utilization_by_operation_type, then select the row that appears below the box.
In the Filter field, enter the value
instance_id, then enter the instance ID you want to examine and click >Apply.
In the Group By field, select
categoryfrom the dropdown list. The chart will show CPU utilization of user tasks grouped by operation type, or category.
Here's an example chart of the CPU utilization by operation types metric.
Figure 2. CPU utilization by operation type chart in Metrics Explorer.
Each operation type, or category, is plotted on a line graph. The category filter below the chart identifies each graph. You can hide and show each graph by selecting or deselecting its respective category filter.
So, while the CPU utilization by priority metric in the preceding section helped determine whether a user or system task caused an increase in CPU resource usage, with the CPU utilization by operation type metric we can dig deeper and find out the type of user-initiated operation behind this rise in CPU usage.
Identifying which user request is contributing to increased CPU usage
To determine what specific user request is responsible for the spike in CPU utilization in the executesql_select_readonly operation type graph we saw in Figure 2, we'll use the built-in introspection statistics tables to gain more insight.
Use the following table as a guide to determine which statistics table to query based on the operation type that is causing high CPU usage.
For example, if read_withpartitiontoken is the problem, troubleshoot using read statistics.
In our scenario, the executsql_select_readonly operation seems to be the reason for the CPU usage increase we are observing. Based on the preceding table, we should look at query statistics next to find out what queries are expensive, run frequently or scan a lot of data.
To find out the queries with the highest CPU usage in the previous hour, we run
the following query on the
query_stats_top_hour statistics table.
SELECT text, execution_count AS count, avg_latency_seconds AS latency, avg_cpu_seconds AS cpu, execution_count * avg_cpu_seconds AS total_cpu FROM spanner_sys.query_stats_top_hour WHERE interval_end = (SELECT MAX(interval_end) FROM spanner_sys.query_stats_top_hour) ORDER BY total_cpu DESC;
The output will show queries sorted by CPU usage. Once we identify the query with highest CPU usage, we can try the following options to tune it.
Review the query execution plan to identify any possible inefficiencies that might contribute to high CPU utilization.
Review your query to make sure it is following SQL best practices for Cloud Spanner.
Review the database schema design and update the schema to allow for more efficient queries.
Establish a baseline for the number of times Cloud Spanner executes a query during an interval. Using this baseline, you'll be able to detect and investigate the cause of any unexpected deviations from normal behavior.
If you didn't manage to find a CPU-intensive query, add nodes to the instance. Adding nodes provides more CPU resources and enables Cloud Spanner to handle a larger workload. For more information, see Adding nodes.
Learn about CPU utilization metrics
Learn about other Introspection tools.
Learn about Monitoring with Cloud Monitoring.
Learn more about SQL best practices for Cloud Spanner.
See the list of Metrics from Cloud Spanner.