Investigating high CPU utilization

This page describes how to use CPU utilization metrics and charts, along with other introspection tools, to investigate high CPU usage in your database.

Identifying whether a system or user task is causing high CPU utilization

The Cloud Console provides several monitoring tools for Cloud Spanner, allowing you to see the status of the most essential metrics for your instance. One of these charts is CPU utilization - high priority. The chart shows you CPU utilization of high-priority tasks against the recommended maximum CPU usage for both single-region and multi- region instances. There are two types of tasks; user tasks, such as reads and writes, and system tasks, such as performing backups and compaction.

To distinguish between task types on this chart:

  1. Click the More chart options icon on the chart.

  2. Select View in Metrics Explorer.

  3. In the Metrics Explorer, add is_system to the Group By option. The chart now displays CPU usage broken down by system task and user task.

Figure 1 shows an example of the CPU utilization by high priority chart.

Example of a CPU utilization by high priority chart

Figure 1. CPU utilization by high priority chart in the Monitoring dashboard in Cloud Console.

Now, imagine you receive an alert from Cloud Monitoring that CPU usage has increased significantly. You open the Monitoring dashboard for your instance in Cloud Console and examine the CPU Utilization - high priority chart by opening it in the Metrics Explorer. As shown in Figure 1, you can see the increase in high priority CPU utilization and, looking at the value of is_system, you determine that the spike is caused by user tasks. The next step is to find out what user operation is causing this CPU usage increase.

Identifying what user operation is causing the CPU utilization spike

The CPU utilization - high priority chart in Figure 1 showed us that high priority user tasks are the cause of higher CPU usage.

Next, we'll examine the CPU Utilization by operation type metric by creating a chart using the Cloud Monitoring metrics explorer. This chart will show us CPU utilization broken down by high-priority, user-initiated, operations.

What is a user-initiated operation?

A user-initiated operation is an operation that is initiated through an API request. Cloud Spanner groups these requests into operation types or categories, and we can display each operation type as a line on the CPU utilization by operation type chart. The following table describes the API methods that are included in each operation type.

Operation API methods Description
read_readonly Read
StreamingRead
Includes reads which fetch rows from the database using key lookups and scans.
read_readwrite Read
StreamingRead
Includes reads inside read-write transactions.
read_withpartitiontoken Read
StreamingRead
Includes read operations performed using a set of partition tokens.
executesql_select_readonly ExecuteSql
ExecuteStreamingSql
Includes execute Select SQL statement.
executesql_select_readwrite ExecuteSql
ExecuteStreamingSql
Includes execute Select statement inside read-write transactions.
executesql_select_withpartitiontoken ExecuteSql
ExecuteStreamingSql
Includes execute Select statement performed using a set of partition tokens.
executesql_dml_readwrite ExecuteSql
ExecuteStreamingSql
ExecuteBatchDml
Includes execute DML SQL statement.
executesql_dml_partitioned ExecuteSql
ExecuteStreamingSql
ExecuteBatchDml
Includes execute Partitioned DML SQL statement.
beginorcommit BeginTransaction
Commit
Rollback
Includes begin, commit, and rollback transactions.
misc PartitionQuery
PartitionRead
GetSession
CreateSession
Includes PartitionQuery, PartitionRead, Create Database, Create Instance, session related operations, internal time-critical serving operations, etc.

Creating a chart for CPU utilization by operations type in Metrics Explorer

  1. In the Cloud Console, select Monitoring, or use the following button:

    Go to Monitoring

  2. Select Metrics Explorer in the navigation pane.
  3. In the Find resource type and metric field, enter the value spanner.googleapis.com/instance/cpu/utilization_by_operation_type , then select the row that appears below the box.
  4. In the Filter field, enter the value instance_id, then enter the instance ID you want to examine and click >Apply.
  5. In the Group By field, select category from the dropdown list. The chart will show CPU utilization of user tasks grouped by operation type, or category.

Here's an example chart of the CPU utilization by operation types metric.

Example of a CPU utilization by operation types chart

Figure 2. CPU utilization by operation type chart in Metrics Explorer.

Each operation type, or category, is plotted on a line graph. The category filter below the chart identifies each graph. You can hide and show each graph by selecting or deselecting its respective category filter.

So, while the CPU utilization by priority metric in the preceding section helped determine whether a user or system task caused an increase in CPU resource usage, with the CPU utilization by operation type metric we can dig deeper and find out the type of user-initiated operation behind this rise in CPU usage.

Identifying which user request is contributing to increased CPU usage

To determine what specific user request is responsible for the spike in CPU utilization in the executesql_select_readonly operation type graph we saw in Figure 2, we'll use the built-in introspection statistics tables to gain more insight.

Use the following table as a guide to determine which statistics table to query based on the operation type that is causing high CPU usage.

Operation type Query Read Transaction
read_readonly No Yes No
read_readwrite No Yes Yes
read_withpartitiontoken No Yes No
executesql_select_readonly Yes No No
executesql_select_withpartitiontoken Yes No No
executesql_select_readwrite Yes No Yes
executesql_dml_readwrite No No Yes
executesql_dml_partitioned No No Yes
beginorcommit No No Yes

For example, if read_withpartitiontoken is the problem, troubleshoot using read statistics.

In our scenario, the executsql_select_readonly operation seems to be the reason for the CPU usage increase we are observing. Based on the preceding table, we should look at query statistics next to find out what queries are expensive, run frequently or scan a lot of data.

To find out the queries with the highest CPU usage in the previous hour, we run the following query on the query_stats_top_hour statistics table.

SELECT text,
       execution_count AS count,
       avg_latency_seconds AS latency,
       avg_cpu_seconds AS cpu,
       execution_count * avg_cpu_seconds AS total_cpu
FROM spanner_sys.query_stats_top_hour
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.query_stats_top_hour)
ORDER BY total_cpu DESC;

The output will show queries sorted by CPU usage. Once we identify the query with highest CPU usage, we can try the following options to tune it.

  • Review the query execution plan to identify any possible inefficiencies that might contribute to high CPU utilization.

  • Review your query to make sure it is following SQL best practices for Cloud Spanner.

  • Review the database schema design and update the schema to allow for more efficient queries.

  • Establish a baseline for the number of times Cloud Spanner executes a query during an interval. Using this baseline, you'll be able to detect and investigate the cause of any unexpected deviations from normal behavior.

If you didn't manage to find a CPU-intensive query, add compute capacity to the instance. Adding compute capacity provides more CPU resources and enables Cloud Spanner to handle a larger workload. For more information, see Increasing compute capacity.

What's next