Query statistics

Cloud Spanner provides built-in tables that keep many statistics for the queries that used the most CPU, and all queries in aggregate. The Google Cloud Console uses the data in these SPANNER_SYS.QUERY_STATS* tables to generate its Query statistics view, but you can also retrieve statistics from the tables using SQL statements.

Availability

SPANNER_SYS data is available only through SQL interfaces (for example, executeQuery and gcloud spanner databases execute-sql); other single read methods Cloud Spanner provides do not support SPANNER_SYS.

CPU usage grouped by query

The following tables track the queries with the highest CPU usage during a specific time period:

  • SPANNER_SYS.QUERY_STATS_TOP_MINUTE: Queries during 1 minute intervals
  • SPANNER_SYS.QUERY_STATS_TOP_10MINUTE: Queries during 10 minute intervals
  • SPANNER_SYS.QUERY_STATS_TOP_HOUR: Queries during 1 hour intervals

These tables have the following properties:

  • Each table contains data for non-overlapping time intervals of the length the table name specifies.

  • Intervals are based on clock times. 1 minute intervals end on the minute, 10 minute intervals end every 10 minutes starting on the hour, and 1 hour intervals end on the hour.

    For example, at 11:59:30 AM, the most recent intervals available to SQL queries are:

    • 1 minute: 11:58:00–11:58:59 AM
    • 10 minute: 11:40:00–11:49:59 AM
    • 1 hour: 10:00:00–10:59:59 AM
  • Cloud Spanner groups the statistics by the text of the SQL query. If a query uses query parameters, Cloud Spanner groups all executions of that query into one row. If the query uses string literals, Cloud Spanner only groups the statistics if the full query text is identical; when any text differs, each query appears as a separate row.

  • Each row contains statistics for all executions of a particular SQL query that Cloud Spanner captures statistics for during the specified interval.

  • If Cloud Spanner is unable to store all queries run during the interval, the system prioritizes queries with the highest CPU usage during the specified interval.

  • Tracked queries include those that completed, failed, or were canceled by the user.

  • A subset of statistics is specific to queries that ran but did not complete:

    • Execution count and mean latency in seconds across all the queries that did not succeed.

    • Execution count for queries that timed out.

    • Execution count for queries that were canceled by the user or failed due to network connectivity issues.

Table schema

Column name Type Description
INTERVAL_END TIMESTAMP End of the time interval that the included query executions occurred in.
TEXT STRING SQL query text, truncated to approximately 64KB.
TEXT_TRUNCATED BOOL Whether or not the query text was truncated.
TEXT_FINGERPRINT INT64 Hash of the query text.
EXECUTION_COUNT INT64 Number of times Cloud Spanner saw the query during the interval.
AVG_LATENCY_SECONDS FLOAT64 Average length of time, in seconds, for each query execution within the database. This average excludes the encoding and transmission time for the result set as well as overhead.
AVG_ROWS FLOAT64 Average number of rows that the query returned.
AVG_BYTES FLOAT64 Average number of data bytes that the query returned, excluding transmission encoding overhead.
AVG_ROWS_SCANNED FLOAT64 Average number of rows that the query scanned, excluding deleted values.
AVG_CPU_SECONDS FLOAT64 Average number of seconds of CPU time Cloud Spanner spent on all operations to execute the query.
ALL_FAILED_EXECUTION_COUNT INT64 Number of times the query failed during the interval.
ALL_FAILED_AVG_LATENCY_SECONDS FLOAT64 Average length of time, in seconds, for each query execution that failed within the database. This average excludes the encoding and transmission time for the result set as well as overhead.
CANCELLED_OR_DISCONNECTED_EXECUTION_COUNT INT64 Number of times the query was canceled by the user or failed due to broken network connection during the interval.
TIMED_OUT_EXECUTION_COUNT INT64 Number of times the query timed out during the interval.

EXECUTION_COUNT and AVG_LATENCY_SECONDS for failed queries include queries that failed due to incorrect syntax or encountered a transient error but succeeded on retry.

Aggregate statistics

There are also tables that track aggregate data for all the queries for which Cloud Spanner captured statistics in a specific time period:

  • SPANNER_SYS.QUERY_STATS_TOTAL_MINUTE: Queries during 1 minute intervals
  • SPANNER_SYS.QUERY_STATS_TOTAL_10MINUTE: Queries during 10 minute intervals
  • SPANNER_SYS.QUERY_STATS_TOTAL_HOUR: Queries during 1 hour intervals

These tables have the following properties:

  • Each table contains data for non-overlapping time intervals of the length the table name specifies.

  • Intervals are based on clock times. 1 minute intervals end on the minute, 10 minute intervals end every 10 minutes starting on the hour, and 1 hour intervals end on the hour.

    For example, at 11:59:30 AM, the most recent intervals available to SQL queries are:

    • 1 minute: 11:58:00–11:58:59 AM
    • 10 minute: 11:40:00–11:49:59 AM
    • 1 hour: 10:00:00–10:59:59 AM
  • Each row contains statistics for all queries executed over the database during the specified interval, aggregated together. There is only one row per time interval and it includes completed queries, failed queries, and queries canceled by the user.

  • The statistics captured in the TOTAL tables might include queries that Cloud Spanner did not capture in the TOP tables.

Table schema

Column name Type Description
INTERVAL_END TIMESTAMP End of the time interval that the included query executions occurred in.
EXECUTION_COUNT INT64 Number of times Cloud Spanner saw the query during the interval of time.
AVG_LATENCY_SECONDS FLOAT64 Average length of time, in seconds, for each query execution within the database. This average excludes the encoding and transmission time for the result set as well as overhead.
AVG_ROWS FLOAT64 Average number of rows that the query returned.
AVG_BYTES FLOAT64 Average number of data bytes that the query returned, excluding transmission encoding overhead.
AVG_ROWS_SCANNED FLOAT64 Average number of rows that the query scanned, excluding deleted values.
AVG_CPU_SECONDS FLOAT64 Average number of seconds of CPU time Cloud Spanner spent on all operations to execute the query.
ALL_FAILED_EXECUTION_COUNT INT64 Number of times the query failed during the interval.
ALL_FAILED_AVG_LATENCY_SECONDS FLOAT64 Average length of time, in seconds, for each query execution that failed within the database. This average excludes the encoding and transmission time for the result set as well as overhead.
CANCELLED_OR_DISCONNECTED_EXECUTION_COUNT INT64 Number of times the query was canceled by the user or failed due to broken network connection during the interval.
TIMED_OUT_EXECUTION_COUNT INT64 Number of times the query timed out during the interval.

Data retention

At a minimum, Cloud Spanner keeps data for each table for the following time periods:

  • SPANNER_SYS.QUERY_STATS_TOP_MINUTE and SPANNER_SYS.QUERY_STATS_TOTAL_MINUTE: Intervals covering the previous 6 hours.

  • SPANNER_SYS.QUERY_STATS_TOP_10MINUTE and SPANNER_SYS.QUERY_STATS_TOTAL_10MINUTE: Intervals covering the previous 4 days.

  • SPANNER_SYS.QUERY_STATS_TOP_HOUR and SPANNER_SYS.QUERY_STATS_TOTAL_HOUR: Intervals covering the previous 30 days.

Example queries

This section includes several example SQL statements that retrieve query statistics. You can run these SQL statements using the client libraries, the gcloud command-line tool, or the Cloud Console.

Listing the basic statistics for each query in a given time period

The following query returns the raw data for the top queries in the previous minute:

SELECT text,
       interval_end,
       execution_count,
       avg_latency_seconds,
       avg_rows,
       avg_bytes,
       avg_rows_scanned,
       avg_cpu_seconds
FROM spanner_sys.query_stats_top_minute
ORDER BY interval_end DESC;

Listing the queries with the highest CPU usage

The following query returns the queries with the highest CPU usage in the previous hour:

SELECT text,
       execution_count AS count,
       avg_latency_seconds AS latency,
       avg_cpu_seconds AS cpu,
       execution_count * avg_cpu_seconds AS total_cpu
FROM spanner_sys.query_stats_top_hour
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.query_stats_top_hour)
ORDER BY total_cpu DESC;

Finding the total execution count in a given time period

The following query returns the total number of queries executed in the most recent complete 1-minute interval:

SELECT interval_end,
       execution_count
FROM spanner_sys.query_stats_total_minute
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.query_stats_top_minute);

Finding the average latency for a query

The following query returns the average latency information for a specific query:

SELECT avg_latency_seconds
FROM spanner_sys.query_stats_top_hour
WHERE text LIKE "SELECT x FROM table WHERE x=@foo;";

Finding the queries that scan the most data

You can use the number of rows scanned by a query as a measure of the amount of data that the query scanned. The following query returns the number of rows scanned by queries executed in the previous hour:

SELECT text,
       execution_count,
       avg_rows_scanned
FROM spanner_sys.query_stats_top_hour
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.query_stats_top_hour)
ORDER BY avg_rows_scanned DESC;

Totaling the CPU usage across all queries

The following query returns the number of CPU hours used in the previous hour:

SELECT (avg_cpu_seconds * execution_count / 60 / 60)
  AS total_cpu_hours
FROM spanner_sys.query_stats_total_hour
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.query_stats_total_hour);

Listing the queries that have failed in a given time period

The following query returns the raw data including execution count and average latency of failed queries for the top queries in the previous minute:

SELECT text,
       interval_end,
       execution_count,
       all_failed_execution_count,
       all_failed_avg_latency_seconds,
       avg_latency_seconds,
       avg_rows,
       avg_bytes,
       avg_rows_scanned,
       avg_cpu_seconds
FROM spanner_sys.query_stats_top_minute
WHERE all_failed_execution_count > 0
ORDER BY interval_end;

Finding the total error count in a given time period

The following query returns the total number of queries that failed to execute in the most recent complete 1-minute interval.

SELECT interval_end,
       all_failed_execution_count
FROM spanner_sys.query_stats_total_minute
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.query_stats_top_minute)
ORDER BY interval_end;

Listing the queries that timeout the most

The following query returns the queries with the highest timeout count in the previous hour.

SELECT text,
       execution_count AS count,
       timed_out_execution_count AS timeout_count,
       avg_latency_seconds AS latency,
       avg_cpu_seconds AS cpu,
       execution_count * avg_cpu_seconds AS total_cpu
FROM spanner_sys.query_stats_top_hour
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.query_stats_top_hour)
ORDER BY timed_out_execution_count DESC;

Finding the average latency of successful and failed executions for a query

The following query returns the combined average latency, average latency for successful executions, and average latency for failed executions for a specific query.

SELECT avg_latency_seconds AS combined_avg_latency,
       all_failed_avg_latency_seconds AS failed_execution_latency,
       ( avg_latency_seconds * execution_count -
         all_failed_avg_latency_seconds * all_failed_execution_count
       ) / (
       execution_count - all_failed_execution_count ) AS success_execution_latency
FROM   spanner_sys.query_stats_top_hour
WHERE  text LIKE "select x from table where x=@foo;";

Troubleshooting high CPU usage or elevated query latency with query statistics

Query statistics are useful when you need to investigate high CPU usage on your Cloud Spanner database or when you are just trying to understand the CPU- heavy query shapes on your database. Inspecting queries that use significant amounts of database resources gives Cloud Spanner users a potential way to reduce operational costs and possibly improve general system latencies. Using the following steps, we'll show you how to use query statistics to investigate high CPU usage in your database.

While the example below focuses on CPU usage, similar steps can be followed to troubleshoot elevated query latency and find the queries with the highest latencies. Simply select time intervals and queries by latency instead of CPU usage.

Selecting a time period to investigate

Start your investigation by looking for a time when your application began to experience high CPU usage. For example, let's say the issue started occurring around 5:00 pm on July 24, 2020 UTC.

Gathering query statistics for the selected time period

Having selected a time period to start our investigation, we'll look at statistics gathered in the QUERY_STATS_TOTAL_10MINUTE table around that time. The results of this query might indicate how CPU and other query statistics changed over that period of time.

The following query returns the aggregated query statistics from 16:30 to 17:30 UTC inclusive. We are using ROUND in our query to restrict the number of decimal places for display purposes.

SELECT interval_end,
       execution_count AS count,
       ROUND(avg_latency_seconds,2) AS latency,
       ROUND(avg_rows,2) AS rows_returned,
       ROUND(avg_bytes,2) AS bytes,
       ROUND(avg_rows_scanned,2) AS rows_scanned,
       ROUND(avg_cpu_seconds,3) AS avg_cpu
FROM spanner_sys.query_stats_total_10minute
WHERE
  interval_end >= "2020-07-24T16:30:00Z"
  AND interval_end <= "2020-07-20T17:30:00Z"
ORDER BY interval_end;

Running the query produced the following results.

interval_end count latency rows_returned bytes rows_scanned avg_cpu
2020-07-24T16:30:00Z 6 0.06 5.00 536.00 16.67 0.035
2020-07-24T16:40:00Z 55 0.02 0.22 25.29 0.22 0.004
2020-07-24T16:50:00Z 102 0.02 0.30 33.35 0.30 0.004
2020-07-24T17:00:00Z 154 1.06 4.42 486.33 7792208.12 4.633
2020-07-24T17:10:00Z 94 0.02 1.68 106.84 1.68 0.006
2020-07-24T17:20:00Z 110 0.02 0.38 34.60 0.38 0.005
2020-07-24T17:30:00Z 47 0.02 0.23 24.96 0.23 0.004

In the preceding table we see that average CPU time, the avg_cpu column in the results table, is highest in the highlighted intervals ending at 17:00. We also see a much higher number of rows scanned on average. This indicates that more expensive queries ran between 16:50 and 17:00. Let's choose that interval to investigate further in the next step.

Finding which queries are causing high CPU usage

With a time interval to investigate selected, we now query the QUERY_STATS_TOP_10MINUTE table. The results of this query can help indicate which queries cause high CPU usage.

SELECT text_fingerprint AS fingerprint,
       execution_count AS count,
       ROUND(avg_latency_seconds,2) AS latency,
       ROUND(avg_cpu_seconds,3) AS cpu,
       ROUND(execution_count * avg_cpu_seconds,3) AS total_cpu
FROM spanner_sys.query_stats_top_10MINUTE
WHERE
  interval_end = "2020-07-20T17:00:00Z"
ORDER BY total_cpu DESC;

Running this query yields the following results.

fingerprint count latency cpu total_cpu
5505124206529314852 30 3.88 17.635 529.039
1697951036096498470 10 4.49 18.388 183.882
2295109096748351518 1 0.33 0.048 0.048
11618299167612903606 1 0.25 0.021 0.021
10302798842433860499 1 0.04 0.006 0.006
123771704548746223 1 0.04 0.006 0.006
4216063638051261350 1 0.04 0.006 0.006
3654744714919476398 1 0.04 0.006 0.006
2999453161628434990 1 0.04 0.006 0.006
823179738756093706 1 0.02 0.005 0.0056

The top 2 queries, highlighted in the results table, are outliers in terms of average CPU and latency, as well as number of executions and total CPU. Let's investigate the first query listed in these results.

Comparing query runs over time

Having narrowed down the investigation, we can turn our attention to the QUERY_STATS_TOP_MINUTE table. By comparing runs over time for a particular query, we can look for correlations between the number of rows or bytes returned, or the number of rows scanned and elevated CPU or latency. A deviation may indicate non-uniformity in the data. Consistently high numbers of rows scanned may indicate the lack of appropriate indexes or sub-optimal join ordering.

Let's investigate the query exhibiting highest average CPU usage and highest latency by running the following statement which filters on the text_fingerprint of that query.

SELECT interval_end,
       ROUND(avg_latency_seconds,2) AS latency,
       avg_rows AS rows_returned,
       avg_bytes AS bytes_returned,
       avg_rows_scanned AS rows_scanned,
       ROUND(avg_cpu_seconds,3) AS cpu,
FROM spanner_sys.query_stats_top_minute
WHERE text_fingerprint = 5505124206529314852
ORDER BY interval_end DESC;

Running this query returns the following results.

interval_end latency rows_returned bytes_returned rows_scanned cpu
2020-07-24T17:00:00Z 4.55 21 2365 30000000 19.255
2020-07-24T16:00:00Z 3.62 21 2365 30000000 17.255
2020-07-24T15:00:00Z 4.37 21 2365 30000000 18.350
2020-07-24T14:00:00Z 4.02 21 2365 30000000 17.748
2020-07-24T13:00:00Z 3.12 21 2365 30000000 16.380
2020-07-24T12:00:00Z 3.45 21 2365 30000000 15.476
2020-07-24T11:00:00Z 4.94 21 2365 30000000 22.611
2020-07-24T10:00:00Z 6.48 21 2365 30000000 21.265
2020-07-24T09:00:00Z 0.23 21 2365 5 0.040
2020-07-24T08:00:00Z 0.04 21 2365 5 0.021
2020-07-24T07:00:00Z 0.09 21 2365 5 0.030

Examining the preceding results, we see that the number of rows scanned, CPU used, and latency all changed significantly around 9:00 am. To understand why these numbers increased so dramatically, we'll examine the query text and see whether any changes in the schema might have impacted the query.

Use the following query to retrieve the query text for the query we are investigating.

SELECT text,
       text_truncated
FROM spanner_sys.query_stats_top_hour
WHERE text_fingerprint = 5505124206529314852
LIMIT 1;

This returns the following result.

text text_truncated
select * from orders where o_custkey = 36901; false

Examining the query text which is returned, we realize that the query is filtering on a field called o_custkey. This is a non-key column on the orders table. As it happens, there used to be an index on that column that was dropped around 9 am. This explains the change in cost for this query. We can add the index back in or, if the query is infrequently run, decide to not have the index and accept the higher read cost.

Our investigation focused so far on queries that completed successfully and we found one reason why the database was experiencing some performance degradation. In the next step, we'll focus on failed or canceled queries and show how to examine that data for more insights.

Investigating failed queries

Queries that don't complete successfully still consume resources before they time out, are cancelled, or otherwise fail. Cloud Spanner tracks the execution count and resources consumed by failed queries along with successful ones.

To check whether failed queries are a significant contributor to system utilization we can first check how many queries failed in the time interval of interest.

SELECT interval_end,
       all_failed_execution_count AS failed_count,
       all_failed_avg_latency_seconds AS latency
FROM spanner_sys.query_stats_total_minute
WHERE
  interval_end >= "2020-07-24T16:50:00Z"
  AND interval_end <= "2020-07-24T17:00:00Z"
ORDER BY interval_end;
interval_end failed_count latency
2020-07-24T16:52:00Z 1 15.211391
2020-07-24T16:53:00Z 3 58.312232

Investigating further, we can look for queries that are most likely to fail using the following query.

SELECT interval_end,
       text_fingerprint,
       execution_count,
       avg_latency_seconds AS avg_latency,
       all_failed_execution_count AS failed_count,
       all_failed_avg_latency_seconds AS failed_latency,
       cancelled_or_disconnected_execution_count AS cancel_count,
       timed_out_execution_count AS to_count
FROM spanner_sys.query_stats_top_minute
WHERE all_failed_execution_count > 0
ORDER BY interval_end;
interval_end text_fingerprint execution_count failed_count cancel_count to_count
2020-07-24T16:52:00Z 5505124206529314852 3 1 1 0
2020-07-24T16:53:00Z 1697951036096498470 2 1 1 0
2020-07-24T16:53:00Z 5505124206529314852 5 2 1 1

As the preceding table shows, the query with fingerprint 5505124206529314852 has failed multiple times during different time intervals. Given a pattern of failures such as this, it is then interesting to compare latency of successful and unsuccessful runs.

SELECT interval_end,
       avg_latency_seconds AS combined_avg_latency,
       all_failed_avg_latency_seconds AS failed_execution_latency,
       ( avg_latency_seconds * execution_count -
         all_failed_avg_latency_seconds * all_failed_execution_count
       ) / (
       execution_count - all_failed_execution_count ) AS success_execution_latency
FROM   spanner_sys.query_stats_top_hour
WHERE  text_fingerprint = 5505124206529314852;
interval_end combined_avg_latency failed_execution_latency success_execution_latency
2020-07-24T17:00:00Z 3.880420 13.830709 2.774832

Applying best practices

Having identified a candidate query for optimization, we can next look at the query profile and try to optimize using SQL best practices.

What's next