Transaction statistics

Cloud Spanner provides built-in tables that store statistics about transactions. You can retrieve statistics from these SPANNER_SYS.TXN_STATS* tables using SQL statements.

When to use transaction statistics

Transaction statistics are useful when investigating performance issues. For example, you can check whether there are any slow running transactions that might be impacting performance or Queries Per Second (QPS) in your database. Another scenario is when your client applications are experiencing high transaction execution latency. Analyzing transaction statistics may help to discover potential bottlenecks, such as large volumes of updates to a particular column, which might be impacting latency.

Availability

SPANNER_SYS data is available only through SQL interfaces (for example, executeQuery and gcloud spanner databases execute-sql); other single read methods Cloud Spanner provides do not support SPANNER_SYS.

Latency statistics grouped by transaction

The following tables track the statistics for TOP resource consuming transactions during a specific time period.

  • SPANNER_SYS.TXN_STATS_TOP_MINUTE: Transaction statistics aggregated across 1 minute intervals.

  • SPANNER_SYS.TXN_STATS_TOP_10MINUTE: Transaction statistics aggregated across 10 minute intervals.

  • SPANNER_SYS.TXN_STATS_TOP_HOUR: Transaction statistics aggregated across 1 hour intervals.

These tables have the following properties:

  • Each table contains data for non-overlapping time intervals of the length the table name specifies.

  • Intervals are based on clock times. 1 minute intervals end on the minute, 10 minute intervals end every 10 minutes starting on the hour, and 1 hour intervals end on the hour.

    For example, at 11:59:30 AM, the most recent intervals available to SQL queries are:

    • 1 minute: 11:58:01–11:59:00 AM
    • 10 minute: 11:40:01–11:50:00 AM
    • 1 hour: 10:00:01–11:00:00 AM
  • Cloud Spanner groups the statistics by FPRINT (Fingerprint) of the transactions. FPRINT is the hash calculated based on the operations involved in the transaction.

  • Since statistics are grouped based on FPRINT, if the same transaction is executed multiple times within any time interval, we still see only one entry for that transaction in these tables.

  • Each row contains statistics for all executions of a particular transaction that Cloud Spanner captures statistics for during the specified interval.

If Cloud Spanner is unable to store statistics for all transactions run during the interval in these tables, the system prioritizes transactions with the highest latency, commit attempts and bytes written during the specified interval.

Table schema

Column name Type Description
INTERVAL_END TIMESTAMP End of the time interval in which the included transaction executions occurred.
FPRINT INT64 Fingerprint is the hash calculated based on the operations involved in the transaction. INTERVAL_END and FPRINT together act as an unique key for these tables.
READ_COLUMNS ARRAY<STRING> The set of columns that were read by the transaction.
WRITE_CONSTRUCTIVE_COLUMNS ARRAY<STRING> The set of columns that were constructively written (i.e. assigned to new values) by the transaction.
WRITE_DELETE_TABLES ARRAY<STRING> The set of tables that had rows deleted or replaced by the transaction.
COMMIT_ATTEMPT_COUNT INT64 Total number of commit attempts on the transaction.
COMMIT_FAILED_PRECONDITION_COUNT INT64 Total number of precondition failures (FAILED_PRECONDITION) for the transaction.
COMMIT_ABORT_COUNT INT64 Number of times the commits were aborted for the transaction.
AVG_PARTICIPANTS FLOAT64 Average number of participants in each commit attempt. To learn more about participants, see Life of Cloud Spanner Reads & Writes.
AVG_TOTAL_LATENCY_SECONDS FLOAT64 Average seconds taken from the first operation of the transaction to commit/abort.
AVG_COMMIT_LATENCY_SECONDS FLOAT64 Average seconds taken to perform the commit operation.
AVG_BYTES FLOAT64 Average number of bytes written by the transaction.

Example queries

This section includes several example SQL statements that retrieve transaction statistics. You can run these SQL statements using the client libraries, the gcloud command-line tool, or the Cloud Console.

Listing the basic statistics for each transaction in a given time period

The following query returns the raw data for the top transactions in the previous minute.

SELECT fprint,
       read_columns,
       write_constructive_columns,
       write_delete_tables,
       avg_total_latency_seconds,
       avg_commit_latency_seconds,
       avg_bytes
FROM spanner_sys.txn_stats_top_minute
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.txn_stats_top_minute);
Query output
fprint read_columns write_constructive_columns write_delete_tables avg_total_latency_seconds avg_commit_latency_seconds avg_bytes
40015598317 [] ["Routes"] ["Users"] 0.006578737 0.006547737 25286
20524969030 ["id", "no"] [] [] 0.001732442 0.000247442 0
77848338483 [] [] ["Cars", "Routes"] 0.033467418 0.000251418 0

Listing the transactions with the highest average commit latency

The following query returns the transactions with high average commit latency in the previous hour, ordered from highest to lowest average commit latency.

SELECT fprint,
       read_columns,
       write_constructive_columns,
       write_delete_tables,
       avg_total_latency_seconds,
       avg_commit_latency_seconds,
       avg_bytes
FROM spanner_sys.txn_stats_top_hour
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.txn_stats_top_hour)
ORDER BY avg_commit_latency_seconds DESC;
Query output
fprint read_columns write_constructive_columns write_delete_tables avg_total_latency_seconds avg_commit_latency_seconds avg_bytes
40015598317 [] ["Routes"] ["Users"] 0.006578737 0.006547737 25286
77848338483 [] [] ["Cars", "Routes"] 0.033467418 0.000251418 0
20524969030 ["id", "no"] [] [] 0.001732442 0.000247442 0

Finding the average latency of transactions which read certain columns

The following query returns the average latency information for transactions that read the column ADDRESS from 1 hour stats:

SELECT fprint,
       read_columns,
       write_constructive_columns,
       write_delete_tables,
       avg_total_latency_seconds
FROM spanner_sys.txn_stats_top_hour
WHERE 'ADDRESS' IN UNNEST(read_columns)
ORDER BY avg_total_latency_seconds DESC;
Query output
fprint read_columns write_constructive_columns write_delete_tables avg_total_latency_seconds
77848338483 ["ID", "ADDRESS"] [] ["Cars", "Routes"] 0.033467418
40015598317 ["ID", "NAME", "ADDRESS"] [] ["Users"] 0.006578737

Listing transactions by the average number of bytes modified

The following query returns the transactions sampled in the last hour, ordered by the average number of bytes modified by the transaction.

SELECT fprint,
       read_columns,
       write_constructive_columns,
       write_delete_tables,
       avg_bytes
FROM spanner_sys.txn_stats_top_hour
ORDER BY avg_bytes DESC;
Query output
fprint read_columns write_constructive_columns write_delete_tables avg_bytes
40015598317 [] [] ["Users"] 25286
77848338483 [] [] ["Cars", "Routes"] 12005
20524969030 ["ID", "ADDRESS"] [] ["Users"] 10923

Aggregate statistics

SPANNER_SYS also contains tables to store aggregate data for all the transactions for which Cloud Spanner captured statistics in a specific time period:

  • SPANNER_SYS.TXN_STATS_TOTAL_MINUTE: Aggregate statistics for all transactions during 1 minute intervals
  • SPANNER_SYS.TXN_STATS_TOTAL_10MINUTE: Aggregate statistics for all transactions during 10 minute intervals
  • SPANNER_SYS.TXN_STATS_TOTAL_HOUR: Aggregate statistics for all transactions during 1 hour intervals

Aggregate statistics tables have the following properties:

  • Each table contains data for non-overlapping time intervals of the length the table name specifies.

  • Intervals are based on clock times. 1 minute intervals end on the minute, 10 minute intervals end every 10 minutes starting on the hour, and 1 hour intervals end on the hour.

    For example, at 11:59:30 AM, the most recent intervals available to SQL queries on aggregate transaction statistics are:

    • 1 minute: 11:58:01–11:59:00 AM
    • 10 minute: 11:40:01–11:50:00 AM
    • 1 hour: 10:00:01–11:00:00 AM
  • Each row contains statistics for all transactions executed over the database during the specified interval, aggregated together. There is only one row per time interval.

  • The statistics captured in the SPANNER_SYS.TXN_STATS_TOTAL_* tables might include transactions that Cloud Spanner did not capture in the SPANNER_SYS.TXN_STATS_TOP_* tables.

Table schema

Column name Type Description
INTERVAL_END TIMESTAMP End of the time interval in which this statistic was captured.
COMMIT_ATTEMPT_COUNT INT64 Total number of commit attempts on the transaction.
COMMIT_FAILED_PRECONDITION_COUNT INT64 Total number of precondition failures (FAILED_PRECONDITION) for the transaction.
COMMIT_ABORT_COUNT INT64 Number of times the commits were aborted for the transaction.
AVG_PARTICIPANTS FLOAT64 Average number of participants in each commit attempt. To learn more about participants, see Life of Cloud Spanner Reads & Writes.
AVG_TOTAL_LATENCY_SECONDS FLOAT64 Average seconds taken from the first operation of the transaction to commit/abort.
AVG_COMMIT_LATENCY_SECONDS FLOAT64 Average seconds taken to perform commit operation.
AVG_BYTES FLOAT64 Average number of bytes written by the transaction.

Example queries

This section includes several example SQL statements that retrieve transaction statistics. You can run these SQL statements using the client libraries, the gcloud command-line tool, or the Cloud Console.

Finding the total number of commit attempts for a transaction in a given time period

The following query returns the total number of commit attempts for all transactions in the most recent complete 1-minute interval:

SELECT interval_end,
       commit_attempt_count
FROM spanner_sys.txn_stats_total_minute
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.txn_stats_total_minute)
ORDER BY interval_end;
Query output
interval_end commit_attempt_count
2020-01-17 11:46:00-08:00 21

Note that there is only one row in the result because aggregated stats have only one entry per interval_end for any time duration.

Finding the total commit latency across all transactions

The following query returns the total commit latency across all transactions in the previous 10 minutes:

SELECT (avg_commit_latency_seconds * commit_attempt_count / 60 / 60)
  AS total_commit_latency_hours
FROM spanner_sys.txn_stats_total_10minute
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.txn_stats_total_10minute);
Query output
total_commit_latency_hours
0.8967

Note that there is only one row in the result because aggregated stats have only one entry per interval_end for any time duration.

Data retention

At a minimum, Cloud Spanner keeps data for each table for the following time periods:

  • SPANNER_SYS.TXN_STATS_TOP_MINUTE and SPANNER_SYS.TXN_STATS_TOTAL_MINUTE: Intervals covering the previous 6 hours.

  • SPANNER_SYS.TXN_STATS_TOP_10MINUTE and SPANNER_SYS.TXN_STATS_TOTAL_10MINUTE: Intervals covering the previous 4 days.

  • SPANNER_SYS.TXN_STATS_TOP_HOUR and SPANNER_SYS.TXN_STATS_TOTAL_HOUR: Intervals covering the previous 30 days.

Transaction statistics in Cloud Spanner give insight into how an application is using the database, and are useful when investigating performance issues. For example, you can check whether there are any slow running transactions that might be causing contention, or you can identify potential sources of high load, such as large volumes of updates to a particular column. Using the following steps, we'll show you how to use transaction statistics to investigate contentions in your database.

Troubleshooting database contentions using transaction statistics

Selecting a time period to investigate

This can be found from the application which is using Cloud Spanner.

For the purposes of this exercise, let's say that the issue started occurring at around 5:20pm on the 17th May, 2020.

Gathering transaction statistics for the selected time period

To start our investigation, we'll query the TXN_STATS_TOTAL_10MINUTE table around the start of the issue. The results of this query will show us how latency and other transaction statistics changed over that period of time.

For example, the following query returns the aggregated transaction statistics from 4:30 pm to 7:40 pm (inclusive).

SELECT
  interval_end,
  ROUND(avg_total_latency_seconds,4) as avg_total_latency_seconds,
  commit_attempt_count,
  commit_abort_count
FROM SPANNER_SYS.TXN_STATS_TOTAL_10MINUTE
WHERE
  interval_end >= "2020-05-17T16:40:00"
  AND interval_end <= "2020-05-17T19:40:00"
ORDER BY interval_end;

The following table lists example data returned from our query.

interval_end avg_total_latency_seconds commit_attempt_count commit_abort_count
2020-05-17 16:40:00-07:00 0.0284 315691 5170
2020-05-17 16:50:00-07:00 0.0250 302124 3828
2020-05-17 17:00:00-07:00 0.0460 346087 11382
2020-05-17 17:10:00-07:00 0.0864 379964 33826
2020-05-17 17:20:00-07:00 0.1291 390343 52549
2020-05-17 17:30:00-07:00 0.1314 456455 76392
2020-05-17 17:40:00-07:00 0.1598 507774 121458
2020-05-17 17:50:00-07:00 0.1641 516587 115875
2020-05-17 18:00:00-07:00 0.1578 552711 122626
2020-05-17 18:10:00-07:00 0.1750 569460 154205
2020-05-17 18:20:00-07:00 0.1727 613571 160772
2020-05-17 18:30:00-07:00 0.1588 601994 143044
2020-05-17 18:40:00-07:00 0.2025 604211 170019
2020-05-17 18:50:00-07:00 0.1615 601622 135601
2020-05-17 19:00:00-07:00 0.1653 596804 129511
2020-05-17 19:10:00-07:00 0.1414 560023 112247
2020-05-17 19:20:00-07:00 0.1367 570864 100596
2020-05-17 19:30:00-07:00 0.0894 539729 65316
2020-05-17 19:40:00-07:00 0.0820 479151 40398

Here we see that aggregated latency and abort count is higher in the highlighted periods. We can pick any 10 minute interval where aggregated latency and/or abort count are high. Let's choose the interval ending at 2020-05-17T18:40:00 and use it in the next step to identify which transactions are contributing to high latency and abort count.

Identifying transactions that are experiencing high latency

Let's now query the TXN_STATS_TOP_10MINUTE table for the interval which was picked in the previous step. Using this data, we can start to identify which transactions are experiencing high latency and/or high abort count.

Execute the following query to get top performance-impacting transactions in descending order of total latency for our example interval ending at 2020-05-17T18:40:00.

SELECT
  interval_end,
  fprint,
  ROUND(avg_total_latency_seconds,4) as avg_total_latency_seconds,
  ROUND(avg_commit_latency_seconds,4) as avg_commit_latency_seconds,
  commit_attempt_count,
  commit_abort_count
FROM SPANNER_SYS.TXN_STATS_TOP_10MINUTE
WHERE
  interval_end = "2020-05-17T18:40:00"
ORDER BY avg_total_latency_seconds DESC;
interval_end fprint avg_total_latency_seconds avg_commit_latency_seconds commit_attempt_count commit_abort_count
2020-05-17 18:40:00-07:00 15185072816865185658 0.3508 0.0139 278802 142205
2020-05-17 18:40:00-07:00 15435530087434255496 0.1633 0.0142 129012 27177
2020-05-17 18:40:00-07:00 14175643543447671202 0.1423 0.0133 5357 636
2020-05-17 18:40:00-07:00 898069986622520747 0.0198 0.0158 6 0
2020-05-17 18:40:00-07:00 10510121182038036893 0.0168 0.0125 7 0
2020-05-17 18:40:00-07:00 9287748709638024175 0.0159 0.0118 4269 1
2020-05-17 18:40:00-07:00 7129109266372596045 0.0142 0.0102 182227 0
2020-05-17 18:40:00-07:00 15630228555662391800 0.0120 0.0107 58 0
2020-05-17 18:40:00-07:00 7907238229716746451 0.0108 0.0097 65 0
2020-05-17 18:40:00-07:00 10158167220149989178 0.0095 0.0047 3454 0
2020-05-17 18:40:00-07:00 9353100217060788102 0.0093 0.0045 725 0
2020-05-17 18:40:00-07:00 9521689070912159706 0.0093 0.0045 164 0
2020-05-17 18:40:00-07:00 11079878968512225881 0.0064 0.0019 65 0

We can see clearly that the first row (highlighted) in the preceding table shows a transaction experiencing high latency because of a high number of commit aborts. In the next step, we'll investigate further to see what is causing this issue.

Discovering what columns are involved in a transaction experiencing high latency

In this step, we'll check whether high latency transactions are operating on the same set of columns by fetching read_columns, write_constructive_columns and write_delete_tables data for transactions with high abort count. The FPRINT value will also be useful in the next step.

SELECT
  fprint,
  read_columns,
  write_constructive_columns,
  write_delete_tables
FROM SPANNER_SYS.TXN_STATS_TOP_10MINUTE
WHERE
  interval_end = "2020-05-17T18:40:00"
ORDER BY avg_total_latency_seconds DESC LIMIT 3;
fprint read_columns write_constructive_columns write_delete
15185072816865185658 [TestHigherLatency._exists,TestHigherLatency.lang_status,TestHigherLatency.score,globalTagAffinity.shares] [TestHigherLatency._exists,TestHigherLatency.shares,TestHigherLatency_lang_status_score_index.shares] []
15435530087434255496 [TestHigherLatency._exists,TestHigherLatency.lang_status,TestHigherLatency.likes,globalTagAffinity.score] [TestHigherLatency._exists,TestHigherLatency.likes,TestHigherLatency_lang_status_score_index.likes] []
14175643543447671202 [TestHigherLatency._exists,TestHigherLatency.lang_status,TestHigherLatency.score,globalTagAffinity.ugcCount] [TestHigherLatency._exists,TestHigherLatency.ugcCount,TestHigherLatency_lang_status_score_index.ugcCount] []

As the output shows in the preceding table, the transactions with the highest average total latency are reading the same columns. We can also observe some write contention since the transactions are writing to the same column, that is to say, TestHigherLatency._exists.

Determining how the transaction performance has changed over time

We can see how the statistics associated with this transaction shape have changed over a period of time. Use the following query, where $FPRINT is the fingerprint of the high latency transaction from the previous step.

SELECT
  interval_end,
  ROUND(avg_total_latency_seconds, 3) AS latency,
  ROUND(avg_commit_latency_seconds, 3) AS commit_latency,
  commit_attempt_count,
  commit_abort_count,
  commit_failed_precondition_count,
  avg_bytes
FROM SPANNER_SYS.TXN_STATS_TOP_10MINUTE
WHERE
  interval_end >= "2020-05-17T16:40:00"
  AND interval_end <= "2020-05-17T19:40:00"
  AND fprint = $FPRINT
ORDER BY interval_end;
interval_end latency commit_latency commit_attempt_count commit_abort_count commit_failed_precondition_count avg_bytes
2020-05-17 16:40:00-07:00 0.095 0.010 53230 4752 0 91
2020-05-17 16:50:00-07:00 0.069 0.009 61264 3589 0 91
2020-05-17 17:00:00-07:00 0.150 0.010 75868 10557 0 91
2020-05-17 17:10:00-07:00 0.248 0.013 103151 30220 0 91
2020-05-17 17:20:00-07:00 0.310 0.012 130078 45655 0 91
2020-05-17 17:30:00-07:00 0.294 0.012 160064 64930 0 91
2020-05-17 17:40:00-07:00 0.315 0.013 209614 104949 0 91
2020-05-17 17:50:00-07:00 0.322 0.012 215682 100408 0 90
2020-05-17 18:00:00-07:00 0.310 0.012 230932 106728 0 91
2020-05-17 18:10:00-07:00 0.309 0.012 259645 131049 0 91
2020-05-17 18:20:00-07:00 0.315 0.013 272171 137910 0 90
2020-05-17 18:30:00-07:00 0.292 0.013 258944 121475 0 91
2020-05-17 18:40:00-07:00 0.350 0.013 278802 142205 0 91
2020-05-17 18:50:00-07:00 0.302 0.013 256259 115626 0 91
2020-05-17 19:00:00-07:00 0.315 0.014 250560 110662 0 91
2020-05-17 19:10:00-07:00 0.271 0.014 238384 99025 0 91
2020-05-17 19:20:00-07:00 0.273 0.014 219687 84019 0 91
2020-05-17 19:30:00-07:00 0.198 0.013 195357 59370 0 91
2020-05-17 19:40:00-07:00 0.181 0.013 167514 35705 0 91

In above output we can observe that total latency is high for the highlighted period of time. And, wherever total latency is high, both commit_attempt_count and commit_abort_coun are also high even though commit latency (commit_latency) has not changed very much. Since transaction commits are getting aborted more frequently, commit attempts are also high because of commit retries.

Conclusion

In this example, we saw that high commit abort count was the cause of high latency. The next step is to look at the commit abort error messages received by the application to know the reason for aborts. By inspecting logs in the application, we see the application actually changed its workload during this time, i.e. some other transaction shape showed up with high attempts_per_second, and that different transaction (maybe a nightly cleanup job) was responsible for the additional lock conflicts.

What's next