Transaction statistics

Spanner provides built-in tables that store statistics about transactions. You can retrieve statistics from these SPANNER_SYS.TXN_STATS* tables using SQL statements.

When to use transaction statistics

Transaction statistics are useful when investigating performance issues. For example, you can check whether there are any slow running transactions that might be impacting performance or Queries Per Second (QPS) in your database. Another scenario is when your client applications are experiencing high transaction execution latency. Analyzing transaction statistics may help to discover potential bottlenecks, such as large volumes of updates to a particular column, which might be impacting latency.

Availability

SPANNER_SYS data is available only through SQL interfaces; for example:

  • A database's Spanner Studio page in the Google Cloud console

  • The gcloud spanner databases execute-sql command

  • The Transaction insights dashboard

  • The executeQuery API

Other single read methods that Spanner provides don't support SPANNER_SYS.

Latency statistics grouped by transaction

The following tables track the statistics for TOP resource consuming transactions during a specific time period.

  • SPANNER_SYS.TXN_STATS_TOP_MINUTE: Transaction statistics aggregated across 1 minute intervals.

  • SPANNER_SYS.TXN_STATS_TOP_10MINUTE: Transaction statistics aggregated across 10 minute intervals.

  • SPANNER_SYS.TXN_STATS_TOP_HOUR: Transaction statistics aggregated across 1 hour intervals.

These tables have the following properties:

  • Each table contains data for non-overlapping time intervals of the length the table name specifies.

  • Intervals are based on clock times. 1 minute intervals end on the minute, 10 minute intervals end every 10 minutes starting on the hour, and 1 hour intervals end on the hour.

    For example, at 11:59:30 AM, the most recent intervals available to SQL queries are:

    • 1 minute: 11:58:00–11:58:59 AM
    • 10 minute: 11:40:00–11:49:59 AM
    • 1 hour: 10:00:00–10:59:59 AM
  • Spanner groups the statistics by FPRINT (Fingerprint) of the transactions. If a transaction tag is present, FPRINT is the hash of the tag. Otherwise, it is the hash calculated based on the operations involved in the transaction.

  • Since statistics are grouped based on FPRINT, if the same transaction is executed multiple times within any time interval, we still see only one entry for that transaction in these tables.

  • Each row contains statistics for all executions of a particular transaction that Spanner captures statistics for during the specified interval.

  • If Spanner is unable to store statistics for all transactions run during the interval in these tables, the system prioritizes transactions with the highest latency, commit attempts, and bytes written during the specified interval.

  • All columns in the tables are nullable.

Table schema

Column name Type Description
INTERVAL_END TIMESTAMP End of the time interval in which the included transaction executions occurred.
TRANSACTION_TAG STRING The optional transaction tag for this transaction operation. For more information about using tags, see Troubleshooting with transaction tags. Statistics for multiple transactions that have the same tag string are grouped in a single row with the `TRANSACTION_TAG` matching that tag string.
FPRINT INT64 The hash of the TRANSACTION_TAG if present; Otherwise, the hash is calculated based on the operations involved in the transaction. INTERVAL_END and FPRINT together act as an unique key for these tables.
READ_COLUMNS ARRAY<STRING> The set of columns that were read by the transaction.
WRITE_CONSTRUCTIVE_COLUMNS ARRAY<STRING> The set of columns that were constructively written (i.e. assigned to new values) by the transaction.

For change streams, if the transaction involved writes to columns and tables watched by a change stream, WRITE_CONSTRUCTIVE_COLUMNS will contain two columns - .data and ._exists 1, prefixed with a change stream name.
WRITE_DELETE_TABLES ARRAY<STRING> The set of tables that had rows deleted or replaced by the transaction.
ATTEMPT_COUNT INT64 Total number of times that the transaction is attempted, including the attempts that abort before calling `commit`.
COMMIT_ATTEMPT_COUNT INT64 Total number of transaction commit attempts. This must match the number of calls to the transaction's commit method.
COMMIT_ABORT_COUNT INT64 Total number of transaction attempts that were aborted, including those that were aborted before calling the transaction's commit method.
COMMIT_RETRY_COUNT INT64 Total number of attempts that are retries from previously aborted attempts. A Spanner transaction might be tried multiple times before it commits due to lock contentions or transient events. A high number of retries relative to commit attempts indicates that there might be issues worth investigating. For more information, see Understanding transactions and commit counts on this page.
COMMIT_FAILED_PRECONDITION_COUNT INT64 Total number of transaction commit attempts that returned failed precondition errors, such as UNIQUE index violations, row already exists, row not found, and so on.
AVG_PARTICIPANTS FLOAT64 Average number of participants in each commit attempt. To learn more about participants, see Life of Spanner Reads & Writes.
AVG_TOTAL_LATENCY_SECONDS FLOAT64 Average seconds taken from the first operation of the transaction to commit/abort.
AVG_COMMIT_LATENCY_SECONDS FLOAT64 Average seconds taken to perform the commit operation.
AVG_BYTES FLOAT64 Average number of bytes written by the transaction.
TOTAL_LATENCY_DISTRIBUTION ARRAY<STRUCT>

A histogram of the total commit latency, which is the time from the first transactional operation start time to the commit or abort time, for all attempts of a transaction.

If a transaction aborts multiple times and then successfully commits, latency is measured for each attempt until the final successful commit. The values are measured in seconds.

The array contains a single element and has the following type:
ARRAY<STRUCT<
  COUNT INT64,
  MEAN FLOAT64,
  SUM_OF_SQUARED_DEVIATION FLOAT64,
  NUM_FINITE_BUCKETS INT64,
  GROWTH_FACTOR FLOAT64,
  SCALE FLOAT64,
  BUCKET_COUNTS ARRAY<INT64>>>

For more information about the values, see Distribution.

To calculate the desired percentile latency from the distribution, use the SPANNER_SYS.DISTRIBUTION_PERCENTILE(distribution, n FLOAT64)function, which returns the estimated nth percentile. For a related example, see Find the 99th percentile latency for transactions.

For more information, see Percentiles and distribution-valued metrics.

OPERATIONS_BY_TABLE ARRAY<STRUCT>

Impact of INSERT or UPDATE operations by the transaction on a per-table basis. This is indicated by the number of times that rows are affected and the number of bytes that are written.

This column helps visualize the load on tables and provides insights into the rate at which a transaction writes to tables.

Specify the array as follows:
ARRAY<STRUCT<
  TABLE STRING(MAX),
  INSERT_OR_UPDATE_COUNT INT64,
  INSERT_OR_UPDATE_BYTES INT64>>

1 _exists is an internal field that is used to check whether a certain row exists or not.

Example queries

This section includes several example SQL statements that retrieve transaction statistics. You can run these SQL statements using the client libraries, the gcloud spanner, or the Google Cloud console.

List the basic statistics for each transaction in a given time period

The following query returns the raw data for the top transactions in the previous minute.

SELECT fprint,
       read_columns,
       write_constructive_columns,
       write_delete_tables,
       avg_total_latency_seconds,
       avg_commit_latency_seconds,
       operations_by_table,
       avg_bytes
FROM spanner_sys.txn_stats_top_minute
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.txn_stats_top_minute);
Query output
fprint read_columns write_constructive_columns write_delete_tables avg_total_latency_seconds avg_commit_latency_seconds operations_by_table avg_bytes
40015598317 [] ["Routes.name", "Cars.model"] ["Users"] 0.006578737 0.006547737 [["Cars",1107,30996],["Routes",560,26880]] 25286
20524969030 ["id", "no"] [] [] 0.001732442 0.000247442 [] 0
77848338483 [] [] ["Cars", "Routes"] 0.033467418 0.000251418 [] 0

List the transactions with the highest average commit latency

The following query returns the transactions with high average commit latency in the previous hour, ordered from highest to lowest average commit latency.

SELECT fprint,
       read_columns,
       write_constructive_columns,
       write_delete_tables,
       avg_total_latency_seconds,
       avg_commit_latency_seconds,
       avg_bytes
FROM spanner_sys.txn_stats_top_hour
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.txn_stats_top_hour)
ORDER BY avg_commit_latency_seconds DESC;
Query output
fprint read_columns write_constructive_columns write_delete_tables avg_total_latency_seconds avg_commit_latency_seconds avg_bytes
40015598317 [] ["Routes.name", "Cars.model"] ["Users"] 0.006578737 0.006547737 25286
77848338483 [] [] ["Cars", "Routes"] 0.033467418 0.000251418 0
20524969030 ["id", "no"] [] [] 0.001732442 0.000247442 0

Find the average latency of transactions which read certain columns

The following query returns the average latency information for transactions that read the column ADDRESS from 1 hour stats:

SELECT fprint,
       read_columns,
       write_constructive_columns,
       write_delete_tables,
       avg_total_latency_seconds
FROM spanner_sys.txn_stats_top_hour
WHERE 'ADDRESS' IN UNNEST(read_columns)
ORDER BY avg_total_latency_seconds DESC;
Query output
fprint read_columns write_constructive_columns write_delete_tables avg_total_latency_seconds
77848338483 ["ID", "ADDRESS"] [] ["Cars", "Routes"] 0.033467418
40015598317 ["ID", "NAME", "ADDRESS"] [] ["Users"] 0.006578737

List transactions by the average number of bytes modified

The following query returns the transactions sampled in the last hour, ordered by the average number of bytes modified by the transaction.

SELECT fprint,
       read_columns,
       write_constructive_columns,
       write_delete_tables,
       avg_bytes
FROM spanner_sys.txn_stats_top_hour
ORDER BY avg_bytes DESC;
Query output
fprint read_columns write_constructive_columns write_delete_tables avg_bytes
40015598317 [] [] ["Users"] 25286
77848338483 [] [] ["Cars", "Routes"] 12005
20524969030 ["ID", "ADDRESS"] [] ["Users"] 10923

Aggregate statistics

SPANNER_SYS also contains tables to store aggregate data for all the transactions for which Spanner captured statistics in a specific time period:

  • SPANNER_SYS.TXN_STATS_TOTAL_MINUTE: Aggregate statistics for all transactions during 1 minute intervals
  • SPANNER_SYS.TXN_STATS_TOTAL_10MINUTE: Aggregate statistics for all transactions during 10 minute intervals
  • SPANNER_SYS.TXN_STATS_TOTAL_HOUR: Aggregate statistics for all transactions during 1 hour intervals

Aggregate statistics tables have the following properties:

  • Each table contains data for non-overlapping time intervals of the length the table name specifies.

  • Intervals are based on clock times. 1 minute intervals end on the minute, 10 minute intervals end every 10 minutes starting on the hour, and 1 hour intervals end on the hour.

    For example, at 11:59:30 AM, the most recent intervals available to SQL queries on aggregate transaction statistics are:

    • 1 minute: 11:58:00–11:58:59 AM
    • 10 minute: 11:40:00–11:49:59 AM
    • 1 hour: 10:00:00–10:59:59 AM
  • Each row contains statistics for all transactions executed over the database during the specified interval, aggregated together. There is only one row per time interval.

  • The statistics captured in the SPANNER_SYS.TXN_STATS_TOTAL_* tables might include transactions that Spanner did not capture in the SPANNER_SYS.TXN_STATS_TOP_* tables.

  • Some columns in these tables are exposed as metrics in Cloud Monitoring. The exposed metrics are:

    • Commit attempts count
    • Commit retries count
    • Transaction participants
    • Transaction latencies
    • Bytes written

    For more information, see Spanner metrics.

Table schema

Column name Type Description
INTERVAL_END TIMESTAMP End of the time interval in which this statistic was captured.
ATTEMPT_COUNT INT64 Total number of times that the transactions are attempted, including the attempts that abort before calling `commit`.
COMMIT_ATTEMPT_COUNT INT64 Total number of transaction commit attempts. This must match the number of calls to the transaction's commit method.
COMMIT_ABORT_COUNT INT64 Total number of transaction attempts that were aborted, including those that are aborted before calling the transaction's commit method.
COMMIT_RETRY_COUNT INT64 Number of commit attempts that are retries from previously aborted attempts. A Spanner transaction may have been tried multiple times before it commits due to lock contentions or transient events. A high number of retries relative to commit attempts indicates that there might be issues worth investigating. For more information, see Understanding transactions and commit counts on this page.
COMMIT_FAILED_PRECONDITION_COUNT INT64 Total number of transaction commit attempts that returned failed precondition errors, such as UNIQUE index violations, row already exists, row not found, and so on.
AVG_PARTICIPANTS FLOAT64 Average number of participants in each commit attempt. To learn more about participants, see Life of Spanner Reads & Writes.
AVG_TOTAL_LATENCY_SECONDS FLOAT64 Average seconds taken from the first operation of the transaction to commit/abort.
AVG_COMMIT_LATENCY_SECONDS FLOAT64 Average seconds taken to perform the commit operation.
AVG_BYTES FLOAT64 Average number of bytes written by the transaction.
TOTAL_LATENCY_DISTRIBUTION ARRAY<STRUCT>

A histogram of the total commit latency, which is the time from the first transactional operation start time to the commit or abort time for all transaction attempts.

If a transaction aborts multiple times and then successfully commits, latency is measured for each attempt until the final successful commit. The values are measured in seconds.

The array contains a single element and has the following type:
ARRAY<STRUCT<
  COUNT INT64,
  MEAN FLOAT64,
  SUM_OF_SQUARED_DEVIATION FLOAT64,
  NUM_FINITE_BUCKETS INT64,
  GROWTH_FACTOR FLOAT64,
  SCALE FLOAT64,
  BUCKET_COUNTS ARRAY<INT64>>>

For more information about the values, see Distribution.

To calculate the desired percentile latency from the distribution, use the SPANNER_SYS.DISTRIBUTION_PERCENTILE(distribution, n FLOAT64)function, which returns the estimated nth percentile. For an example, see Find the 99th percentile latency for transactions.

For more information, see Percentiles and distribution-valued metrics.

OPERATIONS_BY_TABLE ARRAY<STRUCT>

Impact of INSERT or UPDATE operations by all transactions on a per-table basis. This is indicated by the number of times that rows are affected and the number of bytes that are written.

This column helps visualize the load on tables and provides insights into the rate at which the transactions write to tables.

Specify the array as follows:
ARRAY<STRUCT<
  TABLE STRING(MAX),
  INSERT_OR_UPDATE_COUNT INT64,
  INSERT_OR_UPDATE_BYTES INT64>>

Example queries

This section includes several example SQL statements that retrieve transaction statistics. You can run these SQL statements using the client libraries, the gcloud spanner, or the Google Cloud console.

Find the total number of commit attempts for all transactions

The following query returns the total number of commit attempts for all transactions in the most recent complete 1-minute interval:

SELECT interval_end,
       commit_attempt_count
FROM spanner_sys.txn_stats_total_minute
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.txn_stats_total_minute)
ORDER BY interval_end;
Query output
interval_end commit_attempt_count
2020-01-17 11:46:00-08:00 21

Note that there is only one row in the result because aggregated stats have only one entry per interval_end for any time duration.

Find the total commit latency across all transactions

The following query returns the total commit latency across all transactions in the previous 10 minutes:

SELECT (avg_commit_latency_seconds * commit_attempt_count / 60 / 60)
  AS total_commit_latency_hours
FROM spanner_sys.txn_stats_total_10minute
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.txn_stats_total_10minute);
Query output
total_commit_latency_hours
0.8967

Note that there is only one row in the result because aggregated stats have only one entry per interval_end for any time duration.

Find the 99th percentile latency for transactions

The following query returns the 99th percentile latency for transactions ran in the previous 10 minutes:

SELECT interval_end, avg_total_latency_seconds,
       SPANNER_SYS.DISTRIBUTION_PERCENTILE(total_latency_distribution[OFFSET(0)], 99.0)
  AS percentile_latency
FROM spanner_sys.txn_stats_total_10minute
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.txn_stats_total_10minute)
ORDER BY interval_end;

Query output
interval_end avg_total_latency_seconds percentile_latency
2022-08-17 11:46:00-08:00 0.34576998305986395 9.00296190476190476

Notice the large difference between the average and the 99th percentile latency. The 99th percentile latency helps identify possible outlier transactions with high latency.

There's only one row in the result because aggregated stats have only one entry per interval_end for any time duration.

Data retention

At a minimum, Spanner keeps data for each table for the following time periods:

  • SPANNER_SYS.TXN_STATS_TOP_MINUTE and SPANNER_SYS.TXN_STATS_TOTAL_MINUTE: Intervals covering the previous 6 hours.

  • SPANNER_SYS.TXN_STATS_TOP_10MINUTE and SPANNER_SYS.TXN_STATS_TOTAL_10MINUTE: Intervals covering the previous 4 days.

  • SPANNER_SYS.TXN_STATS_TOP_HOUR and SPANNER_SYS.TXN_STATS_TOTAL_HOUR: Intervals covering the previous 30 days.

Transaction statistics in Spanner give insight into how an application is using the database, and are useful when investigating performance issues. For example, you can check whether there are any slow running transactions that might be causing contention, or you can identify potential sources of high load, such as large volumes of updates to a particular column. Using the following steps, we'll show you how to use transaction statistics to investigate contentions in your database.

Understand transactions and commit counts

A Spanner transaction may have to be tried multiple times before it commits. This most commonly occurs when two transactions attempt to work on the same data at the same time, and one of the transactions needs to be aborted to preserve the isolation property of the transaction. Some other transient events which can also cause a transaction to be aborted include:

  • Transient network issues.

  • Database schema changes being applied while a transaction is in the process of committing.

  • Spanner instance doesn't have the capacity to handle all the requests it is receiving.

In such scenarios, a client should retry the aborted transaction until it commits successfully or times out. For users of the official Spanner client libraries, each library has implemented an automatic retry mechanism. If you are using a custom version of the client code, wrap your transaction commits in a retry loop.

A Spanner transaction may also be aborted due to a non-retriable error such as a transaction timeout, permission issues, or an invalid table/column name. There is no need to retry such transactions and the Spanner client library will return the error immediately.

The following table describes some examples of how COMMIT_ATTEMPT_COUNT, COMMIT_ABORT_COUNT, and COMMIT_RETRY_COUNT are logged in different scenarios.

Scenario COMMIT_ATTEMPT_COUNT COMMIT_ABORT_COUNT COMMIT_RETRY_COUNT
Transaction successfully committed on first attempt. 1 0 0
Transaction aborted due to time out error. 1 1 0
Transaction aborted due to transient network issue and successfully committed after one retry. 2 1 1
5 transactions with the same FPRINT are executed within a 10-minute interval. 3 of the transactions successfully committed on first attempt, while 2 transactions were aborted and then committed successfully on the first retry. 7 2 2

The data in the transactions-stats tables are aggregated data for a time interval. For a particular interval, it is possible that a transaction abort and retry happens around the boundaries and fall into different buckets. As a result, in a particular time interval aborts and retries may not be equal.

These stats are designed for troubleshooting and introspection and are not guaranteed to be 100% accurate. Statistics are aggregated in memory before being stored in Spanner tables. During an upgrade or other maintenance activities, Spanner servers can restart, affecting the accuracy of the numbers.

Troubleshoot database contentions using transaction statistics

You can use SQL code or the Transaction insights dashboard to view the transactions in your database that might cause high latencies due to lock contentions.

The following topics show how you can investigate such transactions by using SQL code.

Select a time period to investigate

This can be found from the application which is using Spanner.

For the purposes of this exercise, let's say that the issue started occurring at around 5:20pm on the 17th May, 2020.

You can use Transaction Tags to identify the source of the transaction and correlate between Transaction Statistics Table and Lock Statistics tables for effective lock contention troubleshooting. Read more at Troubleshooting with transaction tags.

Gather transaction statistics for the selected time period

To start our investigation, we'll query the TXN_STATS_TOTAL_10MINUTE table around the start of the issue. The results of this query will show us how latency and other transaction statistics changed over that period of time.

For example, the following query returns the aggregated transaction statistics from 4:30 pm to 7:40 pm (inclusive).

SELECT
  interval_end,
  ROUND(avg_total_latency_seconds,4) as avg_total_latency_seconds,
  commit_attempt_count,
  commit_abort_count
FROM SPANNER_SYS.TXN_STATS_TOTAL_10MINUTE
WHERE
  interval_end >= "2020-05-17T16:40:00"
  AND interval_end <= "2020-05-17T19:40:00"
ORDER BY interval_end;

The following table lists example data returned from our query.

interval_end avg_total_latency_seconds commit_attempt_count commit_abort_count
2020-05-17 16:40:00-07:00 0.0284 315691 5170
2020-05-17 16:50:00-07:00 0.0250 302124 3828
2020-05-17 17:00:00-07:00 0.0460 346087 11382
2020-05-17 17:10:00-07:00 0.0864 379964 33826
2020-05-17 17:20:00-07:00 0.1291 390343 52549
2020-05-17 17:30:00-07:00 0.1314 456455 76392
2020-05-17 17:40:00-07:00 0.1598 507774 121458
2020-05-17 17:50:00-07:00 0.1641 516587 115875
2020-05-17 18:00:00-07:00 0.1578 552711 122626
2020-05-17 18:10:00-07:00 0.1750 569460 154205
2020-05-17 18:20:00-07:00 0.1727 613571 160772
2020-05-17 18:30:00-07:00 0.1588 601994 143044
2020-05-17 18:40:00-07:00 0.2025 604211 170019
2020-05-17 18:50:00-07:00 0.1615 601622 135601
2020-05-17 19:00:00-07:00 0.1653 596804 129511
2020-05-17 19:10:00-07:00 0.1414 560023 112247
2020-05-17 19:20:00-07:00 0.1367 570864 100596
2020-05-17 19:30:00-07:00 0.0894 539729 65316
2020-05-17 19:40:00-07:00 0.0820 479151 40398

Here we see that aggregated latency and abort count is higher in the highlighted periods. We can pick any 10 minute interval where aggregated latency and/or abort count are high. Let's choose the interval ending at 2020-05-17T18:40:00 and use it in the next step to identify which transactions are contributing to high latency and abort count.

Identify transactions that are experiencing high latency

Let's now query the TXN_STATS_TOP_10MINUTE table for the interval which was picked in the previous step. Using this data, we can start to identify which transactions are experiencing high latency and/or high abort count.

Execute the following query to get top performance-impacting transactions in descending order of total latency for our example interval ending at 2020-05-17T18:40:00.

SELECT
  interval_end,
  fprint,
  ROUND(avg_total_latency_seconds,4) as avg_total_latency_seconds,
  ROUND(avg_commit_latency_seconds,4) as avg_commit_latency_seconds,
  commit_attempt_count,
  commit_abort_count,
  commit_retry_count
FROM SPANNER_SYS.TXN_STATS_TOP_10MINUTE
WHERE
  interval_end = "2020-05-17T18:40:00"
ORDER BY avg_total_latency_seconds DESC;
interval_end fprint avg_total_latency_seconds avg_commit_latency_seconds commit_attempt_count commit_abort_count commit_retry_count
2020-05-17 18:40:00-07:00 15185072816865185658 0.3508 0.0139 278802 142205 129884
2020-05-17 18:40:00-07:00 15435530087434255496 0.1633 0.0142 129012 27177 24559
2020-05-17 18:40:00-07:00 14175643543447671202 0.1423 0.0133 5357 636 433
2020-05-17 18:40:00-07:00 898069986622520747 0.0198 0.0158 6 0 0
2020-05-17 18:40:00-07:00 10510121182038036893 0.0168 0.0125 7 0 0
2020-05-17 18:40:00-07:00 9287748709638024175 0.0159 0.0118 4269 1 0
2020-05-17 18:40:00-07:00 7129109266372596045 0.0142 0.0102 182227 0 0
2020-05-17 18:40:00-07:00 15630228555662391800 0.0120 0.0107 58 0 0
2020-05-17 18:40:00-07:00 7907238229716746451 0.0108 0.0097 65 0 0
2020-05-17 18:40:00-07:00 10158167220149989178 0.0095 0.0047 3454 0 0
2020-05-17 18:40:00-07:00 9353100217060788102 0.0093 0.0045 725 0 0
2020-05-17 18:40:00-07:00 9521689070912159706 0.0093 0.0045 164 0 0
2020-05-17 18:40:00-07:00 11079878968512225881 0.0064 0.0019 65 0 0

We can see clearly that the first row (highlighted) in the preceding table shows a transaction experiencing high latency because of a high number of commit aborts. We can also see a high number of commit retries which indicates the aborted commits were subsequently retried. In the next step, we'll investigate further to see what is causing this issue.

Identify the columns involved in a transaction experiencing high latency

In this step, we'll check whether high latency transactions are operating on the same set of columns by fetching read_columns, write_constructive_columns and write_delete_tables data for transactions with high abort count. The FPRINT value will also be useful in the next step.

SELECT
  fprint,
  read_columns,
  write_constructive_columns,
  write_delete_tables
FROM SPANNER_SYS.TXN_STATS_TOP_10MINUTE
WHERE
  interval_end = "2020-05-17T18:40:00"
ORDER BY avg_total_latency_seconds DESC LIMIT 3;
fprint read_columns write_constructive_columns write_delete_tables
15185072816865185658 [TestHigherLatency._exists,TestHigherLatency.lang_status,TestHigherLatency.score,globalTagAffinity.shares] [TestHigherLatency._exists,TestHigherLatency.shares,TestHigherLatency_lang_status_score_index.shares] []
15435530087434255496 [TestHigherLatency._exists,TestHigherLatency.lang_status,TestHigherLatency.likes,globalTagAffinity.score] [TestHigherLatency._exists,TestHigherLatency.likes,TestHigherLatency_lang_status_score_index.likes] []
14175643543447671202 [TestHigherLatency._exists,TestHigherLatency.lang_status,TestHigherLatency.score,globalTagAffinity.ugcCount] [TestHigherLatency._exists,TestHigherLatency.ugcCount,TestHigherLatency_lang_status_score_index.ugcCount] []

As the output shows in the preceding table, the transactions with the highest average total latency are reading the same columns. We can also observe some write contention since the transactions are writing to the same column, that is to say, TestHigherLatency._exists.

Determine how the transaction performance has changed over time

We can see how the statistics associated with this transaction shape have changed over a period of time. Use the following query, where $FPRINT is the fingerprint of the high latency transaction from the previous step.

SELECT
  interval_end,
  ROUND(avg_total_latency_seconds, 3) AS latency,
  ROUND(avg_commit_latency_seconds, 3) AS commit_latency,
  commit_attempt_count,
  commit_abort_count,
  commit_retry_count,
  commit_failed_precondition_count,
  avg_bytes
FROM SPANNER_SYS.TXN_STATS_TOP_10MINUTE
WHERE
  interval_end >= "2020-05-17T16:40:00"
  AND interval_end <= "2020-05-17T19:40:00"
  AND fprint = $FPRINT
ORDER BY interval_end;
interval_end latency commit_latency commit_attempt_count commit_abort_count commit_retry_count commit_failed_precondition_count avg_bytes
2020-05-17 16:40:00-07:00 0.095 0.010 53230 4752 4330 0 91
2020-05-17 16:50:00-07:00 0.069 0.009 61264 3589 3364 0 91
2020-05-17 17:00:00-07:00 0.150 0.010 75868 10557 9322 0 91
2020-05-17 17:10:00-07:00 0.248 0.013 103151 30220 28483 0 91
2020-05-17 17:20:00-07:00 0.310 0.012 130078 45655 41966 0 91
2020-05-17 17:30:00-07:00 0.294 0.012 160064 64930 59933 0 91
2020-05-17 17:40:00-07:00 0.315 0.013 209614 104949 96770 0 91
2020-05-17 17:50:00-07:00 0.322 0.012 215682 100408 95867 0 90
2020-05-17 18:00:00-07:00 0.310 0.012 230932 106728 99462 0 91
2020-05-17 18:10:00-07:00 0.309 0.012 259645 131049 125889 0 91
2020-05-17 18:20:00-07:00 0.315 0.013 272171 137910 129411 0 90
2020-05-17 18:30:00-07:00 0.292 0.013 258944 121475 115844 0 91
2020-05-17 18:40:00-07:00 0.350 0.013 278802 142205 134229 0 91
2020-05-17 18:50:00-07:00 0.302 0.013 256259 115626 109756 0 91
2020-05-17 19:00:00-07:00 0.315 0.014 250560 110662 100322 0 91
2020-05-17 19:10:00-07:00 0.271 0.014 238384 99025 90187 0 91
2020-05-17 19:20:00-07:00 0.273 0.014 219687 84019 79874 0 91
2020-05-17 19:30:00-07:00 0.198 0.013 195357 59370 55909 0 91
2020-05-17 19:40:00-07:00 0.181 0.013 167514 35705 32885 0 91

In above output we can observe that total latency is high for the highlighted period of time. And, wherever total latency is high, commit_attempt_count commit_abort_count, and commit_retry_count are also high even though commit latency (commit_latency) has not changed very much. Since transaction commits are getting aborted more frequently, commit attempts are also high because of commit retries.

Conclusion

In this example, we saw that high commit abort count was the cause of high latency. The next step is to look at the commit abort error messages received by the application to know the reason for aborts. By inspecting logs in the application, we see the application actually changed its workload during this time, i.e. some other transaction shape showed up with high attempts_per_second, and that different transaction (maybe a nightly cleanup job) was responsible for the additional lock conflicts.

Identify transactions that were not retried correctly

The following query returns the transactions sampled in the last ten minutes that have a high commit abort count, but no retries.

SELECT
  *
FROM (
  SELECT
    fprint,
    SUM(commit_attempt_count) AS total_commit_attempt_count,
    SUM(commit_abort_count) AS total_commit_abort_count,
    SUM(commit_retry_count) AS total_commit_retry_count
  FROM
    SPANNER_SYS.TXN_STATS_TOP_10MINUTE
  GROUP BY
    fprint )
WHERE
  total_commit_retry_count = 0
  AND total_commit_abort_count > 0
ORDER BY
  total_commit_abort_count DESC;
fprint total_commit_attempt_count total_commit_abort_count total_commit_retry_count
1557557373282541312 3367894 44232 0
5776062322886969344 13566 14 0

We can see that the transaction with fprint 1557557373282541312 was aborted 44232 times, but it was never retried. This looks suspicious because the abort count is high and it is unlikely that every abort was caused by a non-retriable error. On the other hand, for the transaction with fprint 5776062322886969344, it is less suspicious because the total abort count is not that high.

The following query returns more details about the transaction with fprint 1557557373282541312 including the read_columns,write_constructive_columns, and write_delete_tables. This information helps to identify the transaction in client code, where the retry logic can be reviewed for this scenario.

SELECT
  interval_end,
  fprint,
  read_columns,
  write_constructive_columns,
  write_delete_tables,
  commit_attempt_count,
  commit_abort_count,
  commit_retry_count
FROM
  SPANNER_SYS.TXN_STATS_TOP_10MINUTE
WHERE
  fprint = 1557557373282541312
ORDER BY
  interval_end DESC;
interval_end fprint read_columns write_constructive_columns write_delete_tables commit_attempt_count commit_abort_count commit_retry_count
2021-01-27T18:30:00Z 1557557373282541312 ['Singers._exists'] ['Singers.FirstName', 'Singers.LastName', 'Singers._exists'] [] 805228 1839 0
2021-01-27T18:20:00Z 1557557373282541312 ['Singers._exists'] ['Singers.FirstName', 'Singers.LastName', 'Singers._exists'] [] 1034429 38779 0
2021-01-27T18:10:00Z 1557557373282541312 ['Singers._exists'] ['Singers.FirstName', 'Singers.LastName', 'Singers._exists'] [] 833677 2266 0
2021-01-27T18:00:00Z 1557557373282541312 ['Singers._exists'] ['Singers.FirstName', 'Singers.LastName', 'Singers._exists'] [] 694560 1348 0

We can see that the transaction involves a read to the Singers._exists hidden column to check the existence of a row. The transaction also writes to the Singers.FirstName and Singer.LastName columns. This information can help determine whether the transaction retry mechanism implemented in your custom client library is working as expected.

What's next