Transaction statistics

Spanner provides built-in tables that store statistics about transactions. You can retrieve statistics from these SPANNER_SYS.TXN_STATS* tables using SQL statements.

When to use transaction statistics

Transaction statistics are useful when investigating performance issues. For example, you can check whether there are any slow running transactions that might be impacting performance or Queries Per Second (QPS) in your database. Another scenario is when your client applications are experiencing high transaction execution latency. Analyzing transaction statistics may help to discover potential bottlenecks, such as large volumes of updates to a particular column, which might be impacting latency.

Availability

SPANNER_SYS data is available only through SQL interfaces; for example:

A database's Spanner Studio page in the Google Cloud console
The gcloud spanner databases execute-sql command
The Transaction insights dashboard
The executeQuery API

Other single read methods that Spanner provides don't support SPANNER_SYS.

Latency statistics grouped by transaction

The following tables track the statistics for TOP resource consuming transactions during a specific time period.

SPANNER_SYS.TXN_STATS_TOP_MINUTE: Transaction statistics aggregated across 1 minute intervals.
SPANNER_SYS.TXN_STATS_TOP_10MINUTE: Transaction statistics aggregated across 10 minute intervals.
SPANNER_SYS.TXN_STATS_TOP_HOUR: Transaction statistics aggregated across 1 hour intervals.

These tables have the following properties:

Each table contains data for non-overlapping time intervals of the length the table name specifies.
Intervals are based on clock times. 1 minute intervals end on the minute, 10 minute intervals end every 10 minutes starting on the hour, and 1 hour intervals end on the hour.

For example, at 11:59:30 AM, the most recent intervals available to SQL queries are:
- 1 minute: 11:58:00–11:58:59 AM
- 10 minute: 11:40:00–11:49:59 AM
- 1 hour: 10:00:00–10:59:59 AM
Spanner groups the statistics by FPRINT (Fingerprint) of the transactions. If a transaction tag is present, FPRINT is the hash of the tag. Otherwise, it is the hash calculated based on the operations involved in the transaction.
Since statistics are grouped based on FPRINT, if the same transaction is executed multiple times within any time interval, we still see only one entry for that transaction in these tables.
Each row contains statistics for all executions of a particular transaction that Spanner captures statistics for during the specified interval.
If Spanner is unable to store statistics for all transactions run during the interval in these tables, the system prioritizes transactions with the highest latency, commit attempts, and bytes written during the specified interval.
All columns in the tables are nullable.

Table schema

Column name	Type	Description
`INTERVAL_END`	`TIMESTAMP`	End of the time interval in which the included transaction executions occurred.
`TRANSACTION_TAG`	`STRING`	The optional transaction tag for this transaction operation. For more information about using tags, see Troubleshooting with transaction tags. Statistics for multiple transactions that have the same tag string are grouped in a single row with the `TRANSACTION_TAG` matching that tag string.
`FPRINT`	`INT64`	The hash of the `TRANSACTION_TAG` if present; Otherwise, the hash is calculated based on the operations involved in the transaction. `INTERVAL_END` and `FPRINT` together act as an unique key for these tables.
`READ_COLUMNS`	`ARRAY<STRING>`	The set of columns that were read by the transaction.
`WRITE_CONSTRUCTIVE_COLUMNS`	`ARRAY<STRING>`	The set of columns that were constructively written (i.e. assigned to new values) by the transaction. For change streams, if the transaction involved writes to columns and tables watched by a change stream, `WRITE_CONSTRUCTIVE_COLUMNS` will contain two columns - `.data` and `._exists` ¹, prefixed with a change stream name.
`WRITE_DELETE_TABLES`	`ARRAY<STRING>`	The set of tables that had rows deleted or replaced by the transaction.
`ATTEMPT_COUNT`	`INT64`	Total number of times that the transaction is attempted, including the attempts that abort before calling `commit`.
`COMMIT_ATTEMPT_COUNT`	`INT64`	Total number of transaction commit attempts. This must match the number of calls to the transaction's `commit` method.
`COMMIT_ABORT_COUNT`	`INT64`	Total number of transaction attempts that were aborted, including those that were aborted before calling the transaction's `commit` method.
`COMMIT_RETRY_COUNT`	`INT64`	Total number of attempts that are retries from previously aborted attempts. A Spanner transaction might be tried multiple times before it commits due to lock contentions or transient events. A high number of retries relative to commit attempts indicates that there might be issues worth investigating. For more information, see Understanding transactions and commit counts on this page.
`COMMIT_FAILED_PRECONDITION_COUNT`	`INT64`	Total number of transaction commit attempts that returned failed precondition errors, such as `UNIQUE` index violations, row already exists, row not found, and so on.
`AVG_PARTICIPANTS`	`FLOAT64`	Average number of participants in each commit attempt. To learn more about participants, see Life of Spanner Reads & Writes.
`AVG_TOTAL_LATENCY_SECONDS`	`FLOAT64`	Average seconds taken from the first operation of the transaction to commit/abort.
`AVG_COMMIT_LATENCY_SECONDS`	`FLOAT64`	Average seconds taken to perform the commit operation.
`AVG_BYTES`	`FLOAT64`	Average number of bytes written by the transaction.
`TOTAL_LATENCY_DISTRIBUTION`	`ARRAY<STRUCT>`	A histogram of the total commit latency, which is the time from the first transactional operation start time to the commit or abort time, for all attempts of a transaction. PostgreSQL interface note: PostgreSQL-dialect databases do not support this column. If a transaction aborts multiple times and then successfully commits, latency is measured for each attempt until the final successful commit. The values are measured in seconds. The array contains a single element and has the following type: `ARRAY<STRUCT< COUNT INT64, MEAN FLOAT64, SUM_OF_SQUARED_DEVIATION FLOAT64, NUM_FINITE_BUCKETS INT64, GROWTH_FACTOR FLOAT64, SCALE FLOAT64, BUCKET_COUNTS ARRAY<INT64>>>` For more information about the values, see Distribution. To calculate the desired percentile latency from the distribution, use the `SPANNER_SYS.DISTRIBUTION_PERCENTILE(distribution, n FLOAT64)`function, which returns the estimated nth percentile. For a related example, see Find the 99th percentile latency for transactions. For more information, see Percentiles and distribution-valued metrics.
`OPERATIONS_BY_TABLE`	`ARRAY<STRUCT>`	Impact of `INSERT` or `UPDATE` operations by the transaction on a per-table basis. This is indicated by the number of times that rows are affected and the number of bytes that are written. PostgreSQL interface note: PostgreSQL-dialect databases do not support this column. This column helps visualize the load on tables and provides insights into the rate at which a transaction writes to tables. Specify the array as follows: `ARRAY<STRUCT< TABLE STRING(MAX), INSERT_OR_UPDATE_COUNT INT64, INSERT_OR_UPDATE_BYTES INT64>>`

¹ _exists is an internal field that is used to check whether a certain row exists or not.

Example queries

This section includes several example SQL statements that retrieve transaction statistics. You can run these SQL statements using the client libraries, the gcloud spanner, or the Google Cloud console.

List the basic statistics for each transaction in a given time period

The following query returns the raw data for the top transactions in the previous minute.

SELECT fprint,
       read_columns,
       write_constructive_columns,
       write_delete_tables,
       avg_total_latency_seconds,
       avg_commit_latency_seconds,
       operations_by_table,
       avg_bytes
FROM spanner_sys.txn_stats_top_minute
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.txn_stats_top_minute);

Query output

fprint	read_columns	write_constructive_columns	write_delete_tables	avg_total_latency_seconds	avg_commit_latency_seconds	operations_by_table	avg_bytes
`40015598317`	`[]`	`["Routes.name", "Cars.model"]`	`["Users"]`	`0.006578737`	`0.006547737`	`[["Cars",1107,30996],["Routes",560,26880]]`	`25286`
`20524969030`	`["id", "no"]`	`[]`	`[]`	`0.001732442`	`0.000247442`	`[]`	`0`
`77848338483`	`[]`	`[]`	`["Cars", "Routes"]`	`0.033467418`	`0.000251418`	`[]`	`0`

List the transactions with the highest average commit latency

The following query returns the transactions with high average commit latency in the previous hour, ordered from highest to lowest average commit latency.

SELECT fprint,
       read_columns,
       write_constructive_columns,
       write_delete_tables,
       avg_total_latency_seconds,
       avg_commit_latency_seconds,
       avg_bytes
FROM spanner_sys.txn_stats_top_hour
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.txn_stats_top_hour)
ORDER BY avg_commit_latency_seconds DESC;

Query output

fprint	read_columns	write_constructive_columns	write_delete_tables	avg_total_latency_seconds	avg_commit_latency_seconds	avg_bytes
`40015598317`	`[]`	`["Routes.name", "Cars.model"]`	`["Users"]`	`0.006578737`	`0.006547737`	`25286`
`77848338483`	`[]`	`[]`	`["Cars", "Routes"]`	`0.033467418`	`0.000251418`	`0`
`20524969030`	`["id", "no"]`	`[]`	`[]`	`0.001732442`	`0.000247442`	`0`

Find the average latency of transactions which read certain columns

The following query returns the average latency information for transactions that read the column ADDRESS from 1 hour stats:

SELECT fprint,
       read_columns,
       write_constructive_columns,
       write_delete_tables,
       avg_total_latency_seconds
FROM spanner_sys.txn_stats_top_hour
WHERE 'ADDRESS' IN UNNEST(read_columns)
ORDER BY avg_total_latency_seconds DESC;

Query output

fprint	read_columns	write_constructive_columns	write_delete_tables	avg_total_latency_seconds
`77848338483`	`["ID", "ADDRESS"]`	`[]`	`["Cars", "Routes"]`	`0.033467418`
`40015598317`	`["ID", "NAME", "ADDRESS"]`	`[]`	`["Users"]`	`0.006578737`

List transactions by the average number of bytes modified

The following query returns the transactions sampled in the last hour, ordered by the average number of bytes modified by the transaction.

SELECT fprint,
       read_columns,
       write_constructive_columns,
       write_delete_tables,
       avg_bytes
FROM spanner_sys.txn_stats_top_hour
ORDER BY avg_bytes DESC;

Query output

fprint	read_columns	write_constructive_columns	write_delete_tables	avg_bytes
`40015598317`	`[]`	`[]`	`["Users"]`	`25286`
`77848338483`	`[]`	`[]`	`["Cars", "Routes"]`	`12005`
`20524969030`	`["ID", "ADDRESS"]`	`[]`	`["Users"]`	`10923`

Aggregate statistics

SPANNER_SYS also contains tables to store aggregate data for all the transactions for which Spanner captured statistics in a specific time period:

SPANNER_SYS.TXN_STATS_TOTAL_MINUTE: Aggregate statistics for all transactions during 1 minute intervals
SPANNER_SYS.TXN_STATS_TOTAL_10MINUTE: Aggregate statistics for all transactions during 10 minute intervals
SPANNER_SYS.TXN_STATS_TOTAL_HOUR: Aggregate statistics for all transactions during 1 hour intervals

Aggregate statistics tables have the following properties:

Each table contains data for non-overlapping time intervals of the length the table name specifies.
Intervals are based on clock times. 1 minute intervals end on the minute, 10 minute intervals end every 10 minutes starting on the hour, and 1 hour intervals end on the hour.

For example, at 11:59:30 AM, the most recent intervals available to SQL queries on aggregate transaction statistics are:
- 1 minute: 11:58:00–11:58:59 AM
- 10 minute: 11:40:00–11:49:59 AM
- 1 hour: 10:00:00–10:59:59 AM
Each row contains statistics for all transactions executed over the database during the specified interval, aggregated together. There is only one row per time interval.
The statistics captured in the SPANNER_SYS.TXN_STATS_TOTAL_* tables might include transactions that Spanner did not capture in the SPANNER_SYS.TXN_STATS_TOP_* tables.
Some columns in these tables are exposed as metrics in Cloud Monitoring. The exposed metrics are:
- Commit attempts count
- Commit retries count
- Transaction participants
- Transaction latencies
- Bytes written
For more information, see Spanner metrics.

Table schema

Column name	Type	Description
`INTERVAL_END`	`TIMESTAMP`	End of the time interval in which this statistic was captured.
`ATTEMPT_COUNT`	`INT64`	Total number of times that the transactions are attempted, including the attempts that abort before calling `commit`.
`COMMIT_ATTEMPT_COUNT`	`INT64`	Total number of transaction commit attempts. This must match the number of calls to the transaction's `commit` method.
`COMMIT_ABORT_COUNT`	`INT64`	Total number of transaction attempts that were aborted, including those that are aborted before calling the transaction's `commit` method.
`COMMIT_RETRY_COUNT`	`INT64`	Number of commit attempts that are retries from previously aborted attempts. A Spanner transaction may have been tried multiple times before it commits due to lock contentions or transient events. A high number of retries relative to commit attempts indicates that there might be issues worth investigating. For more information, see Understanding transactions and commit counts on this page.
`COMMIT_FAILED_PRECONDITION_COUNT`	`INT64`	Total number of transaction commit attempts that returned failed precondition errors, such as `UNIQUE` index violations, row already exists, row not found, and so on.
`AVG_PARTICIPANTS`	`FLOAT64`	Average number of participants in each commit attempt. To learn more about participants, see Life of Spanner Reads & Writes.
`AVG_TOTAL_LATENCY_SECONDS`	`FLOAT64`	Average seconds taken from the first operation of the transaction to commit/abort.
`AVG_COMMIT_LATENCY_SECONDS`	`FLOAT64`	Average seconds taken to perform the commit operation.
`AVG_BYTES`	`FLOAT64`	Average number of bytes written by the transaction.
`TOTAL_LATENCY_DISTRIBUTION`	`ARRAY<STRUCT>`	A histogram of the total commit latency, which is the time from the first transactional operation start time to the commit or abort time for all transaction attempts. PostgreSQL interface note: PostgreSQL-dialect databases do not support this column. If a transaction aborts multiple times and then successfully commits, latency is measured for each attempt until the final successful commit. The values are measured in seconds. The array contains a single element and has the following type: `ARRAY<STRUCT< COUNT INT64, MEAN FLOAT64, SUM_OF_SQUARED_DEVIATION FLOAT64, NUM_FINITE_BUCKETS INT64, GROWTH_FACTOR FLOAT64, SCALE FLOAT64, BUCKET_COUNTS ARRAY<INT64>>>` For more information about the values, see Distribution. To calculate the desired percentile latency from the distribution, use the `SPANNER_SYS.DISTRIBUTION_PERCENTILE(distribution, n FLOAT64)`function, which returns the estimated nth percentile. For an example, see Find the 99th percentile latency for transactions. For more information, see Percentiles and distribution-valued metrics.
`OPERATIONS_BY_TABLE`	`ARRAY<STRUCT>`	Impact of `INSERT` or `UPDATE` operations by all transactions on a per-table basis. This is indicated by the number of times that rows are affected and the number of bytes that are written. PostgreSQL interface note: PostgreSQL-dialect databases do not support this column. This column helps visualize the load on tables and provides insights into the rate at which the transactions write to tables. Specify the array as follows: `ARRAY<STRUCT< TABLE STRING(MAX), INSERT_OR_UPDATE_COUNT INT64, INSERT_OR_UPDATE_BYTES INT64>>`

Example queries

Find the total number of commit attempts for all transactions

The following query returns the total number of commit attempts for all transactions in the most recent complete 1-minute interval:

SELECT interval_end,
       commit_attempt_count
FROM spanner_sys.txn_stats_total_minute
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.txn_stats_total_minute)
ORDER BY interval_end;

Query output

interval_end	commit_attempt_count
`2020-01-17 11:46:00-08:00`	`21`

Note that there is only one row in the result because aggregated stats have only one entry per interval_end for any time duration.

Find the total commit latency across all transactions

The following query returns the total commit latency across all transactions in the previous 10 minutes:

SELECT (avg_commit_latency_seconds * commit_attempt_count / 60 / 60)
  AS total_commit_latency_hours
FROM spanner_sys.txn_stats_total_10minute
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.txn_stats_total_10minute);

Query output

total_commit_latency_hours
`0.8967`

Note that there is only one row in the result because aggregated stats have only one entry per interval_end for any time duration.

Find the 99th percentile latency for transactions

The following query returns the 99th percentile latency for transactions ran in the previous 10 minutes:

SELECT interval_end, avg_total_latency_seconds,
       SPANNER_SYS.DISTRIBUTION_PERCENTILE(total_latency_distribution[OFFSET(0)], 99.0)
  AS percentile_latency
FROM spanner_sys.txn_stats_total_10minute
WHERE interval_end =
  (SELECT MAX(interval_end)
   FROM spanner_sys.txn_stats_total_10minute)
ORDER BY interval_end;

Query output

interval_end	avg_total_latency_seconds	percentile_latency
`2022-08-17 11:46:00-08:00`	`0.34576998305986395`	`9.00296190476190476`

Notice the large difference between the average and the 99th percentile latency. The 99th percentile latency helps identify possible outlier transactions with high latency.

There's only one row in the result because aggregated stats have only one entry per interval_end for any time duration.

Data retention

At a minimum, Spanner keeps data for each table for the following time periods:

SPANNER_SYS.TXN_STATS_TOP_MINUTE and SPANNER_SYS.TXN_STATS_TOTAL_MINUTE: Intervals covering the previous 6 hours.
SPANNER_SYS.TXN_STATS_TOP_10MINUTE and SPANNER_SYS.TXN_STATS_TOTAL_10MINUTE: Intervals covering the previous 4 days.
SPANNER_SYS.TXN_STATS_TOP_HOUR and SPANNER_SYS.TXN_STATS_TOTAL_HOUR: Intervals covering the previous 30 days.

Transaction statistics in Spanner give insight into how an application is using the database, and are useful when investigating performance issues. For example, you can check whether there are any slow running transactions that might be causing contention, or you can identify potential sources of high load, such as large volumes of updates to a particular column. Using the following steps, we'll show you how to use transaction statistics to investigate contentions in your database.

Understand transactions and commit counts

A Spanner transaction may have to be tried multiple times before it commits. This most commonly occurs when two transactions attempt to work on the same data at the same time, and one of the transactions needs to be aborted to preserve the isolation property of the transaction. Some other transient events which can also cause a transaction to be aborted include:

Transient network issues.
Database schema changes being applied while a transaction is in the process of committing.
Spanner instance doesn't have the capacity to handle all the requests it is receiving.

In such scenarios, a client should retry the aborted transaction until it commits successfully or times out. For users of the official Spanner client libraries, each library has implemented an automatic retry mechanism. If you are using a custom version of the client code, wrap your transaction commits in a retry loop.

A Spanner transaction may also be aborted due to a non-retriable error such as a transaction timeout, permission issues, or an invalid table/column name. There is no need to retry such transactions and the Spanner client library will return the error immediately.

The following table describes some examples of how COMMIT_ATTEMPT_COUNT, COMMIT_ABORT_COUNT, and COMMIT_RETRY_COUNT are logged in different scenarios.

Scenario	COMMIT_ATTEMPT_COUNT	COMMIT_ABORT_COUNT	COMMIT_RETRY_COUNT
Transaction successfully committed on first attempt.	1	0	0
Transaction aborted due to time out error.	1	1	0
Transaction aborted due to transient network issue and successfully committed after one retry.	2	1	1
5 transactions with the same FPRINT are executed within a 10-minute interval. 3 of the transactions successfully committed on first attempt, while 2 transactions were aborted and then committed successfully on the first retry.	7	2	2

The data in the transactions-stats tables are aggregated data for a time interval. For a particular interval, it is possible that a transaction abort and retry happens around the boundaries and fall into different buckets. As a result, in a particular time interval aborts and retries may not be equal.

These stats are designed for troubleshooting and introspection and are not guaranteed to be 100% accurate. Statistics are aggregated in memory before being stored in Spanner tables. During an upgrade or other maintenance activities, Spanner servers can restart, affecting the accuracy of the numbers.

Troubleshoot database contentions using transaction statistics

You can use SQL code or the Transaction insights dashboard to view the transactions in your database that might cause high latencies due to lock contentions.

The following topics show how you can investigate such transactions by using SQL code.

Select a time period to investigate

This can be found from the application which is using Spanner.

For the purposes of this exercise, let's say that the issue started occurring at around 5:20pm on the 17th May, 2020.

You can use Transaction Tags to identify the source of the transaction and correlate between Transaction Statistics Table and Lock Statistics tables for effective lock contention troubleshooting. Read more at Troubleshooting with transaction tags.

Gather transaction statistics for the selected time period

To start our investigation, we'll query the TXN_STATS_TOTAL_10MINUTE table around the start of the issue. The results of this query will show us how latency and other transaction statistics changed over that period of time.

For example, the following query returns the aggregated transaction statistics from 4:30 pm to 7:40 pm (inclusive).

SELECT
  interval_end,
  ROUND(avg_total_latency_seconds,4) as avg_total_latency_seconds,
  commit_attempt_count,
  commit_abort_count
FROM SPANNER_SYS.TXN_STATS_TOTAL_10MINUTE
WHERE
  interval_end >= "2020-05-17T16:40:00"
  AND interval_end <= "2020-05-17T19:40:00"
ORDER BY interval_end;

The following table lists example data returned from our query.

interval_end	avg_total_latency_seconds	commit_attempt_count	commit_abort_count
2020-05-17 16:40:00-07:00	0.0284	315691	5170
2020-05-17 16:50:00-07:00	0.0250	302124	3828
2020-05-17 17:00:00-07:00	0.0460	346087	11382
2020-05-17 17:10:00-07:00	0.0864	379964	33826
2020-05-17 17:20:00-07:00	0.1291	390343	52549
2020-05-17 17:30:00-07:00	0.1314	456455	76392
2020-05-17 17:40:00-07:00	0.1598	507774	121458
2020-05-17 17:50:00-07:00	0.1641	516587	115875
2020-05-17 18:00:00-07:00	0.1578	552711	122626
2020-05-17 18:10:00-07:00	0.1750	569460	154205
2020-05-17 18:20:00-07:00	0.1727	613571	160772
2020-05-17 18:30:00-07:00	0.1588	601994	143044
2020-05-17 18:40:00-07:00	0.2025	604211	170019
2020-05-17 18:50:00-07:00	0.1615	601622	135601
2020-05-17 19:00:00-07:00	0.1653	596804	129511
2020-05-17 19:10:00-07:00	0.1414	560023	112247
2020-05-17 19:20:00-07:00	0.1367	570864	100596
2020-05-17 19:30:00-07:00	0.0894	539729	65316
2020-05-17 19:40:00-07:00	0.0820	479151	40398

Here we see that aggregated latency and abort count is higher in the highlighted periods. We can pick any 10 minute interval where aggregated latency and/or abort count are high. Let's choose the interval ending at 2020-05-17T18:40:00 and use it in the next step to identify which transactions are contributing to high latency and abort count.

Identify transactions that are experiencing high latency

Let's now query the TXN_STATS_TOP_10MINUTE table for the interval which was picked in the previous step. Using this data, we can start to identify which transactions are experiencing high latency and/or high abort count.

Execute the following query to get top performance-impacting transactions in descending order of total latency for our example interval ending at 2020-05-17T18:40:00.

SELECT
  interval_end,
  fprint,
  ROUND(avg_total_latency_seconds,4) as avg_total_latency_seconds,
  ROUND(avg_commit_latency_seconds,4) as avg_commit_latency_seconds,
  commit_attempt_count,
  commit_abort_count,
  commit_retry_count
FROM SPANNER_SYS.TXN_STATS_TOP_10MINUTE
WHERE
  interval_end = "2020-05-17T18:40:00"
ORDER BY avg_total_latency_seconds DESC;

interval_end	fprint	avg_total_latency_seconds	avg_commit_latency_seconds	commit_attempt_count	commit_abort_count	commit_retry_count
2020-05-17 18:40:00-07:00	15185072816865185658	0.3508	0.0139	278802	142205	129884
2020-05-17 18:40:00-07:00	15435530087434255496	0.1633	0.0142	129012	27177	24559
2020-05-17 18:40:00-07:00	14175643543447671202	0.1423	0.0133	5357	636	433
2020-05-17 18:40:00-07:00	898069986622520747	0.0198	0.0158	6	0	0
2020-05-17 18:40:00-07:00	10510121182038036893	0.0168	0.0125	7	0	0
2020-05-17 18:40:00-07:00	9287748709638024175	0.0159	0.0118	4269	1	0
2020-05-17 18:40:00-07:00	7129109266372596045	0.0142	0.0102	182227	0	0
2020-05-17 18:40:00-07:00	15630228555662391800	0.0120	0.0107	58	0	0
2020-05-17 18:40:00-07:00	7907238229716746451	0.0108	0.0097	65	0	0
2020-05-17 18:40:00-07:00	10158167220149989178	0.0095	0.0047	3454	0	0
2020-05-17 18:40:00-07:00	9353100217060788102	0.0093	0.0045	725	0	0
2020-05-17 18:40:00-07:00	9521689070912159706	0.0093	0.0045	164	0	0
2020-05-17 18:40:00-07:00	11079878968512225881	0.0064	0.0019	65	0	0

We can see clearly that the first row (highlighted) in the preceding table shows a transaction experiencing high latency because of a high number of commit aborts. We can also see a high number of commit retries which indicates the aborted commits were subsequently retried. In the next step, we'll investigate further to see what is causing this issue.

Identify the columns involved in a transaction experiencing high latency

In this step, we'll check whether high latency transactions are operating on the same set of columns by fetching read_columns, write_constructive_columns and write_delete_tables data for transactions with high abort count. The FPRINT value will also be useful in the next step.

SELECT
  fprint,
  read_columns,
  write_constructive_columns,
  write_delete_tables
FROM SPANNER_SYS.TXN_STATS_TOP_10MINUTE
WHERE
  interval_end = "2020-05-17T18:40:00"
ORDER BY avg_total_latency_seconds DESC LIMIT 3;

fprint	read_columns	write_constructive_columns	write_delete_tables
15185072816865185658	`[TestHigherLatency._exists,TestHigherLatency.lang_status,TestHigherLatency.score,globalTagAffinity.shares]`	`[TestHigherLatency._exists,TestHigherLatency.shares,TestHigherLatency_lang_status_score_index.shares]`	[]
15435530087434255496	`[TestHigherLatency._exists,TestHigherLatency.lang_status,TestHigherLatency.likes,globalTagAffinity.score]`	`[TestHigherLatency._exists,TestHigherLatency.likes,TestHigherLatency_lang_status_score_index.likes]`	[]
14175643543447671202	`[TestHigherLatency._exists,TestHigherLatency.lang_status,TestHigherLatency.score,globalTagAffinity.ugcCount]`	`[TestHigherLatency._exists,TestHigherLatency.ugcCount,TestHigherLatency_lang_status_score_index.ugcCount]`	[]

As the output shows in the preceding table, the transactions with the highest average total latency are reading the same columns. We can also observe some write contention since the transactions are writing to the same column, that is to say, TestHigherLatency._exists.

Determine how the transaction performance has changed over time

We can see how the statistics associated with this transaction shape have changed over a period of time. Use the following query, where $FPRINT is the fingerprint of the high latency transaction from the previous step.

SELECT
  interval_end,
  ROUND(avg_total_latency_seconds, 3) AS latency,
  ROUND(avg_commit_latency_seconds, 3) AS commit_latency,
  commit_attempt_count,
  commit_abort_count,
  commit_retry_count,
  commit_failed_precondition_count,
  avg_bytes
FROM SPANNER_SYS.TXN_STATS_TOP_10MINUTE
WHERE
  interval_end >= "2020-05-17T16:40:00"
  AND interval_end <= "2020-05-17T19:40:00"
  AND fprint = $FPRINT
ORDER BY interval_end;

interval_end	latency	commit_latency	commit_attempt_count	commit_abort_count	commit_retry_count	avg_bytes
2020-05-17 16:40:00-07:00	0.095	0.010	53230	4752	4330	91
2020-05-17 16:50:00-07:00	0.069	0.009	61264	3589	3364	91
2020-05-17 17:00:00-07:00	0.150	0.010	75868	10557	9322	91
2020-05-17 17:10:00-07:00	0.248	0.013	103151	30220	28483	91
2020-05-17 17:20:00-07:00	0.310	0.012	130078	45655	41966	91
2020-05-17 17:30:00-07:00	0.294	0.012	160064	64930	59933	91
2020-05-17 17:40:00-07:00	0.315	0.013	209614	104949	96770	91
2020-05-17 17:50:00-07:00	0.322	0.012	215682	100408	95867	90
2020-05-17 18:00:00-07:00	0.310	0.012	230932	106728	99462	91
2020-05-17 18:10:00-07:00	0.309	0.012	259645	131049	125889	91
2020-05-17 18:20:00-07:00	0.315	0.013	272171	137910	129411	90
2020-05-17 18:30:00-07:00	0.292	0.013	258944	121475	115844	91
2020-05-17 18:40:00-07:00	0.350	0.013	278802	142205	134229	91
2020-05-17 18:50:00-07:00	0.302	0.013	256259	115626	109756	91
2020-05-17 19:00:00-07:00	0.315	0.014	250560	110662	100322	91
2020-05-17 19:10:00-07:00	0.271	0.014	238384	99025	90187	91
2020-05-17 19:20:00-07:00	0.273	0.014	219687	84019	79874	91
2020-05-17 19:30:00-07:00	0.198	0.013	195357	59370	55909	91
2020-05-17 19:40:00-07:00	0.181	0.013	167514	35705	32885	91

In above output we can observe that total latency is high for the highlighted period of time. And, wherever total latency is high, commit_attempt_count commit_abort_count, and commit_retry_count are also high even though commit latency (commit_latency) has not changed very much. Since transaction commits are getting aborted more frequently, commit attempts are also high because of commit retries.

Conclusion

In this example, we saw that high commit abort count was the cause of high latency. The next step is to look at the commit abort error messages received by the application to know the reason for aborts. By inspecting logs in the application, we see the application actually changed its workload during this time, i.e. some other transaction shape showed up with high attempts_per_second, and that different transaction (maybe a nightly cleanup job) was responsible for the additional lock conflicts.

Identify transactions that were not retried correctly

The following query returns the transactions sampled in the last ten minutes that have a high commit abort count, but no retries.

SELECT
  *
FROM (
  SELECT
    fprint,
    SUM(commit_attempt_count) AS total_commit_attempt_count,
    SUM(commit_abort_count) AS total_commit_abort_count,
    SUM(commit_retry_count) AS total_commit_retry_count
  FROM
    SPANNER_SYS.TXN_STATS_TOP_10MINUTE
  GROUP BY
    fprint )
WHERE
  total_commit_retry_count = 0
  AND total_commit_abort_count > 0
ORDER BY
  total_commit_abort_count DESC;

fprint	total_commit_attempt_count	total_commit_abort_count	total_commit_retry_count
1557557373282541312	3367894	44232	0
5776062322886969344	13566	14	0

We can see that the transaction with fprint 1557557373282541312 was aborted 44232 times, but it was never retried. This looks suspicious because the abort count is high and it is unlikely that every abort was caused by a non-retriable error. On the other hand, for the transaction with fprint 5776062322886969344, it is less suspicious because the total abort count is not that high.

The following query returns more details about the transaction with fprint 1557557373282541312 including the read_columns,write_constructive_columns, and write_delete_tables. This information helps to identify the transaction in client code, where the retry logic can be reviewed for this scenario.

SELECT
  interval_end,
  fprint,
  read_columns,
  write_constructive_columns,
  write_delete_tables,
  commit_attempt_count,
  commit_abort_count,
  commit_retry_count
FROM
  SPANNER_SYS.TXN_STATS_TOP_10MINUTE
WHERE
  fprint = 1557557373282541312
ORDER BY
  interval_end DESC;

interval_end	fprint	read_columns	write_constructive_columns	write_delete_tables	commit_attempt_count	commit_abort_count
2021-01-27T18:30:00Z	1557557373282541312	['Singers._exists']	['Singers.FirstName', 'Singers.LastName', 'Singers._exists']	[]	805228	1839
2021-01-27T18:20:00Z	1557557373282541312	['Singers._exists']	['Singers.FirstName', 'Singers.LastName', 'Singers._exists']	[]	1034429	38779
2021-01-27T18:10:00Z	1557557373282541312	['Singers._exists']	['Singers.FirstName', 'Singers.LastName', 'Singers._exists']	[]	833677	2266
2021-01-27T18:00:00Z	1557557373282541312	['Singers._exists']	['Singers.FirstName', 'Singers.LastName', 'Singers._exists']	[]	694560	1348

We can see that the transaction involves a read to the Singers._exists hidden column to check the existence of a row. The transaction also writes to the Singers.FirstName and Singer.LastName columns. This information can help determine whether the transaction retry mechanism implemented in your custom client library is working as expected.

What's next

Learn about other Introspection tools.
Learn about other information Spanner stores for each database in the database's information schema tables.
Learn more about SQL best practices for Spanner.
Learn more about Investigating high CPU utilization.