Monitor instances with system insights

This page describes how to use the system insights dashboard to monitor Spanner instances and databases.

About system insights

The system insights dashboard displays scorecards and graphs with respect to a selected instance or database, and provides measures of latencies, CPU utilization, storage, throughput, and other performance statistics. You can view charts for several different time periods, ranging from the past 1 hour to the past 30 days.

The system insights dashboard includes the following sections (refer to the screenshot):

  1. Databases list: Shows statistics of the selected database. You can view a single database or an aggregate of all databases. This is available for instances only.
  2. Layout toggle: Toggles between a single-column or two-column layout.
  3. Time range filter: Filters the statistics by time ranges, such as hours, days, or a custom range.
  4. Scorecards : Displays statistics at a point of time, over the selected period.
  5. Graphs: Displays graphs of CPU utilization, throughputs, latencies, storage use, and more.

    If you create a partition (in preview) in your instance, you see an additional drop-down list to view graphs for a single partition or an aggregate of all partitions. You don't see this drop-down list if you haven't created any partitions.

System insights dashboard

System insights scorecards, charts, and metrics

The system insights dashboard provides the following charts and metrics to show an instance's current and historical status. Most charts and metrics are available at the instance level. You can also view many charts and metrics for a single database within an instance.

Available Scorecards

Name Description
CPU utilization Total CPU use within an instance or selected database. In a dual-region or multi-region instance, this metric represents the mean of CPU utilization across regions.
Latency: P99 P99 latency for read and write operations within an instance or selected database.
Latency: P50 P50 latency for read and write operations within an instance or selected database.
Throughput Amount of uncompressed data that was read from,or written to the instance or database each second. This value is measured in binary megabytes (MB), where 1 MB is 2^20 bytes. This unit of measurement is also known as a mebibyte (MiB)
Operations Per Second Number of operations per second (rate) of read and writes within an instance or selected database.
Storage utilization At the instance level it is the total storage utilization percentage within an instance. At the database level this is the total storage used for the selected database.

Available charts and metrics

The following is a chart for a sample metric:

image

The toolbar on each chart card provides the following set of standard options:

  • To zoom into a particular section of a chart, click the chart and drag horizontally or vertically. To revert the zoom operation, click Reset zoom. Zoom operations apply to all charts on the dashboard at the same time.

  • To hide or display the legend, click Expand/collapse chart legend.

  • To view a chart in full-screen mode, click Enter/exit fullscreen. You can also exit full screen by clicking Esc.

  • To view additional options, click More chart options.

    Most charts offer these options:

    • Download a PNG image.
    • Download a CSV file.
    • Add to Custom Dashboard. This option lets you add a chart to a new dashboard or an existing dashboard in Cloud Monitoring.
    • View in Metrics Explorer. View the metric in Metrics Explorer. You can view other Spanner metrics in the Metrics Explorer after selecting the Spanner Database resource type.

The following table describes the charts that appear by default on the system insights dashboard. The metric type for each chart is listed. The metric type strings follow this prefix: spanner.googleapis.com/. Metric type describes measurements that can be collected from a monitored resource.

Chart name and metric type
Description Available for instances Available for databases

CPU utilization by priority


instance/cpu/utilization_by_priority

The percentage of the instance's CPU resources for high, medium, low, or all tasks by priority. These tasks include requests that you initiate and maintenance tasks that Spanner must complete promptly.

For dual-region or multi-region instances, metrics are grouped by the region and priority.

Learn more about high-priority tasks.
Learn more about CPU utilization.



Total CPU utilization


instance/cpu/utilization_by_priority

The total CPU utilization, as a percentage of the instance's CPU resources.

For instances, you can view the stacked chart of total CPU utilization grouped by database, or grouped by combination of task type (User/System) and priority.

For databases, you can view the stacked chart of total CPU utilization grouped by combination of task type (User/System) and priority.

For dual-region or multi-region instances, you can choose the region to view or you can display all regions as multiple line charts.



CPU utilization by operation type


instance/cpu/utilization_by_operation_type

A stacked chart of CPU utilization as a percentage of the instance's CPU resources, grouped by user-initiated operations such as reads, writes, and commits. Use this metric to get a detailed breakdown of CPU usage and to troubleshoot further, as explained in Investigate high CPU utilization.

You can further filter by priority of the tasks using the Priority drop-down.

For dual-region or multi-region instances, metrics in the line chart show the mean percentage among regions.



CPU utilization (rolling 24-hour average)


instance/cpu/smoothed_utilization

A rolling average of total CPU Spanner utilization, as a percentage of the instance's CPU resources, for each database. Each data point is an average for the previous 24 hours.

For dual-region or multi-region instances, you can filter metrics in the line chart by region by using the Region drop-down.



Latency by change stream read


api/read_request_latencies_by_change_stream

The distribution of read request latencies by change stream. Use this metric view all latencies and distinguish if a latency is for a change stream read or a non-change stream read.

Change stream queries are long-running and expected to be several seconds long. In contrast, non-change stream queries are mostly short-running. Using this metric, you can:
  • View the latencies for change stream queries.
  • View the latencies for non-change stream queries.
  • Identify if a non-change stream query is experiencing a high latency.



Peak split CPU usage


instance/peak_split_peak
The maximum peak split CPU usage observed across all splits in a database. This metric shows the percentage of the processing unit resources that are being used on a split. A percentage of over 50% is a warm split, which means that the split is using half of the host server's processing unit resources. A percentage of 100% is a hot split, which is a split that's using the majority of the host server's processing unit resources. Spanner uses load-based splitting to resolve hotspots and balance the load. However, Spanner might not be able to balance the load, even after multiple attempts at splitting, due to problematic patterns in the application. Hence, hotspots that lasts for at least 10 minutes might need further troubleshooting and could potentially require application changes. For more information, see Find hotspots in splits.


Latency


api/request_latencies

The amount of time that Spanner took to handle a read or write request. Use the Function drop-down to select Read or Write, or select Read/write to view metrics for both. This measurement begins when the Spanner receives a request, and it ends when the Spanner starts to send a response.

You can view latency metrics for the 50th and 99th percentile latencies by using the Percentile drop-down:
  • 50th percentile latency: The maximum latency, in seconds, for the fastest 50% of all requests.
  • 99th percentile latency: The maximum latency, in seconds, for the fastest 99% of all requests.



Latency by database


api/request_latencies

The amount of time that Spanner took to handle a read or write request, grouped by database. Use the Function drop-down to select Read or Write, or select Read/Write to view metrics for both. This measurement begins when Spanner receives a request, and it ends when Spanner starts to send a response.

You can view metrics for the 50th and 99th percentile latency by using the Percentile drop-down:
  • 50th percentile latency: The maximum latency, in seconds, for the fastest 50% of all requests.
  • 99th percentile latency: The maximum latency, in seconds, for the fastest 99% of all requests.



Latency by API method


api/request_latencies

The amount of time that Spanner took to handle a request, grouped by Spanner API methods. This measurement begins when Spanner receives a request, and it ends when Spanner starts to send a response.

You can view metrics for the 50th and 99th percentile latencies by using the Percentile drop-down:
  • 50th percentile latency: The maximum latency, in seconds, for the fastest 50% of all requests.
  • 99th percentile latency: The maximum latency, in seconds, for the fastest 99% of all requests.




Transaction latency


api/request_latencies_by_transaction_type

The amount of time that Spanner took to process a transaction. You can select to view metrics for read-write and read-only type transactions.

The major difference between the Latency chart and the Transaction latency chart is that the Transaction latency chart lets you select the leader involvement for the read-only type. You can select Leader is involved or No leader is involved for the read-only transaction. Reads that involve the leader might experience higher latency. You can use this chart to evaluate if you should use stale reads without communicating with the leader, assuming the timestamp bound is at least 15 seconds. For read-write transactions, the leader is always involved in the transaction, so the data shown on the chart always includes the time it took for the request to reach the leader and receive a response.

You can view metrics for the 50th and 99th percentile latency:
  • 50th percentile latency: The maximum latency, in seconds, for the fastest 50% of all transactions.
  • 99th percentile latency: The maximum latency, in seconds, for the fastest 99% of all transactions.



Transaction latency by database


api/request_latencies_by_transaction_type

The amount of time that Spanner took to process a transaction. You can select to view metrics for read-write and read-only type transactions.

The major difference between the Latency chart and the Transaction latency by database chart is that the Transaction latency by database chart lets you select the leader involvement for the read-only type. You can select Leader is involved or No leader is involved for the read-only transaction. Reads that involve the leader might experience higher latency. You can use this chart to evaluate if you should use stale reads without communicating with the leader, assuming the timestamp bound is at least 15 seconds. For read-write transactions, the leader is always involved in the transaction, so the data shown on the chart always includes the time it took for the request to reach the leader and receive a response.

You can view metrics for the 50th and 99th percentile latency:
  • 50th percentile latency: The maximum latency, in seconds, for the fastest 50% of all transactions.
  • 99th percentile latency: The maximum latency, in seconds, for the fastest 99% of all transactions.




Transaction latency by API method


api/request_latencies_by_transaction_type

The amount of time that Spanner took to process a transaction. You can select to view metrics for read-write and read-only type transactions.

The major difference between the Latency chart and the Transaction latency by API method chart is that the Transaction latency by API method chart lets you select the leader involvement for the read-only type. You can select Leader is involved or No leader is involved for the read-only transaction. Reads which involve the leader might experience higher latency. You can use this chart to evaluate if you should use stale reads without communicating with the leader, assuming the timestamp bound is at least 15 seconds. For read-write transactions, the leader is always involved in the transaction so the data shown on the chart always include the time it took for the request to reach the leader and receive a response.

You can view metrics for 50th and 99th percentile latency:
  • 50th percentile latency: The maximum latency, in seconds, for the fastest 50% of all transactions.
  • 99th percentile latency: The maximum latency, in seconds, for the fastest 99% of all transactions.



Operations per second


api/api_request_count

The number of operations (read/write) that Spanner performed per second, or the number of errors that occurred on the Spanner server per second.

You can choose which operations to view in this chart:
  • Reads and writes (also includes read and write errors)
  • Reads only (also includes DML statements and read errors)
  • Writes only (excludes DML statements and includes write errors)
  • Errors on the Spanner server (grouped by read and write)



Operations per second by database


api/api_request_count

The number of operations (read/write) that Spanner performed per second, or the number of errors that occurred on the Spanner server per second. This chart is grouped by database.

You can choose which operations to view in this chart:
  • Reads and writes (also includes read and write errors)
  • Reads only (also includes DML statements and read errors)
  • Writes only (excludes DML statements and includes write errors)
  • Errors on the Spanner server (grouped by read and write)



Operations per second by API method


api/api_request_count

The number of operations that Spanner performed per second, grouped by Spanner API method



Throughput


api/sent_bytes_count (read)

api/received_bytes_count (write)

The amount of uncompressed data that was read from, or written to, the instance or database each second. This value is measured in binary byte units. This unit of measurement is based on the power of 2. For example, 1 binary gigabyte (GB) is 2^30 bytes. This unit of measurement is also known as a gibibyte (GiB).

Read throughput includes requests and responses for methods in the read API and for SQL queries. It also includes requests and responses for DML statements.

Write throughput includes requests and responses to commit data through the mutation API. It excludes requests and responses for DML statements.



Throughput by database


api/sent_bytes_count (read)

api/received_bytes_count (write)

The amount of uncompressed data that was read from, or written to, the instance or database each second, grouped by database. This value is measured in binary byte units. This unit of measurement is based on the power of 2. For example, 1 binary gigabyte (GB) is 2^30 bytes. This unit of measurement is also known as a gibibyte (GiB).

Read throughput includes requests and responses for methods in the read API and for SQL queries. It also includes requests and responses for DML statements.

Write throughput includes requests and responses to commit data through the mutation API. It excludes requests and responses for DML statements.



Throughput by API method


api/sent_bytes_count (read)

api/received_bytes_count (write)

The amount of uncompressed data that was read from, or written to, the instance or database each second, grouped by API method. This value is measured in binary byte units. This unit of measurement is based on the power of 2. For example, 1 binary gigabyte (GB) is 2^30 bytes. This unit of measurement is also known as a gibibyte (GiB).

Read throughput includes requests and responses for methods in the read API and for SQL queries. It also includes requests and responses for DML statements.

Write throughput includes requests and responses to commit data through the mutation API. It excludes requests and responses for DML statements.



Total storage


instance/storage/used_bytes

The amount of data that is stored in the instance or database. This value is measured in binary byte units. For example, 1 binary gigabyte (GB) is 2^30 bytes. This unit of measurement is also known as a gibibyte (GiB).



Total database storage by database


instance/storage/used_bytes

The amount of data that is stored in the instance or database, grouped by database. This value is measured in binary byte units. For example, 1 binary gigabyte (GB) is 2^30 bytes. This unit of measurement is also known as a gibibyte (GiB).



Database storage by table


(none)

The amount of data that is stored in the instance or database, grouped by tables in the selected database. This value is measured in binary byte units. For example, 1 binary gigabyte (GB) is 2^30 bytes. This unit of measurement is also known as a gibibyte (GiB).

This chart obtains its data by querying SPANNER_SYS.TABLE_SIZES_STATS_1HOUR. For more information, see Table sizes statistics.



Most-used tables by operations


(none)

The 15 most used tables and indexes in the instance or database, determined by the number of read or write or delete operations.
This chart obtains its data by querying the table operations statistics tables. For more information, see Table operations statistics.



Least-used tables by operations


(none)

The 15 least used tables and indexes in the instance or database, determined by the number of read or write or delete operations.
This chart obtains its data by querying the table operations statistics tables. For more information, see Table operations statistics.



Lock wait time


lock_stat/total/lock_wait_time

Lock wait time for a transaction is the time needed to acquire a lock on a resource held by another transaction.

Total lock wait time for lock conflicts is recorded for the entire database.



Lock wait time by database


lock_stat/total/lock_wait_time

Lock wait time for a transaction is the time needed to acquire a lock on a resource held by another transaction.

Total lock wait time for lock conflicts is recorded for the entire database.



Total backup storage


instance/backup/used_bytes

The amount of data that is stored in the backups that are associated with the instance or database. This value is measured in binary byte units. For example, 1 binary gigabyte (GB) is 2^30 bytes. This unit of measurement is also known as a gibibyte (GiB).



Total backup storage by database


instance/backup/used_bytes

The amount of data that is stored in the backups that are associated with the instance or database, grouped by database. This value is measured in binary byte units. For example, 1 binary gigabyte (GB) is 2^30 bytes. This unit of measurement is also known as a gibibyte (GiB).



Compute capacity


instance/processing_units
instance/nodes

The compute capacity is the amount of processing units or nodes available in an instance. You can choose to display the capacity in processing units or in nodes.




Leader distribution


instance/leader_percentage_by_region

For dual-region or multi-region instances, you can view the number of databases with the majority of leaders (>=50%) in a given region. Under the Regions drop-down menu, if you select a specific region, the chart shows the total number of databases within that instance that have the selected region as the leader region. If you select All regions under the Regions drop-down menu, the chart shows one line for each region, and each line shows the total number of databases in the instance that has that region as its leader region.

For databases in a dual-region or multi-region instance, you can view the percentage of leaders grouped by region. For example, if a database has five leaders, one in us-west1 and four in us-east1 at a point-in-time, the "All regions" chart shows two lines (one per region). One line for us-west1 is at 20%, and the other line for us-east1 is at 80%. The us-west1 chart shows one single line at 20%, and the us-east1 chart shows one single line at 80%.

Note that if a database was recently created or a leader region was recently modified, the charts might not stabilize right away.

This chart is only available for dual-region and multi-region instances.




Dual-region quorum health timeline


instance/dual_region_quorum_availability

This chart is available only for dual-region instance configurations. It shows the health of three quorums: the dual-region quorum (Global), and the single region quorum in each region (for example, Sydney and Melbourne).

It shows an orange bar in the timeline when there is a service disruption. You can hover over the bar to see the start and end times of the disruption. Use this chart alongside the error rates and latency metrics to help you make self-managed, when-to-failover decisions in the case of regional failures. For more information, see Failover and failback.

To failover and failback manually, see Change dual-region quorum.




Remote service call count


query_stat/total/remote_service_calls_count

Count of remote service calls, grouped by the service and response codes.

Responds with an HTTP response code, such as 200 or 500.




Remote service call latencies


query_stat/total/remote_service_calls_latencies

The latency of the remote service calls, grouped by service.

You can view latency metrics for the 50th and 99th percentile latencies by using the Percentile drop-down:
  • 50th percentile latency: The maximum latency, in seconds, for the fastest 50th percentile of all requests.
  • 99th percentile latency: The maximum latency, in seconds, for the fastest 99th percentile of all requests.




Remote service processed rows count


query_stat/total/remote_service_processed_rows_count

Count of rows processed by a remote service, grouped by the servicer and response codes.

Responds with an HTTP response code, such as 200 or 500.




Remote service rows latencies


query_stat/total/remote_service_processed_rows_latencies

Count of rows processed by a remote service, grouped by the service and response codes.

You can view latency metrics for the 50th and 99th percentile latencies by using the Percentile drop-down:
  • 50th percentile latency: The maximum latency, in seconds, for the fastest 50th percentile of all requests.
  • 99th percentile latency: The maximum latency, in seconds, for the fastest 99th percentile of all requests.




Remote service network bytes


query_stat/total/remote_service_network_bytes_sizes

Network bytes exchanged with the remote service, grouped by service and direction.

This value is measured in binary byte units. This unit of measurement is based on the power of 2. For example, 1 binary gigabyte (GB) is 2^30 bytes. This unit of measurement is also known as a gibibyte (GiB).

Direction refers to traffic being sent or received.

You can view metrics for the 50th and 99th percentile of network bytes exchange by using the Percentile drop-down:
  • 50th percentile: The data exchanged, in 50th percentile of the requests.
  • 99th percentile: The data exchanged, in 99th percentile of the requests.


Managed autoscaler charts and metrics

In addition to the options shown in the previous section, when an instance has managed autoscaler enabled, the compute capacity chart has the View Logs button. When you click this button, it displays logs from the managed autoscaler.

The following metrics are available for instances that have the managed autoscaler enabled.

Chart name and metric type Description
Compute capacity With nodes selected.

instance/autoscaling/min_node_count

Minimum number of nodes autoscaler is configured to allocate to the instance.

instance/autoscaling/max_node_count
Maximum number of nodes autoscaler is configured to allocate to the instance.

instance/autoscaling/recommended_node_count_for_cpu

Recommended number of nodes based on the CPU usage of the instance.

instance/autoscaling/recommended_node_count_for_storage

Recommended number of nodes based on the storage usage of the instance.
Compute capacity With processing units selected.

instance/autoscaling/min_processing_units

Minimum number of processing units autoscaler is configured to allocate to the instance.

instance/autoscaling/max_processing_units

Maximum number of processing units autoscaler is configured to allocate to the instance.

instance/autoscaling/recommended_processing_units_for_cpu

Recommended number of processing units. This recommendation is based on the previous CPU usage of the instance.

instance/autoscaling/recommended_processing_units_for_storage

Recommended number of processing units to use. This recommendation is based on the previous storage usage of the instance.
CPU utilization by priority

instance/autoscaling/high_priority_cpu_utilization_target

High priority CPU utilization target to use for autoscaling.
Total storage With processing units selected.

instance/storage/limit_bytes

Storage limit for the instance in bytes.

instance/autoscaling/storage_utilization_target

Storage utilization target to use for autoscaling.

Data retention

The maximum data retention for most metrics on the system insights dashboard is six weeks. However, for the Database storage by table graph, the data is consumed from the SPANNER_SYS.TABLE_SIZES_STATS_1HOUR table (instead of Spanner), which has a maximum retention of 30 days. See Data retention to learn more.

View the system insights dashboard

To view the system insights page, you need the following Identity and Access Management (IAM) permissions in addition to the Spanner permissions and Spanner permissions at the instance and database levels:

  • spanner.databases.beginReadOnlyTransaction
  • spanner.databases.select
  • spanner.sessions.create

For more information about Spanner IAM permissions, see Access control with IAM.

If you enable managed autoscaler on your instance, you also need the logging.logEntries.list permission to view the managed autoscaler logs.

For more information about this permission, see Predefined roles.

To view the system insights dashboard, follow these steps:

  1. In the Google Cloud console, open the list of Spanner instances.

    Go to the instance list

  2. Do one of the following:

    1. To see metrics for an instance, click the name of the instance that you want to learn about, then click System insights in the navigation menu.

    2. To see metrics for a database, click the name of the instance, select a database, then click System insights in the navigation menu.

  3. Optional: To view historical data for a different time period, find the buttons at the top right of the page, then click the time period that you want to view.

  4. Optional: To control what data appears in the chart, click one of the drop-down lists in the chart. For example, if the instance uses a dual-region or multi-region configuration, some charts provide a drop-down list to view data for a specific region. Not all charts have drop-down lists.

What's next