Jump to Content
Databases

Getting to know Systems insights, a simplified database system monitoring tool

October 13, 2023
https://storage.googleapis.com/gweb-cloudblog-publish/images/0_-_Hero_image___2436_x1200.max-2500x2500.jpg
Mani HK

Product Designer, Google Cloud

Kaushal Agrawal

Product Designer, Google Cloud

A slow running database is hard to diagnose. When problems arise, database teams want to immediately understand what is causing the issue. Is the database experiencing resource saturation? If so, what is consuming resources — an application query or an operation initiated by Cloud SQL? Are there any new changes in the database? Are there any background tasks such as vacuum and backup operations? Answering these questions isn't easy. It requires investigating and correlating metrics from multiple different sources to understand the state of the system and pinpoint where the issue is.

To make this process a bit easier, we designed System insights.

System insights is a database system monitoring tool that brings together critical metrics, events, and logs to provide a comprehensive view of both the external database performance and the internal system resources. Bringing all the signals into a single dashboard allows you to quickly identify potential sources of problems without having to switch between tools. Right now, System insights is available for Cloud SQL for PostgreSQL and Spanner in GA and Cloud SQL for MySQL in preview.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_-_Cloud_SQL_System_insights_for_PostgreS.max-2200x2200.png

Fig 1: Cloud SQL System insights for PostgreSQL

Why design a new system monitoring tool?

Before System insights, users had two options for monitoring performance, both of which had significant user friction.

  1. Metrics on the instance overview page:
    While metrics shown here are useful, the interface did not allow users to compare and correlate multiple metrics at the same time — a key feature for successfully troubleshooting a database problem.

  2. Custom dashboards:
    Some users set up custom monitoring dashboards, either within Google Cloud or using third-party tools like Grafana. This provides great flexibility, but presents a steep learning curve to get going. Perhaps most importantly, it doesn’t let you see everything you need in a central place.

System insights addresses both of these challenges with a pre-built dashboard with richer, customizable metrics. This reduces the barrier to entry for novice users and facilitates complex troubleshooting journeys for advanced users, all within the database console UI.

In this blog, we walk you through the key features of System insights and the design rationale behind each of them.

Quick snapshot of system status

At the top of the System insights page, you’ll see a snapshot of your system’s key resources and its current status — a quick visual check on any potential issues that you may need to investigate. The system configuration helps you understand how the system is performing relative to the resources that were originally provisioned.

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_-_Snapshot_of_key_system_resources_and_i.max-1800x1800.png

Fig 2: Snapshot of key system resources and their current status

Pre-built dashboard with actionable metrics

When designing system monitoring dashboards, we often run into the “problem of plenty” — there’s a wealth of metrics and data points to expose to users. How to decide on the right ones to surface? In Google Cloud, we cater to users with a wide range of database expertise and expectations. So, the key design challenge was to balance the needs of generalists and specialists.

We drew inspiration from observability frameworks such as RED (Rates, Errors, and Durations) and USE (Usage, Saturation, and Errors) and applied it to key system resources such as CPU, memory, active connections etc. This gives a comprehensive-yet-easily-digestible view of both the external and internal factors influencing database throughput. Within each metric, special attention is given to the way it is represented to make the underlying causes of behavior more readily identifiable. For example, disk storage is shown with a breakdown of data, logs and others, to clearly indicate which aspect of the disk is contributing to the increasing storage utilization.

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_-_Break-down_of_disk_storage_by_typesize.max-2200x2200.png

Fig 3: Break-down of disk storage by type

Also, wherever possible, saturation limit lines are shown on resource utilization graphs. For metrics like CPU or memory, utilization percentage intuitively communicates saturation. However, the saturation limit for metrics like active connections is not straightforward. So, in such cases, based on the user’s current system configuration, we show a computed saturation limit to make it complete and actionable for users.

Correlation beyond metrics

To troubleshoot a database problem successfully, users often go beyond metrics and look into logs and system events. For example, a slow-running query without a proper index can create a long lock wait, resulting in increased error rates or CPU utilization, even if there is no change in the external workload. And it is not easy to understand the ‘why’ behind such behaviors just by looking at metrics. A trained database specialist will use a cloud logging tool along with system monitoring to see if there are any system events that coincide with a resource anomaly. This process involves not only switching between multiple tools, but also extracting fine-grained information from an ocean of logs that are not human-friendly to begin with.

https://storage.googleapis.com/gweb-cloudblog-publish/images/4_-_Events_timeline_brings_metrics_and_log.max-2200x2200.png

Fig 4: Events timeline brings metrics and logs together

With System insights, we simplified this journey for users. Instead of expecting users to maneuver between multiple tools, we integrated events into the System insights dashboard. Key system operations are filtered from cloud logs and presented to users as a visual timeline, referred to as the ‘events timeline’. It’s a dynamic experience that allows users to naturally scroll through time and correlate metrics with system events.

Customizable views for specialist use cases

The dashboard defaults are a great starting point for most users. But for those who need more control over what metrics to monitor, we also provide customizable views. Users can add to the default selection of metrics, or choose a specific set of metrics that are more relevant to them. As System insights continues to expand its telemetry coverage and adds more dimensions to customize, this feature will address more of the needs of database specialists without making the default experience daunting for generalists.

https://storage.googleapis.com/gweb-cloudblog-publish/images/5_-_Customisable_viewssize_1x.max-2200x2200.png

Fig 5: Customisable views

To summarize, no matter where you are in your database system observability journey, System insights provides an easy to use starting point for generalists as well as provides customisable views for specialist use cases.

What next?

Our observability strategy is to build on a foundation of telemetry depth and ease of use. This effort began with Query Insights, a tool that helps developers fix query inefficiencies through visual query plans and latency analysis. In 2022, we launched proactive database wellness recommenders to help you increase the security, performance, reliability, and cost efficiency of your Cloud SQL fleet. Now, System insights extends that capability to system resource monitoring and troubleshooting.

We are actively working on improving complex troubleshooting journeys. Stay tuned for more!

Posted in