Databases

Where is your Cloud Bigtable cluster spending its CPU?

#databases

CPU utilization is a key performance indicator for Cloud Bigtable. Understanding CPU spend is essential for optimizing Bigtable performance and cost. We have significantly improved Bigtable’s observability by allowing you to visualize your Bigtable cluster’s CPU utilization in more detail. We now provide you with the ability to break the utilization down by various dimensions like app profile, method and table. This finer grained reporting can help you make more informed application design choices and help with diagnosing performance related incidents.

In this post, we present how this visibility may be used in the real world, through example persona-based user journeys.

User Journey: Investigate an incident with high latency
Target Persona: Site Reliability Engineer (SRE)

ABC Corp runs Cloud Bigtable in a multi-tenant environment. Multiple teams at ABC Corp use the same Bigtable instance.

Alice is an SRE at ABC Corp. Alice gets paged because the tail latency of a cluster exceeded the acceptable performance threshold. Alice looks at the cluster level CPU utilization chart and sees that the CPU usage spiked during the incident window.

1 P99 latency for app profile personalization-reader spikes.jpg
P99 latency for app profile personalization-reader spikes
2 CPU utilization for the cluster spikes.jpg
CPU utilization for the cluster spikes

Alice wants to drill down further to get more details about this spike. The primary question she wants to answer is “Which team should I be reaching out to?” 

Fortunately, teams at ABC Corp follow the best practice of tagging the usage of each team with an app profile in the following format: <teamname>-<workload-type>

The bigtable instance has the following app profiles:

  • revenue-updater

  • info-updater

  • personalization-reader

  • personalization-batch-updater

The instance's data is stored in the following tables:

  • revenue

  • client-info

  • personalization

She uses the CPU per app profile chart to determine that the personalization-batch-updater app profile utilized the most CPU during the time of the incident and also saw a spike that corresponded with the spike in latency of the serving path traffic under the personalization-reader app profile.

3.jpg

At this point, Alice knows that the personalization-batch-updater traffic is adversely impacting the personalization-reader traffic. She further digs into the dashboards in Metrics Explorer to figure out the problematic method and table.

4.jpg
CPU usage breakdown by app profile, table and method

Alice has now identified the personalization-batch-updater app profile, the personalization table and the MutateRows method as the reason for the increase in CPU utilization that is causing high tail latency of the serving path traffic.

With this information, she reaches out to the personalization team to provision the cluster correctly before the batch job starts so that the performance of other tenants is not affected. 

The following options can be considered in this scenario:

  1. Run the batch job on a replicated instance with multiple clusters. Provision a dedicated cluster for the batch job and use single cluster routing to completely isolate the serving path traffic from the batch updates

  2. Provision more nodes for the cluster before the batch job starts and for the duration of the batch job. This option is less preferred than option 1, since serving path traffic may still be impacted. However, this option is more cost effective.

User Journey: Schema and cost optimization
Target Persona: Developer

Bob is a developer who is onboarding a new workload on Bigtable. He completes the development of his feature and moves on to the performance benchmarking phase before releasing to production. He notices that both the throughput and latency of his queries are lower than what he expected and begins debugging the issue. 

His first step is to look at the CPU utilization of the cluster, which is higher than expected and is hovering around the recommended max.

8.jpg
CPU utilization by cluster

To debug further, he looks at the CPU utilization by app profile and the CPU utilization by table charts. He determines that the majority of the CPU is consumed by the product-reader app profile and the product_info table.

6.jpg
CPU utilization by app profile
7.jpg
CPU utilization by table

He inspects the application code and notices that the query includes a value range filter. He realizes that value filters are expensive, so he moves the filtering to the application. This leads to substantial decrease in Bigtable cluster CPU utilization. Consequently, not only does he improve performance, but he can also lower costs for the Bigtable cluster.

5.jpg
CPU utilization by cluster after removing value range filter
9.jpg
CPU utilization by app profile after removing value range filter
10.jpg
CPU utilization by table after removing value range filter

We hope that this blog helps you to understand why and when you might want to use our new observability metric - CPU per app profile, method and table. 

Accessing the metrics

These metrics can be accessed on the Bigtable Monitoring UI under the Tables and Application Profiles tabs. To see the method breakdown, view the metric in Metrics Explorer, which you can also navigate to from Cloud Monitoring UI.

11.jpg