Troubleshooting query performance problems using Query Insights

This page describes how to use the Query Insights dashboard to detect and analyze performance problems. To learn more, see Overview of Query Insights.

Before you begin

If you or other users need to view the query plan or perform end-to-end tracing, you need specific IAM permissions to do so. You can create a custom role and add the necessary IAM permissions to it. Then you can add this role to each user account that will use Query Insights to troubleshoot an issue. See Creating a custom role.

The custom role needs to have the following IAM permission.

  • cloudtrace.traces.get

Opening the Query Insights dashboard

Console

  1. Go to the Cloud SQL Instances page in the Google Cloud Console.
    Go to the Cloud SQL Instances page
  2. Click on an instance name to navigate to the Overview page.

  3. To open the Query Insights dashboard, either select the Query Insights tab or click Go to Query Insights for more in-depth info on queries and performance.

    Figure 1. Opening the Query Insights dashboard using the Query Insights
         tab or the icon at the bottom of the graph.
    Figure 1. Ways to open the Query Insights dashboard.

  4. If Query Insights is not yet enabled on your instance, click Enable.

  5. Select any Query Insights options you need. These options can be changed at any time. Options include:

    Store client IP addresses

    Lets you use the client IP addresses where queries are coming from, and group that data to run metrics against it. Queries come from more than one host. Some hosts might be behaving badly or hosts might be handling different applications. Reviewing graphs for queries from client IP addresses could help identify the source of a problem.

    Store application tags

    Lets you determine which APIs and model-view-controller (MVC) routes are making requests, and group that data to run metrics against it. This information helps Query Insights identify the source of a problem. In this case, understanding which MVC the problem is coming from. Application paths help you with application monitoring. This option requires you to comment queries with a specific set of tags. You can use the sqlcommenter Open Source Software on your application server to do this.

    Customize query lengths

    The default query length is 1024B. You can change the default limit up to 4500B or down to 256B. Higher query lengths are more useful for analytical queries, but they also require more memory. Changing query length requires you to restart the instance. Tags will still be added to queries that exceed the length limit.

  6. Click ENABLE.

    After you enable the Query Insights dashboard, you see filtering options and a graph for the database load for all queries.

gcloud

To create a new instance with Insights enabled, run gcloud beta sql instances create with the --insights-config-query-insights-enabled flag.

This flag can be used with one or more of the following optional flags:

  • --insights-config-record-client-address

    Lets you use the client IP addresses where queries are coming from, and group that data to run metrics against it. Queries come from more than one host. Some hosts might be behaving badly or hosts might be handling different applications. Reviewing graphs for queries from client IP addresses could help identify the source of a problem.

  • --insights-config-record-application-tags

    Lets you determine which APIs and model-view-controller (MVC) routes are making requests, and group that data to run metrics against it. This information helps Query Insights identify the source of a problem. In this case, understanding which MVC the problem is coming from. Application paths help you with application monitoring. This option requires you to comment queries with a specific set of tags. You can use the sqlcommenter Open Source Software on your application server to do this.

  • --insights-config-query-string-length

    The default query length is 1024B. You can change the default limit up to 4500B or down to 256B. Higher query lengths are more useful for analytical queries, but they also require more memory. Changing query length requires you to restart the instance.

Replace the following:

  • INSIGHTS_CONFIG_QUERY_STRING_LENGTH: The query string length in bytes to be stored by the query insights feature. Default length is 1024 bytes. Allowed range: 256 to 4500 bytes.
  • API_TIER_STRING: The custom machine type to use for the instance.
  • REGION: The region for the instance.
  gcloud sql instances create INSTANCE_NAME
    --insights-config-query-insights-enabled
    --insights-config-query-string-length=INSIGHTS_CONFIG_QUERY_STRING_LENGTH
    --insights-config-record-application-tags
    --insights-config-record-client-address
    --tier=API_TIER_STRING
    --region=REGION
  

REST v1

For reference information about this command, see the Instances settings.

Before using any of the request data below, make the following replacements:

  • project-id: The project ID
  • instance-id: The instance ID

HTTP method and URL:

PATCH https://sqladmin.googleapis.com/sql/v1beta4/projects/project-id/instances/instance-id

Request JSON body:

{
  "settings" : { "insightsConfig" : { "queryInsightsEnabled" : true } }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "kind": "sql#operation",
  "targetLink": "https://sqladmin.googleapis.com/sql/v1beta4/projects/project-id/instances/instance-id",
  "status": "PENDING",
  "user": "user@example.com",
  "insertTime": "2021-01-28T22:43:40.009Z",
  "operationType": "UPDATE",
  "name": "operation-id",
  "targetId": "instance-id",
  "selfLink": "https://sqladmin.googleapis.com/sql/v1beta4/projects/project-id/operations/operation-id",
  "targetProject": "project-id"
}

The Insights dashboard

Figure 2 shows the Query Insights dashboard, with drop-down menus
         for databases, users, and addresses. To the right of the drop-down
         menus, there's a filter for setting a time range. Below that,
         there's a graph for database load for all queries. At the bottom
         of the graph there are selection boxes for CPU capacity, CPU and
         CPU wait, IO Wait, and Lock Wait. Below the graph, there are two
         tabs: One for queries and one for tags.
Figure 2. The Query Insights dashboard.

The areas of the dashboard include:

  1. Databases. Filters query load on a specific database or all databases.
  2. User. Filters query load from specific user accounts.
  3. Client address. Filters query load from a specific IP address.
  4. Time range. Filters query load by time ranges, such as hour, day, week, or a custom range.
  5. Database Load graph. Displays the query load graph, based on filtered data.
  6. CPU capacity, CPU and CPU wait, IO Wait, and Lock Wait. Filters for loads based on the options you select. See Quick detection: Viewing database load for queries for details about each of these filters.
  7. Queries and tags. Filters query load for either a selected query or a SQL query tag. For more information about how tags work, see Quick detection: Filtering database load for tags. To learn how to add tags to your queries, see Adding tags to SQL queries.

High-level guide to troubleshooting performance issues

Query Insights troubleshoots Cloud SQL queries to look for performance issues. The Query Insights dashboard shows the query load based on factors that you select. Query load is a measurement of the total work for all the queries in the instance in the selected time range.

Query Insights helps you detect and analyze query performance problems. You use Query Insights to troubleshoot queries in three phases:

  • Quick detection — identify the problematic query and application
  • Root cause analysis — determine why the query is slow with query plan analysis
  • Identify the source — identify the application, user, and host for the problematic query

Quick detection: Viewing database load for queries

The top-level Query Insights dashboard shows the Database load — all top queries graph. Database query load is a measure of the work (in CPU seconds) that executed queries in your selected database perform over time. Each running query is either using or waiting for CPU resources, IO resources, or lock resources. Database query load is the ratio of the amount of time taken by all the queries that are completed in a given time window to the wall clock time.

Drop-down menus let you filter the graph for a specific database, user, or client address.

Database Load graph

Shows the database load graph with a load for CPU capacity, CPU and
         CPU wait, IO Wait, and Lock Wait.
Figure 3. Shows the database load graph with a load for CPU capacity, CPU and CPU wait, IO Wait, and Lock Wait.

Colored lines in the graph show the query load, split into four categories:

  • CPU capacity: The number of CPUs available on the instance.
  • CPU and CPU Wait: The ratio of the time taken by queries in an active state to wall clock time. IO and Lock waits do not block queries that are in an active state. This metric might mean that the query is using the CPU. Or the query might be waiting for the Linux scheduler to schedule the server process running the query while other processes are using the CPU.

    Note: CPU load accounts for both the runtime and the time waiting for the Linux scheduler to schedule the server process that is running. As a result, the CPU load can go beyond the maximum core line.

  • IO Wait: The ratio of time taken by queries that are waiting for IO to wall clock time. IO wait includes Read IO Wait and Write IO Wait. See the PostgreSQL event table. If you want a breakdown of information for IO waits, you can see it in Cloud Monitoring. See Insights metrics for more information.

  • Lock Wait: The ratio of time taken by queries that are waiting for Locks to wall clock time. It includes Lock Waits, LwLock Waits, and Buffer pin Lock waits. If you want a breakdown of information for lock waits, you can see it in Cloud Monitoring. See Insights metrics for more information.

Now review the graph and use the filtering options to answer these questions:

  1. Is the query load high? Review the query load graph. Is the graph spiking or elevated over time? If you don't see a high load, then the problem isn't with your queries.
  2. How long has the load been high? Is it only high now? Or has it been high for a long time? Use the range selection to select various time periods to find out how long the problem has occurred. Or you can zoom in to view a time window where query load spikes are observed. You can zoom out to view up to one week of the timeline.
  3. What is causing the high load? You can select options to look at the CPU capacity, CPU and CPU wait, lock wait, or IO wait. The graph for each of these options is a different color so that you can see which one has the highest load. The dark blue line on the graph shows the max CPU capacity of the system. It lets you compare the query load with the max CPU system capacity. This comparison helps you know whether an instance is running out of CPU resources.
  4. Which database is experiencing the load? Select different databases from the Databases drop-down menu to find the databases with the highest loads.
  5. Are specific users or IP addresses causing higher loads? Select different users and addresses from the drop-down menus to compare which ones are causing higher loads.

Quick detection: Filtering database load for all queries

The QUERIES table provides an overview of the queries that cause the most query load. The table shows all the normalized queries for the time window and options selected on the Query Insights dashboard.

The QUERIES table sorts queries by the total execution time during the time window you selected.

Shows the database load graph with a load for queries, with
         filters selected for CPU capacity, CPU and
         CPU wait, IO Wait, and Lock Wait.
Figure 4. Shows the database load graph for queries, with filters selected for CPU capacity, CPU and CPU wait, IO Wait, and Lock Wait.

To sort the table, select a property from Filter queries, or select a column heading. The table shows the following properties:

  • Query string. The normalized query string. Query Insights only shows 1024 characters in the query string by default.
  • Database. The database that the query was run against.
  • Load by total time/Load by CPU/Load by IO wait/Load by lock wait. These options let you filter specific queries to find the largest load for each option.
  • Avg exec time (ms). Average time for the query to execute.
  • Times called. The number of times the query was called by the application.
  • Avg rows fetched. The average number of rows fetched for the query.

For PostgreSQL 9.6 and 10, Query Insights displays normalized queries, that is, '?' replaces the literal constant value. In the example below, the name constant is removed, and '?' replaces it. For example:

UPDATE
  "demo_customer"
SET
  "customer_id" = ?::uuid,
  "name" = ?,
  "address" = ?,
  "rating" = ?,
  "balance" = ?,
  "current_city" = ?,
  "current_location" = ?
WHERE
  "demo_customer"."id" = ?

For PostgreSQL version 11 and above, $1, $2, and so on replaces the literal constant value. For example:

UPDATE
  "demo_customer"
SET
  "customer_id" = $1::uuid,
  "name" = $2,
  "address" = $3,
  "rating" = $4,
  "balance" = $5,
  "current_city" = $6,
  "current_location" = $7
WHERE
  "demo_customer"."id" = $8

The constant's value is ignored so that Query Insights can aggregate similar queries and remove any PII information that the constant might show.

Now try the following to determine if a query is the root cause of the problem by doing the following from the Queries tab:

  1. Click Load by total time to see if a query is taking a long time.
  2. If you see a query that looks like it is taking a longer time than other queries, go to the specific query graph for further troubleshooting. If queries look normal, go to the Root cause analysis: Application tags section.

Root cause analysis: Filtering database load for a specific query

When you find a query that is showing a large query load, click the query to open the Database load — specific query dashboard.

The Database load — specific query graph shows a measure of the work (in CPU seconds) that your selected normalized query has performed in your selected query over time. To calculate load, it uses the amount of time taken by the normalized queries that are completed at the minute boundary to the wall clock time.

At the top of the table, the first 1024 characters of the normalized query (where literals are removed for aggregation and PII reasons) are displayed.

Shows the database load graph with a load for a specific query, with
         filters selected for CPU capacity, CPU and CPU wait, IO Wait, and Lock
         Wait.
Figure 5. Shows the database load graph with a load for a specific query, with filters selected for CPU capacity, CPU and CPU wait, IO Wait, and Lock Wait.

As with the total queries graph, you can filter the load for a specific query by Database, User, and Client Address. Query load is split into CPU capacity, CPU and CPU Wait, IO Wait, and Lock Wait.

Examining query latency

You use the Latency graph to examine latency on the query. Latency is the time taken for the normalized query to complete, in wall clock time. The latency dashboard shows 50th, 95th, and 99th percentile latency to find outlier behaviors.

Shows the query latency graph for a specific
         query with filters selected for CPU capacity, CPU and CPU wait,
         IO Wait, and Lock Wait.
Figure 6. Shows the database load graph at the 50th percentile for a specific query with filters selected for CPU capacity, CPU and CPU wait, IO Wait, and Lock Wait.

Latency of Parallel Queries are measured in wall clock time even though query load can be higher for the query due to multiple cores being used to run part of the query.

Try to narrow down the problem by looking at the following:

  1. What is causing the high load? Select options to look at the CPU capacity, CPU and CPU wait, lock wait, or IO wait.
  2. Were there spikes in latency? You can change the time window to study the historical latency for the normalized query.
  3. When you find the areas and times for the highest load, go to Identifying the source of the problem using tracing to learn how to drill down further.

Quick detection: Filtering database load for tags

Query Insights provides application-centric monitoring to diagnose performance problems for applications built using ORMs.

If you are responsible for the entire application stack, Query Insights provides query monitoring from an application view. Query tagging helps you find issues at higher-level constructs, such as using the business logic, a microservice, or some other construct. You might tag queries by the business logic, for example, using the payment, inventory, business analytics, or shipping tags. You can then find the query load that the various business logics create. For example, you might find unexpected events, such as spikes for a business analytics tag at 1 PM. Or you might see abnormal growth for a payment service trending over the previous week.

Query load tags provide a breakdown of the query load of the selected tag over time.

To calculate the Database Load for Tag, Query Insights uses the amount of time taken by every query that uses the tag you select. Query Insights calculates the completion time at the minute boundary using wall clock time.

On the Query Insights dashboard, select TAGS to view the tags table. The TAGS table sorts tags by their total load by total time.

Figure 7 shows the Query Insights dashboard, with load for tags.
         Below the graph is a list of tags.
Figure 7. The database load for a selected tag.

You can sort the table by selecting a property from Filter queries, or by clicking a column heading. The table shows the following properties:

  • Action, Controller, Framework, Route, Application, DB Driver. Each property that you added to your queries is shown as a column. At least one of these properties must be added if you want to filter by tags.
  • Load by total time/Load by CPU/Load by IO wait/Load by lock wait. These options let you filter specific queries to find the largest load for each option.
  • Avg execution time (ms). Average time for the query to execute.
  • Times called. The number of times the query was called by the application.
  • Avg rows fetched. The average number of rows fetched for the query.
  • Database. The database that the query was run against.

Now try the following to determine if a query is the root cause of the problem by doing the following from the TAGS tab:

  1. Click Load by total time to see if a query is taking a long time.
  2. If you see a tag that looks like it is taking a longer time than other tags, go to the specific tag graph for further troubleshooting.

Root cause analysis: Viewing details for a tagged query load

Similar to a normalized query load, you can click any entry to the tags table to view the details for a tagged query load. For example, if all queries from a microservices payment are tagged as payment, you can see the amount of query load that is trending by viewing the tag payment.

Figure 8 shows the database load and latency graphs on the page for a
         specific tag.
Figure 8. Database load and latency graphs for a specific tag.

As with the total queries graph, you can filter the load for a specific query by Database, User, and Client Address. Query load is split into CPU capacity, CPU and CPU Wait, IO Wait, and Lock Wait.

Try to narrow down the problem by looking at the following:

  1. What is causing the high load? Select options to look at the CPU capacity, CPU and CPU wait, lock wait, or IO wait.
  2. How long has the load been high? Is it only high now? Or has it been high for a long time? Change the time ranges to find the date and time that the load started performing poorly.
  3. When you find the areas and times for the highest load, go to Identifying the source of the problem using tracing to learn how to drill down further.

Examining operations in a sampled query plan

A query plan takes a sample of your query and breaks it down into individual operations. It explains and analyzes each operation in the query. The Query plan samples graph shows all the query plans running at particular times and the amount of time each plan took to run.

Figure 9 shows a graph for sample query plans, with the time when they
         were run at the bottom of the graph (x axis), and the number of seconds
         they ran on the right (y axis).
Figure 9. Graph showing sample query plans, when they were run, and the number of seconds they ran.

To see details for the sample query plan, click the dots on the Sample query plans graph. There is a view of executed sample query plans for most, but not all, queries. Expanded details show a model of all the operations in the query plan. Each operation shows the latency, rows returned, and the cost for that operation. When you select an operation, you can see more details, such as shared hit blocks, the type of schema, the actual loops, plan rows, and more.

Figure 10. The query plan shows the latency and cost for each operation
         run for the query. It starts with an aggregate, which returns 48 rows,
         with a latency of 31.06 ms and a cost. of 296.34. The next operation is
         a nested loop, which splits into another nested loop and a materialize.
         The nested loop splits into another nested loop and an index scan. The
         materialize leads to a sequence scan.
Figure 10. The query plan shows the latency and cost for each operation run for the query.

Try to narrow down the problem by looking at the following:

  1. What is the resource consumption?
  2. How does it relate to other queries?
  3. Does consumption change over time?

Identifying the source of the problem using tracing

The final question we need to answer in our troubleshooting journey is What is the source of the problem? More specifically, which model, view, controller, route, host, and user is causing the problem?

Query Insights provides an in-context end-to-end trace view to understand what is going on at the database layer for a specific request. It uses an in-context end-to-end application trace to find the source of the problematic query by model, view, controllers, and route. If you enable OpenCensus or OpenTelemetry, opencensus span information is sent to the database along with the tag information inside SQL comments. Any traces from the application to Cloud Logging are linked with database query plan traces to identify the source of the problem. You can click the END to END tab as shown below to look at the in-context trace.

Figure 11. Select a tag under End to End to view specific information
         about the tag. The Summary shows the RPCs and Total Duration in ms for
         each operation for that tag.
Figure 11. Select a tag under End to End to view specific information about the tag.

To determine the client and user causing the problem, use the top client addresses and top users tables to find the highest loads. You can add a user or IP address to the filter to further analyze a specific user or client address.

Figure 12. The image shows that for top client addresses, the load was
         100%, the average execution time was 19,568 seconds, and the times
         called was 1,226. For top users, the user postgres had 100% of the load,
         had an average execution time of 19,568 ms, and was called 1,226
         times.
Figure 12. The details for Top client addresses and top users includes the percentage of the query load, the average execution time in ms, and the times called.

For queries, you can use Cloud Trace to see end-to-end tracing for each step in the query plan. On the Insights dashboard, click the View in trace link to open the Cloud Trace tool.

Figure 13. The trace graph shows all the traces that have been run for
         the selected period of time, in this case, one hour. The
         page also has a table that shows latency, HTTP method, URL, and the
         time when the trace was run.cd
Figure 13. The trace graph shows all the traces that have been run for the selected period of time.

For details about how to use tools in Cloud Trace, see Finding and viewing traces.

Adding tags to SQL queries

Tagging SQL queries simplifies application troubleshooting. If you are using ORM and sqlcommenter together, the tags are auto created without requiring you to add custom code to your application. If you are not using ORM, you must manually add sqlcommenter tags to your SQL queries.

Using sqlcommenter to add tags manually

In your query, you must augment each SQL statement with a comment containing a serialized key-value pair. Use at least one of the following keys:

  • action=''
  • controller=''
  • framework=''
  • route=''
  • application=''
  • db driver=''

Query Insights drops all other keys. See the sqlcommenter documentation for the correct SQL comment format.

Using sqlcommenter with ORM

When Object-relational mapping (ORM) is used instead of directly writing SQL queries, you might not find application code that is causing performance challenges. You might also have trouble analyzing how your application code impacts query performance. To tackle that pain point, Query Insights provides an open source library called sqlcommenter, an ORM instrumentation library. This library is useful for developers using ORMs and Admins to detect which application code is causing performance problems. It is worth mentioning that the whole process requires no change to the application code.

To tackle pain points, Query Insights provides an ORM instrumentation library called sqlcommenter. You can install sqlcommenter on the application server. The instrumentation library allows application information related to your Model, View, and Controller (MVC) framework to be propagated to the database along with the queries as a SQL comment. The database picks up these tags and starts recording and aggregating statistics by tags, which are orthogonal with statistics aggregated by normalized queries. Query Insights shows the tags so that you know which application is causing the query load. This information helps you find which application code is causing performance problems. It is worth mentioning that the whole process requires no change to the application code.

When you examine results in SQL database logs, they look like the following:

SELECT * from USERS /*action='run+this',
controller='foo%3',
traceparent='00-01',
tracestate='rojo%2'*/

Supported tags include the controller name, route, framework, and action.

The set of ORMs in sqlcommenter is supported for various programming languages:

Python
  • Django
  • psycopg2
  • Sqlalchemy
  • Flask
Java
  • Hibernate
  • Spring
Ruby
  • Rails
Node.js
  • Knex.js
  • Sequelize.js
  • Express.js

For more information, see the sqlcommenter documentation in GitHub.

Installing sqlcommenter with your ORM framework

For more information and instructions on installation, see the sqlcommenter documentation in GitHub.

Disabling Insights

To disable Insights on an instance:

Console

  1. Go to the Cloud SQL Instances page in the Google Cloud Console.
    Go to the Cloud SQL Instances page
  2. Click on an instance name to navigate to the Overview page.

  3. On the Configuration tile, click Edit configuration.
  4. Under Configuation options, expand Query Insights.
  5. Uncheck the "Enable Query Insights" checkbox.
  6. Click Save.

gcloud

Run gcloud beta sql instances patch with the --no-insights-config-query-insights-enabled flag.

  gcloud beta sql instances patch --no-insights-config-query-insights-enabled
  

REST

Before using any of the request data below, make the following replacements:

  • project-id: The project ID
  • instance-id: The instance ID

HTTP method and URL:

PATCH https://sqladmin.googleapis.com/sql/v1beta4/projects/project-id/instances/instance-id

Request JSON body:

{
  "settings" : { "insightsConfig" : { "queryInsightsEnabled" : false } }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "kind": "sql#operation",
  "targetLink": "https://sqladmin.googleapis.com/sql/v1beta4/projects/project-id/instances/instance-id",
  "status": "PENDING",
  "user": "user@example.com",
  "insertTime": "2021-01-28T22:43:40.009Z",
  "operationType": "UPDATE",
  "name": "operation-id",
  "targetId": "instance-id",
  "selfLink": "https://sqladmin.googleapis.com/sql/v1beta4/projects/project-id/operations/operation-id",
  "targetProject": "project-id"
}

What's Next