Identify where latency occurs

This topic describes how to troubleshoot Spanner components to find the source of the latency. To learn more about possible latency points in a Spanner request, see Latency points in a Spanner request.

  1. In your client application that affects your service, confirm there's a latency increase from client round-trip latency. Check the following dimensions from your client-side metrics.

    • Client Application Name
    • Client locality (for example, Compute Engine VM zones) and Host (that is, VM names)
    • Spanner API method
    • Spanner API status

    Group by these dimensions to see if the issue is limited to a specific client, status, or method. For multi-regional workloads, see if the issue is limited to a specific client or Spanner region.

  2. Check your client application health, especially the computing infrastructure on the client side (for example, VM, CPU, or memory utilization, connections, file descriptors, and so on).

  3. Check latency in Spanner components:

    a. Check client round-trip latency with OpenTelemetry or with OpenCensus.

    b. Check Google Front End (GFE) latency with OpenTelemetry or with OpenCensus.

    c. Check Spanner API request latency with OpenTelemetry or with OpenCensus.

    If you have high client round-trip latency, but low GFE latency, and a low Spanner API request latency, the application code might have an issue. It could also indicate a networking issue between the client and regional GFE. If your application has a performance issue that causes some code paths to be slow, then the client round-trip latency for each API request might increase. There might also be an issue in the client computing infrastructure that was not detected in the previous step.

  4. Check the following dimensions for Spanner metrics:

    • Spanner Database Name
    • Spanner API method
    • Spanner API status

    Group by these dimensions to see if the issue is limited to a specific database, status, or method. For multi-regional workloads, check to see if the issue is limited to a specific region.

    If you have a high GFE latency, but a low Spanner API request latency, it might have one of the following causes:

    • Accessing a database from another region. This action can lead to high GFE latency and low Spanner API request latency. For example, traffic from a client in the us-east1 region that has an instance in the us-central1 region might have a high GFE latency but a lower Spanner API request latency.

    • There's an issue at the GFE layer. Check the Google Cloud Status Dashboard to see if there are any ongoing networking issues in your region. If there aren't any issues, then open a support case and include this information so that support engineers can help with troubleshooting the GFE.

  5. Check the CPU utilization of the instance. If the CPU utilization of the instance is above the recommended level, you should manually add more nodes, or set up auto scaling. For more information, see Autoscaling overview.

  6. Observe and troubleshoot potential hotspots or unbalanced access patterns using Key Visualizer and try to roll back any application code changes that strongly correlate with the issue timeframe.

  7. Check any traffic pattern changes.

  8. Check Query insights and Transaction insights to see if there might be any query or transaction performance bottlenecks. In most cases, you need to follow Spanner best practices to optimize your queries or transactions.

  9. Use procedures in Oldest active queries to see any expense queries that might cause a performance bottleneck and cancel the queries as needed.

  10. Use procedures in the troubleshooting sections in the following topics to troubleshoot the issue further using Spanner introspection tools:

What's next