Troubleshooting app latency with Cloud Spanner and OpenCensus

This tutorial describes how to troubleshoot latency problems of apps that use Spanner by using OpenCensus and Cloud Trace. This tutorial demonstrates how to identify common problems related to complex transactions, large payloads, and full table scans. The example app generates random load and intentionally exposes issues to demonstrate troubleshooting techniques.

This tutorial assumes that you are familiar with Go development, Cloud Spanner API, and OpenCensus tracing APIs.

The following diagram shows a schematic configuration of the test app used in this tutorial.

Architecture of tutorial app.

The app is instrumented with the OpenCensus library. The trace and metrics data generated is stored and visualized in Cloud Logging.

Objectives

  • Add OpenCensus trace and metrics instrumentation to an app that uses Spanner.
  • Interpret trace data to identify sources of latency.
  • Add trace-log correlation to enable a deeper understanding of app behavior.

Costs

This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.

Before you begin

  1. In the Cloud Console, go to the project selector page.

    Go to the project selector page

  2. Select or create a Cloud project.

  3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  4. In the Cloud Console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Cloud Console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.

  5. Enable the Cloud Trace, Cloud Logging, Compute Engine, and Cloud Spanner APIs:
    gcloud services enable stackdriver.googleapis.com \
        cloudtrace.googleapis.com \
        spanner.googleapis.com \
        logging.googleapis.com \
        compute.googleapis.com
    

Setting up your environment

  1. In Cloud Shell, clone the GitHub project.

    git clone https://github.com/GoogleCloudPlatform/opencensus-spanner-demo
    
  2. Import the environment variables contained in the setup.env file into your Cloud Shell environment:

    cd opencensus-spanner-demo
    source ./setup.env
    

Setting up Spanner

In the following steps, you set up Spanner and create some tables for the test app with the same schema as Getting started with Spanner in Go. There are two tables: Singers and Albums. Albums has a reference to Singers in the SingerId field.

  1. In Cloud Shell, create a Spanner instance:

    gcloud spanner instances create $SPANNER_INSTANCE \
        --config=regional-us-central1 \
        --description="Test Instance" \
        --nodes=1
    
  2. Create a database:

    gcloud spanner databases create $DATABASE --instance=$SPANNER_INSTANCE
    

    The tables in the test app represent singers and albums.

  3. In the Cloud Console, go to Spanner.

    Go to Spanner

  4. To create the Singers table, click Create table.

  5. Click Edit as text, and then enter the following in the DDL statements field:

    CREATE TABLE Singers (
      SingerId   INT64 NOT NULL,
      FirstName  STRING(1024),
      LastName   STRING(1024),
      BirthDate  DATE,
      LastUpdated TIMESTAMP,
    ) PRIMARY KEY(SingerId);
    
  6. Click Create.

  7. To create the Albums table, click Create table.

  8. Click Edit as text, and then enter the following in the DDL statements field:

    CREATE TABLE Albums (
      SingerId        INT64 NOT NULL,
      AlbumId         INT64 NOT NULL,
      AlbumTitle      STRING(MAX),
      MarketingBudget INT64,
    ) PRIMARY KEY(SingerId, AlbumId),
      INTERLEAVE IN PARENT Singers ON DELETE CASCADE;
    
  9. Click Create.

  10. To add an index for LastName, complete the following steps:

    1. Click the Singers table.
    2. Click Create index.
    3. Click Edit as text, and enter the following information in the DDL statements field:

      CREATE INDEX SingersByFLastName ON Singers(LastName)
      
  11. Click Create.

Install the app on a Compute Engine instance

  1. In Cloud Shell, create a Compute Engine instance to run the test app from:

    gcloud compute instances create $CLIENT_INSTANCE \
        --zone=$ZONE \
        --scopes=https://www.googleapis.com/auth/cloud-platform \
        --boot-disk-size=200GB
    
  2. Find the email address of the service account associated with the instance:

    gcloud compute instances describe $CLIENT_INSTANCE --zone=$ZONE \
        --format="value(serviceAccounts.email)"
    

    Make a note of the service account email address for the next step.

  3. Grant the role of roles/spanner.databaseUser to the instance service account:

    SA_ACCOUNT=service-account-id
    gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
        --member serviceAccount:$SA_ACCOUNT \
        --role roles/spanner.databaseUser
    

    Replace the following:

    • service-account-id: your service account email address that you retrieved in the preceding step.
  4. Connect to the instance with SSH:

    gcloud compute ssh --zone $ZONE $CLIENT_INSTANCE
    
  5. Install Git on the instance:

    sudo apt-get update
    sudo apt-get install -y git
    
  6. Install Go version 1.12.

  7. Get the source code for this tutorial:

    git clone https://github.com/GoogleCloudPlatform/opencensus-spanner-demo.git
    cd opencensus-spanner-demo
    

Run the app

  1. In Cloud Shell, set your Google Cloud project ID:

    export GOOGLE_CLOUD_PROJECT=your project-id
    

    Replace the following:

    • your project-id: your Google Cloud project ID.
  2. Initialize the environment:

    source ./setup.env
    
  3. Build the code:

    go build
    
  4. Run the test app:

    nohup ./oc-spannerlab --project=$GOOGLE_CLOUD_PROJECT \
        --instance=$SPANNER_INSTANCE \
        --database=$DATABASE \
        --command=simulation \
        --iterations=10000 &
    

    This command runs 10,000 iterations of the test app in simulation mode, which executes a random combination of queries and updates to generate metrics and trace data. This command takes about ten minutes to run. You can track progress:

    tail -f nohup.out
    
  5. In the Cloud Console, go to the Logs Viewer page.

    Go to the Logs Viewer page

    Check for errors in the logs.

  6. In the Cloud Console, go to the Trace list page.

    Go to Trace list page

    Review the trace data.

    Scatterplot with trace data and timeline.

The test app code structure

The following sections show how to use Cloud Trace to solve problems that the test app is intentionally designed to show.

The approach of using the OpenCensus Spanner integration, as described in this tutorial, requires a relatively small amount of code to enable tracing of app code that uses Spanner. OpenCensus has a built-in integration with gRPC, which enables Spanner client API instrumentation. However, some issues might require app instrumentation beyond gRPC. For more information, see the OpenCensus Quickstarts and other resources in the what's next section of this tutorial.

Google Cloud collects and displays monitoring data for most services, including storage services. However, it's important to instrument your code to provide app performance data that reflects your users' experience. The OpenCensus Google Cloud integration guide describes how to instrument app code. In some cases, you can export the stats, and the trace data is already instrumented in the client libraries. Currently, only Java and Go Spanner clients are instrumented with OpenCensus tracing.

To send app logs to Logging, this tutorial uses Cloud Client Libraries for Go. The applog package is included to correlate traces with logs. Trace-log correlation is discussed in the following section. The OpenCensus Logging exporter, the OpenCensus gRPC plug-in, and the stats and trace packages are imported in the following Go code snippet:

import (
	"bytes"
	"context"
	"flag"
	"fmt"
	"os"

	"cloud.google.com/go/spanner"

	"contrib.go.opencensus.io/exporter/stackdriver"
	"go.opencensus.io/plugin/ocgrpc"
	"go.opencensus.io/stats/view"
	"go.opencensus.io/trace"

	log "github.com/GoogleCloudPlatform/opencensus-spanner-demo/applog"
	"github.com/GoogleCloudPlatform/opencensus-spanner-demo/query"
	"github.com/GoogleCloudPlatform/opencensus-spanner-demo/testdata"
	"github.com/GoogleCloudPlatform/opencensus-spanner-demo/update"
)

The app must call the Logging exporter. The app initializes the exporter:

func initOC(project string) *stackdriver.Exporter {
	se, err := stackdriver.NewExporter(stackdriver.Options{
		ProjectID:    project,
		MetricPrefix: "spanner-oc-test",
	})
	if err != nil {
		ctx := context.Background()
		log.Fatalf(ctx, "Failed to create exporter: %v", err)
	}
	trace.RegisterExporter(se)
	view.RegisterExporter(se)
	if err := view.Register(ocgrpc.DefaultClientViews...); err != nil {
		ctx := context.Background()
		log.Fatalf(ctx, "Failed to register gRPC default client views: %v", err)
	}
	trace.ApplyConfig(trace.Config{DefaultSampler: trace.AlwaysSample()})
	return se
}

The preceding code initializes the Logging exporter, the gRPC view, and sets tracing to AlwaysSample.

In OpenCensus, a trace is a tree of spans. Top-level spans are created for each of the transactions, as illustrated in the following Go code:

ctx, span := trace.StartSpan(ctx, "query-albums")
defer span.End()

For more information, see tracing.

The preceding StartSpan() call is an example of a custom span that is added to the app. The Spanner client library also creates spans by using the gRPC integration with OpenCensus. The queryAlbums() function in query.go contains an example of these custom spans.

The following code snippet from queryAlbums() calls the Spanner ReadOnlyTransaction() method, which creates a span. The span closes when the ReadOnlyTransaction object ro goes out of scope.

ro := client.ReadOnlyTransaction()
defer ro.Close()

Interpreting the data

The following sections explain the reasons for differences in latency measures and how those sources appear in the trace data. In these steps, you compare data for different traces. The details that you see in your own traces might be different, but the same basic principles apply.

Check transaction latency

The sequence the app follows to add a singer and album:

  1. The AddAllTxn() function opens a read-write transaction to contain the database reads and writes.
  2. The getSingerId() command checks if the singer is in the database based on the singer's first and last name. If the singer is in the database, this command retrieves the singer's ID for later use. If the singer isn't in the database, then this command adds the new singer.
  3. The getAlbumId() command uses the singer ID and album title to check whether the album is in the database. If the album is in the database, the app takes no further action.
  4. If the singer ID and album aren't in the database, the app adds the new album.

The preceding oc-spannerlab command for the test app runs 10,000 generated iterations, randomly selecting between these different functions to simulate a real app.

To review the trace data, complete the following steps:

  1. In the Cloud Console, go to the Trace list page.

    Go to Trace list page

  2. To explore the scatter plot, click four trace points. You can identify the different code paths in the test app with the span names. Compare the latency values for the different spans.

    Timeline of 4 trace points with different code paths.

    The root span at the top of the trace detail shows that the complete time for the add-singer trace, including child spans, takes 101.925 ms. The start of the read-write transaction is at the beginning of the request. The following screenshot shows multiple transactions in a single span. For more information on transactions, see Transactions.

    Example trace with four short transactions.

    The read-write transaction is now mid-way through the sequence. The complete task for add-singer now takes only 42.747 ms.

Check the effect of a large payload

Payload size is closely related to latency because a larger amount of data needs to be transmitted over the network and processed by the app. The example app loads enough data into Spanner to demonstrate the effects of response payload size. For more information on other factors affecting latency, see Latency metrics.

The following query in the app code returns all rows in the table:

SELECT SingerId, AlbumId, AlbumTitle FROM Albums

To check the effect of payload size on latency, complete the following steps:

  1. In the Cloud Console, go to the Trace list page.

    Go to Trace List page

  2. In the Request filter field, enter query albums, and then press Enter.

  3. Click one of the points in the scatter plot.

  4. Click Show events.

    Events displaying a larger payload.

    You can identify the query by the trace name. The number of bytes received (109,560) is also shown in the trace detail.

Use a secondary index instead of full table scans

If you query a field other than the primary key, it can be expensive for large tables without secondary indexes. These queries have high latency due to the need to read the full table to find qualifying rows. Creating a secondary index avoids scanning the table and can reduce the latency by directly looking up qualifying rows using indexed columns. In the example app, a secondary index is created on the Singers table for the field LastName with the following statement in the preceding instructions:

CREATE INDEX SingersByLastName ON Singers(LastName)

Use of secondary indexes like the preceding index is a best practice to speed up common queries. The secondary index is helpful for queries such as the following:

SELECT
  SingerId, FirstName, LastName
FROM
  Singers
WHERE
  LastName = 'Zero'

In many cases, Spanner automatically uses any secondary indexes that are likely to make the query more efficient. In other cases, Spanner doesn't automatically use an index or might choose an index that causes query latency to increase.

In a few cases, though, Spanner might choose an index that causes query latency to increase. If you've followed the troubleshooting steps for performance regressions and confirmed that you want to try a different index for the query, you can specify the index as part of your query with the FORCE_INDEX directive as follows:

SELECT
  SingerId, FirstName, LastName
FROM
  Singers@{FORCE_INDEX=SingersByLastName}
WHERE
  LastName = 'Zero'

To check the query execution time, complete the following steps:

  1. In the Cloud Console, go to Spanner.

    Go to Spanner

  2. Click the name of your instance. In this tutorial, it's Test Instance.

  3. Click the name of your database. In this tutorial, it's test.

  4. Click the Query stats tab.

  5. Select a 1-hour time interval that includes the test run.

  6. Select the preceding query to view the Query stats.

    View the stats in the Details pane, including average latency (sec) and average rows scanned.

    This query does a back-join to the base table to find the SingerId and FirstName columns. You can improve the query further by following the suggestions in Spanner SQL best practices.

Correlating traces with logs

To see whether the best practices in query optimization impact your app, you might need detailed information on code paths and the actual Spanner queries executed. The queries aren't available in the Spanner log, but are added to the app logs in the test app. You can add this information to the trace annotations; however, logging is typically a first choice for debugging. By using trace-log correlation, you can view log messages directly in the trace detail. If the trace ID is added to the Entry struct in the Cloud Client Libraries for Go, you can correlate trace and log data. The test app demonstrates this correlation.

The following example code for trace-log correlation is located in the applog package:

func printf(ctx context.Context, severity logging.Severity, format string,
	v ...interface{}) {
	span := trace.FromContext(ctx)
	if client == nil {
		log.Printf(format, v...)
	} else if span == nil {
		lg := client.Logger(LOGNAME)
		lg.Log(logging.Entry{
			Severity: severity,
			Payload:  fmt.Sprintf(format, v...),
		})
	} else {
		sCtx := span.SpanContext()
		tr := sCtx.TraceID.String()
		lg := client.Logger(LOGNAME)
		trace := fmt.Sprintf("projects/%s/traces/%s", projectId, tr)
		lg.Log(logging.Entry{
			Severity: severity,
			Payload:  fmt.Sprintf(format, v...),
			Trace:    trace,
			SpanID:   sCtx.SpanID.String(),
		})
	}
}

View log messages in traces

  1. In the Cloud Console, go to the Trace list page.

    Go to Trace List page

  2. Click one of the points.

  3. In the Timeline chart, click Show logs.

  4. To see the related log entry in the logs viewer, click Open in logs viewer .

    Trace-log correlation in timeline

  5. The log detail also has a link to the related entries in the Logs Viewer. Click Open in Logs Viewer to see the log detail. Copy the following query from the log to use in a later step.

    Log detail with query.

Use queries extracted from logs

The filter constructed is useful because it can link all the log entries in a single request, which is hard to do without a trace ID. The query executed and the number of results returned are in the log detail because they were added to a log statement in the app's code. You can take the query from there and execute it in Spanner to find the query execution plan.

  1. In the Cloud Console, go to the Trace list page.

    Go to Trace List page

  2. In the Request field, enter query-limit.

  3. To see the log detail, follow the steps in the section View log messages in traces.

  4. Click Test instance and then click the test database.

  5. Click Query.

  6. In the Query database: test field, enter the query that you copied from the log.

  7. Click Run query.

    Query database.

The query execution plan includes a table scan. This query is fine for small tables, but is a problem with large tables. If only a few rows are required, a full table scan is an expensive way to obtain results and can lead to high latency. The Rows scanned field is important to check. This trace shows only 10 rows scanned. For more information on query optimization, see SQL best practices, Query execution plans, and Query execution operators.

Viewing metrics

Metrics are collected and stored in aggregate rather than individual measurements. Aggregations are more useful to observe trends, whereas traces are better for debugging specific requests. Spanner provides a set of metrics, including CPU utilization metrics and latency metrics. You can combine Spanner metrics with metrics collected by OpenCensus, which provide a client view, to diagnose problems. To compare client and service latency metrics, follow these steps.

  1. In the Google Cloud Console, go to Monitoring or use the following button:
    Go to Monitoring
  2. If Metrics Explorer is shown in the navigation pane, click Metrics Explorer. Otherwise, select Resources and then select Metrics Explorer.
  3. Ensure Metric is the selected tab.
  4. Click in the box labeled Find resource type and metric, and then select from the menu or enter the name for the resource and metric. Use the following information to complete the fields for this text box:
    1. For Resource, enter opencensus.
    2. For Metric, enter roundtrip_latency.
    3. For Filter, click grpc_client_method equals ExecuteSQL, and then click Apply.
    4. For Aggregator, select 50th percentile.

  5. Click Add Metric.
  6. Repeat the preceding steps to add the same Resource, Metric, and Filter. For Aggregator, select 95th percentile.
  7. Click Save chart.
  8. In the Chart title field, enter Spanner E-E Latency.
  9. Click New dashboard and enter Spanner Metrics.
  10. Click Save.
  11. To view the dashboard you created, go to Dashboards > Spanner Metrics.
  12. Click Show all legends.

    The chart filters for the ExecuteSql method.

    Chart for median and 95th percentile latency from the client perspective.

Compare app and Spanner view of metrics

Now that you've created a chart from the app's perspective, you create another chart from Spanner's perspective to compare the latency.

  1. In the Cloud Console, click Add Chart.
  2. In the Find resource type and metric text box, type spanner.
  3. Click the metric spanner.googleapis/api/request_latency.
  4. In the Filter field, click method equals ExecuteSQL, and then click Apply.
  5. In the Aggregator drop-down list, click 50th percentile.
  6. Click Add Metric
  7. Repeat the preceding steps to add the same Resource, Metric, and Filter. For Aggregator, select 95th percentile.
  8. Click Add Metric.
  9. In the Chart title field, enter Spanner latency (backend).
  10. Click Save Chart.

    Chart for median and 95th percentile latency as reported by Spanner.

  11. Click the line in the chart at any point. For the purposes of this tutorial, 9:27 is the point clicked in the preceding chart to check the median values.

    At that time, the value for the client is 8.9 ms, compared to the median value reported by Spanner of 6.1 ms. This difference is 2.8 ms, which is consumed in network transmission and gRPC client processing. If the difference was large, you can start by checking the relative distance between the app client and the Spanner service regions, and the size of the payloads. The latter can be judged from the metrics for bytes received and bytes sent to Spanner. However, if the latency reported by Spanner is high, check the rows scanned and query plan and also the metrics for Spanner node CPU utilization.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

Delete the project

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project that you want to delete and then click Delete .
  3. In the dialog, type the project ID and then click Shut down to delete the project.

Delete the individual resources

If you prefer not to delete the entire project, you can delete individual resources. However, you cannot delete the monitoring and trace data without deleting the project.

  1. In Cloud Shell delete the Spanner instance:

    gcloud spanner instances delete $SPANNER_INSTANCE
    
  2. Delete the Compute Engine instance:

    gcloud compute instances delete $CLIENT_INSTANCE --zone=$ZONE
    

What's next