This tutorial describes how to troubleshoot latency problems of apps that use Spanner by using OpenCensus and Cloud Trace. This tutorial demonstrates how to identify common problems related to complex transactions, large payloads, and full table scans. The example app generates random load and intentionally exposes issues to demonstrate troubleshooting techniques.
This tutorial assumes that you are familiar with Go development, Cloud Spanner API, and OpenCensus tracing APIs.
The following diagram shows a schematic configuration of the test app used in this tutorial.
The app is instrumented with the OpenCensus library. The trace and metrics data generated is stored and visualized in Cloud Logging.
Objectives
- Add OpenCensus trace and metrics instrumentation to an app that uses Spanner.
- Interpret trace data to identify sources of latency.
- Add trace-log correlation to enable a deeper understanding of app behavior.
Costs
This tutorial uses the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Clean up.
Before you begin
-
In the Google Cloud console, go to the project selector page.
-
Select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
-
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
- Enable the Cloud Trace, Cloud Logging, Compute Engine, and
Cloud Spanner APIs:
gcloud services enable stackdriver.googleapis.com \ cloudtrace.googleapis.com \ spanner.googleapis.com \ logging.googleapis.com \ compute.googleapis.com
Setting up your environment
In Cloud Shell, clone the GitHub project.
git clone https://github.com/GoogleCloudPlatform/opencensus-spanner-demo
Import the environment variables contained in the
setup.env
file into your Cloud Shell environment:cd opencensus-spanner-demo source ./setup.env
Setting up Spanner
In the following steps, you set up Spanner and create some
tables for the test app with the same schema as
Getting started with Spanner in Go.
There are two tables: Singers
and Albums
. Albums
has a reference to
Singers
in the SingerId
field.
In Cloud Shell, create a Spanner instance:
gcloud spanner instances create $SPANNER_INSTANCE \ --config=regional-us-central1 \ --description="Test Instance" \ --nodes=1
Create a database:
gcloud spanner databases create $DATABASE --instance=$SPANNER_INSTANCE
The tables in the test app represent singers and albums.
In the Google Cloud console, go to Spanner.
To create the
Singers
table, click Create table.Click Edit as text, and then enter the following in the DDL statements field:
CREATE TABLE Singers ( SingerId INT64 NOT NULL, FirstName STRING(1024), LastName STRING(1024), BirthDate DATE, LastUpdated TIMESTAMP, ) PRIMARY KEY(SingerId);
Click Create.
To create the
Albums
table, click Create table.Click Edit as text, and then enter the following in the DDL statements field:
CREATE TABLE Albums ( SingerId INT64 NOT NULL, AlbumId INT64 NOT NULL, AlbumTitle STRING(MAX), MarketingBudget INT64, ) PRIMARY KEY(SingerId, AlbumId), INTERLEAVE IN PARENT Singers ON DELETE CASCADE;
Click Create.
To add an index for
LastName
, complete the following steps:- Click the
Singers
table. - Click Create index.
Click Edit as text, and enter the following information in the DDL statements field:
CREATE INDEX SingersByFLastName ON Singers(LastName)
- Click the
Click Create.
Install the app on a Compute Engine instance
In Cloud Shell, create a Compute Engine instance to run the test app from:
gcloud compute instances create $CLIENT_INSTANCE \ --zone=$ZONE \ --scopes=https://www.googleapis.com/auth/cloud-platform \ --boot-disk-size=200GB
Find the email address of the service account associated with the instance:
gcloud compute instances describe $CLIENT_INSTANCE --zone=$ZONE \ --format="value(serviceAccounts.email)"
Make a note of the service account email address for the next step.
Grant the role of
roles/spanner.databaseUser
to the instance service account:SA_ACCOUNT=service-account-id gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \ --member serviceAccount:$SA_ACCOUNT \ --role roles/spanner.databaseUser
Replace the following:
service-account-id
: your service account email address that you retrieved in the preceding step.
Connect to the instance with SSH:
gcloud compute ssh --zone $ZONE $CLIENT_INSTANCE
Install Git on the instance:
sudo apt-get update sudo apt-get install -y git
Install Go version 1.12.
Get the source code for this tutorial:
git clone https://github.com/GoogleCloudPlatform/opencensus-spanner-demo.git cd opencensus-spanner-demo
Run the app
In Cloud Shell, set your Google Cloud project ID:
export GOOGLE_CLOUD_PROJECT=your project-id
Replace the following:
your project-id
: your Google Cloud project ID.
Initialize the environment:
source ./setup.env
Build the code:
go build
Run the test app:
nohup ./opencensus-spanner-demo --project=$GOOGLE_CLOUD_PROJECT \ --instance=$SPANNER_INSTANCE \ --database=$DATABASE \ --command=simulation \ --iterations=10000 &
This command runs 10,000 iterations of the test app in simulation mode, which executes a random combination of queries and updates to generate metrics and trace data. This command takes about ten minutes to run. You can track progress:
tail -f nohup.out
In the Google Cloud console, go to the Logs Explorer page.
Check for errors in the logs.
In the Google Cloud console, go to the Trace list page.
Review the trace data.
The test app code structure
The following sections show how to use Cloud Trace to solve problems that the test app is intentionally designed to show.
The approach of using the OpenCensus Spanner integration, as described in this tutorial, requires a relatively small amount of code to enable tracing of app code that uses Spanner. OpenCensus has a built-in integration with gRPC, which enables Spanner client API instrumentation. However, some issues might require app instrumentation beyond gRPC. For more information, see the OpenCensus Quickstarts and other resources in the what's next section of this tutorial.
Google Cloud collects and displays monitoring data for most services, including storage services. However, it's important to instrument your code to provide app performance data that reflects your users' experience. The OpenCensus Google Cloud integration guide describes how to instrument app code. In some cases, you can export the stats, and the trace data is already instrumented in the client libraries. Currently, only Java and Go Spanner clients are instrumented with OpenCensus tracing.
To send app logs to Logging, this tutorial uses Cloud Client
Libraries for Go. The applog
package is included to correlate traces with
logs. Trace-log correlation is discussed in the following section. The
OpenCensus Logging exporter, the OpenCensus gRPC plug-in, and
the stats and trace packages are imported in the following Go code snippet:
The app must call the Logging exporter. The app initializes the exporter:
The preceding code initializes the Logging exporter, the
gRPC view, and sets tracing to AlwaysSample
.
In OpenCensus, a trace is a tree of spans. Top-level spans are created for each of the transactions, as illustrated in the following Go code:
For more information, see tracing.
The preceding StartSpan()
call is an example of a custom span that is added
to the app. The Spanner client library also creates spans by
using the gRPC integration with OpenCensus. The queryAlbums()
function in
query.go
contains an example of these custom spans.
The following code snippet from queryAlbums()
calls
the Spanner ReadOnlyTransaction()
method, which creates a span.
The span closes when the ReadOnlyTransaction
object ro
goes out of scope.
Interpreting the data
The following sections explain the reasons for differences in latency measures and how those sources appear in the trace data. In these steps, you compare data for different traces. The details that you see in your own traces might be different, but the same basic principles apply.
Check transaction latency
The sequence the app follows to add a singer and album:
- The
AddAllTxn()
function opens a read-write transaction to contain the database reads and writes. - The
getSingerId()
command checks if the singer is in the database based on the singer's first and last name. If the singer is in the database, this command retrieves the singer's ID for later use. If the singer isn't in the database, then this command adds the new singer. - The
getAlbumId()
command uses the singer ID and album title to check whether the album is in the database. If the album is in the database, the app takes no further action. - If the singer ID and album aren't in the database, the app adds the new album.
The preceding opencensus-spanner-demo
command for the test app runs 10,000 generated
iterations, randomly selecting between these different functions to simulate a
real app.
To review the trace data, complete the following steps:
In the Google Cloud console, go to the Trace list page.
To explore the scatter plot, click four trace points. You can identify the different code paths in the test app with the span names. Compare the latency values for the different spans.
The root span at the top of the trace detail shows that the complete time for the
add-singer
trace, including child spans, takes 101.925 ms. The start of the read-write transaction is at the beginning of the request. The following screenshot shows multiple transactions in a single span. For more information on transactions, see Transactions.The read-write transaction is now mid-way through the sequence. The complete task for
add-singer
now takes only 42.747 ms.
Check the effect of a large payload
Payload size is closely related to latency because a larger amount of data needs to be transmitted over the network and processed by the app. The example app loads enough data into Spanner to demonstrate the effects of response payload size. For more information on other factors affecting latency, see Latency metrics.
The following query in the app code returns all rows in the table:
SELECT SingerId, AlbumId, AlbumTitle FROM Albums
To check the effect of payload size on latency, complete the following steps:
In the Google Cloud console, go to the Trace list page.
In the Request filter field, enter
query albums
, and then pressEnter
.Click one of the points in the scatter plot.
Click Show events.
You can identify the query by the trace name. The number of bytes received (109,560) is also shown in the trace detail.
Use a secondary index instead of full table scans
If you query a field other than the primary key, it can be expensive for large
tables without secondary indexes. These queries have high latency due to the
need to read the full table to find qualifying rows. Creating a
secondary index
avoids scanning the table and can reduce the latency by directly looking up
qualifying rows using indexed columns. In the example app, a secondary index is
created on the Singers
table for the field LastName
with the following
statement in the preceding instructions:
CREATE INDEX SingersByLastName ON Singers(LastName)
Use of secondary indexes like the preceding index is a best practice to speed up common queries. The secondary index is helpful for queries such as the following:
SELECT
SingerId, FirstName, LastName
FROM
Singers
WHERE
LastName = 'Zero'
In many cases, Spanner automatically uses any secondary indexes that are likely to make the query more efficient. In other cases, Spanner doesn't automatically use an index or might choose an index that causes query latency to increase.
In a few cases, though, Spanner might choose an index that
causes query latency to increase. If you've followed the
troubleshooting steps for performance regressions
and confirmed that you want to try a different index for the query, you can
specify the index as part of your query with the FORCE_INDEX
directive as
follows:
SELECT
SingerId, FirstName, LastName
FROM
Singers@{FORCE_INDEX=SingersByLastName}
WHERE
LastName = 'Zero'
To check the query execution time, complete the following steps:
In the Google Cloud console, go to Spanner.
Click the name of your instance. In this tutorial, it's
Test Instance
.Click the name of your database. In this tutorial, it's
test
.Click the Query stats tab.
Select a 1-hour time interval that includes the test run.
Select the preceding query to view the Query stats.
View the stats in the Details pane, including average latency (sec) and average rows scanned.
This query does a back-join to the base table to find the
SingerId
andFirstName
columns. You can improve the query further by following the suggestions in Spanner SQL best practices.
Correlating traces with logs
To see whether the best practices in query optimization impact your app, you
might need detailed information on code paths and the actual
Spanner queries executed. The queries aren't available in the
Spanner log, but are added to the app logs in the test app. You
can add this information to the
trace annotations;
however, logging is typically a first choice for debugging. By using trace-log
correlation, you can view log messages directly in the trace detail. If the
trace ID is added to the
Entry struct
in the
Cloud Client Libraries for Go,
you can correlate trace and log data. The test app demonstrates this
correlation.
The following example code for trace-log correlation is located in the applog
package:
View log messages in traces
In the Google Cloud console, go to the Trace list page.
Click one of the points.
In the Timeline chart, click Show logs.
To see the related log entry in the Logs Explorer, click Open in Logs Explorer .
The log detail also has a link to the related entries in the Logs Viewer. Click Open in Logs Explorer to see the log detail. Copy the following query from the log to use in a later step.
Use queries extracted from logs
The filter constructed is useful because it can link all the log entries in a single request, which is hard to do without a trace ID. The query executed and the number of results returned are in the log detail because they were added to a log statement in the app's code. You can take the query from there and execute it in Spanner to find the query execution plan.
In the Google Cloud console, go to the Trace list page.
In the Request field, enter
query-limit
.To see the log detail, follow the steps in the section View log messages in traces.
Click
Test instance
and then click thetest
database.Click Query.
In the Query database: test field, enter the query that you copied from the log.
Click Run query.
The query execution plan includes a table scan. This query is fine for small tables, but is a problem with large tables. If only a few rows are required, a full table scan is an expensive way to obtain results and can lead to high latency. The Rows scanned field is important to check. This trace shows only 10 rows scanned. For more information on query optimization, see SQL best practices, Query execution plans, and Query execution operators.
Viewing metrics
Metrics are collected and stored in aggregate rather than individual measurements. Aggregations are more useful to observe trends, whereas traces are better for debugging specific requests. Spanner provides a set of metrics, including CPU utilization metrics and latency metrics. You can combine Spanner metrics with metrics collected by OpenCensus, which provide a client view, to diagnose problems. To compare client and service latency metrics, follow these steps.
- In the Google Cloud console, go to Monitoring or use the following button:
Go to Monitoring - If Metrics Explorer is shown in the navigation pane, click
Metrics Explorer. Otherwise, select appsResources and then select Metrics Explorer.
- Ensure Metric is the selected tab.
- Click in the box labeled
Find resource type and metric, and then select from the menu or
enter the name for the resource and metric. Use the following information to complete the
fields for this text box:
- For Resource, enter
opencensus
. - For Metric, enter
roundtrip_latency
. - For Filter, click
grpc_client_method
equalsExecuteSQL
, and then click Apply. - For Aggregator, select 50th percentile.
- For Resource, enter
- Click Add Metric.
- Repeat the preceding steps to add the same Resource, Metric, and Filter. For Aggregator, select 95th percentile.
- Click Save chart.
- In the Chart title field, enter
Spanner E-E Latency
. - Click New dashboard and enter
Spanner Metrics
. - Click Save.
- To view the dashboard you created, go to Dashboards > Spanner Metrics.
Click Show all legends.
The chart filters for the
ExecuteSql
method.
Compare app and Spanner view of metrics
Now that you've created a chart from the app's perspective, you create another chart from Spanner's perspective to compare the latency.
- In the Google Cloud console, click Add Chart.
- In the Find resource type and metric text box, type
spanner
. - Click the metric
spanner.googleapis/api/request_latency
. - In the Filter field, click
method
equalsExecuteSQL
, and then click Apply. - In the Aggregator drop-down list, click 50th percentile.
- Click Add Metric
- Repeat the preceding steps to add the same Resource, Metric, and Filter. For Aggregator, select 95th percentile.
- Click Add Metric.
- In the Chart title field, enter
Spanner latency (backend)
. Click Save Chart.
Click the line in the chart at any point. For the purposes of this tutorial, 9:27 is the point clicked in the preceding chart to check the median values.
At that time, the value for the client is 8.9 ms, compared to the median value reported by Spanner of 6.1 ms. This difference is 2.8 ms, which is consumed in network transmission and gRPC client processing. If the difference was large, you can start by checking the relative distance between the app client and the Spanner service regions, and the size of the payloads. The latter can be judged from the metrics for bytes received and bytes sent to Spanner. However, if the latency reported by Spanner is high, check the rows scanned and query plan and also the metrics for Spanner node CPU utilization.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the individual resources
If you prefer not to delete the entire project, you can delete individual resources. However, you cannot delete the monitoring and trace data without deleting the project.
In Cloud Shell delete the Spanner instance:
gcloud spanner instances delete $SPANNER_INSTANCE
Delete the Compute Engine instance:
gcloud compute instances delete $CLIENT_INSTANCE --zone=$ZONE
What's next
- Read more about Spanner best practices.
- Review the Spanner API documentation.
- See the OpenCensus Go API documentation and gRPC integration guide.
- Check out these Google research publications to learn about the theoretical foundations of Spanner: Spanner: Google's Globally-Distributed Database and Spanner, TrueTime and the CAP Theorem.
- See the related tutorial Monitoring slow queries in MySQL.
- Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.