Examine latency in a Spanner component with OpenCensus

This topic describes how to examine a Spanner component to find the source of latency and visualize that latency using OpenCensus. For a high-level overview of the components in this topic, see Latency points in a Spanner request.

Spanner client libraries provide statistics and traces with the use of the OpenCensus observability framework. This framework gives insights into the internals of the client, and aids in troubleshooting end-to-end (round-trip) latency. By default, the framework is disabled.

Before you begin

Use the procedure in Identify the latency point to find the components or components that are showing latency.

Capture and visualize client round-trip latency

Client round-trip latency is the length of time (in milliseconds) between the first byte of the Spanner API request that the client sends to the database (through both the Google Front End (GFE) and the Spanner API front end), and the last byte of response that the client receives from the database.

Capture client round-trip latency

You can capture client round-trip latency for the following languages:

Java

static void captureGrpcMetric(DatabaseClient dbClient) {
  // Register basic gRPC views.
  RpcViews.registerClientGrpcBasicViews();

  // Enable OpenCensus exporters to export metrics to Stackdriver Monitoring.
  // Exporters use Application Default Credentials to authenticate.
  // See https://developers.google.com/identity/protocols/application-default-credentials
  // for more details.
  try {
    StackdriverStatsExporter.createAndRegister();
  } catch (IOException | IllegalStateException e) {
    System.out.println("Error during StackdriverStatsExporter");
  }

  try (ResultSet resultSet =
      dbClient
          .singleUse() // Execute a single read or query against Cloud Spanner.
          .executeQuery(Statement.of("SELECT SingerId, AlbumId, AlbumTitle FROM Albums"))) {
    while (resultSet.next()) {
      System.out.printf(
          "%d %d %s", resultSet.getLong(0), resultSet.getLong(1), resultSet.getString(2));
    }
  }
}

Go


import (
	"context"
	"fmt"
	"io"
	"regexp"

	"cloud.google.com/go/spanner"
	"google.golang.org/api/iterator"

	"contrib.go.opencensus.io/exporter/stackdriver"
	"go.opencensus.io/plugin/ocgrpc"
	"go.opencensus.io/stats/view"
)

var validDatabasePattern = regexp.MustCompile("^projects/(?P<project>[^/]+)/instances/(?P<instance>[^/]+)/databases/(?P<database>[^/]+)$")

func queryWithGRPCMetric(w io.Writer, db string) error {
	projectID, _, _, err := parseDatabaseName(db)
	if err != nil {
		return err
	}

	ctx := context.Background()
	client, err := spanner.NewClient(ctx, db)
	if err != nil {
		return err
	}
	defer client.Close()

	// Register OpenCensus views.
	if err := view.Register(ocgrpc.DefaultClientViews...); err != nil {
		return err
	}

	// Create OpenCensus Stackdriver exporter.
	sd, err := stackdriver.NewExporter(stackdriver.Options{
		ProjectID: projectID,
	})
	if err != nil {
		return err
	}
	// It is imperative to invoke flush before your main function exits
	defer sd.Flush()

	// Start the metrics exporter
	sd.StartMetricsExporter()
	defer sd.StopMetricsExporter()

	stmt := spanner.Statement{SQL: `SELECT SingerId, AlbumId, AlbumTitle FROM Albums`}
	iter := client.Single().Query(ctx, stmt)
	defer iter.Stop()
	for {
		row, err := iter.Next()
		if err == iterator.Done {
			return nil
		}
		if err != nil {
			return err
		}
		var singerID, albumID int64
		var albumTitle string
		if err := row.Columns(&singerID, &albumID, &albumTitle); err != nil {
			return err
		}
		fmt.Fprintf(w, "%d %d %s\n", singerID, albumID, albumTitle)
	}
}

func parseDatabaseName(databaseUri string) (project, instance, database string, err error) {
	matches := validDatabasePattern.FindStringSubmatch(databaseUri)
	if len(matches) == 0 {
		return "", "", "", fmt.Errorf("failed to parse database name from %q according to pattern %q",
			databaseUri, validDatabasePattern.String())
	}
	return matches[1], matches[2], matches[3], nil
}

Visualize client round-trip latency

After retrieving the metrics, you can visualize client round-trip latency in Cloud Monitoring.

Here's an example of a graph that illustrates the 5th percentile latency for the client round-trip latency metric. To change the percentile latency to either the 50th or the 99th percentile, use the Aggregator menu.

The program creates an OpenCensus view called roundtrip_latency. This string becomes part of the name of the metric when it's exported to Cloud Monitoring.

Cloud Monitoring client round-trip latency.

Capture and visualize GFE latency

Google Front End (GFE) latency is the length of time (in milliseconds) between when the Google network receives a remote procedure call from the client and when the GFE receives the first byte of the response.

Capture GFE latency

You can capture GFE latency for the following languages:

Java

static void captureGfeMetric(DatabaseClient dbClient) {
  // Capture GFE Latency.
  SpannerRpcViews.registerGfeLatencyView();

  // Capture GFE Latency and GFE Header missing count.
  // SpannerRpcViews.registerGfeLatencyAndHeaderMissingCountViews();

  // Capture only GFE Header missing count.
  // SpannerRpcViews.registerGfeHeaderMissingCountView();

  // Enable OpenCensus exporters to export metrics to Stackdriver Monitoring.
  // Exporters use Application Default Credentials to authenticate.
  // See https://developers.google.com/identity/protocols/application-default-credentials
  // for more details.
  try {
    StackdriverStatsExporter.createAndRegister();
  } catch (IOException | IllegalStateException e) {
    System.out.println("Error during StackdriverStatsExporter");
  }

  try (ResultSet resultSet =
      dbClient
          .singleUse() // Execute a single read or query against Cloud Spanner.
          .executeQuery(Statement.of("SELECT SingerId, AlbumId, AlbumTitle FROM Albums"))) {
    while (resultSet.next()) {
      System.out.printf(
          "%d %d %s", resultSet.getLong(0), resultSet.getLong(1), resultSet.getString(2));
    }
  }
}

Go


// We are in the process of adding support in the Cloud Spanner Go Client Library
// to capture the gfe_latency metric.

import (
	"context"
	"fmt"
	"io"
	"strconv"
	"strings"

	spanner "cloud.google.com/go/spanner/apiv1"
	gax "github.com/googleapis/gax-go/v2"
	sppb "google.golang.org/genproto/googleapis/spanner/v1"
	"google.golang.org/grpc"
	"google.golang.org/grpc/metadata"

	"contrib.go.opencensus.io/exporter/stackdriver"
	"go.opencensus.io/stats"
	"go.opencensus.io/stats/view"
	"go.opencensus.io/tag"
)

// OpenCensus Tag, Measure and View.
var (
	KeyMethod    = tag.MustNewKey("grpc_client_method")
	GFELatencyMs = stats.Int64("cloud.google.com/go/spanner/gfe_latency",
		"Latency between Google's network receives an RPC and reads back the first byte of the response", "ms")
	GFELatencyView = view.View{
		Name:        "cloud.google.com/go/spanner/gfe_latency",
		Measure:     GFELatencyMs,
		Description: "Latency between Google's network receives an RPC and reads back the first byte of the response",
		Aggregation: view.Distribution(0.0, 0.01, 0.05, 0.1, 0.3, 0.6, 0.8, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 10.0, 13.0,
			16.0, 20.0, 25.0, 30.0, 40.0, 50.0, 65.0, 80.0, 100.0, 130.0, 160.0, 200.0, 250.0,
			300.0, 400.0, 500.0, 650.0, 800.0, 1000.0, 2000.0, 5000.0, 10000.0, 20000.0, 50000.0,
			100000.0),
		TagKeys: []tag.Key{KeyMethod}}
)

func queryWithGFELatency(w io.Writer, db string) error {
	projectID, _, _, err := parseDatabaseName(db)
	if err != nil {
		return err
	}

	ctx := context.Background()
	client, err := spanner.NewClient(ctx)
	if err != nil {
		return err
	}
	defer client.Close()

	// Register OpenCensus views.
	err = view.Register(&GFELatencyView)
	if err != nil {
		return err
	}

	// Create OpenCensus Stackdriver exporter.
	sd, err := stackdriver.NewExporter(stackdriver.Options{
		ProjectID: projectID,
	})
	if err != nil {
		return err
	}
	// It is imperative to invoke flush before your main function exits
	defer sd.Flush()

	// Start the metrics exporter
	sd.StartMetricsExporter()
	defer sd.StopMetricsExporter()

	// Create a session.
	req := &sppb.CreateSessionRequest{Database: db}
	session, err := client.CreateSession(ctx, req)
	if err != nil {
		return err
	}

	// Execute a SQL query and retrieve the GFE server-timing header in gRPC metadata.
	req2 := &sppb.ExecuteSqlRequest{
		Session: session.Name,
		Sql:     `SELECT SingerId, AlbumId, AlbumTitle FROM Albums`,
	}
	var md metadata.MD
	resultSet, err := client.ExecuteSql(ctx, req2, gax.WithGRPCOptions(grpc.Header(&md)))
	if err != nil {
		return err
	}
	for _, row := range resultSet.GetRows() {
		for _, value := range row.GetValues() {
			fmt.Fprintf(w, "%s ", value.GetStringValue())
		}
		fmt.Fprintf(w, "\n")
	}

	// The format is: "server-timing: gfet4t7; dur=[GFE latency in ms]"
	srvTiming := md.Get("server-timing")[0]
	gfeLtcy, err := strconv.Atoi(strings.TrimPrefix(srvTiming, "gfet4t7; dur="))
	if err != nil {
		return err
	}
	// Record GFE t4t7 latency with OpenCensus.
	ctx, err = tag.New(ctx, tag.Insert(KeyMethod, "ExecuteSql"))
	if err != nil {
		return err
	}
	stats.Record(ctx, GFELatencyMs.M(int64(gfeLtcy)))

	return nil
}

Visualize GFE latency

After retrieving the metrics, you can visualize GFE latency in Cloud Monitoring.

Here's an example of a graph that illustrates the 5th percentile latency for the GFE latency metric. To change the percentile latency to either the 50th or the 99th percentile, use the Aggregator menu.

The program creates an OpenCensus view called gfe_latency. This string becomes part of the name of the metric when it's exported to Cloud Monitoring.

Cloud Monitoring GFE latency.

Capture and visualize Spanner API request latency

Spanner API request latency is the length of time (in seconds) between the first byte of a request that the Spanner API frontend receives and the last byte of a response that the Spanner API frontend sends.

Capture Spanner API request latency

By default, this latency is available as part of Cloud Monitoring metrics. You don't have to do anything to capture and export it.

Visualize Spanner API request latency

You can use the Metrics Explorer charting tool to visualize the graph for the spanner.googleapis.com/api/request_latencies metric in Cloud Monitoring.

Here's an example of a graph that illustrates the 5th percentile latency for the Spanner API request latency metric. To change the percentile latency to either the 50th or the 99th percentile, use the Aggregator menu.

Cloud Monitoring API request latency.

Capture and visualize query latency

Query latency is the length of time (in milliseconds) that it takes to run SQL queries in the Spanner database.

Capture query latency

You can capture query latency for the following languages:

Java

private static final String MILLISECOND = "ms";
static final List<Double> RPC_MILLIS_BUCKET_BOUNDARIES =
    Collections.unmodifiableList(
        Arrays.asList(
            0.0, 0.01, 0.05, 0.1, 0.3, 0.6, 0.8, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 10.0, 13.0,
            16.0, 20.0, 25.0, 30.0, 40.0, 50.0, 65.0, 80.0, 100.0, 130.0, 160.0, 200.0, 250.0,
            300.0, 400.0, 500.0, 650.0, 800.0, 1000.0, 2000.0, 5000.0, 10000.0, 20000.0, 50000.0,
            100000.0));
static final Aggregation AGGREGATION_WITH_MILLIS_HISTOGRAM =
    Distribution.create(BucketBoundaries.create(RPC_MILLIS_BUCKET_BOUNDARIES));

static MeasureDouble QUERY_STATS_ELAPSED =
    MeasureDouble.create(
        "cloud.google.com/java/spanner/query_stats_elapsed",
        "The execution of the query",
        MILLISECOND);

// Register the view. It is imperative that this step exists,
// otherwise recorded metrics will be dropped and never exported.
static View QUERY_STATS_LATENCY_VIEW = View
    .create(Name.create("cloud.google.com/java/spanner/query_stats_elapsed"),
        "The execution of the query",
        QUERY_STATS_ELAPSED,
        AGGREGATION_WITH_MILLIS_HISTOGRAM,
        Collections.emptyList());

static ViewManager manager = Stats.getViewManager();
private static final StatsRecorder STATS_RECORDER = Stats.getStatsRecorder();

static void captureQueryStatsMetric(DatabaseClient dbClient) {
  manager.registerView(QUERY_STATS_LATENCY_VIEW);

  // Enable OpenCensus exporters to export metrics to Cloud Monitoring.
  // Exporters use Application Default Credentials to authenticate.
  // See https://developers.google.com/identity/protocols/application-default-credentials
  // for more details.
  try {
    StackdriverStatsExporter.createAndRegister();
  } catch (IOException | IllegalStateException e) {
    System.out.println("Error during StackdriverStatsExporter");
  }

  try (ResultSet resultSet = dbClient.singleUse()
      .analyzeQuery(Statement.of("SELECT SingerId, AlbumId, AlbumTitle FROM Albums"),
          QueryAnalyzeMode.PROFILE)) {

    while (resultSet.next()) {
      System.out.printf(
          "%d %d %s", resultSet.getLong(0), resultSet.getLong(1), resultSet.getString(2));
    }
    Value value = resultSet.getStats().getQueryStats()
        .getFieldsOrDefault("elapsed_time", Value.newBuilder().setStringValue("0 msecs").build());
    double elapasedTime = Double.parseDouble(value.getStringValue().replaceAll(" msecs", ""));
    STATS_RECORDER.newMeasureMap()
        .put(QUERY_STATS_ELAPSED, elapasedTime)
        .record();
  }
}

Go


import (
	"context"
	"fmt"
	"io"
	"strconv"
	"strings"

	"cloud.google.com/go/spanner"
	"google.golang.org/api/iterator"

	"contrib.go.opencensus.io/exporter/stackdriver"
	"go.opencensus.io/stats"
	"go.opencensus.io/stats/view"
	"go.opencensus.io/tag"
)

// OpenCensus Tag, Measure and View.
var (
	QueryStatsElapsed = stats.Float64("cloud.google.com/go/spanner/query_stats_elapsed",
		"The execution of the query", "ms")
	QueryStatsLatencyView = view.View{
		Name:        "cloud.google.com/go/spanner/query_stats_elapsed",
		Measure:     QueryStatsElapsed,
		Description: "The execution of the query",
		Aggregation: view.Distribution(0.0, 0.01, 0.05, 0.1, 0.3, 0.6, 0.8, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 10.0, 13.0,
			16.0, 20.0, 25.0, 30.0, 40.0, 50.0, 65.0, 80.0, 100.0, 130.0, 160.0, 200.0, 250.0,
			300.0, 400.0, 500.0, 650.0, 800.0, 1000.0, 2000.0, 5000.0, 10000.0, 20000.0, 50000.0,
			100000.0),
		TagKeys: []tag.Key{}}
)

func queryWithQueryStats(w io.Writer, db string) error {
	projectID, _, _, err := parseDatabaseName(db)
	if err != nil {
		return err
	}

	ctx := context.Background()
	client, err := spanner.NewClient(ctx, db)
	if err != nil {
		return err
	}
	defer client.Close()

	// Register OpenCensus views.
	err = view.Register(&QueryStatsLatencyView)
	if err != nil {
		return err
	}

	// Create OpenCensus Stackdriver exporter.
	sd, err := stackdriver.NewExporter(stackdriver.Options{
		ProjectID: projectID,
	})
	if err != nil {
		return err
	}
	// It is imperative to invoke flush before your main function exits
	defer sd.Flush()

	// Start the metrics exporter
	sd.StartMetricsExporter()
	defer sd.StopMetricsExporter()

	// Execute a SQL query and get the query stats.
	stmt := spanner.Statement{SQL: `SELECT SingerId, AlbumId, AlbumTitle FROM Albums`}
	iter := client.Single().QueryWithStats(ctx, stmt)
	defer iter.Stop()
	for {
		row, err := iter.Next()
		if err == iterator.Done {
			// Record query execution time with OpenCensus.
			elapasedTime := iter.QueryStats["elapsed_time"].(string)
			elapasedTimeMs, err := strconv.ParseFloat(strings.TrimSuffix(elapasedTime, " msecs"), 64)
			if err != nil {
				return err
			}
			stats.Record(ctx, QueryStatsElapsed.M(elapasedTimeMs))
			return nil
		}
		if err != nil {
			return err
		}
		var singerID, albumID int64
		var albumTitle string
		if err := row.Columns(&singerID, &albumID, &albumTitle); err != nil {
			return err
		}
		fmt.Fprintf(w, "%d %d %s\n", singerID, albumID, albumTitle)
	}
}

Visualize query latency

After retrieving the metrics, you can visualize query latency in Cloud Monitoring.

Here's an example of a graph that illustrates the 5th percentile latency for the query latency metric. To change the percentile latency to either the 50th or the 99th percentile, use the Aggregator menu.

The program creates an OpenCensus view called query_stats_elapsed. This string becomes part of the name of the metric when it's exported to Cloud Monitoring.

Cloud Monitoring query latency.

What's next