Examine latency in a Spanner component with OpenTelemetry

This topic describes how to examine a Spanner component to find the source of latency and visualize that latency using OpenTelemetry. For a high-level overview of the components in this topic, see Latency points in a Spanner request.

OpenTelemetry is an open source observability framework and toolkit that lets you create and manage telemetry data such as traces, metrics, and logs. It is the result of a merger between OpenTracing and OpenCensus. For more information, see What is OpenTelemetry?

Spanner client libraries provide metrics and traces with the use of the OpenTelemetry observability framework. Follow the procedure in Identify the latency point to find the components or components that are showing latency in Spanner.

Before you begin

Before you begin capturing latency metrics, familiarize yourself with manual instrumentation with OpenTelemetry. You must configure the OpenTelemetry SDK with appropriate options for exporting your telemetry data. There are multiple OpenTelemetry exporter options available. We recommend using the OpenTelemetry Protocol (OTLP) exporter. Other options include using an OTel Collector with Google Cloud Exporter or Google Managed Service for Prometheus Exporter.

If you are running your application on Compute Engine, you can use the Ops Agent to collect OpenTelemetry Protocol metrics and traces. For more information, see Collect OTLP metrics and traces.

Add dependencies

To configure the OpenTelemetry SDK and OTLP exporter, add the following dependencies to your application.

Java

<dependency>
  <groupId>com.google.cloud</groupId>
  <artifactId>google-cloud-spanner</artifactId>
</dependency>
<dependency>
  <groupId>io.opentelemetry</groupId>
  <artifactId>opentelemetry-sdk</artifactId>
</dependency>
<dependency>
  <groupId>io.opentelemetry</groupId>
  <artifactId>opentelemetry-sdk-metrics</artifactId>
</dependency>
<dependency>
  <groupId>io.opentelemetry</groupId>
  <artifactId>opentelemetry-sdk-trace</artifactId>
</dependency>
<dependency>
  <groupId>io.opentelemetry</groupId>
  <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>

Go

go.opentelemetry.io/otel v1.23.1
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.23.1
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.23.1
go.opentelemetry.io/otel/metric v1.23.1
go.opentelemetry.io/otel/sdk v1.23.1
go.opentelemetry.io/otel/sdk/metric v1.23.1

Inject the OpenTelemetry object

Then, create an OpenTelemetry object with the OTLP exporter and inject the OpenTelemetry object using SpannerOptions.

Java

// Enable OpenTelemetry metrics and traces before Injecting OpenTelemetry
SpannerOptions.enableOpenTelemetryMetrics();
SpannerOptions.enableOpenTelemetryTraces();

// Create a new meter provider
SdkMeterProvider sdkMeterProvider = SdkMeterProvider.builder()
    // Use Otlp exporter or any other exporter of your choice.
    .registerMetricReader(
        PeriodicMetricReader.builder(OtlpGrpcMetricExporter.builder().build()).build())
    .build();

// Create a new tracer provider
SdkTracerProvider sdkTracerProvider = SdkTracerProvider.builder()
    // Use Otlp exporter or any other exporter of your choice.
    .addSpanProcessor(SimpleSpanProcessor.builder(OtlpGrpcSpanExporter
        .builder().build()).build())
        .build();

// Configure OpenTelemetry object using Meter Provider and Tracer Provider
OpenTelemetry openTelemetry = OpenTelemetrySdk.builder()
    .setMeterProvider(sdkMeterProvider)
    .setTracerProvider(sdkTracerProvider)
    .build();

// Inject OpenTelemetry object via Spanner options or register as GlobalOpenTelemetry.
SpannerOptions options = SpannerOptions.newBuilder()
    .setOpenTelemetry(openTelemetry)
    .build();
Spanner spanner = options.getService();

Go

// Ensure that your Go runtime version is supported by the OpenTelemetry-Go compatibility policy before enabling OpenTelemetry instrumentation.
// Refer to compatibility here https://github.com/googleapis/google-cloud-go/blob/main/debug.md#opentelemetry

import (
	"context"
	"fmt"
	"io"
	"log"
	"strconv"
	"strings"

	"cloud.google.com/go/spanner"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
	"go.opentelemetry.io/otel/metric"
	sdkmetric "go.opentelemetry.io/otel/sdk/metric"
	"go.opentelemetry.io/otel/sdk/resource"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
	"google.golang.org/api/iterator"
)

func enableOpenTelemetryMetricsAndTraces(w io.Writer, db string) error {
	// db = `projects/<project>/instances/<instance-id>/database/<database-id>`
	ctx := context.Background()

	// Create a new resource to uniquely identify the application
	res, err := newResource()
	if err != nil {
		log.Fatal(err)
	}

	// Enable OpenTelemetry traces by setting environment variable GOOGLE_API_GO_EXPERIMENTAL_TELEMETRY_PLATFORM_TRACING to the case-insensitive value "opentelemetry" before loading the client library.
	// Enable OpenTelemetry metrics before injecting meter provider.
	spanner.EnableOpenTelemetryMetrics()

	// Create a new tracer provider
	tracerProvider, err := getOtlpTracerProvider(ctx, res)
	defer tracerProvider.ForceFlush(ctx)
	if err != nil {
		log.Fatal(err)
	}
	// Register tracer provider as global
	otel.SetTracerProvider(tracerProvider)

	// Create a new meter provider
	meterProvider := getOtlpMeterProvider(ctx, res)
	defer meterProvider.ForceFlush(ctx)

	// Inject meter provider locally via ClientConfig when creating a spanner client or set globally via setMeterProvider.
	client, err := spanner.NewClientWithConfig(ctx, db, spanner.ClientConfig{OpenTelemetryMeterProvider: meterProvider})
	if err != nil {
		return err
	}
	defer client.Close()
	return nil
}

func getOtlpMeterProvider(ctx context.Context, res *resource.Resource) *sdkmetric.MeterProvider {
	otlpExporter, err := otlpmetricgrpc.New(ctx)
	if err != nil {
		log.Fatal(err)
	}
	meterProvider := sdkmetric.NewMeterProvider(
		sdkmetric.WithResource(res),
		sdkmetric.WithReader(sdkmetric.NewPeriodicReader(otlpExporter)),
	)
	return meterProvider
}

func getOtlpTracerProvider(ctx context.Context, res *resource.Resource) (*sdktrace.TracerProvider, error) {
	traceExporter, err := otlptracegrpc.New(ctx)
	if err != nil {
		return nil, err
	}

	tracerProvider := sdktrace.NewTracerProvider(
		sdktrace.WithResource(res),
		sdktrace.WithBatcher(traceExporter),
		sdktrace.WithSampler(sdktrace.AlwaysSample()),
	)

	return tracerProvider, nil
}

func newResource() (*resource.Resource, error) {
	return resource.Merge(resource.Default(),
		resource.NewWithAttributes(semconv.SchemaURL,
			semconv.ServiceName("otlp-service"),
			semconv.ServiceVersion("0.1.0"),
		))
}

Capture and visualize client round-trip latency

Client round-trip latency is the length of time (in milliseconds) between the first byte of the Spanner API request that the client sends to the database (through both the GFE and the Spanner API frontend), and the last byte of response that the client receives from the database.

Capture client round-trip latency

The Spanner client round-trip latency metric is not supported using OpenTelemetry. You can instrument the metric using OpenCensus with a bridge and migrate the data to OpenTelemetry.

Visualize client round-trip latency

After retrieving the metrics, you can visualize client round-trip latency in Cloud Monitoring.

Here's an example of a graph that illustrates the 5th percentile latency for the client round-trip latency metric. To change the percentile latency to either the 50th or the 99th percentile, use the Aggregator menu.

The program creates a view called roundtrip_latency. This string becomes part of the name of the metric when it's exported to Cloud Monitoring.

Cloud Monitoring client round-trip latency.

Capture and visualize GFE latency

Google Front End (GFE) latency is the length of time (in milliseconds) between when the Google network receives a remote procedure call from the client and when the GFE receives the first byte of the response.

Capture GFE latency

You can capture GFE latency metrics by enabling the following options using the Spanner client library.

Java

static void captureGfeMetric(DatabaseClient dbClient) {
  // GFE_latency and other Spanner metrics are automatically collected
  // when OpenTelemetry metrics are enabled.

  try (ResultSet resultSet =
      dbClient
          .singleUse() // Execute a single read or query against Cloud Spanner.
          .executeQuery(Statement.of("SELECT SingerId, AlbumId, AlbumTitle FROM Albums"))) {
    while (resultSet.next()) {
      System.out.printf(
          "%d %d %s", resultSet.getLong(0), resultSet.getLong(1), resultSet.getString(2));
    }
  }
}

Go

// GFE_Latency and other Spanner metrics are automatically collected
// when OpenTelemetry metrics are enabled.
func captureGFELatencyMetric(ctx context.Context, client spanner.Client) error {
	stmt := spanner.Statement{SQL: `SELECT SingerId, AlbumId, AlbumTitle FROM Albums`}
	iter := client.Single().Query(ctx, stmt)
	defer iter.Stop()
	for {
		row, err := iter.Next()
		if err == iterator.Done {
			return nil
		}
		if err != nil {
			return err
		}
		var singerID, albumID int64
		var albumTitle string
		if err := row.Columns(&singerID, &albumID, &albumTitle); err != nil {
			return err
		}
	}
}

Visualize GFE latency

After retrieving the metrics, you can visualize GFE latency in Cloud Monitoring.

Here's an example of a graph that illustrates the distribution aggregation for the GFE latency metric. To change the percentile latency to the 5th, 50th, 95th, or 99th percentile, use the Aggregator menu.

The program creates a view called spanner/gfe_latency. This string becomes part of the name of the metric when it's exported to Cloud Monitoring.

Cloud Monitoring GFE latency.

Capture and visualize Spanner API request latency

Spanner API request latency is the length of time (in seconds) between the first byte of request that the Spanner API frontend receives and the last byte of response that the Spanner API frontend sends.

Capture Spanner API request latency

By default, this latency is available as part of Cloud Monitoring metrics. You don't have to do anything to capture and export it.

Visualize Spanner API request latency

You can use the Metrics Explorer charting tool to visualize the graph for the spanner.googleapis.com/api/request_latencies metric in Cloud Monitoring.

Here's an example of a graph that illustrates the 5th percentile latency for the Spanner API request latency metric. To change the percentile latency to either the 50th or the 99th percentile, use the Aggregator menu.

Cloud Monitoring API request latency.

Capture and visualize query latency

Query latency is the length of time (in milliseconds) that it takes to run SQL queries in the Spanner database.

Capture query latency

You can capture query latency for the following languages:

Java

static void captureQueryStatsMetric(OpenTelemetry openTelemetry, DatabaseClient dbClient) {
  // Register query stats metric.
  // This should be done once before start recording the data.
  Meter meter = openTelemetry.getMeter("cloud.google.com/java");
  DoubleHistogram queryStatsMetricLatencies =
      meter
          .histogramBuilder("spanner/query_stats_elapsed")
          .setDescription("The execution of the query")
          .setUnit("ms")
          .build();

  // Capture query stats metric data.
  try (ResultSet resultSet = dbClient.singleUse()
      .analyzeQuery(Statement.of("SELECT SingerId, AlbumId, AlbumTitle FROM Albums"),
          QueryAnalyzeMode.PROFILE)) {

    while (resultSet.next()) {
      System.out.printf(
          "%d %d %s", resultSet.getLong(0), resultSet.getLong(1), resultSet.getString(2));
    }

    String value = resultSet.getStats().getQueryStats()
        .getFieldsOrDefault("elapsed_time", Value.newBuilder().setStringValue("0 msecs").build())
        .getStringValue();
    double elapsedTime = value.contains("msecs")
        ? Double.parseDouble(value.replaceAll(" msecs", ""))
        : Double.parseDouble(value.replaceAll(" secs", "")) * 1000;
    queryStatsMetricLatencies.record(elapsedTime);
  }
}

Go

func captureQueryStatsMetric(ctx context.Context, mp metric.MeterProvider, client spanner.Client) error {
	meter := mp.Meter(spanner.OtInstrumentationScope)
	// Register query stats metric with OpenTelemetry to record the data.
	// This should be done once before start recording the data.
	queryStats, err := meter.Float64Histogram(
		"spanner/query_stats_elapsed",
		metric.WithDescription("The execution of the query"),
		metric.WithUnit("ms"),
		metric.WithExplicitBucketBoundaries(0.0, 0.01, 0.05, 0.1, 0.3, 0.6, 0.8, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 10.0, 13.0,
			16.0, 20.0, 25.0, 30.0, 40.0, 50.0, 65.0, 80.0, 100.0, 130.0, 160.0, 200.0, 250.0,
			300.0, 400.0, 500.0, 650.0, 800.0, 1000.0, 2000.0, 5000.0, 10000.0, 20000.0, 50000.0,
			100000.0),
	)
	if err != nil {
		fmt.Print(err)
	}

	stmt := spanner.Statement{SQL: `SELECT SingerId, AlbumId, AlbumTitle FROM Albums`}
	iter := client.Single().QueryWithStats(ctx, stmt)
	defer iter.Stop()
	for {
		row, err := iter.Next()
		if err == iterator.Done {
			// Record query execution time with OpenTelemetry.
			elapasedTime := iter.QueryStats["elapsed_time"].(string)
			elapasedTimeMs, err := strconv.ParseFloat(strings.TrimSuffix(elapasedTime, " msecs"), 64)
			if err != nil {
				return err
			}
			queryStats.Record(ctx, elapasedTimeMs)
			return nil
		}
		if err != nil {
			return err
		}
		var singerID, albumID int64
		var albumTitle string
		if err := row.Columns(&singerID, &albumID, &albumTitle); err != nil {
			return err
		}
	}
}

Visualize query latency

After retrieving the metrics, you can visualize query latency in Cloud Monitoring.

Here's an example of a graph that illustrates the distribution aggregation for the GFE latency metric. To change the percentile latency to the 5th, 50th, 95th, or 99th percentile, use the Aggregator menu.

The program creates an OpenCensus view called query_stats_elapsed. This string becomes part of the name of the metric when it's exported to Cloud Monitoring.

Cloud Monitoring query latency.

What's next