Using OpenCensus metrics

This page covers the basics of creating OpenCensus metrics for availability and latency SLIs. It also provides implementation examples of how to define SLOs using OpenCensus metrics.

The basics of OpenCensus

OpenCensus is a single open-source distribution of libraries, available on the OpenCensus GitHub page, that automatically collect traces and metrics and sends them to any backend. OpenCensus can be used to instrument your services to emit custom metrics that can be ingested into Cloud Monitoring. You can then use these metrics as SLIs.

For an example using OpenCensus to create Monitoring metrics that aren't specifically intended as SLIs, see Custom metrics with OpenCensus.

Metrics

To collect metric data from your service by using OpenCensus, you must use the following OpenCensus constructs:

  • Measure, which represents the metric type to be recorded, specified with a metric name. A Measure can record Int64 or Float64 values.
  • Measurement:, which records a specific data point collected and written by a Measure for a particular event. For example, a Measurement might record the latency of a specific response.
  • View, which specifies an aggregation applied to a Measure. OpenCensus supports the following aggregation types:
    • Count: a count of the number of measurement points.
    • Distribution: a histogram distribution of the measurement points.
    • Sum: a sum of the measurement values.
    • LastValue: the last value recorded by the measurement.

For more information, see OpenCensus stats/metrics. Note that OpenCensus often refers to metrics as stats.

Instrumentation

The OpenCensus libraries are available for a number of languages. For language-specific information on instrumenting your service to emit metrics, see OpenCensus language support. Additionally, Custom metrics with OpenCensus provides examples for languages commonly used with Monitoring.

In the basic case, you need to do the following:

  • Instrument your service to record and export metrics.
  • Define an exporter to receive the metrics.

For each metric, you need to define a Measure to specify the value type: Int64 or Float64. You also need to define and register the View to specify the aggregation type (count, distribution, sum, or last-value). To use the distribution aggregation type, you also need to specify the histogram bucket boundaries explicitly. You also specify a name for your metric in the View.

Exporter

Finally, you need to use an exporter to collect the metrics and write them to Cloud Monitoring or another backend. For information on the language-specific exporters available for Monitoring, see OpenCensus exporters.

You can also write your own exporter; for more information, see Writing a custom exporter.

Creating metrics for SLIs

Your application must create OpenCensus metrics that can be used as SLIs in Cloud Monitoring:

  • For availability SLIs on request and error counts, use a Measure with count aggregation.
  • For latency SLIs, use a Measure with distribution aggregation.

Metrics for availability SLIs

You express a request-based availability SLI in the Cloud Monitoring API by using the TimeSeriesRatio structure to set up a ratio of "good" or "bad" requests to total requests. This ratio is used in the goodTotalRatio field of a RequestBasedSli structure.

Your application must create OpenCensus metrics that can be used to construct this ratio. In your application, you must create at least two of the following:

  1. A metric that counts total events; use this metric in the ratio's totalServiceFilter.

    You can create an OpenCensus metric of type Int64 with count aggregation, where you record a value of 1 for every received request.

  2. A metric that counts "bad" events, use this metric in the ratio's badServiceFilter.

    You can create an OpenCensus metric of type Int64 with count aggregation, where you record a value of 1 for every error or failed request.

  3. A metric that counts "good" events, use this metric in the ratio's goodServiceFilter.

    You can create an OpenCensus metric of type Int64 with count aggregation, where you record a value of 1 for every successful response.

Metrics for latency SLIs

You express a request-based latency SLI in the Cloud Monitoring API by using a DistributionCut structure. This structure is used in the distributionCut field of a RequestBasedSli structure.

You can create an Int64 or Float64 Measure with a View using the distribution aggregation type. You must also explicitly define your bucket boundaries. Note that it is critical to define the buckets in a way that allows you to precisely measure the percentage of requests that are within your desired threshold. For a discussion of this topic, see Implementing SLOs in the Site Reliability Engineering Workbook.

Implementation example

This section presents an example that implements metrics for basic availability and latency SLIs using OpenCensus in Node.js.

Instrumentation

To instrument your service to emit metrics using OpenCensus, do the following:

  1. Include the necessary libraries:

    Go

    import (
    	"flag"
    	"fmt"
    	"log"
    	"math/rand"
    	"net/http"
    	"time"
    
    	"contrib.go.opencensus.io/exporter/stackdriver"
    	"go.opencensus.io/stats"
    	"go.opencensus.io/stats/view"
    	"go.opencensus.io/tag"
    )
    

    Node.js

    // opencensus setup
    const {globalStats, MeasureUnit, AggregationType} = require('@opencensus/core');
    const {StackdriverStatsExporter} = require('@opencensus/exporter-stackdriver');

    Python

    from flask import Flask
    from opencensus.ext.prometheus import stats_exporter as prometheus
    from opencensus.stats import aggregation as aggregation_module
    from opencensus.stats import measure as measure_module
    from opencensus.stats import stats as stats_module
    from opencensus.stats import view as view_module
    from opencensus.tags import tag_map as tag_map_module
    
    from prometheus_flask_exporter import PrometheusMetrics
    
  2. Define and register the exporter:

    Go

    // Sets up Cloud Monitoring exporter.
    sd, err := stackdriver.NewExporter(stackdriver.Options{
    	ProjectID:         *projectID,
    	MetricPrefix:      "opencensus-demo",
    	ReportingInterval: 60 * time.Second,
    })
    if err != nil {
    	log.Fatalf("Failed to create the Cloud Monitoring exporter: %v", err)
    }
    defer sd.Flush()
    
    sd.StartMetricsExporter()
    defer sd.StopMetricsExporter()

    Node.js

    // Stackdriver export interval is 60 seconds
    const EXPORT_INTERVAL = 60;
    const exporter = new StackdriverStatsExporter({
      projectId: projectId,
      period: EXPORT_INTERVAL * 1000,
    });
    globalStats.registerExporter(exporter);

    Python

    def setup_openCensus_and_prometheus_exporter() -> None:
        stats = stats_module.stats
        view_manager = stats.view_manager
        exporter = prometheus.new_stats_exporter(prometheus.Options(namespace="oc_python"))
        view_manager.register_exporter(exporter)
        register_all_views(view_manager)
  3. Define a Measure for each metric:

    Go

    // Sets up metrics.
    var (
    	requestCount       = stats.Int64("oc_request_count", "total request count", "requests")
    	failedRequestCount = stats.Int64("oc_failed_request_count", "count of failed requests", "requests")
    	responseLatency    = stats.Float64("oc_latency_distribution", "distribution of response latencies", "s")
    )
    

    Node.js

    const REQUEST_COUNT = globalStats.createMeasureInt64(
      'request_count',
      MeasureUnit.UNIT,
      'Number of requests to the server'
    );
    const ERROR_COUNT = globalStats.createMeasureInt64(
      'error_count',
      MeasureUnit.UNIT,
      'Number of failed requests to the server'
    );
    const RESPONSE_LATENCY = globalStats.createMeasureInt64(
      'response_latency',
      MeasureUnit.MS,
      'The server response latency in milliseconds'
    );

    Python

    m_request_count = measure_module.MeasureInt(
        "python_request_count", "total requests", "requests"
    )
    m_failed_request_count = measure_module.MeasureInt(
        "python_failed_request_count", "failed requests", "requests"
    )
    m_response_latency = measure_module.MeasureFloat(
        "python_response_latency", "response latency", "s"
    )
  4. Define and register the View for each Measure with the appropriate aggregation type, and for response latency, the bucket boundaries:

    Go

    // Sets up views.
    var (
    	requestCountView = &view.View{
    		Name:        "oc_request_count",
    		Measure:     requestCount,
    		Description: "total request count",
    		Aggregation: view.Count(),
    	}
    	failedRequestCountView = &view.View{
    		Name:        "oc_failed_request_count",
    		Measure:     failedRequestCount,
    		Description: "count of failed requests",
    		Aggregation: view.Count(),
    	}
    	responseLatencyView = &view.View{
    		Name:        "oc_response_latency",
    		Measure:     responseLatency,
    		Description: "The distribution of the latencies",
    		// Bucket definitions must be explicitly specified.
    		Aggregation: view.Distribution(0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000),
    	}
    )
    
    	// Register the views.
    	if err := view.Register(requestCountView, failedRequestCountView, responseLatencyView); err != nil {
    		log.Fatalf("Failed to register the views: %v", err)
    	}

    Node.js

    const request_count_metric = globalStats.createView(
      'request_count_metric',
      REQUEST_COUNT,
      AggregationType.COUNT
    );
    globalStats.registerView(request_count_metric);
    const error_count_metric = globalStats.createView(
      'error_count_metric',
      ERROR_COUNT,
      AggregationType.COUNT
    );
    globalStats.registerView(error_count_metric);
    const latency_metric = globalStats.createView(
      'response_latency_metric',
      RESPONSE_LATENCY,
      AggregationType.DISTRIBUTION,
      [],
      'Server response latency distribution',
      // Latency in buckets:
      [0, 1000, 2000, 3000, 4000, 5000, 10000]
    );
    globalStats.registerView(latency_metric);

    Python

    # set up views
    latency_view = view_module.View(
        "python_response_latency",
        "The distribution of the latencies",
        [],
        m_response_latency,
        aggregation_module.DistributionAggregation(
            [0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000]
        ),
    )
    
    request_count_view = view_module.View(
        "python_request_count",
        "total requests",
        [],
        m_request_count,
        aggregation_module.CountAggregation(),
    )
    
    failed_request_count_view = view_module.View(
        "python_failed_request_count",
        "failed requests",
        [],
        m_failed_request_count,
        aggregation_module.CountAggregation(),
    )
    
    
    # register views
    def register_all_views(view_manager: stats_module.stats.view_manager) -> None:
        view_manager.register_view(latency_view)
        view_manager.register_view(request_count_view)
        view_manager.register_view(failed_request_count_view)
  5. Record values for the request-count and error-count metrics:

    Go

    // Counts the request.
    stats.Record(ctx, requestCount.M(1))
    
    // Randomly fails 10% of the time.
    if rand.Intn(100) >= 90 {
    	// Counts the error.
    	stats.Record(ctx, failedRequestCount.M(1))

    Node.js

    // record a request count for every request
    globalStats.record([
      {
        measure: REQUEST_COUNT,
        value: 1,
      },
    ]);
    
    // randomly throw an error 10% of the time
    const randomValue = Math.floor(Math.random() * 9 + 1);
    if (randomValue === 1) {
      // Record a failed request.
      globalStats.record([
        {
          measure: ERROR_COUNT,
          value: 1,
        },
      ]);

    Python

    mmap = stats_recorder.new_measurement_map()
    # count request
    mmap.measure_int_put(m_request_count, 1)
    # fail 10% of the time
    if random.randint(0, 100) > 90:
        mmap.measure_int_put(m_failed_request_count, 1)
        tmap = tag_map_module.TagMap()
        mmap.record(tmap)
        return ("error!", 500)
  6. Record latency values:

    Go

    requestReceived := time.Now()
    // Records latency for failure OR success.
    defer func() {
    	stats.Record(ctx, responseLatency.M(time.Since(requestReceived).Seconds()))
    }()

    Node.js

    globalStats.record([
      {
        measure: RESPONSE_LATENCY,
        value: stopwatch.elapsedMilliseconds,
      },
    ]);

    Python

    start_time = time.perf_counter()
    mmap = stats_recorder.new_measurement_map()
    if random.randint(0, 100) > 90:
        response_latency = time.perf_counter() - start_time
        mmap.measure_float_put(m_response_latency, response_latency)
        tmap = tag_map_module.TagMap()
        mmap.record(tmap)

Ingested metrics

When your metrics are exported to Cloud Monitoring, they appear as metric types with a prefix that indicates that they originated from OpenCensus. For example, the name of each OpenCensus View in the Node.js implementation is mapped as follows:

  • request_count_sli becomes custom.googleapis.com/opencensus/request_count_sli.
  • error_count_sli becomes custom.googleapis.com/opencensus/error_count_sli.
  • response_latency_sli becomes custom.googleapis.com/opencensus/response_latency_sli.

After your service is running, you can confirm that the metrics are being ingested into Monitoring by searching for them in Metrics Explorer.

Availability SLIs

In Cloud Monitoring, you express a request-based availability SLI by using a TimeSeriesRatio structure. The following example shows an SLO that uses the ingested OpenCensus metrics and expects that the service has a 98% availability, as calculated by a ratio of error_count_sli to request_count_sli, over a rolling 28-day window:

{
  "serviceLevelIndicator": {
    "requestBased": {
      "goodTotalRatio": {
        "totalServiceFilter":
          "metric.type=\"custom.googleapis.com/opencensus/request_count_sli\",
       "badServiceFilter":
          "metric.type=\"custom.googleapis.com/opencensus/error_count_sli\"
      }
    }
  },
  "goal": 0.98,
  "rollingPeriod": "2419200s",
  "displayName": "98% Availability, rolling 28 days"
}

Latency SLIs

In Cloud Monitoring, you express a request-based latency SLI by using a DistributionCut structure. The following example shows an SLO that uses the ingested OpenCensus latency metric and expects that 98% of requests complete in under 1000 ms over a rolling one-day window:

{
  "serviceLevelIndicator": {
    "requestBased": {
      "distributionCut": {
        "distributionFilter":
          "metric.type=\"custom.googleapis.com/opencensus/response_latency_sli\",
        "range": {
          "min": 0,
          "max": 1000
        }
      }
    }
  },
  "goal": 0.98,
  "rollingPeriod": "86400s",
  "displayName": "98% requests under 1000 ms"
}