This page covers the basics of creating OpenCensus metrics for availability and latency SLIs. It also provides implementation examples of how to define SLOs using OpenCensus metrics.
The basics of OpenCensus
OpenCensus is a single open-source distribution of libraries, available on the OpenCensus GitHub page, that automatically collect traces and metrics and sends them to any backend. OpenCensus can be used to instrument your services to emit custom metrics that can be ingested into Cloud Monitoring. You can then use these metrics as SLIs.
For an example using OpenCensus to create Monitoring metrics that aren't specifically intended as SLIs, see Custom metrics with OpenCensus.
Metrics
To collect metric data from your service by using OpenCensus, you must use the following OpenCensus constructs:
-
Measure
, which represents the metric type to be recorded, specified with a metric name. AMeasure
can record Int64 or Float64 values. -
Measurement
:, which records a specific data point collected and written by aMeasure
for a particular event. For example, aMeasurement
might record the latency of a specific response. -
View
, which specifies an aggregation applied to aMeasure
. OpenCensus supports the following aggregation types:- Count: a count of the number of measurement points.
- Distribution: a histogram distribution of the measurement points.
- Sum: a sum of the measurement values.
- LastValue: the last value recorded by the measurement.
For more information, see OpenCensus stats/metrics. Note that OpenCensus often refers to metrics as stats.
Instrumentation
The OpenCensus libraries are available for a number of languages. For language-specific information on instrumenting your service to emit metrics, see OpenCensus language support. Additionally, Custom metrics with OpenCensus provides examples for languages commonly used with Monitoring.
In the basic case, you need to do the following:
- Instrument your service to record and export metrics.
- Define an exporter to receive the metrics.
For each metric, you need to define a Measure
to specify the value type:
Int64 or Float64. You also need to define and register the View
to specify
the aggregation type (count, distribution, sum, or last-value). To use the
distribution aggregation type, you also need to specify the histogram bucket
boundaries explicitly. You also specify a name for your metric in the View
.
Exporter
Finally, you need to use an exporter to collect the metrics and write them to Cloud Monitoring or another backend. For information on the language-specific exporters available for Monitoring, see OpenCensus exporters.
You can also write your own exporter; for more information, see Writing a custom exporter.
Creating metrics for SLIs
Your application must create OpenCensus metrics that can be used as SLIs in Cloud Monitoring:
- For availability SLIs on request and error counts, use a
Measure
with count aggregation. - For latency SLIs, use a
Measure
with distribution aggregation.
Metrics for availability SLIs
You express a request-based availability SLI in the Cloud Monitoring API by using
the TimeSeriesRatio
structure to set up a
ratio of "good" or "bad" requests to total requests. This ratio is used in the
goodTotalRatio
field of a RequestBasedSli
structure.
Your application must create OpenCensus metrics that can be used to construct this ratio. In your application, you must create at least two of the following:
A metric that counts total events; use this metric in the ratio's
totalServiceFilter
.You can create an OpenCensus metric of type Int64 with count aggregation, where you record a value of
1
for every received request.A metric that counts "bad" events, use this metric in the ratio's
badServiceFilter
.You can create an OpenCensus metric of type Int64 with count aggregation, where you record a value of
1
for every error or failed request.A metric that counts "good" events, use this metric in the ratio's
goodServiceFilter
.You can create an OpenCensus metric of type Int64 with count aggregation, where you record a value of
1
for every successful response.
Metrics for latency SLIs
You express a request-based latency SLI in the Cloud Monitoring API by
using a DistributionCut
structure. This
structure is used in the distributionCut
field of a
RequestBasedSli
structure.
You can create an Int64 or Float64 Measure
with a View
using the
distribution aggregation type. You must also explicitly define your bucket
boundaries. Note that it is critical to define the buckets in a way that
allows you to precisely measure the percentage of requests that are within
your desired threshold. For a discussion of this topic, see
Implementing SLOs
in the
Site Reliability Engineering Workbook.
Implementation example
This section presents an example that implements metrics for basic availability and latency SLIs using OpenCensus in Node.js.
Instrumentation
To instrument your service to emit metrics using OpenCensus, do the following:
- Include the necessary libraries:
Go
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
- Define and register the exporter:
Go
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
- Define a
Measure
for each metric:Go
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
- Define and register the
View
for eachMeasure
with the appropriate aggregation type, and for response latency, the bucket boundaries:
Go
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
- Record values for the request-count and error-count metrics:
Go
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
- Record latency values:
Go
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Ingested metrics
When your metrics are exported to Cloud Monitoring, they appear
as metric types with a prefix that indicates that they originated from
OpenCensus. For example, the name of each OpenCensus View
in the Node.js
implementation is mapped as follows:
request_count_sli
becomescustom.googleapis.com/opencensus/request_count_sli
.error_count_sli
becomescustom.googleapis.com/opencensus/error_count_sli
.response_latency_sli
becomescustom.googleapis.com/opencensus/response_latency_sli
.
After your service is running, you can confirm that the metrics are being ingested into Monitoring by searching for them in Metrics Explorer.
Availability SLIs
In Cloud Monitoring, you express a request-based availability SLI by using a
TimeSeriesRatio
structure. The following
example shows an SLO that uses the ingested OpenCensus metrics and expects
that the service has a 98% availability, as calculated by a ratio of
error_count_sli
to request_count_sli
, over a rolling 28-day window:
{
"serviceLevelIndicator": {
"requestBased": {
"goodTotalRatio": {
"totalServiceFilter":
"metric.type=\"custom.googleapis.com/opencensus/request_count_sli\",
"badServiceFilter":
"metric.type=\"custom.googleapis.com/opencensus/error_count_sli\"
}
}
},
"goal": 0.98,
"rollingPeriod": "2419200s",
"displayName": "98% Availability, rolling 28 days"
}
Latency SLIs
In Cloud Monitoring, you express a request-based latency SLI by using a
DistributionCut
structure. The following
example shows an SLO that uses the ingested OpenCensus latency metric and
expects that 98% of requests complete in under 1000 ms over a rolling one-day
window:
{
"serviceLevelIndicator": {
"requestBased": {
"distributionCut": {
"distributionFilter":
"metric.type=\"custom.googleapis.com/opencensus/response_latency_sli\",
"range": {
"min": 0,
"max": 1000
}
}
}
},
"goal": 0.98,
"rollingPeriod": "86400s",
"displayName": "98% requests under 1000 ms"
}