Using Prometheus metrics

This page covers the basics of creating Prometheus metrics for availability and latency SLIs. It also provides implementation examples of how to define SLOs using Prometheus metrics.

The basics of Prometheus

Prometheus is a leading open-source monitoring solution for metrics and alerting.

Prometheus supports dimensional data with key-value identifiers for metrics, provides the PromQL query language, and supports many integrations by providing exporters for other products.

Prometheus integrates with Cloud Monitoring by using the Stackdriver collector.

Metrics

Prometheus supports the following types of metrics:

  • Counter: a single value that can only be monotonically increased or reset to 0 on restart.
  • Gauge: a single numeric value that can be arbitrarily set.
  • Histogram: a group of configurable buckets for sampling obvervations and recording values in ranges; also provides a sum of all observed values
  • Summary: like a histogram, but it also calculates configurable quantiles over a sliding time window.

For more information, see Metric types.

Instrumentation

In order for Prometheus to receive metrics from your application, the application needs to expose a dedicated endpoint (often /metrics) with the metric values available there. To expose such an endpoint, you use the Prometheus client libraries.

If your application already writes metrics to another destination, like a database or a file, you can create a Prometheus exporter to read the data and expose it. For more information see exporters.

Finally, you need to have both a Prometheus server where the metrics are stored and a way for metrics to be ingested into Cloud Monitoring from the server. The Stackdriver collector serves this purpose.

Creating metrics for SLIs

Your application must create Prometheus metrics that can be used as SLIs in Cloud Monitoring:

  • For availability SLIs on request and error counts, you can start with Promethus counter metrics.
  • For latency SLIs, you can use Promethus histogram or summary metrics.

Metrics for availability SLIs

You express a request-based availability SLI in the Cloud Monitoring API by using the TimeSeriesRatio structure to set up a ratio of "good" or "bad" requests to total requests. This ratio is used in the goodTotalRatio field of a RequestBasedSli structure.

Your application must create Prometheus metrics that can be used to construct this ratio. In your application, you must create at least two of the following:

  1. A metric that counts total events; use this metric in the ratio's totalServiceFilter.

    You can create a Prometheus counter that's incremented for every event.

  2. A metric that counts "bad" events, use this metric in the ratio's badServiceFilter.

    You can create a Prometheus counter that's incremented for every error or other "bad" event.

  3. A metric that counts "good" events, use this metric in the ratio's goodServiceFilter.

    You can create a Prometheus counter that's incremented for every successful or other "good" event.

The example in Implementation example creates a counter for the total number of requests, nodeRequestsCounter, and a counter for the number of failed requests, nodeFailedRequestsCounter.

Metrics for latency SLIs

You express a request-based latency SLI in the Cloud Monitoring API by creating a DistributionCut structure. This structure is used in the distributionCut field of a RequestBasedSli structure.

Your application must create a Prometheus metric that can be used to construct the distribution-cut value. You can use a Promethus histogram or summary for this purpose. To determine how to define your buckets to accurately measure whether your responses fall within your SLO, see Metric types in the Prometheus documentation.

The example in Implementation example creates a histogram for response latencies by path, nodeLatenciesHistogram.

Implementation example

This section presents an example that implements metrics for basic availability and latency SLIs using Prometheus in Node.js.

Instrumentation

To instrument your service to expose Prometheus metrics, do the following:

  1. Include or import the Prometheus client:

    Go

    import (
    	"fmt"
    	"log"
    	"math/rand"
    	"net/http"
    	"time"
    
    	"github.com/prometheus/client_golang/prometheus"
    	"github.com/prometheus/client_golang/prometheus/promauto"
    	"github.com/prometheus/client_golang/prometheus/promhttp"
    )
    

    Node.js

    const prometheus = require('prom-client');
    const collectDefaultMetrics = prometheus.collectDefaultMetrics;
    const Registry = prometheus.Registry;
    const register = new Registry();
    collectDefaultMetrics({register});

    Python

    import random
    import time
    
    from flask import Flask
    
    from prometheus_client import (
        Counter,
        generate_latest,
        Histogram,
        REGISTRY,
    )
    
  2. Use the client to define the metrics:

    Go

    // Sets up metrics.
    var (
    	requestCount = promauto.NewCounter(prometheus.CounterOpts{
    		Name: "go_request_count",
    		Help: "total request count",
    	})
    	failedRequestCount = promauto.NewCounter(prometheus.CounterOpts{
    		Name: "go_failed_request_count",
    		Help: "failed request count",
    	})
    	responseLatency = promauto.NewHistogram(prometheus.HistogramOpts{
    		Name: "go_response_latency",
    		Help: "response latencies",
    	})
    )
    

    Node.js

    // total requests - counter
    const nodeRequestsCounter = new prometheus.Counter({
      name: 'node_requests',
      help: 'total requests',
    });
    
    // failed requests - counter
    const nodeFailedRequestsCounter = new prometheus.Counter({
      name: 'node_failed_requests',
      help: 'failed requests',
    });
    
    // latency - histogram
    const nodeLatenciesHistogram = new prometheus.Histogram({
      name: 'node_request_latency',
      help: 'request latency by path',
      labelNames: ['route'],
      buckets: [100, 400],
    });

    Python

    PYTHON_REQUESTS_COUNTER = Counter("python_requests", "total requests")
    PYTHON_FAILED_REQUESTS_COUNTER = Counter("python_failed_requests", "failed requests")
    PYTHON_LATENCIES_HISTOGRAM = Histogram(
        "python_request_latency", "request latency by path"
    )
  3. Define the endpoint on which to expose your Prometheus metrics (using Express):

    Go

    http.Handle("/metrics", promhttp.Handler())

    Node.js

    app.get('/metrics', async (req, res) => {
      try {
        res.set('Content-Type', register.contentType);
        res.end(await register.metrics());
      } catch (ex) {
        res.status(500).end(ex);
      }
    });

    Python

    @app.route("/metrics", methods=["GET"])
    def stats():
        return generate_latest(REGISTRY), 200
    
    
  4. Increment the counter metrics appropriately:

    Go

    requestCount.Inc()
    
    // Fails 10% of the time.
    if rand.Intn(100) >= 90 {
    	log.Printf("intentional failure encountered")
    	failedRequestCount.Inc()
    	http.Error(w, "intentional error!", http.StatusInternalServerError)
    	return
    }

    Node.js

    // increment total requests counter
    nodeRequestsCounter.inc();
    // return an error 10% of the time
    if (Math.floor(Math.random() * 100) > 90) {
      // increment error counter
      nodeFailedRequestsCounter.inc();

    Python

    PYTHON_REQUESTS_COUNTER.inc()
    # fail 10% of the time
    if random.randint(0, 100) > 90:
        PYTHON_FAILED_REQUESTS_COUNTER.inc()
  5. Track the latency metric appropriately:

    Go

    requestReceived := time.Now()
    defer func() {
    	responseLatency.Observe(time.Since(requestReceived).Seconds())
    }()

    Node.js

    // start latency timer
    const requestReceived = new Date().getTime();
    console.log('request made');
    // increment total requests counter
    nodeRequestsCounter.inc();
    // return an error 10% of the time
    if (Math.floor(Math.random() * 100) > 90) {
      // increment error counter
      nodeFailedRequestsCounter.inc();
      // return error code
      res.send('error!', 500);
    } else {
      // delay for a bit
      sleep.msleep(Math.floor(Math.random() * 1000));
      // record response latency
      const responseLatency = new Date().getTime() - requestReceived;
      nodeLatenciesHistogram.labels(req.route.path).observe(responseLatency);

    Python

    @PYTHON_LATENCIES_HISTOGRAM.time()

Configuring ingestion

After your service is running and emitting metrics on an endpoint, configure the appropriate settings for Prometheus scraping and the Stackdriver collector to ingest metrics into Cloud Monitoring.

This configuration determines how your Prometheus metrics appear in Monitoring. In this example, the Prometheus metrics are mapped as follows:

  • nodeRequestCounter becomes external.googleapis.com/prometheus/total_request_count.
  • nodeFailedRequestCounter becomes external.googleapis.com/prometheus/error_count.
  • nodeLatenciesHistogram becomes external.googleapis.com/prometheus/reponse_latency.

The associated monitored-resource type is k8s_container.

You use these ingested metrics to define your SLIs.

Availability SLIs

In Cloud Monitoring, you express a request-based availability SLI by using a TimeSeriesRatio structure. The following example shows an SLO that uses the ingested Prometheus metrics and expects that the service has a 98% availability, as calculated by a ratio of bad to total requests, over a rolling 28-day window:

{
 "serviceLevelIndicator": {
   "requestBased": {
     "goodTotalRatio": {
       "totalServiceFilter":
         "metric.type=\"external.googleapis.com/prometheus/total_request_count\"
          resource.type=\"k8s_container\"",
       "badServiceFilter":
         "metric.type=\"external.googleapis.com/prometheus/error_count\"
          resource.type=\"k8s_container\""
     }
   }
 },
 "goal": 0.98,
 "rollingPeriod": "2419200s",
 "displayName": "98% Availability, rolling 28 days"
}

Latency SLIs

In Cloud Monitoring, you express a request-based latency SLI by using a DistributionCut structure. The following example shows an SLO that uses the ingested Prometheus latency metric and expects that 98% of requests complete in under 500 ms over a rolling one-day window:

{
  "serviceLevelIndicator": {
    "requestBased": {
      "distributionCut": {
        "distributionFilter":
          "metric.type=\"external.googleapis.com/prometheus/response_latency\"
           resource.type=\"k8s_container\"",
        "range": {
          "min": 0,
          "max": 500
        }
      }
    }
  },
  "goal": 0.98,
  "rollingPeriod": "86400s",
  "displayName": "98% requests under 500 ms"
}