This page covers the basics of creating Prometheus metrics for availability and latency SLIs. It also provides implementation examples of how to define SLOs using Prometheus metrics.
The basics of Prometheus
Prometheus is a leading open-source monitoring solution for metrics and alerting.
Prometheus supports dimensional data with key-value identifiers for metrics, provides the PromQL query language, and supports many integrations by providing exporters for other products.
Prometheus integrates with Cloud Monitoring by using the Stackdriver collector.
Metrics
Prometheus supports the following types of metrics:
- Counter: a single value that can only be monotonically increased or reset to 0 on restart.
- Gauge: a single numeric value that can be arbitrarily set.
- Histogram: a group of configurable buckets for sampling obvervations and recording values in ranges; also provides a sum of all observed values
- Summary: like a histogram, but it also calculates configurable quantiles over a sliding time window.
For more information, see Metric types.
Instrumentation
In order for Prometheus to receive metrics from your application, the
application needs to expose a dedicated endpoint (often /metrics
) with the
metric values available there. To expose such an endpoint, you use the
Prometheus client libraries.
If your application already writes metrics to another destination, like a database or a file, you can create a Prometheus exporter to read the data and expose it. For more information see exporters.
Finally, you need to have both a Prometheus server where the metrics are stored and a way for metrics to be ingested into Cloud Monitoring from the server. The Stackdriver collector serves this purpose.
Creating metrics for SLIs
Your application must create Prometheus metrics that can be used as SLIs in Cloud Monitoring:
- For availability SLIs on request and error counts, you can start with Promethus counter metrics.
- For latency SLIs, you can use Promethus histogram or summary metrics.
Metrics for availability SLIs
You express a request-based availability SLI in the Cloud Monitoring API by
using the TimeSeriesRatio
structure to set
up a ratio of "good" or "bad" requests to total requests. This ratio is used
in the goodTotalRatio
field of a RequestBasedSli
structure.
Your application must create Prometheus metrics that can be used to construct this ratio. In your application, you must create at least two of the following:
A metric that counts total events; use this metric in the ratio's
totalServiceFilter
.You can create a Prometheus counter that's incremented for every event.
A metric that counts "bad" events, use this metric in the ratio's
badServiceFilter
.You can create a Prometheus counter that's incremented for every error or other "bad" event.
A metric that counts "good" events, use this metric in the ratio's
goodServiceFilter
.You can create a Prometheus counter that's incremented for every successful or other "good" event.
The example in Implementation example creates
a counter for the total number of requests, nodeRequestsCounter
, and a
counter for the number of failed requests, nodeFailedRequestsCounter
.
Metrics for latency SLIs
You express a request-based latency SLI in the Cloud Monitoring API by
creating a DistributionCut
structure. This
structure is used in the distributionCut
field of a
RequestBasedSli
structure.
Your application must create a Prometheus metric that can be used to construct the distribution-cut value. You can use a Promethus histogram or summary for this purpose. To determine how to define your buckets to accurately measure whether your responses fall within your SLO, see Metric types in the Prometheus documentation.
The example in Implementation example creates
a histogram for response latencies by path, nodeLatenciesHistogram
.
Implementation example
This section presents an example that implements metrics for basic availability and latency SLIs using Prometheus in Node.js.
Instrumentation
To instrument your service to expose Prometheus metrics, do the following:
- Include or import the Prometheus client:
Go
Node.js
Python
- Use the client to define the metrics:
Go
Node.js
Python
- Define the endpoint on which to expose your Prometheus metrics (using Express):
Go
Node.js
Python
- Increment the counter metrics appropriately:
Go
Node.js
Python
- Track the latency metric appropriately:
Go
Node.js
Python
Configuring ingestion
After your service is running and emitting metrics on an endpoint, configure the appropriate settings for Prometheus scraping and the Stackdriver collector to ingest metrics into Cloud Monitoring.
This configuration determines how your Prometheus metrics appear in Monitoring. In this example, the Prometheus metrics are mapped as follows:
nodeRequestCounter
becomesexternal.googleapis.com/prometheus/total_request_count
.nodeFailedRequestCounter
becomesexternal.googleapis.com/prometheus/error_count
.nodeLatenciesHistogram
becomesexternal.googleapis.com/prometheus/reponse_latency
.
The associated monitored-resource type is
k8s_container
.You use these ingested metrics to define your SLIs.
Availability SLIs
In Cloud Monitoring, you express a request-based availability SLI by using a
TimeSeriesRatio
structure. The following example shows an SLO that uses the ingested Prometheus metrics and expects that the service has a 98% availability, as calculated by a ratio of bad to total requests, over a rolling 28-day window:{ "serviceLevelIndicator": { "requestBased": { "goodTotalRatio": { "totalServiceFilter": "metric.type=\"external.googleapis.com/prometheus/total_request_count\" resource.type=\"k8s_container\"", "badServiceFilter": "metric.type=\"external.googleapis.com/prometheus/error_count\" resource.type=\"k8s_container\"" } } }, "goal": 0.98, "rollingPeriod": "2419200s", "displayName": "98% Availability, rolling 28 days" }
Latency SLIs
In Cloud Monitoring, you express a request-based latency SLI by using a
DistributionCut
structure. The following example shows an SLO that uses the ingested Prometheus latency metric and expects that 98% of requests complete in under 500 ms over a rolling one-day window:{ "serviceLevelIndicator": { "requestBased": { "distributionCut": { "distributionFilter": "metric.type=\"external.googleapis.com/prometheus/response_latency\" resource.type=\"k8s_container\"", "range": { "min": 0, "max": 500 } } } }, "goal": 0.98, "rollingPeriod": "86400s", "displayName": "98% requests under 500 ms" }
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2023-03-20 UTC.
[{ "type": "thumb-down", "id": "hardToUnderstand", "label":"Hard to understand" },{ "type": "thumb-down", "id": "incorrectInformationOrSampleCode", "label":"Incorrect information or sample code" },{ "type": "thumb-down", "id": "missingTheInformationSamplesINeed", "label":"Missing the information/samples I need" },{ "type": "thumb-down", "id": "otherDown", "label":"Other" }] [{ "type": "thumb-up", "id": "easyToUnderstand", "label":"Easy to understand" },{ "type": "thumb-up", "id": "solvedMyProblem", "label":"Solved my problem" },{ "type": "thumb-up", "id": "otherUp", "label":"Other" }] - Define the endpoint on which to expose your Prometheus metrics (using Express):