Choose your service level indicators (SLIs)

Last reviewed 2024-03-29 UTC

This document in the Google Cloud Architecture Framework describes how to choose appropriate service level indicators (SLIs) for your service. This document builds on the concepts defined in Components of SLOs.

Metrics are required to determine if your service level objectives (SLOs) are being met. You define those metrics as SLIs. Each SLI is the measurement of a specific aspect of your service such as response time, availability, or success rate.

SLOs include one or more SLIs, and are ideally based on critical user journeys (CUJs). CUJs refer to a specific set of user interactions or paths that a user takes to accomplish their goal on a website. Consider a customer shopping on an ecommerce service. The customer logs in, searches for a product, adds the item to a cart, navigates to the checkout page, and checks out. CUJs identify the different ways to help users complete tasks as quickly as possible.

When choosing SLIs, you need to consider the metrics that are appropriate to your service, the various metric types that you can use, the quality of the metric, and the correct number of metrics needed.

Choose appropriate SLIs for your service type

There are many service types. The following table lists common service types and provides examples of SLIs for each. Some SLIs are applicable to multiple service types. If an SLI appears more than once in the table, only the first SLI instance provides a definition. Recall that SLIs are often expressed by the number of "nines" in the metric.

Service type	Typical SLIs
Serving systems	Availability — the percentage of the service that is usable. Availability is defined as the fraction of successful requests divided by the total number of requests, and expressed as a percentage such as 99.9%. Latency — how quickly a certain percentage of requests are fulfilled. For example, 99th percentile at 300 ms. Quality — the extent to which the content in the response to a request deviates from the ideal response content. For example, a scale from 0% to 100%.
Data processing systems	Coverage — the amount of data that has been processed, expressed as a fraction. For example, 95%. Correctness — the fraction of output data deemed to be correct. For example, 99.99%. Freshness — The freshness of the source data or aggregated output data. For example, data was refreshed 20 minutes ago. Throughput — The amount of data processed. For example, 500 MiB per sec or 1000 requests per second.
Storage systems	Durability — the likelihood that data written to the system is accessed in the future. For example, 99.9999%. Time to first byte (TTFB) — the time it takes to send and get the first byte of a page. Blob availability — the ratio of customer requests returning a non-server error response to the total number of customer requests. Throughput Latency
Request-drive systems	Availability Latency Quality
Scheduled execution systems	Skew — the proportion of executions that start within an acceptable window of the expected start time. Execution — The time a job takes to complete. For a given execution, a common failure mode is for actual duration to exceed scheduled duration.

Evaluate different metric types

In addition to choosing the appropriate SLI for your service, you need to decide the metric type to use for your SLI. The SLIs listed in the previous section tend to be one of the following types:

Counter: This type of metric can increase but not decrease. For example, the number of errors that occurred up to a given point of measurement.
Gauge: This type of metric can increase or decrease. For example, the actual value of a measurable part of the system (such as queue length).
Distribution (histogram): The number of events that inhabit a particular measurement segment for a given time period. For example, measuring how many requests take 0-10 ms to complete, how many take 11-30 ms, and how many take 31-100 ms. The result is a count for each bucket, such as [0-10: 50], [11-30: 220], and [31-100: 1103].

For more information about these types, see the Prometheus project documentation and Value types and metric kinds in Cloud Monitoring.

Consider the quality of the metric

Not every metric is useful. Apart from being a ratio of successful events to total events, you need to determine whether a metric is a good SLI for your needs. To help you make that determination, consider the following characteristics of a good metric:

Metrics relate directly to user happiness. Users are unhappy when a service does not behave as expected, such as when the service is slow, inaccurate, or fails completely. Validate any SLO based on these metrics by comparing the SLI to other signals of user happiness. This comparison includes data such as the number of customer complaint tickets, support call volume, and social media sentiment. (To learn more, see Continuous Improvement of SLO Targets).

If your metric doesn't align with these other indicators of user happiness, it might not be a good SLI.
The metric deterioration correlates with outages. Any metric reporting good service results during an outage is clearly the wrong metric for an SLI. Conversely, a metric that looks bad during normal operation is also problematic
The metric provides a good signal-to-noise ratio. Dismiss any metric that results in a large number of false negatives or false positives.
The metric scales monotonically and linearly with customer happiness. Simply put, as the metric improves, customer happiness also improves.

Select the correct number of metrics

A single service can have multiple SLIs, especially if the service performs different types of work or serves different types of users. It's best to choose the appropriate metrics for each type.

In contrast, some services perform similar types of work which can be directly comparable. For example, users viewing different pages on your site (such as the homepage, subcategories, and the top-10 list). Instead of developing a separate SLI for each of these actions, combine them into a single SLI category, such as browse services.

Your users' expectations don't change much between actions of a similar category. Their happiness is quantifiable by the answer to the question: "Did I see a full page of items quickly?"

Use as few SLIs as possible to accurately represent your service tolerances. As a general guide, have two to six SLIs. With too few SLIs, you can miss valuable signals. Too many and your support team has too much data at hand with little added benefit. Your SLIs should simplify your understanding of production health and provide a sense of coverage, not overwhelm (or underwhelm) you.

What's next?

Read Measure your SL0s.
Check out other SRE resources:
- The SRE Book
- The SRE Workbook
Explore recommendations in other pillars of the Architecture Framework.
For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.