This document in the Google Cloud Architecture Framework describes how to choose appropriate service level indicators (SLIs) for your service. This document builds on the concepts defined in Components of SLOs.
Metrics are required to determine if your service level objectives (SLOs) are being met. You define those metrics as SLIs. Each SLI is the measurement of a specific aspect of your service such as response time, availability, or success rate.
SLOs include one or more SLIs, and are ideally based on critical user journeys (CUJs). CUJs refer to a specific set of user interactions or paths that a user takes to accomplish their goal on a website. Consider a customer shopping on an ecommerce service. The customer logs in, searches for a product, adds the item to a cart, navigates to the checkout page, and checks out. CUJs identify the different ways to help users complete tasks as quickly as possible.
When choosing SLIs, you need to consider the metrics that are appropriate to your service, the various metric types that you can use, the quality of the metric, and the correct number of metrics needed.
Choose appropriate SLIs for your service type
There are many service types. The following table lists common service types and provides examples of SLIs for each. Some SLIs are applicable to multiple service types. If an SLI appears more than once in the table, only the first SLI instance provides a definition. Recall that SLIs are often expressed by the number of "nines" in the metric.
Service type | Typical SLIs |
---|---|
Serving systems |
|
Data processing systems |
|
Storage systems |
|
Request-drive systems |
|
Scheduled execution systems |
|
Evaluate different metric types
In addition to choosing the appropriate SLI for your service, you need to decide the metric type to use for your SLI. The SLIs listed in the previous section tend to be one of the following types:
- Counter: This type of metric can increase but not decrease. For example, the number of errors that occurred up to a given point of measurement.
- Gauge: This type of metric can increase or decrease. For example, the actual value of a measurable part of the system (such as queue length).
- Distribution (histogram): The number of events that inhabit a particular measurement segment for a given time period. For example, measuring how many requests take 0-10 ms to complete, how many take 11-30 ms, and how many take 31-100 ms. The result is a count for each bucket, such as [0-10: 50], [11-30: 220], and [31-100: 1103].
For more information about these types, see the Prometheus project documentation and Value types and metric kinds in Cloud Monitoring.
Consider the quality of the metric
Not every metric is useful. Apart from being a ratio of successful events to total events, you need to determine whether a metric is a good SLI for your needs. To help you make that determination, consider the following characteristics of a good metric:
Metrics relate directly to user happiness. Users are unhappy when a service does not behave as expected, such as when the service is slow, inaccurate, or fails completely. Validate any SLO based on these metrics by comparing the SLI to other signals of user happiness. This comparison includes data such as the number of customer complaint tickets, support call volume, and social media sentiment. (To learn more, see Continuous Improvement of SLO Targets).
If your metric doesn't align with these other indicators of user happiness, it might not be a good SLI.
The metric deterioration correlates with outages. Any metric reporting good service results during an outage is clearly the wrong metric for an SLI. Conversely, a metric that looks bad during normal operation is also problematic
The metric provides a good signal-to-noise ratio. Dismiss any metric that results in a large number of false negatives or false positives.
The metric scales monotonically and linearly with customer happiness. Simply put, as the metric improves, customer happiness also improves.
Select the correct number of metrics
A single service can have multiple SLIs, especially if the service performs different types of work or serves different types of users. It's best to choose the appropriate metrics for each type.
In contrast, some services perform similar types of work which can be directly comparable. For example, users viewing different pages on your site (such as the homepage, subcategories, and the top-10 list). Instead of developing a separate SLI for each of these actions, combine them into a single SLI category, such as browse services.
Your users' expectations don't change much between actions of a similar category. Their happiness is quantifiable by the answer to the question: "Did I see a full page of items quickly?"
Use as few SLIs as possible to accurately represent your service tolerances. As a general guide, have two to six SLIs. With too few SLIs, you can miss valuable signals. Too many and your support team has too much data at hand with little added benefit. Your SLIs should simplify your understanding of production health and provide a sense of coverage, not overwhelm (or underwhelm) you.
What's next?
- Read Measure your SL0s.
- Check out other SRE resources:
- Explore recommendations in other pillars of the Architecture Framework.
- For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.