Building good SLOs—CRE life lessons
In a previous episode of CRE Life Lessons, we discussed how choosing good service level indicators (SLIs) and service level objectives (SLOs) is critical for defining and measuring the reliability of your service. There’s also a whole chapter in the SRE book about this topic. In this episode, we’re going to get meta and go into more detail about some best practices we use at Google to formulate good SLOs for our SLIs.
SLOs are objectives that your business aspires to meet and intends to take action to defend; just remember, your SLOs are not your SLAs (service level agreements)! You should pick SLOs that represent the most critical aspects of the user experience. If you meet an SLO, your users and your business should be happy. Conversely, if the system does not meet the SLO, that implies there are users who are being made unhappy!
Your business needs to be able to defend an endangered SLO by reducing the frequency of outages, or reducing the impact of outages when they occur. Some ways to do this might include: slowing down the rate at which you release new versions, or by implementing reliability improvements instead of features. All parts of your business need to acknowledge that these SLOs are valuable and should be defended through trade-offs.
Here are some important things to keep in mind when designing your SLOs:
- An SLO can be a useful tool for resolving meaningful uncertainty about what a team should be doing. The objective is a line in the sand between "we definitely need to work on this issue" and "we might not need to work on this issue." Therefore, don’t pick SLO targets that are higher than what you actually need, even if you happen to be meeting them now, as that reduces your flexibility to change things in the future, including trade offs against reliability, like development velocity.
- Group queries into SLOs by user experience, rather than by specific product elements or internal implementation details. For example, direct responses to user action should be grouped into a different SLO than background or ancillary responses (e.g., thumbnails). Similarly, “read” operations (e.g., view product) should be grouped into a different SLO than lower volume but more important “write” ones (e.g., check out). Each SLO will likely have different availability and latency targets.
- Be explicit about the scope of your SLOs and what they cover (which queries, which data objects) and under what conditions they are offered. Be sure to consider questions like whether or not to count invalid user requests as errors, or happens when a single client spams you with lots of requests.
- Finally, though somewhat in tension with the above, keep your SLOs simple and specific. It’s better not to cover non-critical operations with an SLO than to dilute what you really care about. Gain experience with a small set of SLOs; launch and iterate!
Here we're trying to answer the question "Was the service available to our user?" Our approach is to count the failures and known missed requests, and report the measurement as a percentage. Record errors from the first point that is in your control (e.g., data from your Load Balancer, not from the browser’s HTTP requests). For requests between microservices, record data from the client side, not the server side.
That leaves us with an SLO of the form:
For example . . .
. . . and . . .
Availability: Node.js will respond with a non-503 within 60 seconds for mobile API calls for at least 99.9% of requests in the month.
For requests that took longer than 30 seconds (60 second for mobile), the service might as well have been down, so they count against our availability SLO.
Latency is a measure of how well a service performed for our users. We count the number of queries that are slower than a threshold, and report them as a percentage of total queries. The best measurements are done as close to the client as possible, so measure latency at the Load Balancer for incoming web requests, and from the client not the server for requests between microservices.
For example . . .
Latency: Node.js will respond within 250ms for at least 50% of requests in the month, and within 3000ms for at least 99% of requests in the month.
Percentages are your friend . . .Note that we expressed our latency SLI as a percentage: “percentage of requests with latency < 3000ms” with target of 99%, not “99th percentile latency in ms” with target “< 3000ms”. This keeps SLOs consistent and easy to understand, because they all have the same unit and the same range. Also, accurately computing percentiles across large data sets is hard, while counting the number of requests below a threshold is easy. You’ll likely want to monitor multiple thresholds (e.g., percentage of requests < 50ms, < 250ms, . . .), but having SLO targets of 99% for one threshold, and possibly 50% for another, is generally sufficient.
Avoid targeting average (mean) latency—it's almost never what you want. Averages can hide outliers, and sufficiently small values are indistinguishable from zero; users will not notice a difference between 50 ms and 250 ms for a full page response time, and thus they should be comparably good. There’s a big difference between an average of 250ms because all requests are taking 250ms, and an average of 250ms because 95% of requests are taking 1ms and 5% of requests are taking 5s.
. . . except 100%A target of 100% is impossible over any meaningful length of time. It’s also likely not necessary. SREs use SLOs to embrace risk; the inverse of your SLO target is your error budget, and if your SLO target is 100% that means you have no error budget! In addition, SLOs are a tool for establishing team priorities—dividing top-priority work from work that's prioritized on a case-by-case basis. SLOs tend to lose their credibility if every individual failure is treated as a top priority.
Regardless of the SLO target that you eventually choose, the discussion is likely to be very interesting; be sure to capture the rationale for your chosen target for posterity.
ReportingReport on your SLOs quarterly, and use quarterly aggregates to guide policies, particularly pager thresholds. Using shorter periods tends to shift focus to smaller, day-to-day issues, and away from the larger, infrequent issues that are more damaging. Any live reporting should use the same sliding window as the quarterly report, to avoid confusion; the published quarterly report is merely a snapshot of the live report.
Example quarterly SLO summary
This is how you might present the historical performance of your service against SLO, e.g., for a semi-annual service report, where the SLO period is one quarter:
|Latency ≤ 250ms||50%||74%||70%|
|Latency ≤ 3000ms||99%||99.4%||98.9%|
For SLO-dependent policies such as paging alerts or freezing of releases when you’ve spent the error budget, use a sliding window shorter than a quarter. For example, you might trigger a page if you spent ≥1% of the quarterly error budget over the last four hours, or you might freeze releases if you spent ≥ ⅓ of the quarterly budget in the last 30 days.
Breakdowns of SLI performance (by region, by zone, by customer, by specific RPC, etc.) are useful for debugging and possibly for alerting, but aren’t usually necessary in the SLO definition or quarterly summary.
Finally, be mindful about with whom you share your SLOs, especially early on. They can be a very useful tool for communicating expectations about your service, but the more broadly they are exposed the harder it is to change them.
SLOs are a deep topic, but we’re often asked about handy rules of thumb people can use to start reasoning about them. The SRE book has more on the topic, but if you start with these basic guidelines, you’ll be well on your way to avoiding the most common mistakes people make when starting with SLOs. Thanks for reading, we hope this post has been helpful. And as we say here at Google, may the queries flow, your SLOs be met and the pager stay silent!