Google Cloud Platform
Defining SLOs for services with dependencies—CRE life lessons
In a previous episode of CRE Life Lessons, we discussed how service level objectives (SLOs) are an important tool for defining and measuring the reliability of your service. There’s also a whole chapter in the SRE book about this topic. In this episode, we discuss how to define and manage SLOs for services with dependencies, each of which may (or may not!) have their own SLOs.
Any non-trivial service has dependencies. Some dependencies are direct: service A makes a Remote Procedure Call to service B, so A depends on B. Others are indirect: if B in turn depends on C and D, then A also depends on C and D, in addition to B. Still others are structurally implicit: a service may run in a particular Google Cloud Platform (GCP) zone or region, or depend on DNS or some other form of service discovery.
To make things more complicated, not all dependencies have the same impact. Outages for "hard" dependencies imply that your service is out as well. Outages for "soft" dependencies should have no impact on your service if they were designed appropriately. A common example is best-effort logging/tracing to an external monitoring system. Other dependencies are somewhere in between; for example, a failure in a caching layer might result in degraded latency performance, which may or may not be out of SLO.
Take a moment to think about one of your services. Do you have a list of its dependencies, and what impact they have? Do the dependencies have SLOs that cover your specific needs?
Given all this, how can you as a service owner define SLOs and be confident about meeting them? Consider the following complexities:
- Some of your dependencies may not even have SLOs, or their SLOs may not capture how you're using them.
- The effect of a dependency's SLO on your service isn't always straightforward. In addition to the "hard" vs "soft" vs "degraded" impact discussed above, your code may complicate the effect of a dependency's SLOs on your service. For example, you have a 10s timeout on an RPC, but its SLO is based on serving a response within 30s. Or, your code does retries, and its impact on your service depends on the effectiveness of those retries (e.g., if the dependency fails 0.1% of all requests, does your retry have a 0.1% chance of failing or is there something about your request that means it is more than 0.1% likely to fail again?).
- How to combine SLOs of multiple dependencies depends on the correlation between them. At the extremes, if all of your dependencies are always unavailable at the same time, then theoretically your unavailability is based on the max(), i.e., the dependency with the longest unavailability. If they are unavailable at distinct times, then theoretically your unavailability is the sum() of the unavailability of each dependency. The reality is likely somewhere in between.
- Services usually do better than their SLOs (and usually much better than their service level agreements), so using them to estimate your downtime is often too conservative.
At this point you may want to throw up your hands and give up on determining an achievable SLO for your service entirely. Don't despair! The way out of this thorny mess is to go back to the basics of how to define a good SLO. Instead of determining your SLO bottom-up ("What can my service achieve based on all of my dependencies?"), go top down: "What SLO do my customers need to be happy?" Use that as your SLO.
Risky businessYou may find that you can consistently meet that SLO with the availability you get from your dependencies (minus your own home-grown sources of unavailability). Great! Your users are happy. If not, you have some work to do. Either way, the top-down approach of setting your SLO doesn't mean you should ignore the risks that dependencies pose to it. CRE tech lead Matt Brown gave a great talk at SRECon18 Americas about prioritizing risk (slides), including a risk analysis spreadsheet that you can use to help identify, communicate, and prioritize the top risks to your error budget (the talk expands on a previous CRE Life Lessons blog post).
Some of the main sources of risk to your SLO will of course come from your dependencies. When modeling the risk from a dependency, you can use its published SLO, or choose to use observed/historical performance instead: SLOs tend to be conservative, so using them will likely overestimate the actual risk. In some cases, if a dependency doesn't have a published SLO and you don't have historical data, you'll have to use your best guess. When modeling risk, also keep in mind the difficulties described above about mapping a dependency's SLO onto yours. If you're using the spreadsheet, you can try out different values (for example, the published SLO for a dependency versus the observed performance) and see the effect they have on your projected SLO performance.1
Remember that you're making these estimates as a tool for prioritization; they don't have to be perfectly accurate, and your estimates won't result in any guarantees. However, the process should give you a better understanding of whether you're likely to consistently meet your SLO, and if not, what the biggest sources of risk to your error budget are. It also encourages you to document your assumptions, where they can be discussed and critiqued. From there, you can do a pragmatic cost/benefit analysis to decide which risks to mitigate.
For dependencies, mitigation might mean:
- Trying to remove it from your critical path
- Making it more reliable; e.g., running multiple copies and failing over between them
- Automating manual failover processes
- Replacing it with a more reliable alternative
- Sharding it so that the scope of failure is reduced
- Adding retries
- Increasing (or decreasing, sometimes it is better to fail fast and retry!) RPC timeouts
- Adding caching and using stale data instead of live data
- Adding graceful degradation using partial responses
- Asking for an SLO that better meets your needs
A series of earlier CRE Life Lessons posts (1, 2, 3) discussed consequences and escalations for SLO violations, as a way to balance velocity and risk; an example of a consequence might be to temporarily block new releases when the error budget is spent. If an outage was caused by one of your service's dependencies, should the consequences still apply? After all, it's not your fault, right?!? The answer is "yes"—the SLO is your proxy for your users' happiness, and users don't care whose "fault" it is. If a particular dependency causes frequent violations to your SLO, you need to mitigate the risk from it, or mitigate other risks to free up more error budget. As always, you can be pragmatic about how and when to enforce consequences for SLO violations, but if you're regularly making exceptions, especially for the same cause, that's a sign that you should consider lowering your SLOs, or increasing the time/effort you are putting into improving reliability.
In summary, every non-trivial service has dependencies, probably many of them. When choosing an SLO for your service, don't think about your dependencies and what SLO you can achieve—instead, think about your users, and what level of service they need to be happy. Once you have an SLO, your dependencies represent sources of risk, but they're not the only sources. Analyze all of the sources of risk together to predict whether you'll be able to consistently meet your SLO and prioritize which risks to mitigate.
1. If you're interested, The Calculus of Service Availability has more in-depth discussion about modeling risks from dependencies, and strategies for mitigating them.