Google Cloud

Understanding error budget overspend—CRE life lessons

June 28, 2018

Adrian Hilton

Customer Reliability Engineer, SRE

Alec Warner

Customer Reliability Engineer

In previous CRE Life Lessons blog posts, the Google Customer Reliability Engineering (CRE) team has spent a lot of time talking about service level objectives (SLOs), which measure whether your service is meeting its reliability targets from the point of view of its end users. Your SLO lets you specify how much downtime your service can have in a given period—for example, 43 minutes every 30 days for a service that needs to be available 99.9% of the time. This downtime allowance is your error budget. Like a household budget, it’s OK to spend this error budget over those 30 days, as long as you don’t spend more than that.

If you do run out of your error budget, either by spending a bit too much each day, or by having a major outage that blows it all at once, that tells you that your service’s users are suffering too much and it’s time to give them a break. How do you do that? Here are a few questions to consider to see if you need to recalibrate your error budget.

Where are you spending your error budget?

Your SLOs will be target values for corresponding service level indicators (SLIs), which are the measurements of the critical parts of the end-user experience. One SLI for the 99.9% available example system above might be “the percent of HTTP responses which are successful (200), out of all 20x and 50x HTTP responses.” You calculate your error budget spend by the percent of the measurement period where your service fails to reach all of its SLO targets; depending on the granularity and accuracy of your SLI measurement, this might be done on a per-minute, per-hour or even per-day basis.

When you analyze your error budget spend day-by-day, you should try to attribute the main causes of error budget spend over the measurement period:

Do most of your errors happen when you’re doing binary releases? That implies that you’re not going to be able to keep within budget unless you do something to make releases less frequent, less error-prone or lower-impact when there is an error.
Are you seeing steady error spend coming from intermittent application failure, which adds up to the majority of your budget? That’s telling you that you’ve got a fundamental failure in your application. It’s a strong signal that you need to drill down in your logs to find the troublesome queries, and that you should expect to dedicate some of your engineers to identify the root causes and either address them directly or plan to fix them in your next project planning cycle.
Are large chunks of your error budget getting spent by major application failures, where most of your service goes down for many minutes due to configuration pushes, excessive load or queries-of-death? You need to run effective postmortems to identify the root causes and mitigate them. You will need to redirect some of your development engineering effort to address the top action items from those postmortems—so feature development and releases will naturally happen more slowly. (More on this in another post.)
Is the bulk of your spend coming from a dependency outside your control, such as a critical backend or your compute platform? You’ll need to address the dependency or platform owner directly, showing them your SLI metrics and negotiating about how they can make their service more reliable—or how you can be more resilient to the expected failure modes.

For each of these cases, you have an objective measurement of whether the problem has been sufficiently addressed: you will expect your SLIs to stay high in circumstances where previously they plummeted.

Are you measuring the right signal?

Something else you should consider: Did the outage reflect real user pain? If you have a strong indicator that users weren’t concerned by a major outage that spent a chunk of your error budget, then you may not have to change your development practices or architecture, but you still have something to fix. Either you should determine a new, lower target level for your SLO, or you should find a different SLI that better represents the user experience.

Can your users tolerate a slightly worse experience?

Suppose you’re trying to run your service at a 99.9% availability level, with the corresponding 43-minute-per-month error budget, but you’re consistently failing to meet that; you’re spending 50-60 minutes per month. How much does that actually matter?

You probably have business intelligence channels for measuring customer happiness in terms of time spent on your site, purchase rate, support tickets raised and other fairly direct measurements of user happiness. Evaluate those statistics against your SLIs: Are your budget overspend periods correlated with less user happiness, and if so, what’s the correlation function? If a 50% error budget overspend corresponds to a 1% decrease in customer revenue, then you may feel that you can adjust your SLO target and aim for a 99.5% availability level, rather than spend a lot of engineering effort trying to raise your availability to the original target.

What is important in this case is to have, and document, the data used to determine the SLO target. You don’t want to fall into the trap of increasing your error budget by 50% each period because “users don’t really care”—you need to articulate the tradeoff in user happiness/spend vs. reliability in your SLO definition. An SLO specification shouldn’t just contain numbers and metric names. It should also reference the logic and data used to determine the SLO target.

When your users’ experience isn’t definitive

It may be true that the customer is always right— but what if your service’s users are part of your company? In some cases, the overall business decision may be that continuing to build and release the software is in the best interest of the company as a whole, even if you’re consistently going over budget. The error budget spend may cause an inconvenience to employees, but failing to release new versions of the software would have a significant cost to the company that outweighs user inconvenience.

This can occur when there's a disconnect between what the users of the software are perceived to need (for example, the 99.9% availability target of this example service) and what the executives who pay for the development of the software think these users should tolerate in the name of greater velocity.

Now that we understand what messages an error budget is telling us, in part two of this post we will look at how best to keep a positive balance.

Interested to learn more about site reliability engineering (SRE) in practice? We’ll be discussing how to apply SRE principles to deploy and manage services at Next ‘18 in July. Join us!