Google Cloud

Applying the Escalation Policy—CRE life lessons

February 8, 2018

Will Tipton

Site Reliability Engineer

Alex Bramley

Customer Reliability Engineer

In past posts, we’ve discussed the importance of creating an explicit policy document describing how to escalate SLO violations, and given a real-world example of a document from an SRE team at Google. This final post is an exercise in hypotheticals, to provide some scenarios that exercise the policy and illustrate edge cases. The following scenarios all assume a "three nines" availability SLO for a service that burns half its error budget on background errors, i.e., our error budget is 0.1% errors, and serving 0.05% errors is "normal."

First, let's recap the policy thresholds:

Threshold 1: Automated alerts notify SRE of an at-risk SLO
Threshold 2: SREs conclude they need help to defend SLO and escalate to devs
Threshold 3: The 30-day error budget is exhausted and the root cause has not been found; SRE blocks releases and asks for more support from the dev team
Threshold 4: The 90-day error budget is exhausted and the root cause has not been found; SRE escalates to executive leadership to commandeer more engineering time for reliability work

With that refresher in mind, let’s dig in to these SLO violations.

Scenario 1: A short but severe outage is quickly root-caused to a dependency problem

Scenario: A bad push of a critical dependency causes a 50% outage for an hour as the team responsible for the dependency rolls back the bad release. The error rate returns to previous levels when the rollback completes, and the team identifies the commit that is the root cause and reverts it. The team responsible for the dependency writes a postmortem to which SREs for the service contribute some AIs to prevent recurrence.

Error budget: Assuming three nines, the service burned 70% of a 30-day error budget during this outage alone. It has exceeded the 7-day budget, and, given background errors, the 30-day budget as well.
Escalation: SRE was alerted to deal with the impact to production (Threshold 1). If SRE judges the class of issue (bad config or binary push) to be sufficiently rare or adequately mitigated, then the service is "brought back into SLO", and the policy requires no other escalation. This is by design—the policy errs on the side of development velocity, and blocking releases solely because the 30-day error budget is exhausted goes against that.

If this is a recurring issue (e.g., occurring two weeks in a row, burning most of a quarter's error budget) or is otherwise judged likely to recur, then it’s time to escalate to Threshold 2 or 3.

Scenario 2: A short but severe outage has no obvious root cause

Scenario: The service has a 50% outage for an hour, cause unknown.

Error budget: Same as previous scenario: the service has exceeded both its 7-day and 30-day error budgets.

Escalation: SRE is alerted to deal with the impact to production (Threshold 1). SRE escalates quickly to the dev team. They may request new metrics to provide additional visibility of the problem, install fallback alerting, and the SRE and dev oncall prioritize investigating the issue for the next week (Threshold 2).

If the root cause continues to evade understanding, SRE pauses feature releases after the first week until the outage passes out of the 30-day SLO measurement window. More of the SRE and dev teams are pulled from their project work to debug or try to reproduce the outage—this is their number-one priority until the service is back within SLO or they find the root cause. As the investigation continues, SRE and dev teams shift towards mitigation, work-arounds and building defense-in-depth. Ideally, by the time the outage passes over the SLO horizon, SRE is confident that any recurrence will be easier to root-cause and will not have the same impact. In this situation, they can justify resuming release pushes even without having identified a concrete root cause.

Scenario 3: A slow burn of error budget with a badly attributed root cause

Scenario: A regression makes it into prod at time T, and the service begins serving 0.15% errors. The root cause of the regression eludes both SRE and developers for weeks; they attempt multiple fixes but don’t reduce the impact.

Error budget: If left unresolved, this burns 1.5 months' error budget per month.

Escalation: The SRE oncall is notified via ticket at around T+5 days, when the 7-day budget is burned. SRE triggers Threshold 2 and escalates to the dev team at about T+7 days. Because of some correlations with increased CPU utilization, the SRE and dev oncall hypothesize that the root cause of the errors is a performance regression. The dev oncall finds some simple optimizations and submits them; these get into production at T+16 days as part of the normal release process. The fixes don't resolve the problem, but now it is impractical to roll back two weeks of daily code releases for the affected service.

At T+20 days, the service exceeds its 30-day error budget, triggering Threshold 3. SRE stops releases and escalates for more dev engagement. The dev team agrees to assemble a working party of two devs and one SRE, whose priority is to root-cause and remedy the problem. With no good correlation between code changes and the regression, they start doing in-depth performance profiling and removing more bottlenecks. All the fixes are aggregated into a weekly patch on top of the current production release. The efficiency of the service increases noticeably, but it still serves an elevated rate of errors.

At T+60 days, the service expends its 90-day error budget, triggering Threshold 4. SRE escalates to executive leadership to ask for a significant quantity of development effort to address the ongoing lack of reliability. The dev team does a deep dive and finds some architectural problems, eventually reimplementing some features to close out an edge-case interaction between server and client. Error rates revert to their previous state and blocked features roll out in batches once releases begin again.

Scenario 4: Recurring, transient SLI excursions

Scenario: Binary releases cause transient error spikes that occur with the daily or weekly release.

Error budget: If the errors don’t threaten the long term SLO, it's entirely SRE's responsibility to ensure the alert is properly tuned. So, for the sake of this scenario, assume that the error spikes don’t threaten the SLO over any window longer than a few hours.

Escalation: SLO-based alerting initially notifies SREs, and the escalation path proceeds as in the slow-burn case. In particular, the issue should not be considered "resolved" simply because the SLI returns to normal between releases, since the bar for bringing a service back into SLO is set higher than that. SRE may tune the alert so that a longer or faster burn of error budget triggers a page, and may decide to investigate the issue as part of their ongoing work on the service’s reliability.

Summary

It's important to “war-game” any escalation policy with hypothetical scenarios to make sure there are no unexpected edge cases and to check that the wording is clear. Circulate draft policies widely and address concerns that are raised in appendices, but don't clutter the policy itself with justifications of the chosen inflection points. Expect to have an extended discussion if your peers find your proposals contentious. Remember that the example shared here is just that—a real-life SRE team looking to meet a high availability target would likely structure their escalation policy quite differently.

Google Cloud