Google Cloud Platform

An example escalation policy — CRE life lessons

In an earlier blog post, we discussed the spectrum of engineering effort between reliability and feature development and the importance of describing when and how an organization should dedicate engineering time towards the reliability of a service that is out of SLO. In this post, we show a lightly-edited SLO escalation policy and associated rationales from a Google SRE team to illustrate the trade-offs that particular teams make to maintain a high development velocity.

This SRE team works with large teams of developers focused on different areas of the serving stack, which comprises around ten high-traffic services and a dozen or so smaller ones, all with SRE support. The team has shards in Europe and America, each covering 12 hours of a follow-the-sun on-call rotation. The supported services have both coarse top-level SLOs representing desired user experience and finer-grained SLOs representing the availability requirements of stack components; crucially the SRE team can route pages to dev teams at the granularity of an individual SLO, making "revoking support" for an SLO both cheap and quick. Alerting is configured to page when the service has burned nine hours of error budget within an hour, and file a ticket when it has burned one week of error budget over the previous week.

It's important to note that this policy is just an example, and probably a poor one if your SRE team supports a service with availability targets of 99.99% or higher. The industry that this Google team operates in is highly competitive and moves quickly, making feature iteration speed and time-to-market more important than maintaining high levels of availability.

Escalation policy preamble

Before getting into the specifics of the escalation policy, it's important to consider the following broad points.

The intent of an escalation policy is not to be completely proscriptive; SREs are expected to make judgement calls as to appropriate responses to situations they face. Instead, this document establishes reasonable thresholds for specific actions to take place, with the intent of reducing the likely range of responses and achieving a measure of consistency. It's structured as a series of thresholds that, when crossed, trigger the redirection of more engineering effort towards addressing an SLO violation.

Furthermore, SRE must focus on fixing the class of issue before declaring an incident resolved. This is a higher bar than fixing the issue itself. For example, if a bad flag flip causes a severe outage, reverting the flag flip is insufficient to bring the service back into SLO. SRE must instead ensure that flag flips in general are extremely unlikely to threaten the SLO in the future, with staged rollouts, automated rollbacks on push failures, and versioned configuration to tie flags to binary versions.

For the following four thresholds in the escalation policy, "bringing a service back into SLO" means:

  • finding the root cause and fixing the relevant class of issue, or
  • automating remediation such that ongoing manual intervention is no longer necessary, or 
  • simply waiting one week, if the class of issue is extremely unlikely to recur with frequency and severity sufficient to threaten the SLO in the future
In other words, a plan for manual remediation is not sufficient to consider the service back within SLO. Bear in mind that you usually need to understand the root cause of a violation to conclude that it's unlikely to recur or to automate remediation.


Escalation policy thresholds

Threshold 1 -  wherein SRE are notified that an SLO is potentially impacted

SRE will maintain alerting so as to be notified of danger to supported SLOs. Upon being notified, SRE will investigate and attempt to find and address the root cause. SRE will consider taking mitigating actions, including redirecting traffic at the load balancers and rolling back binary or configuration pushes. SRE on-call engineers will notify the dev team about the SLO impact and keep them updated as necessary, but no action on their part is required at this point.

Threshold 2 - wherein SRE escalates to the developers

  • If,
    • SRE have concluded they cannot bring the service into SLO without help, and
    • SRE and dev agree that the SLO represents desired user experience
  • SRE have concluded they cannot bring the service into SLO without help, and
  • SRE and dev agree that the SLO represents desired user experience
  • Then,
    • SRE and dev on-calls prioritize fixing the root cause and update the bug daily
    • SRE escalates to dev leads for visibility and additional assistance if necessary
    • Alerting thresholds may be relaxed to avoid continually paging for the known issue, while continuing to provide protection against further regressions
  • SRE and dev on-calls prioritize fixing the root cause and update the bug daily
  • SRE escalates to dev leads for visibility and additional assistance if necessary
  • Alerting thresholds may be relaxed to avoid continually paging for the known issue, while continuing to provide protection against further regressions
  • When the service is brought back into SLO,
    • SRE will revert any alerting changes
    • SRE may create a postmortem
    • Or, if the SLO does not accurately represent desired user experience, the SRE, dev and product teams will agree to change or retire the SLO
  • SRE will revert any alerting changes
  • SRE may create a postmortem
  • Or, if the SLO does not accurately represent desired user experience, the SRE, dev and product teams will agree to change or retire the SLO
Threshold 3 - wherein SRE pauses feature releases and focuses on reliability
  • If,
    • Conditions for the previous threshold are met for at least one week, and
    • The service has not been brought back into SLO, and
    • The 30-day error budget is exhausted
  • Conditions for the previous threshold are met for at least one week, and
  • The service has not been brought back into SLO, and
  • The 30-day error budget is exhausted
  • Then during the following week,
  • Only cherry-picked fixes for diagnosed root causes may be pushed to production
  • SRE may escalate to their leadership and dev management to request that members of the dev team prioritize finding and fixing the root cause over any non-emergency work
  • Daily updates may be made to an "escalations" mailing list (used to broadcast information about outages to a wide audience, including executive leadership).
  • When the service is brought back into SLO,
    • Normal binary releases resume
    • SRE creates a postmortem
    • Team members may re-prioritize normal project work
  • Normal binary releases resume
  • SRE creates a postmortem
  • Team members may re-prioritize normal project work

Threshold 4 - wherein SRE may escalate or revoke support

  • If,
    • Conditions for the previous threshold are met for at least one week, and
    • The service has not been brought back into SLO, and
    • The 90-day error budget is exhausted or the dev team is unwilling to pause feature work to improve reliability
  • Conditions for the previous threshold are met for at least one week, and
  • The service has not been brought back into SLO, and
  • The 90-day error budget is exhausted or the dev team is unwilling to pause feature work to improve reliability
  • Then,
    • SRE may escalate to executive leadership to commandeer more people dedicated to fixing the problem
    • SRE may revoke support for the SLO or the service, and re-direct or disable relevant alerting
  • SRE may escalate to executive leadership to commandeer more people dedicated to fixing the problem
  • SRE may revoke support for the SLO or the service, and re-direct or disable relevant alerting

On escalation and incident response

SREs are first responders, and there's an expectation that they'll make a reasonable effort to bring the service back within SLO before escalating to developers. As such, threshold 1 applies when the SRE team is notified about a violation, despite the one-week ticket alert indicating the seven-day budget is already exhausted. SRE should wait no longer than one week from the initial violation notification before escalating to developers, but they may exercise their own judgement as to whether escalation is appropriate before this point.

Every time SRE escalates, it’s important to ask developers whether the availability goals still represent the desired balance between reliability and development velocity. This gives them the choice between preserving availability goals by rolling back a new feature and temporarily relaxing them to preserve the availability of that feature for users if the latter is the desired user experience. For repeated violations of the same SLO in a short time window, you probably don't need to ask the question over and over again, though that's a strong signal that further escalation is necessary. It's also OK to insist that developers take back the pager for the service until they're willing to restore the previously-agreed availability targets—if they want to run a less reliable service temporarily so that a business-critical feature remains available while they work on its reliability, they can also shoulder the burden of its failures.

On blocking releases

Blocking releases is an appropriate course of action for three main reasons:
  1. Commonly, the largest source of burnt error budget at steady state is the release push. If you’ve already burned all your budget, not pushing new releases lowers the steady-state burn rate, bringing the service back into SLO more quickly
  2. It eliminates the risk of further unexpected SLO violations due to bugs in new code. This is also why any fixes for diagnosed root causes must be patched into the current release, rather than rolling forward to a new release
  3. While blocking releases is not intended as a punitive measure, it does directly impact release velocity, which the dev org cares about deeply. As such, tying SLO violations to reduced velocity aligns the incentives of both organizations. SRE wants the service to stay within SLO, the dev org wants to build new features quickly. This way, either both happen or neither do.
SRE should prefer to unblock feature releases sooner rather than later, once the root cause(s) of a violation has been found and fixed. Giving our dev teams the benefit of the doubt that there will be no further service degradation before the SLO is in compliance over a 30-day window strikes a more acceptable balance between reliability and velocity. This is effectively "borrowing" future error budget to unblock the release before the service is compliant, with the expectation that it will be within a reasonable timeframe. Absent any push-related outages, new features should increase user happiness with the service, repaying some of the unhappiness caused by the SLO violation.

SRE may choose not to unblock releases if pre-violation error-budget burn rates were close to the SLO threshold. In this case, there's less future budget to borrow, thus the risk of further violations is higher and the time until the service is SLO compliant will be significantly longer if releases are allowed to continue.

Summary

We hope that the above example gives you some ideas about how to make trade-offs between reliability and development velocity for a service where the latter is a key business priority. The main concessions to velocity are that SRE doesn’t immediately block releases when an SLO is violated, and provides a mechanism for them to resume before the SLO has returned to compliance with the informed consent of SRE. In the final post of the series, we'll take these policy thresholds out for a spin with some hypothetical scenarios.