Adventures in SRE-land: Incident management at Google
Paul Newson
SRE Mission Controller
Editor's Note: This is the second of a two-part series on Mission Control, a rotation program that gives Google engineers a taste of what it's like to be a SRE. You can read the first part here.
Have you ever wondered what happens at Google when something goes wrong? Our industry is fond of using colorful metaphors such as “putting out fires” to describe what we do.
Of course, unlike the actual firefighters pictured here, our incidents don’t normally involve risk to life and limb. Despite the imperfect metaphor, Google Site Reliability Engineers (SREs) have a lot in common with other first responders in other fields.
Like these other first responders, SREs at Google regularly practice emergency response, honing the skills, tools, techniques and attitude required to quickly and effectively deal with the problem at hand.
In emergency services, and at Google, when something goes wrong, it's called an “incident.”
This is the story of my first “incident” as a Google SRE.
Prologue: preparation
For the past several months, I’ve been on a Mission Control rotation with the Google Compute Engine SRE team. I did one week of general SRE training. I learned about Compute Engine through weekly peer training sessions, and by taking on project work. I participated in weekly “Wheel of Misfortune” sessions, where we're given a typical on-call problem and try to solve it. I shadowed actual on-callers, helping them respond to problems. I was secondary on-call, assisting the primary with urgent issues, and handling less urgent issues independently.
Sooner or later, after all the preparation, it’s time to be at the sharp end. Primary on-call. The first responder.
Editor's Note: Chapter 28 “Accelerating SREs to On-Call and Beyond” in Site Reliability Engineering goes into detail about how we prepare new SREs to be ready to be first responders.
Going on-call
There's a lot more to being an SRE than being on-call. On-call is, by design, a minority of what Site Reliability Engineers (SREs) do, but it's also critical. Not only because someone needs to respond when things go wrong, but because the experience of being on-call informs many other things we do as SREs.During my first on-call shifts, our alerting system saw fit to page1 me twice, and two other problems were escalated to me by other people. With each page, I felt a hit of adrenaline. I wondered "Can I handle this? What if I can’t?" But then I started to work the problem in front of me, like I was trained to, and I remembered that I don’t need to know everything — there are other people I can call on, and they will answer. I may be on point, but I’m not alone.
Editor’s Note: Chapter 11 “Being On-Call” in Site Reliability Engineering has lots of advice on how to organize on-call duties in a way that allows people to be effective over the long term.
It’s an incident!
Three of the pages I received were minor. The fourth was more, shall we say. . . interesting?Another Google engineer using Compute Engine for their service had a test automation failure, and upon investigation noticed something unusual with a few of their instances. They notified the development team’s primary on-call, Parya, and she brought me into the loop. I reached out to my more experienced secondary, Benson, and the three of us started to investigate, along with others from the development team who were looped in. Relatively quickly we determined it was a genuine problem. Having no reason to believe that the impact was limited to the single internal customer who reported the issue, we declared an incident.
What does declaring an incident mean? In principle it means that an issue is of sufficient potential impact, scope and complexity that it will require a coordinated effort with well defined roles to manage it effectively. At some point, everything you see on the summary page of the Google Cloud Status Dashboard was declared an incident by someone at Google. In practice, declaring an incident at Google means creating a new incident in our internal incident management tool.
As part of my on-call training, I was trained on the principles behind Google’s incident management protocol, and the internal tool that we use to facilitate incident response. The incident management protocol defines roles and responsibilities for the individuals involved. Earlier I asserted that Google SREs have a lot in common with other first responders. Not surprisingly, our incident management process was inspired by, and is similar to, well established incident command protocols used in other forms of emergency response.
My role was Incident Commander. Less than seven minutes after I declared the incident, a member of our support team took on the External Communications role. In this particular incident, we did not declare any other formal roles, but in retrospect, Parya was the Operations Lead; she led the efforts to root-cause the issue, pulling in others as needed. Benson was the Assistant Incident Commander, as I asked him a series of questions of the form “I think we should do X, Y and Z. Does that sound reasonable to you?”
One of the keys to effective incident response is clear communication between incident responders, and others who may be affected by the incident. Part of that equation is the incident management tool itself, which is a central place that Googlers can go to know about any ongoing incidents with Google services. The tool then directs Googlers to additional relevant resources, such as an issue in our issue-tracking database that contains more details, or the communications channels being used to coordinate the incident response.
Editor’s Note: Chapters 12, 13 and 14 of Site Reliability Engineering discuss effective troubleshooting, emergency response and managing oncidents respectively.
The rollback — an SRE’s fire extinguisher
While some of us worked to understand the scope of the issue, others looked for the proximate and root causes so we could take action to mitigate the incident. The scope was determined to be relatively limited, and the cause was tracked down to a particular change included in a release that was currently being rolled out.
This is quite typical. The majority of problems in production systems are caused by changing something — a new configuration, a new binary, or a service you depend on doing one of those things. There are two best practices that help in this very common situation.
First, all non-emergency changes should use a progressive rollout, which simply means don’t change everything at once. This gives you the time to notice problems, such as the one described here, before they become big problems affecting large numbers of customers.
Second, all rollouts should have a well understood and well tested rollback mechanism. This means that once you understand which change is responsible for the problem, you have an “undo” button you can press to restore service.
Keeping your problems small using a progressive rollout, and then mitigating them quickly via a trusted rollback mechanism are two powerful tools in the quest to meet your Service Level Objectives (SLOs).
This particular incident followed this pattern. We caught the problem while it was small, and then were able to mitigate it quickly via a rollback.
Editor’s Note: Chapter 36 “A Collection of Best Practices for Production Services” in Site Reliability Engineering talks more about these, and other, best practices.
Epilogue: the postmortem
With the rollback complete, and the problem mitigated, I declared the incident “closed.” At this point, the incident management tool helpfully created a postmortem document for the incident responders to collaborate on. Taking our firefighting analogy to its logical conclusion, this is analogous to the part where the fire marshal analyzes the fire, and the response to the fire, to see how similar fires could be prevented in the future, or handled more effectively.
Google has a blameless postmortem culture. We believe that when something goes wrong, you should not look for someone to blame and punish. Chances are the people in the story were well intentioned, competent and doing the best they could with the information they had at the time. If you want to make lasting change, and avoid having similar problems in the future, you need to look to how you can improve the systems, tools and processes around the people, such that a similar problem simply can’t happen again.
Despite the relatively limited impact of the incident, and the relatively subtle nature of the bug, the postmortem identified nine specific follow-up actions that could potentially avoid the problem in the future, or allow us to detect and mitigate it faster if a similar problem occurs. These nine issues were all filed in our bug tracking database, with owners assigned, so they'll be considered, researched and followed up on in the future.
The follow-up actions are not the only outcome of the postmortem. Since every incident at Google has a postmortem, and since we use a common template for our postmortem documents, we can perform analysis of overall trends. For example, this is how we know that a significant fraction of incidents at Google come from configuration changes. (Remember this the next time someone says “but it’s just a config change” when trying to convince you that it’s a good idea to push it out late on the Friday before a long weekend . . .)
Postmortems are also shared within the teams involved. On the Compute Engine team, for example, we have a weekly incident review meeting, where incident responders present their postmortem to a broader group of SREs and developers who work on Compute Engine. This helps identify additional follow up items that may have been overlooked, and shares the lessons learned with the broader team, making everyone better at thinking about reliability from these case studies. It's also a very strong way to reinforce Google’s blameless post mortem culture. I recall one of these meetings where the person presenting the postmortem attempted to take blame for the problem. The person running the meeting said “While I appreciate your willingness to fall on your sword, we don’t do that here.”
The next time you read the phrase “We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence” on our status page, I hope you'll remember this story. Having experienced firsthand the way we follow up on incidents at Google, I can assure you that it's not an empty promise.
Editor's Note: Chapter 15, “Postmortem Culture: Learning from Failure” in Site Reliability Engineering discusses postmortem culture in depth.
1 We don’t actually use pagers anymore of course, but we still call it “getting paged” no matter what device or communications channel is used.