Management Tools

Building resilient systems to weather the unexpected

June 10, 2020

Ben Treynor

VP, 24x7

The global cloud that powers Google runs lots of products that people rely on every day—Google Search, YouTube, Gmail, Maps, and more. In this time of increased internet use and virtual everything, it’s natural to wonder if the internet can keep up with, and stay ahead of, all this new demand. The answer is yes, in large part due to an internal team and set of principles guiding the way: site reliability engineering (SRE).

Nearly two decades ago, I was asked to lead Google’s "production team," which at the time was seven engineers. Today, that team—Site Reliability Engineering, or SRE—has grown to be thousands of Googlers strong. SRE is one of our secret weapons for keeping Google up and running. We've learned a lot over the years about planning and resilience, and are glad to share these insights as you navigate your own business continuity and disaster recovery scenarios.

SRE follows a set of practices and principles engineering teams can use to ensure that services stay reliable for users. Since that small team formed nearly 20 years ago, we’ve evolved our practices, done a lot of testing, written three books, and seen other companies—like Samsung— build SRE organizations of their own. SRE work can be summed up with a phrase we use a lot around here: Hope is not a strategy; wish for the best, but prepare for the worst. Ideally, you won’t have to face the worst-case scenario—but being ready if that happens can make or break a business.

For more than a decade, extensive disaster recovery planning and testing has been a key part of SRE’s practice. At Google, we regularly conduct disaster recovery testing, or DiRT for short: a regular, coordinated set of both real and fictitious incidents and outages across the company to test everything from our technical systems to processes and people. Yes, that’s right—we intentionally bring down parts of our production services as part of these exercises. To avoid affecting our users, we use capacity that is unneeded at the time of the test; if engineers can’t find the fix quickly, we’ll stop the test before the capacity is needed again. We’ve also simulated natural disasters in different locations, which has been useful in the current situation where employees can’t come into the office.

This kind of testing takes time, but it pays off in the long run. Rigorous testing lets our SRE teams find unknown weaknesses, blind spots, and edge cases, and create processes to fix them. With any software or system, disruptions will happen, but when you’re prepared for a variety of scenarios, panic is optional. SRE takes into account that humans are running these systems, so practices like blameless post-mortems and lots of communication let team members work together constructively.

If you’re just getting started with disaster recovery planning, you might consider beginning your drills by focusing on small, service-specific tests. That might include putting in place a handoff between on-call team members as they finish a shift, along with continuous documentation to pass on to colleagues. You can also make sure backup relief is accessible if needed. You can also find tips here on common initial SRE challenges and how to meet them.

Inside a service disruption
With any user-facing service, it’s not a matter of if, but when, a service disruption will happen. Here's a look at how we handle them at Google.

First, it’s important to detect and immediately start work on the issue. Our SREs often carry pagers so they can hear about a critical disruption or outage right away and immediately post to internal admin channels. We page on service-level objectives (SLOs), and recommend customers do the same, so it’s clear that every alert requires human attention.
Define roles and responsibilities among on-call SRE team members. Some SREs will mitigate the actual issue, while others may act as project managers or communications managers, updating and fielding questions from customers and non-SRE colleagues.
Find and fix the root cause of the problem. The team finds what’s causing the disruption or outage and mitigates it. At the same time, communications managers on the team follow the work as it progresses and add updates on any customer-facing channels.
Hand off, if necessary. On-call SREs document progress and hand off to colleagues starting a shift or in the next time zone, if the problem persists that long. SREs also make sure to look out for each other and initiate backup if needed.
Finally, write the postmortem. This is a place to detail the incident, the contributing causes, and what the team and business will do to prevent future similar incidents. Note that SRE postmortems are blameless; we assume skill and good intent from everyone involved in the incident, and focus our attention on how to make the systems function better.

Throughout any outage, remember that it’s difficult to overcommunicate. While SREs prioritize mitigation work, rotating across global locations to maintain 24x7 coverage, the rest of the business is going about its day. During that time, the SRE team sets a clear schedule for work. They maintain multiple communication channels—across Google Meet, Chat rooms, Google Docs, etc.—for visibility, and in case a system goes down.

SRE during COVID-19
During this global coronavirus pandemic, our normal incident response process has only had to shift a little. SRE teams were generally already split between two geographic locations. For our employees working in data centers, we’ve separated staff members and taken other measures to avoid coronavirus exposure. In general, a big part of healthy SRE teams is the culture—that includes maintaining work-life balance and a culture of “no heroism.” We’re finding those tenets even more important now to keep employees mentally and physically healthy.

For more on SRE, and more tips on improving system resilience within your own business, check out the video that I recently filmed with two of our infrastructure leads, Dave Rensin and Ben Lutch. We discuss additional lessons Google has learned as a result of the pandemic.

Planning, testing, then testing some more pays off in the long run with satisfied, productive, and well-informed users, whatever service you’re running. SRE is truly a team effort, and our Google SREs exemplify that collaborative, get-it-done spirit. We wish you reliable services, strong communication, and quick mitigation as you get started with your own SRE practices.

Learn more about meeting common SRE challenges when you’re getting started.

Management Tools

Meeting reliability challenges with SRE principles

Following SRE principles can help you build reliable production systems. When getting started, you may encounter three common challenges. Here’s how to solve them.

By Cheryl Kang • 6-minute read