Lessons from a Google App Engine SRE on how to serve over 100 billion requests per day
In this blog post we caught up with Chris Jones, a Site Reliability Engineer on Google App Engine for the past three-and-a-half years and SRE at Google for almost 9 years, to find out more about running production systems at Google. Chris is also one of the editors of Site Reliability Engineering: How Google Runs Production Systems, published by O’Reilly and available today.
Google App Engine serves over 100 billion requests per day. You might have heard about how our Site Reliability Engineers, or SREs, make this happen. It’s a little bit of magic, but mostly about applying the principles of computer science and engineering to the design and development of computing systems —generally very large, distributed ones.
Site Reliability Engineering is a set of engineering approaches to technology that lets us or anyone run better production systems. It went on to inform the concept of DevOps for the wider IT community. It’s interesting because it’s a relatively straightforward way of improving performance and reliability at planet-scale, but can be just as useful for any company for say, rolling out Windows desktops. Done right, SRE techniques can increase the effectiveness of operating any computing service.
Q: Chris, tell us how many SREs operate App Engine and at what scale?
CJ: We have millions of apps on App Engine serving over 100 billion requests per day supported by dozens of SREs.
Q: How do we do that with so few people?
CJ: SRE is an engineering approach to operating large-scale distributed computing services. Making systems highly standardized is critical. This means all systems work in similar ways to each other, which means fewer people are needed to operate them since there are fewer complexities to understand and deal with.
Automation is also important: our turn-up processes to build new capacity or scale load balancing are automated so that we can scale these processes nicely with computers, rather than with more people. If you put a human on a process that’s boring and repetitive, you’ll notice errors creeping up. Computers’ response times to failures are also much faster than ours. In the time it takes us to notice the error the computer has already moved the traffic to another data center, keeping the service up and running. It’s better to have people do things people are good at and computers do things computers are good at.
Q: What are some of the other approaches behind the SRE model?
CJ: Because there are SRE teams working with many of Google’s services, we’re able to extend the principle of standardization across products: SRE-built tools originally used for deploying a new version of Gmail, for instance, might be generalized to cover more situations. This means that each team doesn’t need to build its own way to deploy updates. This ensures that every product gets the benefit of improvements to the tools, which leads to better tooling for the whole organization.
In addition, the combination of software engineering and systems engineering knowledge in SRE often leads to solutions that synthesize the best of both backgrounds. Google’s software network load balancer, Maglev, is an example — and it’s the underlying technology for the Google Cloud Load Balancer.
Q: How do these approaches impact App Engine and our customers running on App Engine?
CJ: Here’s a story that illustrates it pretty well. In the summer of 2013 we moved all of App Engine’s US region from one side of the country to the other. The move incurred no downtime to our customers.
CJ: We shut down one App Engine cluster, and as designed, the apps running on it automatically moved to the remaining clusters. We had created a copy of the US region’s High Replication Datastore in the destination data center ahead of time so that those applications’ data (and there were petabytes of it!) was already in place; changes to the Datastore were automatically replicated in near real-time so that it was consistently up to date. When it was time to turn on App Engine in the new location, apps assigned to that cluster automatically migrated from their backup clusters and had all their data already in place. We then repeated the process with the remaining clusters until we were done.
Advance preparation, combined with extensive testing and contingency plans, meant that we were ready when things went slightly wrong and were able to minimize the impact on customers. And of course, we put together an internal postmortem — another key part of how SRE works — to understand what went wrong and how to fix it for the future, without pointing fingers.
Q: Very cool. How can we find out more about SRE?
CJ: Sure. If you’re interested in learning more about how Site Reliability Engineering works at Google, including the lessons we learned along the way, check out this website, the new book and we’ll also be at SREcon this week (April 7-8) giving various talks on this topic.