Google Cloud Platform

Adventures in SRE-land: Welcome to Google Mission Control

sre-land-21dax.PNG

Wait. That’s not Google. That’s Houston.

We do have a Mission Control at Google, named in honor of NASA’s Christopher C. Kraft Jr. Mission Control Center, pictured here. But at Google, Mission Control is not a place. It’s a six month rotation program for engineers working on product development to experience what it’s like to be a Site Reliability Engineer (SRE). The goal is to increase the number of engineers who understand the challenges of building and operating a high reliability service at Google's scale.

The Mission Control inspiration goes further; SREs at Google are issued jackets that bear a flight patch inspired by the one Gene Kranz had commissioned for the Mission Controllers in Houston 1. It bears the “Kranz Dictum” of “Tough and Competent” in Latin: “Duri et Periti”. If you see someone wearing a leather jacket with this flight patch, you’re looking at a Google SRE.

sre-land-19uvi.PNG

But what is an SRE? According to Google Vice President of Engineering Ben Treynor Sloss, who coined the term SRE, “SRE is what happens when you ask a software engineer to design an operations function.” In 2003, Ben was asked to lead Google’s existing “Production Team” which at the time consisted of seven software engineers. The team started as a software engineering team, and since Ben is also a software engineer, he continued to grow a team that he, as a software engineer, would still want to work on. Thirteen years later, Ben leads a team of roughly 2,000 SREs, and it is still a team that software engineers want to work on. About half of the engineers who do a Mission Control rotation choose to remain an SRE after their rotation is complete.

Google has been putting the word out about SRE for the past couple of years. Ben gave a talk at SREcon14 where he shared the principles of SRE learned over 11 years of building the team at Google. Melissa Binde gave a talk at GCP Next 2016 where she provided some pointers on how to apply some of the techniques we use at Google to your workloads running in our cloud. And if you really want to dig deep, the Site Reliability Engineering book is now available, and highly recommended reading.

Over the next six months, I will be on the uncomfortably exciting adventure of my own Mission Control rotation with the SRE team in Seattle that looks after Google Compute Engine. I will also be sharing some of the things I learn along the way with everyone here on this blog. So, if you want to learn more about being an SRE and how Site Reliability Engineering impacts our cloud services, keep watching this space.