Google Cloud

Introducing Google Customer Reliability Engineering (CRE)

October 10, 2016

Dave Rensin

Director, Customer Reliability Engineering, Google Cloud

In the 25 years that I’ve been in technology nearly everything has changed. Computers have moved out of the labs and into our pockets. They’re connected together 24/7 and the things we can do with them are starting to rival our most optimistic science fiction.

Almost nothing looks the same as it did back then — except customer support. Support is (basically) still people in call centers wearing headsets. In this new world, that old model just isn’t enough.

We want to change that.

Last week, we announced a brand new profession at Google: Customer Reliability Engineering, or CRE. The mission of this new role is to create shared operational fate between Google and our Google Cloud Platform customers, to give you more control over the critical applications you're entrusting to us, and share a fate greater than money.

Reducing customer anxiety

When you look out at organizations adopting cloud, you can’t help but notice high levels of anxiety.

It took me a while to figure out a reasonable explanation, but here’s where I finally landed:

Humans are evolutionarily disposed to want to control our environment, which is a really valuable survival attribute. As a result, we don’t react well when we feel like we’re losing that control. The higher the stakes, the more forcefully we react to the perceived loss.

Now think about the basic public cloud business model. It boils down to:

Give up control of your physical infrastructure and (to some extent) your data. In exchange for that uncertainty, the cloud will give you greater innovation, lower costs, better security and more stability.

It’s a completely rational exchange, but it also pushes against one of our strongest evolutionary impulses. No wonder people are anxious.

The last several years have taught me that many customers will not eat their anxieties in exchange for lower prices — at least not for long. This is especially true in cloud because of the stakes involved for most companies. There have already been a small number of high-profile companies going back on-prem because the industry hasn’t done enough to recognize this reality.

Cloud providers ignore this risk at their own peril and addressing this anxiety will be a central requirement to unlock the overwhelming majority of businesses not yet in the cloud.

The support mission

The support function in organizations used to be pretty straightforward: answer questions and fix problems quickly and efficiently. Over time, much of the entire IT support function has been boiled down to FAQs, help centers, checklists and procedures.

In the era of cloud technology, however, this is completely wrong.

Anxious customers need empathy, compassion and humanity. You need to know that you're not alone and that we take you seriously. You are, after all, betting your businesses on our platforms and tools.

There's only one true and proper mission of a support function in this day and age:

Drive Customer Anxiety -> 0

People who aren’t feeling anxious don’t spend the time and effort to think seriously about leaving a platform that’s working for them. The decision to churn starts with an unresolved anxiety.

Anxiety = 1 / Reliability

It seems obvious to say that the biggest driver of customer anxiety is reliability.

Here’s the non-obvious part, though.

Cloud customers don’t really care about the reliability of their cloud provider —

you care about the reliability of your production application. You only indirectly care about the reliability of the cloud in which you run.

The reliability of an application is the product of two things:

The reliability of the cloud provider
The reliability inherent in the design, code and operations of your application

Item (1) is a pretty well understood problem in the industry. There are thousands of engineers employed at the major cloud vendors that focus exclusively on it.

Here at Google we pioneered a whole profession around it: Site Reliability Engineering (SRE).

https://storage.googleapis.com/gweb-cloudblog-publish/images/CRE-manifesto-1hkm3.max-400x400.PNG

We even wrote a book!

What about item (2)? Who’s worried about the reliability inherent in the design, implementation and operation of your production application?

So far, just you.

The standard answer in the industry is:

Here are some white papers, best practices and consultants. Don’t do silly things and your app will be mostly fine.

Tweet this quote

As an industry, we’re asking you to bet your livelihoods on our platforms, to let us be your business partner and to give up big chunks of control. And in exchange for that we’re giving you . . . whitepapers.

No wonder you’re anxious. You should be!

No matter how much innovation, speed or scale your cloud provider gives you, this arrangement will always feel unbalanced —

especially at 3am when something goes wrong.

Perhaps you think I’m overstating the case?

Just a few months ago Dropbox announced that it was leaving their public cloud provider to go back on-prem. They’ve spoken at length about their decision making process around this choice and have expressed a strong desire to more fully “control their own destiny.” The cumulative weight of their loss of control just got to be too much. So they left.

SRE 101

The idea behind Google CRE comes from the decade-long journey of Google SRE. I realize you might not be familiar with the history of SRE, so let me spend a couple paragraphs to catch you up . . .

https://storage.googleapis.com/gweb-cloudblog-publish/images/CRE-manifesto-2vdba.max-700x700.PNG

. . . there were two warring kingdoms — developers and operations.

The developers were interested in building and shipping interesting and useful features to users. The faster the innovation, the better. In the developer tribe’s perfect world there would never be a break in the development and deployment of new and awesome products.

The operations kingdom, on the other hand, was concerned with the reliability of the systems being shipped, because they were the ones getting paged at 3am when something went down. Once the system became stable they’d rather never ship anything new again since 100% of new bugs come from new code.

For decades these kingdoms warred and much blood was spilled. (OK. Not actual blood, but the emails could get pretty testy . . . )

Then, one day this guy had an idea.

https://storage.googleapis.com/gweb-cloudblog-publish/images/CRE-manifesto-3stzl.max-400x400.PNG

Benjamin Treynor-Sloss, VP, 24x7, Father of SRE

He realized that the underlying assumptions of this age old conflict were wrong and recast the problem into an entirely new notion — the error budget.

No system you’re likely to build (except maybe a pacemaker) needs to be available 100% of the time. Users have lots of interruptions they never notice because they’re too busy living their lives.

It therefore follows that for nearly all systems there's a very small (but nonzero) acceptable quantity of unavailability. That downtime can be thought of as a budget. As long as a system is down less than its budget it is considered healthy.

For example, let’s say you need a system to be available 99.9% of the time (three nines). That means it’s OK for the system to be unavailable 0.1% of the time (for any given 30-day month, that’s 43 minutes).

As long as you don’t do anything that causes the system to be down more than 43 minutes you can develop and deploy to your heart’s content. Once you blow your budget, however, you need to spend 100% of your engineering time writing code that fixes the problem and generally makes your system more stable. The more stable you make things, the less likely you are to blow your error budget next month and the more new features you can build and deploy.

In short, the error budgets align the interests of the developer and operations tribes and create a virtuous circle.

From this, a new profession was born: Site Reliability Engineering (SRE).

At Google, there's a basic agreement between SREs and developers.

The SREs will accept the responsibility for the uptime and healthy operation of a system if:

The system (as developed) can pass a strict inspection process — known as a Production Readiness Review (PRR)
The development team who built the system agrees to maintain critical support systems (like monitoring) and be active participants in key events like periodic reviews and postmortems
The system does not routinely blow its error budget

If the developers don’t maintain their responsibilities in the relationship then the SREs “offboard” the system. (And hand back the pagers!)

This basic relationship has helped create a culture of cooperation that has led to both incredible reliability and super fast innovation.

The Customer Reliability Engineering mission

At Google, we’ve decided we need a similar approach with our customers.

CRE is what you get when you take the principles and lessons of SRE and apply them towards customers.

The CRE team deeply inspects the key elements of a customer’s critical production application — code, design, implementation and operational procedures. We take what we find and put the application (and associated teams) through a strict PRR.

At the end of that process we'll tell you: “here are the reliability gaps in your system. Here is your error budget. If you want more nines here are the changes you should make.”

We'll also build common system monitoring so that we can have mutually agreed upon telemetry for paging and tickets.

It’ll be a lot of hard work on your part to get past our PRR, but in exchange for the effort you can expect the following:

Shared paging. When your pagers go off, so will ours.
Auto-creation and escalation of Priority 1 tickets
CRE participation in customer war rooms (because despite everyone’s best efforts, bad things will inevitably happen)
A Google-reviewed design and production system

Additional Cost: $0

Wait . . . that’s a lot of value. Why aren’t we charging money for it?

The most important lever SREs have in Google is the ability to hand back the pagers. It’s the same thing with CREs. When a customer fails to keep up their end of the work with timely bug fixes, participation in joint postmortems, good operational hygiene etc., we'll “hand back the pagers” too.

Please note, however, that $0 is not the same as “free.” Achieving Google-class operational rigor requires a sustained commitment on your part. It takes time and effort. We’ll be there on the journey, but you still need to walk the path. If you want some idea of what you’re signing up to, get a copy of the Site Reliability Engineering book and ask yourself how willing you are to do the things it outlines.

It’s fashionable for companies to tell their customers that “we’re in this together,” but they don’t usually act the part.

People who are truly “in it together” are accountable to one another and have mutual responsibilities. They work together as a team for a common goal and share a fate greater than the dollars that pass between them.

This program won’t be for everyone. In fact, we expect that the overwhelming majority of customers won’t participate because of the effort involved. We think big enterprises betting multi-billion dollars businesses on the cloud, however, would be foolish to pass this up. Think of it as a de-risking exercise with a price tag any CFO will love.

Lowering the anxiety with a new social contract

Over the last few weeks we’ve been quietly talking to customers to gauge their interest in the CRE profession and our plans for it. Every time we do, there’s a visible sigh, a relaxing of the shoulders and the unmistakable expression of relief on people's faces.

Just the idea that Google would invest in this way is lowering our customers’ anxiety.

This isn’t altruism, of course. It’s just good business. These principles and practices are a strong incentive for a customer to stay with Google. It’s an affinity built on human relations instead of technical lock-in.

By driving inherent reliability into your critical applications we also increase the practical reliability of our platform. That, in turn, lets us innovate faster (a thing we really like to do).

If you’re a cloud customer, this is the new social contract we think you deserve.

If you’re a service provider looking to expand and innovate your cloud practice, we’d like to work with you to bring these practices to scale.

If you’re another cloud provider, we hope you’ll join us in growing this new profession. It’s what all our customers truly need.

Google Cloud