DevOps & SRE
Are we there yet? Thoughts on assessing an SRE team’s maturity
One facet of our work as Customer Reliability Engineers—Google Site Reliability Engineers (SREs) tapped to help Google Cloud customers develop that practice in their own organizations—is advising operations or SRE teams to improve their operational maturity. We've noticed a recurring question cropping up across many of these discussions, usually phrased along the lines of "is what we're currently doing 'SRE work'?" or, with a little more existential dread, "can we call ourselves SREs yet?"
We've answered this question before with a list of practices from the SRE workbook. But the list is long on the what and short on the why, which can make it hard to digest for folks already suffering an identity crisis. Instead, we hope to help answer this question by discussing some principles we consider fundamental to how an SRE team operates. We'll examine why they're important and suggest questions that characterize a team's progress towards embodying them.
Are we there yet?
This question is asked in different ways, for a myriad of different reasons, and it can be quite hard to answer due to the wide range of different circumstances that our customers operate in. Moreover, CRE, and Google in general, is not the final arbiter of what is and isn't "SRE" for your organization, so we can't provide an authoritative answer, if one even exists. We can only influence you and the community at large by expressing our opinions and experiences, in person or via our books and blog posts.
Further, discussions of this topic tend to be complicated by the fact that the term "SRE" is used interchangeably to mean three things:
A job role primarily focused on maintaining the reliability of a service or product.
A group of people working within an organization, usually in the above job role.
A set of principles and practices that the above people can utilize to improve service reliability.
When people ask "can we call ourselves SREs yet?" we could interpret it as a desire to link these three definitions together. A clearer restatement of this interpretation might be: "Is our group sufficiently advanced in our application of the principles and practices that we can justifiably term our job role SRE?"
We should stress that we're not saying that you need a clearly defined job role—or even a team—before you can start utilizing the principles and practices to do things that are recognizably SRE-like. Job roles and teams crystallize from a more fluid set of responsibilities as organizations grow larger. But as this process plays out, the people involved may feel less certain of the scope of their responsibilities, precipitating the ‘are we there yet?’ question. We suspect that's where the tone of existential dread comes from...
Key SRE indicators
Within the CRE team here at Google Cloud, the ‘are we there yet?’ question surfaced a wide variety of opinions about the core principles that should guide an SRE team. We did manage to reach a rough consensus, with one proviso—the answer is partially dependent on how a team engages with the services it supports.
We've chosen to structure this post around a set of principles that we would broadly expect groups of people working as SREs that directly support services in production to adhere to. As in a litmus test, this won't provide pin-point accuracy; but in our collective opinion at least, alignment with most of the principles laid out below is a good signal that a team is practicing something that can recognizably be termed Site Reliability Engineering.
Directly engaged SRE teams are usually considered Accountable (in RACI terms) for the service’s reliability, with Responsibility shared between the SRE and development teams. As a team provides less direct support these indicators may be less applicable. We hope those teams can still adapt the principles to their own circumstances.
To illustrate how you might do this, for each principle we've given a counter-example of a team of SREs operating in an advisory capacity. They're subject-matter experts who are Consulted by development teams who are themselves Responsible and Accountable for service reliability.
Wherever your engagement model lies on the spectrum, being perceived by the rest of the organization as jointly responsible for a service's reliability, or as reliability subject-matter experts, is a key indicator of SRE-hood.
Principle #1: SREs mitigate present and future incidents
This principle is the one that usually underlies the perception of responsibility and accountability for a service's reliability. All the careful engineering and active regulation in the world can't guarantee reliability, especially in complex distributed systems—sometimes, things go wrong unexpectedly and the only thing left to do is react, mitigate, and fix. SREs have both the authority and the technical capability to act fast to restore service in these situations.
But mitigating the immediate problem isn't enough. If it can happen again tomorrow, then tomorrow isn't better than today, so SREs should work to understand the precipitating factors of incidents and propose changes that remediate the entire class of problem in the infrastructure they are responsible for. Don't have the same outage again next month!
How unique are your outages? Ask yourself these questions:
Can you mitigate the majority of the incidents without needing specialist knowledge from the development team?
Do you maintain training materials and practice incident response scenarios?
After a major outage happens to your service, are you a key participant in blamelessly figuring out what really went wrong, and how to prevent future outages?
Now, for a counter example. In many organizations, SREs are a scarce resource and may add more value by developing platforms and best practices to uplift large swathes of the company, rather than being primarily focused on incident response. Thus, a consulting SRE team would probably not be directly involved in mitigating most incidents, though they may be called on to coordinate incident response for a widespread outage. Rather than authoring training materials and postmortems, they would be responsible for reviewing those created by the teams they advise.
Principle #2: SREs actively regulate service reliability
Reliability goals and feedback signals are fundamental for both motivating SRE work and influencing the prioritization of development work. At Google, we call our reliability goals Service Level Objectives and our feedback signals Error Budgets, and you can read more about how we use them in the Site Reliability Workbook.
Do your reliability signals affect your organization's priorities? Ask yourself these questions:
Do you agree on goals for the reliability of the services you support with your organization, and track performance against those goals in real time?
Do you have an established feedback loop that moderates the behaviour of the organization based on recent service reliability?
Do you have the influence to effect change within the organization in pursuit of the reliability goals?
Do you have the agency to refuse, or negotiate looser goals, when asked to make changes that may cause a service to miss its current reliability goals?
Each question builds on the last. It is almost impossible to establish a data-driven feedback loop without a well-defined and measured reliability goal. For those goals to be meaningful, SREs must have the capability to defend them. Periods of lower service reliability should result in consequences that temporarily reduce the aggregate risk of future production changes and shift engineering priorities towards reliability.
When it comes down to a choice between service reliability and the rollout of new but unreliable features, SREs need to be able to say "no". This should be a data driven decision—when there's not enough spare error budget, there needs to be a valid business reason for making users unhappy. Sometimes, of course, there will be, and this can be accommodated with new, lower SLO targets that reflect the relaxed reliability requirements.
Consultant SREs, in contrast, help teams draft their reliability goals and may develop shared monitoring infrastructure for measuring them across the organization. They are the de-facto regulators of the reliability feedback loop and maintain the policy documents that underpin it. Their connection to many teams and services gives them broader insights that can spark cross-functional reliability improvements.
Principle #3: SREs engage early and comprehensively
As we said earlier, SREs should be empowered to make tomorrow better than today. Without the ability to change the code and configuration of the services they support, they cannot fix problems as they encounter them. Involving SREs earlier in the design process can head off common reliability anti-patterns that are costly to correct post-facto. And, with the ability to influence architectural decision making, SREs can drive convergence across an organization so that work to increase the reliability of one service can benefit the entire company.
Is your team actively working to make tomorrow better than today? Ask yourself these questions, which go from fine detail to a broad, high-level scope:
Do you engineer your service now to improve its reliability, e.g. by viewing and modifying the source code and/or configuration?
Are you involved in analysis and design of future iterations of your service, providing a lens on reliability/operability/maintainability?
Can you influence your organization’s wider architectural decision making?
Advising other teams naturally shifts priorities away from directly modifying the configuration or code of individual services. But consultant SREs may still maintain frameworks or shared libraries providing core reliability features, like exporting common metrics or graceful service degradation. Their breadth of engagements across many teams makes them naturally suited for providing high-level architectural advice to improve reliability across an entire organization.
Principle #4: SREs automate anything repetitive
Finally, SREs believe that computers are fundamentally better suited to doing repetitive work than humans are. People often underestimate the returns on investment when considering whether to automate a routine task, and that's before factoring in the exponential growth curve that comes with running a large, successful service. Moreover, computers never become inattentive and make mistakes when doing the same task for the hundredth time, or become demoralized and quit. Hiring or training SREs is expensive and time-consuming, so a successful SRE organization depends heavily on making computers do the grunt work.
Are you sufficiently automating your work? Ask yourself these questions:
Do you use—or create—automation and other tools to ensure that operational load won't scale linearly with organic growth or the number of services you support?
Do you try to measure repetitive work on your team, and reduce it over time?
A call for reflection
Most blog posts end with a call for action. We'd rather you took time to reflect, instead of jumping up to make changes straight after reading. There's a risk, when writing an opinionated piece like this, that the lines drawn in the sand are used to divide, not to grow and improve. We promise this isn't a deliberate effort to gatekeep SRE and exclude those who don't tick the boxes; we see no value in that. But in some ways gatekeeping is what job roles are designed to do, because specialization and the division of labour is critical to the success of any organization, and this makes it hard to avoid drawing those lines.
For those who aspire to call themselves SREs, or are concerned that others may disagree with their characterization of themselves as SREs, perhaps these opinions can assuage some of that existential dread.