DevOps & SRE
Do you have an SRE team yet? How to start and assess your journey
We’re pleased to announce that The Site Reliability Workbook is available in HTML now! Site Reliability Engineering (SRE), as it has come to be generally defined at Google, is what happens when you ask a software engineer to solve an operational problem. SRE is an essential part of engineering at Google. It’s a mindset, and a set of practices, metrics, and prescriptive ways to ensure systems reliability. The new workbook is designed to give you actionable tips on getting started with SRE and maturing your SRE practice. We’ve included links to specific chapters of the workbook that align with our tips throughout this post.
We’re often asked what implementing SRE means in practice, since our customers face challenges quantifying their success when setting up their own SRE practices. In this post, we’re sharing a couple of checklists to be used by members of an organization responsible for any high-reliability services. These will be useful when you’re trying to move your team toward an SRE model. Implementing this model at your organization can benefit both your services and teams due to higher service reliability, lower operational cost, and higher-value work for the humans.
But how can you tell how far you have progressed along this journey? While there is no simple or canonical answer, you can see below a non-exhaustive list to check your progress, organized as checklists by ascending order of maturity of a team. Within every checklist, the items are roughly in chronological order, but we do recognize that any given team’s actual needs and priorities may vary.
If you’re part of a mature SRE team, these checklists can be useful as a form of industry benchmark, and we’d love to encourage others to publish theirs as well. Of course, SRE isn’t an exact science, and challenges arise along the way. You may not get to 100% completion of the items here, but we’ve learned at Google that SRE is an ongoing journey.
SRE: Just getting started
The following three practices are key principles of SRE, but can largely be adopted by any team responsible for production systems, regardless of its name, before and in parallel to staffing an SRE team.
- Some service-level objectives (SLOs) have been defined (jointly with developers and business owners, if you aren’t part of one of these groups) and are met most months.
- There's a culture of authoring blameless postmortems.
- There's a process to manage production incidents. It may be company-wide.
Beginner SRE teams
Most, if not all, SRE teams at Google have established the following practices and characteristics. We generally view these as fundamental to an effective SRE team, unless there are good reasons why they aren’t feasible for a specific team’s circumstances.
- A staffing and hiring plan is in place and funding has been approved.
- Once staffed, the team may be on-call for some services while taking at least part of the operational load (toil).
- There is documentation for the release process, service setup, teardown (and failover, if applicable).
- A canary process for releases has been evaluated as a function of the SLO.
- A rollback mechanism is in place where it’s applicable (though it’s understood that this is a nontrivial exercise when mobile applications are involved, for example).
- An operational playbook/runbook should exist, even if not complete.
- Theoretical (role-playing) disaster recovery testing takes place, at least annually.
- SRE plans and executes project work, which may not be immediately visible by their developer counterparts, such as operational load reduction efforts that may not need developer buy-in.
The following practices are also common for SRE teams starting out. If they don’t exist, that can be a sign of poor team health and sustainability issues:
- Enough on-call load to exercise incident response procedures on a regular (i.e., weekly) basis.
- An SRE team charter that’s been reviewed by the appropriate leadership beyond SRE (i.e., CTO).
- Periodic meetings between SRE and developer leadership to discuss issues and goals and share information.
- Project planning and execution is done jointly by developers and SRE. SRE work and positive impact is visible to developer leadership.
Intermediate SRE teams
These characteristics are common in mature teams and generally indicate that the team is taking a proactive approach to efficient management of its services.
- There are periodic reviews of SRE project work and impact with business leaders.
- There are periodic reviews of SLIs and SLOs with business leaders.
- There’s a low volume of toil overall; <=50% can be measured beyond “just” low on-call load. The team establishes an approach regarding configuration changes that takes reliability into account. SREs have established a plan to scale impact beyond adding scope or services to their on-call load.
- There's a rollback mechanism in case of canary failures. It may be automated.
- There is periodic testing of incident management, using a combination of role-playing with some automation in place.
- There’s an escalation policy tied to SLO violations; this might be a release process freeze/unfreeze, or something else. Check out our previous post on the possible consequences of SLO violations.
- There are periodic reviews of postmortems and action items that are shared between developers and SRE.
- Disaster recovery is periodically tested against non-production environments.
- Teams measure demand vs. capacity and use active forecasting to determine when demand might exceed capacity.
- The SRE team may produce long-term plans (i.e., a yearly roadmap) jointly with devs.
Advanced SRE teams
These practices are common in more senior teams, or sometimes can be achieved when an organization or set of SRE teams share a broader charter.
- At least some individuals on the team can claim major positive impact on some aspect of the business beyond firefighting or ops.
- Project work can be and is often executed horizontally, positively impacting many services at once as opposed to linearly or worse per service.
- Most service alerts are based on SLO burn rate.
- Automated disaster recovery testing is in place and positive impact can be measured.
Another set of SRE “features” which may be desirable but unlikely to be implemented by most companies are:
- SREs are not on-call 24x7. SRE teams are geographically distributed in two locations, such as U.S. and Europe. It’s worth pointing out that neither half is treated as secondary.
- SRE and developer organizations share common goals and may have separate reporting chains up to SVP level or higher. This arrangement helps to avoid conflicts of interest.
What should I do next?
Once you’ve looked through these checklists, your next step is to think about whether they match your company’s needs.
For those without an SRE team where most of the beginner list is unfilled, we’d highly recommend reading the associated SRE Workbook chapters in the order they have been presented. If you happen to be a Google Cloud Platform (GCP) customer and would like to request CRE involvement, contact your account manager to apply for this program. But to be clear, SRE is a methodology that will work on a huge variety of infrastructures, and using Google Cloud is not a prerequisite for pursuing this set of engineering practices.
We’d also recommend attending existing conferences and organizing summits with other companies in order to share best practices on how to solve some of the blockers, such as recruiting.
We have also seen teams struggling to fill out the advanced list because of churn. The rate of systems and personnel changes may be a deterrent to get there. In order to avoid teams reverting to the beginner stage and other problems, our SRE leadership reviews key metrics per team every six months. The scope is more narrow than the checklists above because several of the items have now become standard.
As you may have guessed by now, answering the central question in this article involves addressing and attempting to assess a given team’s impact, health, and most importantly, how the actual work is done. After all, as we wrote in our first book on SRE: "If we are engineering processes and solutions that are not automatable, we continue having to staff humans to maintain the system. If we have to staff humans to do the work, we are feeding the machines with the blood, sweat, and tears of human beings."
So yes, you might have an SRE team already. Is it effective? Is it scalable? Are people happy? Wherever you are in your SRE journey, you can likely continue to evolve, grow and hone your team’s work and your company’s services. Learn more here about getting started building an SRE team.
Thanks to Adrian Hilton, Alec Warner, David Ferguson, Eric Harvieux, Matt Brown, Myk Taylor, Stephen Thorne, Todd Underwood and Vivek Rau among others for their contributions to this post.