DevOps & SRE
Four steps to jumpstarting your SRE practice
A few months ago, we wrote about how the first step to implementing Site Reliability Engineering (SRE) in an organization is getting leadership on board. So, let’s assume that you’ve gone ahead and done that. Now what? What are some concrete steps you can take to get the SRE ball rolling? In this blog post, we’ll take a look at what you as an IT leader can do to fast-track SRE within your team.
Step 1: Start small and iterate
"Rome wasn't built in a day," the saying goes, but you do need to start somewhere. When it comes to implementing SRE principles, the approach that I (and my team) found to be the most effective is to start with a proof of concept, learn from our mistakes, and iterate!
Start by identifying a relevant application and/or team
There are many factors that go into choosing a specific team or application for your SRE proof of concept. Most of the time, though, this is a strategic decision for the organization, which is outside the scope of this article. Possible candidates can be a team shifting from traditional operations or DevOps to SRE, or a need to increase reliability to a business-critical product. No matter the reason, it’s crucial to select an application that is:
Critical to the business. Your customers should care deeply about its uptime and reliability.
Currently in development. Pick an application in which the business is actively investing resources.
In a perfect world, the application provides data and metrics regarding its behaviour.
Conversely, stay away from proprietary software. If the application wasn’t built by you, it's not a good candidate for SRE! You need the ability to make strategic decisions about—and engineering changes to—the application as needed.
Pro tip: In general, if you have workloads both on-premises and in the cloud, try to start with the cloud-based app. If your engineers come from a traditional operations environment, changing their thinking away from 'bare metal' and infrastructure metrics will be easier for a cloud-based app, as managed infrastructure turns practitioners into users and forces them to consume it like developers (APIs, infrastructure as code, etc.)
Remember: Set realistic goals. Discouraging your team with unrealistic expectations early on will have a negative effect on the initiative.
Step 2: Empower your teams
Implementing SRE principles requires fostering a learning culture, and in that regard, team enablement means both training them, i.e., in regards to knowledge, as well as empowering them.
Building a training program is a topic in and of itself, but it’s important to think about an enablement strategy at an early stage. Especially in large organizations, you need to address topics like internal upskilling, hiring and scaling the team as well as onboarding and creating a learning community.
Your enablement strategy should also accommodate employees at different levels and in different functions. For example, higher leadership's training will look very different from practitioners’ training. Leadership's education should be sufficient to get buy-in and to be able to make organizational decisions. To drive change in the entire organization, additional training to leadership on cultural concepts and practices might be required.
When it comes to engineering leadership and/or middle management (managers that manage managers), training should be a combination of high-level cultural concepts to help foster the required culture, and technical SRE practices that are deep enough to understand prioritization, resource allocation, process creation, and future needs.
When it comes to practitioners, ideally you want the entire organization to be aligned both from a knowledge perspective as well as culturally. But as we’ve mentioned earlier, it’s best to start simple, with just one team.
The starting point for those teams should be to understand reliability and key concepts like SLAs, SLOs, SLIs and error budgets. These are important because SRE is focused on the customer experience. Measuring whether systems meet customer expectations requires a shift in mindset and can take time.
After identifying your first application and/or the team responsible for it, it's time to identify the app’s user journeys, the set of interactions a user has with a service to achieve a single goal—for example, a single click or a multi-step pipeline, and rank them according to business impact. The most critical ones are called Critical User Journeys (CUJ), and these are where you should start drafting SLO/SLIs.
Pro tip: There are some general technical practices that can help you embrace SRE faster. For example, using less repos rather than more can help you reduce silos within the organization and better utilize resources.
Likewise, prioritizing automatic processes and self-healing systems can benefit reliability, but also team satisfaction, helping the organization retain talent.
Final note: Similar to the way that you make architecture decisions, your chosen technology, solutions and implementation tools should enable you to do what you are trying to do and not vice versa.
Step 3: Scale those learnings
After you establish these SRE practices with one or a few teams, the next step is to think about building an SRE community and formalized processes across the organization. In some organizations, you can do this in parallel to the end of step 2, and in some organizations, only after you have a few successful implementations under your belt.
In this phase, you’ll probably want to address community, culture, enablement and processes. You will need to address them all, especially as they are intertwined, but which one you prioritize will depend on your organization.
Creating an SRE community in the organization is important both from a learning perspective, but also to establish a knowledge base of best practices, train subject-matter experts, help create needed guardrails, and align processes.
Building a community goes hand in hand with fostering an empowered culture and training teams. The idea is that early adopters are ambassadors for SRE who share their learnings and train other teams in the organization.
It is also useful to identify potential ambassadors or champions in individual development teams who are passionate about SRE and will help with the adoption of those practices.
It is also crucial to create repeatable trainings for each functional role, including onboarding sessions. Onboarding new team members is a critical aspect of training and fostering an empowered SRE culture. Therefore it is vital to be mindful about your onboarding process and make sure that the knowledge is not lost when team members change roles.
During this phase, you also want to foster an org-wide culture that promotes psychological safety, accepts failure as normal and enables the team to learn from mistakes. For that, leadership must model the desired culture and promote transparency.
Finally, having structured and formalized processes can help reduce the stress around emergency response—especially being on-call. Processes can also provide clarity and make teams more collaborative and effective.
In order to have the most impact, start by prioritizing the most painful areas under your team’s remit—for example, clean up noisy alerts to avoid (or address) alert fatigue, automate your change management processes and involve only the necessary people to save team bandwidth. Team members shouldn't work on software engineering projects while doing on-call incident management, and vice-versa. Make sure they have enough bandwidth to do both, separately. Similar to other areas, you want to use data to drive your decisions. As such, identify where your teams spend the most time, and for how long.
If you find that it is challenging to collect this kind of data, be it quantitative or qualitative, a good starting point is often your emergency response processes, as those have a direct impact on the business, especially around the escalation process, incident management and related policies.
Pro tip: All the above practices contribute to reducing silos and align goals across the organization; those should include also your vendors and engineering partners. To that end, make sure your contracts with them capture those goals as well.
Step 4: Embody a data-driven mindset
Starting the SRE journey can take time, even if you're just implementing it for one team. Two quick wins that you can start with that will make a positive impact are collecting data and doing blameless postmortems.
In SRE we try to be as data-driven as possible, so creating a measurement culture in your organization is crucial. When prioritizing data collection, ideally look for data that represents the customer experience. Collecting that data will help you identify your gaps and help you prioritize according to business needs and by extension your customer expectations.
Another thing that you can do is run or improve postmortems, which are an essential way of learning from failure and fostering a strong SRE culture. From our experience, even organizations that do run postmortems can benefit from them much more with a few minor improvements. It is important to remember that postmortems should be blameless in order to make the team feel safe to share and learn from failures. And to make tomorrow better than today, i.e., not repeat the same problems, it’s important that postmortems include action items and are assigned to an owner.
Creating a shared repository for postmortems can have a tremendous impact on the team: it increases transparency, reduces silos, and contributes to the learning culture. It also shows the team that the organization “practices what it preaches.” Implementing a repository can be as easy as creating a shared drive.
Pro tip: Postmortems should be blameless and actionable.
On the SRE fast track
Of course, no two organizations are alike, and no two SRE teams are either. But by following these steps, you can help get your team on the path to SRE success faster. To learn more about developing an effective SRE practice, check out the following resources.