DevOps & SRE

5 resources to help you get started with SRE

April 23, 2021

The Google Cloud content marketing team

Site reliability engineering (SRE) is an essential part of engineering at Google—it’s a mindset, and a set of practices, metrics, and prescriptive ways to ensure systems reliability. But not everyone knows the best places to start to implement SRE in their own organizations. Here are our top resources at Google Cloud for getting started.

1. Do you have an SRE team yet? How to start and assess your journey

We’re often asked what implementing SRE means in practice, since our customers face challenges quantifying their success when setting up their own SRE practices. In this post, we share a couple of checklists to be used by members of an organization responsible for any high-reliability services. These will be useful when you’re trying to move your team toward an SRE model. Implementing this model at your organization can benefit both your services and teams due to higher service reliability, lower operational cost, and higher-value work for everyone on the team.

DevOps & SRE

Do you have an SRE team yet? How to start and assess your journey

This post shares checklists you can use when you’re trying to move your team toward an SRE model. These checklists can be useful as a form of industry benchmark.

By Gustavo Franco • 6-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/12_-_DevOps__SRE_qBRZDbA.max-900x900.jpg

2. SRE fundamentals: SLIs, SLAs and SLOs

Core to the definition of SRE is the idea that metrics should be closely tied to business objectives. Thus, a big part of the day-to-day of SREs is establishing and monitoring these service-level metrics. At Google, we use several essential measurements—SLO, SLA and SLI—in SRE planning and practice. This post gives you an overview of what each of these acronyms are, what they mean, and how to incorporate them.

Google Cloud

SRE fundamentals: SLIs, SLAs and SLOs

A big part of SRE is establishing and monitoring service-level metrics like SLOs, SLAs and SLIs. This post gives you an overview of what each of these acronyms are, what they mean, and how to use them.

By Jay Judkowitz • 5-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/15_-_Google_Cloud_2fKuG6b.max-900x900.jpg

3. How SRE teams are organized, and how to get started

You know what SREs do and understand which best practices should be implemented at various levels of SRE maturity. Now you’re ready to take the next step by setting up your own SRE team. In this post, we’ll cover how different implementations of SRE teams establish boundaries to achieve their goals. We describe six different implementations that we’ve experienced, and what we have observed to be their most important pros and cons.

DevOps & SRE

How SRE teams are organized, and how to get started

Learn six different implementations of SRE teams you can apply in your organization, as well as how to establish boundaries to achieve their goals.

By Gustavo Franco • 10-minute read

4. Meeting reliability challenges with SRE principles

Through years of work using SRE principles, we’ve found there are a few common challenges that teams face, and some important ways to meet or avoid those challenges. Learn what we at Google think are the three top sources of production stress and how we recommend addressing them.

Management Tools

Meeting reliability challenges with SRE principles

Following SRE principles can help you build reliable production systems. When getting started, you may encounter three common challenges. Here’s how to solve them.

By Cheryl Kang • 6-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/21_-_Management_Tools_EI9iqlb.max-900x900.jpg

5. Transitioning a typical engineering ops team into an SRE powerhouse

Perpetually adding engineers to ops teams to meet customer growth doesn’t scale. Google’s SRE principles can help, bringing software engineering solutions to operational problems. In this post, we’ll take a look at how we transformed our global network ops team by abandoning traditional network engineering orthodoxy and replacing it with SRE. You’ll learn how Google’s production networking team tackled this problem and consider how you might incorporate SRE principles in your own organization.

Management Tools

Transitioning a typical engineering ops team into an SRE powerhouse

Moving a network operations team to an SRE-driven model took some time, but was well worth the effort, as teams can focus on reliability rather than hardware.

By Tom Wright • 7-minute read

Lots more to read

Can’t wait to read more about SRE? We wrote an entire book on SRE to help you get started (actually, we’ve written more than one). You can also find all our DevOps and SRE blog content or follow our columns on Customer Reliability Engineering.

DevOps & SRE