DevOps & SRE
Helping SaaS partners run reliably with new SRE tools and training
Our Customer Reliability Engineering (CRE) team is on a mission to help make everyone more reliable by making it easy to adopt Site Reliability Engineering (SRE) principles and practices. Lately, we’ve been spending a lot of time with our SaaS company partners, helping them reduce the operational burden on their systems, become more agile, and run reliable services for their users and customers.
We’ve been doing this work with these SaaS partners for more than a year now, and we’ve learned some lessons along the way:
- Most companies are still in the early stages of their site reliability engineering (SRE) journey. Interest in learning more about SRE principles, best practices, and tooling is coming from a wide variety of roles, many of which aren't specifically called "SRE." We’ve gotten consistent feedback that companies want self-paced, interactive online resources, such as a Coursera course, to learn more about SRE.
- While companies have unique combinations of customer requirements and solutions, we’ve found that they share many common architectural patterns as it relates to their customers' experiences. Overwhelmingly, customers want to be able to build service-level objectives (SLOs) quickly and effectively.
- The concept of reliability goes beyond defining and monitoring metrics. We’ve heard that companies want to prevent unanticipated failures and build resilient systems that can gracefully handle previously unknown failure modes when they first occur. They also want to take advantage of the collective knowledge and experience of Google engineers.
As we continue our mission to support all SaaS companies to operate reliably on Google Cloud, we have been working on making it easy for newcomers to get started on their SRE journey in several ways.
Introducing a new Coursera course on Site Reliability Engineering
We want to make it easy for developers to start learning the basics of SRE concepts and help the larger SRE community establish baselines. We designed this new course to distill years of collective Google SRE experience with designing and managing complex systems that meet their reliability targets. We hope that it helps you as developers learn at your own pace and provides insight for new and experienced SREs alike. You can enroll for the class here.
Introducing SLO Guide, a tool that helps you discover what you should measure
At Google, we’ve always believed in building tools to solve complex problems at scale. A goal of our CRE team—our first customer-facing SRE team—is to help every single SaaS company in the world run reliably on Google Cloud Platform (GCP). In the pursuit of this mission, we’ve built SLO Guide, a new tool to help SaaS companies discover what they should measure based on common architectures and critical user journeys (CUJ). Simply put, it will help you quickly create SLOs that measure what your users actually care about.
The SRE course and SLO Guide are available now as a few of the key benefits for our Google Cloud SaaS partners. If you’re an existing partner, you can request access to the tool here. If you’re not a Google Cloud SaaS partner yet, you can become one here.