Site Reliability Engineering (SRE)
Reap the benefits of speed
Automate end to end, from writing code to running services in production. Align dev and ops around shared goals to go faster. Connect to the tools you love, including incident management, as you minimize toil.
Improve reliability with proven SRE principles
Leverage SRE principles developed at Google and proven to work at scale. Easily implement SRE best practices with Google Cloud’s operations suite to speed up problem resolution and improve reliability.
We meet you where you are in your SRE journey
Drive higher software delivery, irrespective of company size, industry, or whether you are using VMs, Kubernetes, or serverless. Choose from free tools or paid offerings to jump-start your SRE journey.
Google Site Reliability Engineering
Access the SRE books, hear from SREs, and learn how we SRE at Google.
Creating an SLO
To monitor a service, you need at least one service-level objective (SLO). Learn step by step how to create your first SLO in Cloud Monitoring.
Hands-on labs: Troubleshooting workloads on GKE for SREs
Learn how to navigate resource pages of GKE, use the GKE dashboard, create logs-based metrics, create an SLO, and define an alert to notify SRE staff of incidents.
Engineering for reliability
Learn how to define and defend your SLOs in Google Cloud's operations suite and improve observability of your applications running in Google Cloud.
SRE: Measuring and managing reliability
This course teaches the theory of service-level objectives (SLOs), a principled way of describing and measuring the desired reliability of a service.
Developing a Google SRE culture
This course introduces key practices of Google SRE and the important role IT and business leaders play in the success of SRE organizational adoption.