Download the new whitepaper on SRE to learn about key concepts and how Google Cloud can help you on your SRE journey

Jump to

Site Reliability Engineering (SRE)

SRE is a job function, a mindset, and a set of engineering practices to run reliable production systems. Google Cloud helps you implement SRE principles through tooling, professional services, and other resources.
Sabre
Lowe’s
adeo
Zebra
Optiva
Proctor & Gamble
TELUS
Ulta

Benefits

Strike the balance between speed and reliability

Reap the benefits of speed

Automate end to end, from writing code to running services in production. Align dev and ops around shared goals to go faster. Connect to the tools you love, including incident management, as you minimize toil.

Improve reliability with proven SRE principles

Leverage SRE principles developed at Google and proven to work at scale. Easily implement SRE best practices with Google Cloud’s operations suite to speed up problem resolution and improve reliability.

We meet you where you are in your SRE journey

Drive higher software delivery, irrespective of company size, industry, or whether you are using VMs, Kubernetes, or serverless. Choose from free tools or paid offerings to jump-start your SRE journey.

Key features

SRE tools and resources to make your operations and SRE teams run better

Monitor service health using SRE principles

Monitor the health of your services and work with developers to increase the velocity of changes using built-in support for service monitoring. Select metrics for SLIs, set SLOs, and track error budgets to mitigate risk for your service. Use powerful dashboards to aggregate metrics and logs, including golden signals to reduce MTTR and quickly answer questions about service health.

Out-of-the-box integrations to increase automation, reduce toil

Use our built-in integrations with the tools you love to troubleshoot incidents quickly. Implement progressive rollouts and roll back changes safely. Pre-built integrations with Cloud Build are available to allow you to build, test, and deploy artifacts to Google Kubernetes Engine, App Engine, Cloud Functions, Firebase, and Cloud Run as part of your CI/CD.

One integrated view for faster resolution

Get one unified view across logs, events, metrics, and SLOs. Get in-context observability data, right within service consoles of Google Kubernetes Engine, Cloud Run, Compute Engine, Anthos and other run times. Collect metrics, traces, and logs with zero setup. Sub-second ingestion latency and terabyte per-second ingestion rate ensure you can perform real-time log management and analysis at scale. 

Get extra help from Google Cloud SRE specialists

If you would like more hands-on help through the journey, we have additional services to consider including Google consulting services. Reach out to sales to see which option would work for your organization. Learn from our CRE team and customer success stories for how Google Cloud tools and practices have helped other companies implement SRE in their organization.

Drive SRE/developer collaboration to “shift-left” observability

With OpenTelemetry (OT) packages and Google Exporter, developers can instrument and export trace data to Cloud Trace. Our new unified Ops agent (in preview), collects metrics and logs and also supports OpenTelemetry to capture and transport metrics. We are working to implement OT libraries as out-of-the-box features in many of our cloud products. Cloud SQL Insights is one example of this effort.


Customers

Meeting customer demand with SRE practices

Related services

Documentation

Learn how to implement SRE at your organization with these resources

Best Practice
Google Site Reliability Engineering

Access the SRE books, hear from SREs, and learn how we SRE at Google.

Google Cloud Basics
Creating an SLO

To monitor a service, you need at least one service-level objective (SLO). Learn step by step how to create your first SLO in Cloud Monitoring.

Tutorial
Hands-on labs: Troubleshooting workloads on GKE for SREs

Learn how to navigate resource pages of GKE, use the GKE dashboard, create logs-based metrics, create an SLO, and define an alert to notify SRE staff of incidents.

Tutorial
Engineering for reliability

Learn how to define and defend your SLOs in Google Cloud's operations suite and improve observability of your applications running in Google Cloud.

Tutorial
SRE: Measuring and managing reliability

This course teaches the theory of service-level objectives (SLOs), a principled way of describing and measuring the desired reliability of a service.

Tutorial
Developing a Google SRE culture

This course introduces key practices of Google SRE and the important role IT and business leaders play in the success of SRE organizational adoption.

What's new

What's new in Google Cloud SRE

Sign up for Google Cloud newsletters to receive product updates, event information, special offers, and more.