SRE is a job function, a mindset, and a set of engineering practices to run reliable production systems. Google Cloud helps you implement SRE principles through tooling, professional services, and other resources.
Benefits
Reap the benefits of speed
Automate end to end, from writing code to running services in production. Align dev and ops around shared goals to go faster. Connect to the tools you love, including incident management, as you minimize toil.
Improve reliability with proven SRE principles
Leverage SRE principles developed at Google and proven to work at scale. Easily implement SRE best practices with Google Cloud’s Observability to speed up problem resolution and improve reliability.
We meet you where you are in your SRE journey
Drive higher software delivery, irrespective of company size, industry, or whether you are using VMs, Kubernetes, or serverless. Choose from free tools or paid offerings to jump-start your SRE journey.
Key features
Monitor the health of your services and work with developers to increase the velocity of changes using built-in support for service monitoring. Select metrics for SLIs, set SLOs, and track error budgets to mitigate risk for your service. Use powerful dashboards to aggregate metrics and logs, including golden signals to reduce MTTR and quickly answer questions about service health.
Use our built-in integrations with the tools you love to troubleshoot incidents quickly. Implement progressive rollouts and roll back changes safely. Pre-built integrations with Cloud Build are available to allow you to build, test, and deploy artifacts to Google Kubernetes Engine, App Engine, Cloud Functions, Firebase, and Cloud Run as part of your CI/CD.
Get one unified view across logs, events, metrics, and SLOs. Get in-context observability data, right within service consoles of Google Kubernetes Engine, Cloud Run, Compute Engine, Anthos and other run times. Collect metrics, traces, and logs with zero setup. Sub-second ingestion latency and terabyte per-second ingestion rate ensure you can perform real-time log management and analysis at scale.
If you would like more hands-on help through the journey, we have additional services to consider including Google consulting services. Reach out to sales to see which option would work for your organization. Learn from our CRE team and customer success stories for how Google Cloud tools and practices have helped other companies implement SRE in their organization.
With OpenTelemetry (OT) packages and Google Exporter, developers can instrument and export trace data to Cloud Trace. Our new unified Ops agent (in preview), collects metrics and logs and also supports OpenTelemetry to capture and transport metrics. We are working to implement OT libraries as out-of-the-box features in many of our cloud products. Cloud SQL Insights is one example of this effort.
Customers
Documentation
Access the SRE books, hear from SREs, and learn how we SRE at Google.
To monitor a service, you need at least one service-level objective (SLO). Learn step by step how to create your first SLO in Cloud Monitoring.
Learn how to define and defend your SLOs in Google Cloud’s Observability and improve observability of your applications running in Google Cloud.
This course teaches the theory of service-level objectives (SLOs), a principled way of describing and measuring the desired reliability of a service.
This course introduces key practices of Google SRE and the important role IT and business leaders play in the success of SRE organizational adoption.
What's new
Sign up for Google Cloud newsletters to receive product updates, event information, special offers, and more.
Tell us what you’re solving for. A Google Cloud expert will help you find the best solution.