How Google Does It: Applying SRE to cybersecurity

Aron Eidelman
Developer Relations Engineer, Security Advocate
Anton Chuvakin
Security Advisor, Office of the CISO
Get original CISO insights in your inbox
The latest on security from Google Cloud's Office of the CISO, twice a month.
SubscribeEver wondered how Google does security? As part of our “How Google Does It” series, we share insights, observations, and top tips about how Google approaches some of today's most pressing security topics, challenges, and concerns, straight from Google experts. In this edition, Aron Eidelman, security advocate, shares insights about how Google uses Site Reliability Engineering to modernize security operations and deliver value quickly, safely, and securely.
For more than two decades, Site Reliability Engineering has been a driving force behind Google’s massive scale, proving the value of treating operations as a software problem. With modern security teams facing increasing complexity, overwhelming noise, and a stream of threats, these same foundational principles are now helping us achieve security and resilience at scale.
In many ways, Site Reliability Engineering (SRE) and security teams share similar goals: They want to move fast and make changes safely — without breaking anything along the way. Service level objectives (SLOs), error budgets, minimizing toil, and a culture of blameless retrospective are all solid engineering practices that apply just as well for building security at scale as they do reliability at scale.
With that in mind, let’s take a look at some of the key SRE lessons that are helping shape our approach and improve our security posture at Google.
1. Eliminate toil
One of the greatest challenges in security engineering is sublinear scaling, where you grow your defense capabilities at a slower rate than the systems you are defending. Even at Google, growing a security team linearly to match the size of our operations is impossible to sustain.
In SRE, sublinear scaling requires tackling toil — the repetitive, manual operational work that scales alongside the system. Adopting this mindset forces us to abandon the hero culture that rewards practitioners for manually fighting fires 12 hours a day. Instead, we treat any manual operation as a bug. If a human operator needs to touch a system during normal operations, we view that as a place to improve the system's design.

Security teams can use these same practices to reduce their most taxing, repetitive work. For example, a core part of SRE at Google is building capabilities that enable developers to deploy code as fast as possible without causing issues. We have found that these same fast, reliable deployment processes can be used for security operations like deploying detection rules, managing policies, and triaging alerts.
Google’s site reliability engineers have found interesting gains with AI, too. They’ve been able to solve some of the problems they face today using Gemini 3 and Gemini CLI — the go-to tool for bringing agentic capabilities to the terminal.
By shifting our investment from hiring more to building software that automates these actions, we gain significant advantages: safer deployments through automated testing, consistent policies across our systems, and the ability to document and manage all our security decisions as code.
2. Alert on symptoms, not causes
Our security engineers, like SREs, face a common problem: alert fatigue from too much noise, which leads to burnout and reduced effectiveness.
At Google, we treat failure as an opportunity to build resilience. A core SRE practice we use is the blameless postmortem. After every security incident, we document what happened and how to improve our systems without blaming individuals.
To fix this, SRE teaches us to alert on the business impacts, or symptoms, rather than the system issues that cause those impacts. When users rely on a service that isn’t working, the users don't care what the underlying infrastructure issue is — they just want it fixed. The goal in alerting should be to make teams aware of a broken promise to users, not to make assumptions about what is causing the problem.
Historically, security teams set alerts for specific metrics, but these are often either too broad and create noisy false positives like failed logins, or too narrow because they rely on an exact IP address or filehash that attackers can easily bypass.
Instead, we should build automation to handle predictable threats and only alert on lagging indicators: events that measurably degrade our core objectives of confidentiality, integrity, and availability. Simply put, if we send a 3 a.m. page to a security engineer, there must be a clear, real need for immediate manual mitigation that is completely novel and could not have been responded to automatically.
We achieve this using SRE practices like SLOs and error budgets to define our risk tolerance. For example, we will trigger a manual alert only if the rate we’re spending our budget predicts that we will violate a core data integrity objective in less than an hour.
This approach requires realistic risk calculation. SRE teaches us that moving from 99.9% to 99.99% availability doesn't cost 10% more — it can actually cost 10,000% more, both in redundant infrastructure and opportunity cost.
Security teams can adapt this mentality by abandoning the unrealistic goal of patching every vulnerability instantly, a 100% target. They might instead check for exploitability and set a narrower goal to patch those within a timeframe that doesn’t disrupt the stability of the rest of their system.
Following this approach can help define a clear error budget. If we miss that target, only then do we freeze feature launches to focus on security, aligning business incentives with security reality.
3. Use blameless postmortems to engineer resilience
At Google, we treat failure as an opportunity to build resilience. A core SRE practice we use is the blameless postmortem. After every security incident, we document what happened and how to improve our systems without blaming individuals. We ask practical questions: How can we detect this with more precision? Respond with less cost? Prevent it entirely?

Historically, security teams set alerts for specific metrics, but these are often either too broad and create noisy false positives like failed logins, or too narrow because they rely on an exact IP address or filehash that attackers can easily bypass.
Instead, we should build automation to handle predictable threats and only alert on lagging indicators: events that measurably degrade our core objectives of confidentiality, integrity, and availability. Simply put, if we send a 3 a.m. page to a security engineer, there must be a clear, real need for immediate manual mitigation that is completely novel and could not have been responded to automatically.
We achieve this using SRE practices like SLOs and error budgets to define our risk tolerance. For example, we will trigger a manual alert only if the rate we’re spending our budget predicts that we will violate a core data integrity objective in less than an hour.
This approach requires realistic risk calculation. SRE teaches us that moving from 99.9% to 99.99% availability doesn't cost 10% more — it can actually cost 10,000% more, both in redundant infrastructure and opportunity cost.
Security teams can adapt this mentality by abandoning the unrealistic goal of patching every vulnerability instantly, a 100% target. They might instead check for exploitability and set a narrower goal to patch those within a timeframe that doesn’t disrupt the stability of the rest of their system.
Following this approach can help define a clear error budget. If we miss that target, only then do we freeze feature launches to focus on security, aligning business incentives with security reality.
3. Use blameless postmortems to engineer resilience
At Google, we treat failure as an opportunity to build resilience. A core SRE practice we use is the blameless postmortem. After every security incident, we document what happened and how to improve our systems without blaming individuals. We ask practical questions: How can we detect this with more precision? Respond with less cost? Prevent it entirely?
We share these postmortems across Google to translate our learnings into scalable defenses. However, our experience shows that the best way to improve a system is to pair the people building it with the people who know how it breaks.
For example, our product security engineers embed closely with software engineers and site reliability engineers throughout the entire development lifecycle. This tight partnership allows us to design secure software from the ground up, without slowing down our release velocity.
4. Embrace gradual, reversible change
Similar to SRE, our goal is to scale our defenses without endlessly growing our headcount by decoupling security operations from human operators. To do this, we can use release engineering, which treats security deployments such as rolling out new rules, tools, or policies, as an engineering product.
Instead of deploying changes everywhere at once, we use gradual rollouts. We test updates on a small, isolated group first, tightly controlling the release to catch negative side effects, and only expand the rollout once we are happy with the results.
This approach has been supported by research, including most recently from our 2025 DORA report. Across teams at Google and hundreds of other participants, we found with a high degree of consistency that working in ”small batches” (defined as committing fewer lines of code per change and including fewer changes per release) increases product stability and decreases friction.
Crucially, SRE and resilient design allows for us to be wrong sometimes, allowing us to take action quickly without fear that the wrong decision can’t be undone.
This article includes insights from the episodes, “The Shared Problem of Alerting: More SRE Lessons for Security,” “Deploy Security Capabilities at Scale: SRE Explains How,” and “Product Security Engineering at Google: Resilience and Security,” of the Cloud Security Podcast.



