Jump to Content
Security & Identity

How Google Does It: Making threat detection high-quality, scalable, and modern

January 7, 2025
https://storage.googleapis.com/gweb-cloudblog-publish/images/GettyImages-1372495058.max-2600x2600.jpg
Anton Chuvakin

Security Advisor, Office of the CISO

Tim Nguyen

Director, Detection and Response, Google

Hear monthly from our Cloud CISO in your inbox

Get the latest on security from Cloud CISO Phil Venables.

Subscribe

Ever wondered how Google does security? As part of our new “How Google Does It” series, we’ll share insights, observations, and top tips about how Google approaches some of today's most pressing security topics, challenges, and concerns — straight from Google experts. In this edition, Tim Nguyen shares an inside look at the core aspects and principles of Google’s approach to modern threat detection and response.

Google’s threat detection and response team is charged with hunting down malicious system and network activity across all of Google and Alphabet. The team’s mandate covers the largest Linux fleet in the world, nearly every flavor of operating system available, Google Cloud’s infrastructure and services, and more than 180,000 employees.

We rely on a detection engine that combines several different code engines working together to consume logs, apply intelligence, and turn them into usable signals. Our process begins by streaming a large pipeline of logs into our cloud data warehouse, which allows us to quickly run queries across our entire dataset of logs going back many months.

When we find a signal that we want to surface for further investigation, it gets routed to a triage queue for a human member of the detection team to review, escalate, and remediate if necessary. When new indicators of compromise come to light, which often is through new intelligence, improved signaling, and more widespread coverage, we have the capability to automatically review all past signals and devices to see if they're affected by these new indicators.

All of these efforts can feel like an uphill battle, especially when you’re trying to scale them across an organization. We want to minimize the time we spend receiving and processing the same information, so we can start making critical decisions as efficiently as possible.

At Google, detection and response teams subscribe to a service-level objective (SLO) to detect and respond to threats promptly. We do this to minimize dwell time (the time an attacker is active on a network before being detected) as much as possible. While the industry average dwell time is weeks, we’ve driven dwell time down to hours.

So, how do we deliver on this rapid-response objective while also scaling out our detection and response capabilities across our vast environment? Below, we’ll share some key ingredients that have helped us create a recipe for threat detection success.

1. Automate (almost) everything

When we’re asked to investigate whether an indicator of compromise (IoC) exists, we have to check everywhere, and we mean everywhere. Every corporate workstation, operating system, production server, virtual machine, and every resource underlying our products and services — it’s the definition of toil. With so many types of IoCs against that level of scale, there’s no such thing as manual.

Our motto is less gathering, more direct analysis. We believe people are uniquely good at understanding nuance, making judgments, and navigating ambiguous information. We want to augment building context as much as possible to give our team more time to make the right decisions.

Instead of a human performing all the same steps for event analysis or investigation over and over again, we use machines to automate most of it. Ideally, we try to retrieve the majority of our machine telemetry, user information, and process executions automatically.

Roughly 97% of our events are generated through automated “hunts,” and then presented to a human along with a risk score and details about where to investigate. This allows us to triage events in a much shorter amount of time because they are starting out with all the contextual information they need to make a decision. The automation also discards false avenues of investigation and gives humans a direction to follow, which can help determine whether this is a true positive.

Another real win from automation has been our ability to drive down the cost of investigating events. We can automatically replicate pieces of logic, gather information, and present it to our team, which allows us to work faster and accelerate our time-to-resolution. We’ve been able to drastically reduce the cost per ticket for handling individual events while simultaneously increasing the number of events we can process.

Naturally, generative AI is in wide use for automation as well. For example, large language model-generated drafts reduced the time that engineers spent writing executive summaries by 53%, while delivering at least on-par content quality in terms of factual accuracy and adherence to writing best practices.

Cloud environments can offer a tremendous advantage here because you can automatically inventory everything, retain asset creation history for long periods of time, and programmatically query your infrastructure.

Also at the same time, there are cases that require us to populate our own log sources or hunt down additional information to investigate an event. We always leave room for manual human expertise, but the overall goal is to ensure we’re adding value and building towards something larger — not wasting our energy and efforts on repetitive tasks.

2. Collaborate for more powerful detections

A crucial yet often overlooked truth is that no detection and response team can be effective in isolation. Successful threat detection requires close, continuous collaboration with different departments, teams, and stakeholders.

At Google, all of our threat hunts, whether manual or automated, begin with threat modeling. It’s impossible to create a good detection without truly understanding exactly what you are chasing, so we always start by speaking with the people in charge of a project to create an accurate model of how the system works and gain a clear picture of the detections they want to build.

Once we understand what types of threats we want to detect, we review the existing logs to determine if we have all the telemetry data to support our efforts. If anything is missing, we’ll work with the team to issue additional logs. In our experience, it’s rare to get through an incident postmortem without discovering that you’re missing information that could shed more light on an attacker’s actions. The process to improve your logs, surface the right information to investigate, and respond to attacks effectively is constant and collaborative.

3. Build an asset inventory

In the past, you had to do incident response without having any sort of asset inventory or history. Imagine trying to figure out if a host existed at a specific point of time, pinpointing when it was created or shut down, and if it was compromised by an attacker — all without having any reference of all the assets in your environment. If you’re not aware of an entire class of assets, they might become a perfect entry point into your infrastructure.

At Google, the people who write the detections are the same ones who respond to the signals. This is predicated on a very simple rule: If you're not in charge of engineering an alert, do you care if it fires off at 3:00 in the morning?

We believe that an asset inventory is essential for protecting your entire infrastructure, and can help answer critical questions when detecting and responding to threats. Cloud environments can offer a tremendous advantage here because you can automatically inventory everything, retain asset creation history for long periods of time, and programmatically query your infrastructure.

4. You write it, you triage it

At Google, the people who write the detections are the same ones who respond to the signals. This is predicated on a very simple rule: If you're not in charge of engineering an alert, do you care if it fires off at 3:00 in the morning?

One of the more common practices in threat detection organizations is to separate the teams writing detection from those triaging alerts. However, this dynamic can be responsible for creating a lot of unnecessary tension. Security alerts are noisy by nature; at some point or another, a detection will inevitably end up flooding the triage team with an unreasonable amount of alerts, no matter how much effort went into writing it.

When there’s no understanding of the impact of alerts on triage, the incentive to improve detections is lower. Alert fatigue is a very real problem that can lead to burn out, frustration, and desensitization that can cause teams to miss critical issues. We’ve found that having the same team that writes the detections also be responsible for triaging them brings more accountability for the quality of the detections and keeps alerts from getting out of control.

5. Security engineering is software engineering

Software engineering is now at the core of every security discipline. Operating a cloud — and a detection infrastructure that can protect that cloud — requires writing code every single day. The lion’s share of our daily workflow involves coding automation analysis and logic so that your work can be triaged and absorbed by others on our team.

That’s why all of our security engineers need to know how to read and write code.

At Google, the lion’s share of our daily workflow involves coding analysis and logic so that your work can be triaged and absorbed by others on our team. We expect our security engineers to cover a wide range of responsibilities, including threat modeling, logs acquisition, data modeling, signal development, analysis automation and triage, as well as incident response. At another company, these tasks might be split across two, three, or even four roles.

Increasing the engineering skills in our team has also helped us to increase the automation that we can build and ultimately, reduce the repetitive toil that has to be done but doesn’t bring any enduring value for us. This approach also includes adopting many of the best practices from software engineering, including documenting our code, monitoring and tracking our progress, stress testing and validating our detections, and weekly reviews to discuss the quality of our analysis and response.

In an ideal world, you could buy a tool and push a couple buttons to get detections up and running, but the reality is that detection is code. Embracing engineering practices for detection can not only improve the quality of your detections and signals but enable you to scale out detection and response across your entire organization.

This article includes insights from the Cloud Security Podcast episodes, “Modern Threat Detection at Google” and “How We Scale Detection and Response at Google: Automation, Metrics, Toil”. Check them out to learn more.

Posted in