Jump to Content
Security & Identity

How Google Does It: Securing production services, servers, and workloads

December 10, 2025
https://storage.googleapis.com/gweb-cloudblog-publish/images/GettyImages-1336250799.max-2600x2600.jpg
Michael Czapinski

Principal Engineer, Infrastructure SRE

Anton Chuvakin

Security Advisor, Office of the CISO

Get original CISO insights in your inbox

The latest on security from Google Cloud's Office of the CISO, twice a month.

Subscribe

Ever wondered how Google does security? As part of our “How Google Does It” series, we share insights, observations, and top tips about how Google approaches some of today's most pressing security topics, challenges, and concerns — straight from Google experts. In this edition, Michael Czapinski, principal engineer, Infrastructure SRE, shares insights about Google’s approach to securing its production services, servers, and workloads.

At Google, we operate a vast production environment that provides products and services to billions around the world. We’ve spent decades refining our security policies and systems to safeguard the applications and data that power our ecosystem. Our strategy is based on a deep understanding of the most critical attacks that could compromise our production services, including unauthorized access threats and the growing risks of lateral movement.

At the same time, given the scale and complexity of our infrastructure, we have to strike a delicate balance between several competing priorities, including security, reliability, efficiency, velocity, and maintainability. For our Site Reliability Engineering (SRE) and security teams, the daily challenge is navigating these critical demands while upholding our industry-leading security posture.

The goal isn’t to hyper-focus on one single area, but find a viable middle ground — and, more critically, develop an approach — that allows us to reconcile all these different needs and requirements effectively. Below, we’ll explore three core pillars that define how we protect production workloads at Google-scale.

Minimize human interaction to maximize security

As systems become increasingly complex, mistakes are inevitable. Unintentional errors can lead to significant outages and create opportunities for adversaries. Anything that can go wrong by mistake can also go wrong intentionally.

Our production security approach is largely defined by our Zero Touch Prod (ZTP) philosophy, a set of principles and tools that aim to minimize direct human interaction with production systems. ZTP dictates that all production changes are done through automation, pre-validated software, or an audited “break glass” emergency access mechanism. This can help reduce the risk of unintentional and malicious outages.

We strive to automate as much production administration as possible. In cases where some level of interaction is necessary, we provide safe proxies, which are pre-validated software tools that can only execute safe, well-defined commands approved by our SRE and security teams. We also use a suite of controls called No Persons (NoPe) to apply access controls for our production systems. For access beyond automated and approved operations, we require a clear business justification and authorized approval. Any elevated privileges are always temporary, never permanent.

We meticulously protect, monitor, and maintain these foundational services at all costs. They undergo regular, rigorous security engineering reviews and are protected with the latest levels of security controls that we offer, prioritizing security and reliability even at the expense of additional costs and inefficiencies.

In addition, our SRE and security teams closely collaborate to transform existing operational and on-call procedures to remove broad, unnecessary powers. For example, we created a system that can automatically review and restrict any actions based on safety checks, team playbooks, and best practices.

These efforts have also led to significant engineering initiatives to re-architect older, complex systems into smaller, functional components that greatly reduce risk while improving both our security and reliability.

Protect your crown jewels at all costs

Every infrastructure, no matter the size and scope, contains what we call “foundational services” — core, essential production services that are critical for securing our production applications and workloads. These critical production services represent the starting point for ensuring the security of our entire infrastructure, and therefore carry the highest security requirements.

We meticulously protect, monitor, and maintain these foundational services at all costs. They undergo regular, rigorous security engineering reviews and are protected with the latest levels of security controls that we offer, prioritizing security and reliability even at the expense of additional costs and inefficiencies.

For example, disabling core dumps may make debugging more challenging, but it can help prevent the loss of powerful secrets, such as root access keys.

We keep this list small — ideally in the dozens — and rarely introduce new ones because each of these services poses a substantial risk if compromised. In addition, we frequently audit foundational services in an effort to reduce their authority and control wherever possible.

Tailor defense for diverse needs and requirements

Not all data or services have the same security requirements. To defend against lateral movement, we employ a hierarchy to isolate our workloads, known as workload security rings (WSR). This concept allows us to categorize different services based on their security requirements and provide them with corresponding levels of isolation. As a result, we can make informed trade-offs between efficiency and security, ensuring that our most critical assets receive the highest degree of protection.

For example, foundational workloads, which could potentially impact all products, receive the highest levels of isolation and protection. They are run on dedicated servers and are never scheduled to run with other types of workloads. Sensitive workloads that process or access product-specific or customer data also receive a very high standard of protection. By comparison, we apply less costly and restrictive security measures to lower priority workloads that can continue functioning even in the event of reduced performance or accuracy, such as experiments and batch processing.

Interestingly, we have found security improvements frequently lead to benefits in multiple areas, so we often look for solutions that can advance multiple goals at once. However, we always draw a hard line for the level of risk we’re unwilling to take — and consistently uphold and defend it. While a suboptimal security posture might be accepted under specific circumstances, this is only done with leadership approval and a concrete plan to restore the optimal security state within a defined period.

Ultimately, absolute security isn’t a practical reality, so we need to find ways to protect without impacting the ability to operate and serve our customers.

This article includes insights from the Cloud Security Podcast episode, “Zero Touch Prod, Security Rings, and Foundational Services: How Google Does Workload Security.”

Posted in