Google infrastructure security design overview

This content was last updated in June 2023, and represents the status quo as of the time it was written. Google's security policies and systems may change going forward, as we continually improve protection for our customers.

Introduction

This document provides an overview of how security is designed into Google's technical infrastructure. It is intended for security executives, security architects, and auditors.

This document describes the following:

  • Google's global technical infrastructure, which is designed to provide security through the entire information processing lifecycle at Google. This infrastructure helps provide the following:
    • Secure deployment of services
    • Secure storage of data with end-user privacy safeguards
    • Secure communication between services
    • Secure and private communication with customers over the internet
    • Safe operation by Google engineers
  • How we use this infrastructure to build internet services, including consumer services such as Google Search, Gmail, and Google Photos, and enterprise services such as Google Workspace and Google Cloud.
  • Our investment in securing our infrastructure and operations. We have many engineers who are dedicated to security and privacy across Google, including many who are recognized industry authorities.
  • The security products and services that are the result of innovations that we implemented internally to meet our security needs. For example, BeyondCorp is the direct result of our internal implementation of the zero-trust security model.
  • How the security of the infrastructure is designed in progressive layers. These layers include the following:

The remaining sections of this document describe the security layers.

Secure low-level infrastructure

This section describes how we secure the physical premises of our data centers, the hardware in our data centers, and the software stack running on the hardware.

Security of physical premises

We design and build our own data centers, which incorporate multiple layers of physical security. Access to these data centers is tightly controlled. We use multiple physical security layers to protect our data center floors. We use biometric identification, metal detection, cameras, vehicle barriers, and laser-based intrusion detection systems. For more information, see Data center security.

We also host some servers in third-party data centers. In these data centers, we ensure that there are Google-controlled physical security measures on top of the security layers that are provided by the data center operator. For example, we operate biometric identification systems, cameras, and metal detectors that are independent from the security layers that the data center operator provides.

Hardware design and provenance

A Google data center consists of thousands of servers connected to a local network. We design the server boards and the networking equipment. We vet the component vendors that we work with and choose components with care. We work with vendors to audit and validate the security properties that are provided by the components. We also design custom chips, including a hardware security chip (called Titan), that we deploy on servers, devices, and peripherals. These chips let us identify and authenticate legitimate Google devices at the hardware level and serve as hardware roots of trust.

Secure boot stack and machine identity

Google servers use various technologies to ensure that they boot the correct software stack. We use cryptographic signatures for low-level components like the baseboard management controller (BMC), BIOS, bootloader, kernel, and base operating system image. These signatures can be validated during each boot or update cycle. The first integrity check for Google servers uses a hardware root of trust. The components are Google-controlled, built, and hardened with integrity attestation. With each new generation of hardware, we strive to continually improve security. For example, depending on the generation of server design, the boot chain's root of trust is in one of the following:

  • The Titan hardware chip
  • A lockable firmware chip
  • A microcontroller running our own security code

Each server in the data center has its own unique identity. This identity can be tied to the hardware root of trust and the software with which the machine boots. This identity is used to authenticate API calls to and from low-level management services on the machine. This identity is also used for mutual server authentication and transport encryption. We developed the Application Layer Transport Security (ALTS) system for securing remote procedure call (RPC) communications within our infrastructure. These machine identities can be centrally revoked to respond to a security incident. In addition, their certificates and keys are routinely rotated, and old ones revoked.

We developed automated systems to do the following:

  • Ensure that servers run up-to-date versions of their software stacks (including security patches).
  • Detect and diagnose hardware and software problems.
  • Ensure the integrity of the machines and peripherals with verified boot and implicit attestation.
  • Ensure that only machines running the intended software and firmware can access credentials that allow them to communicate on the production network.
  • Remove or re-allocate machines from service when they're no longer needed.

Secure service deployment

Google services are the application binaries that our developers write and run on our infrastructure. Examples of Google services are Gmail servers, Spanner databases, Cloud Storage servers, YouTube video transcoders, and Compute Engine VMs running customer applications. To handle the required scale of the workload, thousands of machines might be running binaries of the same service. A cluster orchestration service, called Borg, controls the services that are running directly on the infrastructure.

The infrastructure does not assume any trust between the services that are running on the infrastructure. This trust model is referred to as a zero-trust security model. A zero-trust security model means that no devices or users are trusted by default, whether they are inside or outside of the network.

Because the infrastructure is designed to be multi-tenant, data from our customers (consumers, businesses, and even our own data) is distributed across shared infrastructure. This infrastructure is composed of tens of thousands of homogeneous machines. The infrastructure does not segregate customer data onto a single machine or set of machines, except in specific circumstances, such as when you are using Google Cloud to provision VMs on sole-tenant nodes for Compute Engine.

Google Cloud and Google Workspace support regulatory requirements around data residency. For more information about data residency and Google Cloud, see Implement data residency and sovereignty requirements. For more information about data residency and Google Workspace, see Data regions: Choose a geographic location for your data.

Service identity, integrity, and isolation

To enable inter-service communication, applications use cryptographic authentication and authorization. Authentication and authorization provide strong access control at an abstraction level and granularity that administrators and services can understand.

Services do not rely on internal network segmentation or firewalling as the primary security mechanism. Ingress and egress filtering at various points in our network helps prevent IP spoofing. This approach also helps us to maximize our network's performance and availability. For Google Cloud, you can add additional security mechanisms such as VPC Service Controls and Cloud Interconnect.

Each service that runs on the infrastructure has an associated service account identity. A service is provided with cryptographic credentials that it can use to prove its identity to other services when making or receiving RPCs. These identities are used in security policies. The security policies ensure that clients are communicating with the intended server, and that servers are limiting the methods and data that particular clients can access.

We use various isolation and sandboxing techniques to help protect a service from other services running on the same machine. These techniques include Linux user separation, language-based (such as the Sandboxed API) and kernel-based sandboxes, application kernel for containers (such as gVisor), and hardware virtualization. In general, we use more layers of isolation for riskier workloads. Riskier workloads include user-supplied items that require additional processing. For example, riskier workloads include running complex file converters on user-supplied data or running user-supplied code for products like App Engine or Compute Engine.

For extra security, sensitive services, such as the cluster orchestration service and some key management services, run exclusively on dedicated machines.

In Google Cloud, to provide stronger cryptographic isolation for your workloads and to protect data in use, we support Confidential Computing services for Compute Engine VMs and Google Kubernetes Engine (GKE) nodes.

Inter-service access management

The owner of a service can use access-management features provided by the infrastructure to specify exactly which other services can communicate with the service. For example, a service can restrict incoming RPCs solely to an allowed list of other services. That service can be configured with the allowed list of the service identities, and the infrastructure automatically enforces this access restriction. Enforcement includes audit logging, justifications, and unilateral access restriction (for engineer requests, for example).

Google engineers who need access to services are also issued individual identities. Services can be configured to allow or deny their access based on their identities. All of these identities (machine, service, and employee) are in a global namespace that the infrastructure maintains.

To manage these identities, the infrastructure provides a workflow system that includes approval chains, logging, and notification. For example, the security policy can enforce multi-party authorization. This system uses the two-person rule to ensure that an engineer acting alone cannot perform sensitive operations without first getting approval from another, authorized engineer. This system allows secure access-management processes to scale to thousands of services running on the infrastructure.

The infrastructure also provides services with the canonical service for user, group, and membership management so that they can implement custom, fine-grained access control where necessary.

End-user identities are managed separately, as described in Access management of end-user data in Google Workspace.

Encryption of inter-service communication

The infrastructure provides confidentiality and integrity for RPC data on the network. All Google Cloud virtual networking traffic is encrypted. All communication between infrastructure services is authenticated and most inter- service communication is encrypted, which adds an additional layer of security to help protect communication even if the network is tapped or a network device is compromised. Exceptions to the encryption requirement for inter-service communication are granted only for traffic that has low latency requirements, and that also doesn't leave a single networking fabric within the multiple layers of physical security in our data center.

The infrastructure automatically and efficiently (with help of hardware offload) provides end-to-end encryption for the infrastructure RPC traffic that goes over the network between data centers.

Access management of end-user data in Google Workspace

A typical Google Workspace service is written to do something for an end user. For example, an end user can store their email on Gmail. The end user's interaction with an application like Gmail might span other services within the infrastructure. For example, Gmail might call a People API to access the end user's address book.

The Encryption of inter-service communication section describes how a service (such as Google Contacts) is designed to protect RPC requests from another service (such as Gmail). However, this level of access control is still a broad set of permissions because Gmail is able to request the contacts of any user at any time.

When Gmail makes an RPC request to Google Contacts on behalf of an end user, the infrastructure lets Gmail present an end-user permission ticket in the RPC request. This ticket proves that Gmail is making the RPC request on behalf of that particular end user. The ticket enables Google Contacts to implement a safeguard so that it only returns data for the end user named in the ticket.

The infrastructure provides a central user identity service that issues these end-user context tickets. The identity service verifies the end-user login and then issues a user credential, such as a cookie or OAuth token, to the user's device. Every subsequent request from the device to our infrastructure must present that end-user credential.

When a service receives an end-user credential, the service passes the credential to the identity service for verification. If the end-user credential is verified, the identity service returns a short-lived end-user context ticket that can be used for RPCs related to the user's request. In our example, the service that gets the end-user context ticket is Gmail, which passes the ticket to Google Contacts. From that point on, for any cascading calls, the calling service can send the end-user context ticket to the callee as a part of the RPC.

The following diagram shows how Service A and Service B communicate. The infrastructure provides service identity, automatic mutual authentication, encrypted inter-service communication, and enforcement of the access policies that are defined by the service owner. Each service has a service configuration, which the service owner creates. For encrypted inter-service communication, automatic mutual authentication uses caller and callee identities. Communication is only possible when an access rule configuration permits it.

Diagram that shows how Service A and Service B communicate.

For information about access management in Google Cloud, see IAM overview.

Access management of end-user data in Google Cloud

Similar to Access management of end-user data in Google Workspace, the infrastructure provides a central user identity service that authenticates service accounts and issues end-user context tickets after a service account is authenticated. Access management between Google Cloud services is typically done with service agents rather than using end-user context tickets.

Google Cloud uses Identity and Access Management (IAM) and context-aware products such as Identity-Aware Proxy to let you manage access to the resources in your Google Cloud organization. All requests to Google Cloud services go through IAM to verify permissions.

The access management process is as follows:

  1. A request comes in through the Google Front End service or the Cloud Front End service for customer VMs.
  2. The request is routed to the central user identity service that completes the authentication check and issues the end-user context tickets.
  3. The request is also routed to check for items such as the following:
  4. After all of these checks pass, the Google Cloud backend services are called.

For information about access management in Google Cloud, see IAM overview.

Secure data storage

This section describes how we implement security for data that is stored on the infrastructure.

Encryption at rest

Google's infrastructure provides various storage services and distributed file systems (for example, Spanner and Colossus), and a central key management service. Applications at Google access physical storage by using storage infrastructure. We use several layers of encryption to protect data at rest. By default, the storage infrastructure encrypts all user data before the user data is written to physical storage.

The infrastructure performs encryption at the application or storage infrastructure layer. Encryption lets the infrastructure isolate itself from potential threats at the lower levels of storage, such as malicious disk firmware. Where applicable, we also enable hardware encryption support in our hard drives and SSDs, and we meticulously track each drive through its lifecycle. Before a decommissioned, encrypted storage device can physically leave our custody, the device is cleaned by using a multi-step process that includes two independent verifications. Devices that do not pass this cleaning process are physically destroyed (that is, shredded) on-premises.

In addition to the encryption done by the infrastructure, Google Cloud and Google Workspace provide key management services. For Google Cloud, Cloud KMS is a cloud service that lets customers manage cryptographic keys. For Google Workspace, you can use client-side encryption. For more information, see Client-side encryption and strengthened collaboration in Google Workspace.

Deletion of data

Deletion of data typically starts with marking specific data as scheduled for deletion rather than actually deleting the data. This approach lets us recover from unintentional deletions, whether they are customer-initiated, are due to a bug, or are the result of an internal process error. After data is marked as scheduled for deletion, it is deleted in accordance with service-specific policies.

When an end user deletes their account, the infrastructure notifies the services that are handling the end-user data that the account has been deleted. The services can then schedule the data that is associated with the deleted end-user account for deletion. This feature enables an end user to control their own data.

For more information, see Data deletion on Google Cloud.

Secure internet communication

This section describes how we secure communication between the internet and the services that run on Google infrastructure.

As discussed in Hardware design and provenance, the infrastructure consists of many physical machines that are interconnected over the LAN and WAN. The security of inter-service communication is not dependent on the security of the network. However, we isolate our infrastructure from the internet into a private IP address space. We only expose a subset of the machines directly to external internet traffic so that we can implement additional protections such as defenses against denial of service (DoS) attacks.

Google Front End service

When a service must make itself available on the internet, it can register itself with an infrastructure service called the Google Front End (GFE). The GFE ensures that all TLS connections are terminated with correct certificates and by following best practices such as supporting perfect forward secrecy. The GFE also applies protections against DoS attacks. The GFE then forwards requests for the service by using the RPC security protocol discussed in Access management of end-user data in Google Workspace.

In effect, any internal service that must publish itself externally uses the GFE as a smart reverse-proxy frontend. The GFE provides public IP address hosting of its public DNS name, DoS protection, and TLS termination. GFEs run on the infrastructure like any other service and can scale to match incoming request volumes.

Customer VMs on Google Cloud do not register with GFE. Instead, they register with the Cloud Front End, which is a special configuration of GFE that uses the Compute Engine networking stack. Cloud Front End lets customer VMs access a Google service directly using their public or private IP address. (Private IP addresses are only available when Private Google Access is enabled.)

DoS protection

The scale of our infrastructure enables it to absorb many DoS attacks. To further reduce the risk of DoS impact on services, we have multi-tier, multi-layer DoS protections.

When our fiber-optic backbone delivers an external connection to one of our data centers, the connection passes through several layers of hardware and software load balancers. These load balancers report information about incoming traffic to a central DoS service running on the infrastructure. When the central DoS service detects a DoS attack, the service can configure the load balancers to drop or throttle traffic associated with the attack.

The GFE instances also report information about the requests that they are receiving to the central DoS service, including application-layer information that the load balancers don't have access to. The central DoS service can then configure the GFE instances to drop or throttle attack traffic.

User authentication

After DoS protection, the next layer of defense for secure communication comes from the central identity service. End users interact with this service through the Google login page. The service asks for a username and password, and it can also challenge users for additional information based on risk factors. Example risk factors include whether the users have logged in from the same device or from a similar location in the past. After authenticating the user, the identity service issues credentials such as cookies and OAuth tokens that can be used for subsequent calls.

When users sign in, they can use second factors such as OTPs or phishing-resistant security keys such as the Titan Security Key. The Titan Security Key is a physical token that supports the FIDO Universal 2nd Factor (U2F). We helped develop the U2F open standard with the FIDO Alliance. Most web platforms and browsers have adopted this open authentication standard.

Operational security

This section describes how we develop infrastructure software, protect our employees' machines and credentials, and defend against threats to the infrastructure from both insiders and external actors.

Safe software development

Besides the source control protections and two-party review process described earlier, we use libraries that prevent developers from introducing certain classes of security bugs. For example, we have libraries and frameworks that help eliminate XSS vulnerabilities in web apps. We also use automated tools such as fuzzers, static analysis tools, and web security scanners to automatically detect security bugs.

As a final check, we use manual security reviews that range from quick triages for less risky features to in-depth design and implementation reviews for the most risky features. The team that conducts these reviews includes experts across web security, cryptography, and operating system security. The reviews can lead to the development of new security library features and new fuzzers that we can use for future products.

In addition, we run a Vulnerability Rewards Program that rewards anyone who discovers and informs us of bugs in our infrastructure or applications. For more information about this program, including the rewards that we've given, see Bug hunters key stats.

We also invest in finding zero-day exploits and other security issues in the open source software that we use. We run Project Zero, which is a team of Google researchers who are dedicated to researching zero-day vulnerabilities, including Spectre and Meltdown. In addition, we are the largest submitter of CVEs and security bug fixes for the Linux KVM hypervisor.

Source code protections

Our source code is stored in repositories with built-in source integrity and governance, where both current and past versions of the service can be audited. The infrastructure requires that a service's binaries be built from specific source code, after it is reviewed, checked in, and tested. Binary Authorization for Borg (BAB) is an internal enforcement check that happens when a service is deployed. BAB does the following:

  • Ensures that the production software and configuration that is deployed at Google is reviewed and authorized, particularly when that code can access user data.
  • Ensures that code and configuration deployments meet certain minimum standards.
  • Limits the ability of an insider or adversary to make malicious modifications to source code and also provides a forensic trail from a service back to its source.

Keeping employee devices and credentials safe

We implement safeguards to help protect our employees' devices and credentials from compromise. To help protect our employees against sophisticated phishing attempts, we have replaced OTP second-factor authentication with the mandatory use of U2F-compatible security keys.

We monitor the client devices that our employees use to operate our infrastructure. We ensure that the operating system images for these devices are up to date with security patches and we control the applications that employees can install on their devices. We also have systems that scan user-installed applications, downloads, browser extensions, and web browser content to determine whether they are suitable for corporate devices.

Being connected to the corporate LAN is not our primary mechanism for granting access privileges. Instead, we use zero-trust security to help protect employee access to our resources. Access-management controls at the application level expose internal applications to employees only when employees use a managed device and are connecting from expected networks and geographic locations. A client device is trusted based on a certificate that's issued to the individual machine, and based on assertions about its configuration (such as up-to-date software). For more information, see BeyondCorp.

Reducing insider risk

Insider risk is the potential of a current or former employee, contractor, or other business partner who has or had authorized access to our network, system, or data to misuse that access to undermine the confidentiality, integrity, or availability of our information or information systems.

To help reduce insider risk, we limit and actively monitor the activities of employees who have been granted administrative access to the infrastructure. We continually work to eliminate the need for privileged access for particular tasks by using automation that can accomplish the same tasks in a safe and controlled way. We expose limited APIs that allow debugging without exposing sensitive data, and we require two-party approvals for certain sensitive actions performed by human operators.

Google employee access to end-user information can be logged through low-level infrastructure hooks. Our security team monitors access patterns and investigates unusual events. For more information, see Privileged Access Management in Google Cloud (PDF).

We use binary authorization for Borg to help protect our supply chain from insider risk. In addition, our investment in BeyondProd helps to protect user data in Google infrastructure and to establish trust in our services.

In Google Cloud, you can monitor access to your data using Access Transparency. Access Transparency logs let you verify that Google personnel are accessing your content only for valid business reasons, such as fixing an outage or attending to your support requests. Access Approval ensures that Cloud Customer Care and engineering require your explicit approval whenever they need to access your data. The approval is cryptographically verified to ensure the integrity of the access approval.

Threat monitoring

The Threat Analysis Group at Google monitors threat actors and the evolution of their tactics and techniques. The goals of this group are to help improve the safety and security of Google products and share this intelligence for the benefit of the online community.

For Google Cloud, you can use Google Cloud Threat Intelligence for Chronicle and VirusTotal to monitor and respond to many types of malware. Google Cloud Threat Intelligence for Chronicle is a team of threat researchers who develop threat intelligence for use with Chronicle. VirusTotal is a malware database and visualization solution that you can use to better understand how malware operates within your enterprise.

For more information about our threat monitoring activities, see the Threat Horizons report.

Intrusion detection

We use sophisticated data processing pipelines to integrate host-based signals on individual devices, network-based signals from various monitoring points in the infrastructure, and signals from infrastructure services. Rules and machine intelligence built on top of these pipelines give operational security engineers warnings of possible incidents. Our investigation and incident-response teams triage, investigate, and respond to these potential incidents 24 hours a day, 365 days a year. We conduct Red Team exercises to measure and improve the effectiveness of our detection and response mechanisms.

What's next