The building blocks of reliability in Google Cloud

Stay organized with collections Save and categorize content based on your preferences.

Google Cloud infrastructure services run in locations around the globe. The locations are divided into failure domains called regions and zones, which are the foundational building blocks for designing reliable infrastructure for your cloud workloads.

A failure domain is a resource or a group of resources that can fail independently of other resources. A standalone Compute Engine VM is an example of a resource that's a failure domain. A Google Cloud region or zone is an example of a failure domain that consists of a group of resources. When an application is distributed redundantly across failure domains, it can achieve a higher aggregated level of availability than that provided by each failure domain.

This part of the Google Cloud infrastructure reliability guide describes the building blocks of reliability in Google Cloud and how they affect the availability of your cloud resources.

Regions and zones

Google Cloud regions are independent geographic locations. Each Google Cloud region contains multiple zones. A failure in one zone is unlikely to affect the infrastructure in the other zones. A global network backbone provides high-bandwidth, low-latency connectivity across all the Google Cloud zones and regions.

Platform availability

Google Cloud infrastructure is designed to tolerate and recover from failures. Google continually invests in innovative approaches to maintain and improve the reliability of Google Cloud. The following capabilities of Google Cloud infrastructure help to provide a reliable platform for your cloud workloads:

  • Geographically separated regions to mitigate the effects of natural disasters and region outages on global services.
  • Hardware redundancy and replication to avoid single points of failure. This redundancy extends to machines, racks, storage systems, network devices like switches and routers, cables, and external fiber connectivity.
  • Live migration of resources during maintenance events. For example, during planned infrastructure maintenance, Compute Engine VMs can be moved to another host in the same zone by using live migration.
  • A secure-by-design infrastructure foundation for the physical infrastructure and software on which Google Cloud runs, and operational security controls to protect your data and workloads. For more information, see Google infrastructure security design overview.
  • A high-performance backbone network that uses an advanced software-defined networking (SDN) approach to network management, with edge-caching services to deliver consistent performance that scales well.
  • Continuous monitoring and reporting. You can view the status of Google Cloud services in every location by using the Google Cloud Service Health Dashboard.
  • Annual, company-wide Disaster Recovery Testing (DiRT) events to ensure that Google Cloud services and internal business operations continue to run during a disaster.

Google Cloud infrastructure is designed to support the following target levels of availability for most customer workloads:

Deployment location Availability (uptime) % Approximate maximum downtime
Single zone 3 nines: 99.9% 43.2 minutes in a 30-day month
Multiple zones in a region 4 nines: 99.99% 4.3 minutes in a 30-day month
Multiple regions 5 nines: 99.999% 26 seconds in a 30-day month

The availability percentages in the preceding table are targets. The uptime Service Level Agreements (SLAs) for specific Google Cloud services might be different from these availability targets. For example, the uptime SLA for a Cloud Bigtable instance depends on the number of clusters, their distribution across locations, and the routing policy that you configure.

The minimum uptime SLA for a Bigtable instance with clusters in three or more regions is 99.999% if the multi-cluster routing policy is configured. But, if the single-cluster routing policy is configured, then the minimum uptime SLA is 99.9% regardless of the number of clusters and their distribution.

The diagrams in this section show Bigtable instances with varying cluster sizes and the consequent differences in their uptime SLAs.

Single cluster

The following diagram shows a single-cluster Bigtable instance, with a minimum uptime SLA of 99.9%:

Single-cluster Bigtable instance (minimum uptime SLA: 99.9%).

Multiple clusters

The following diagram shows a multi-cluster Bigtable instance in multiple zones within a single region, with multi-cluster routing (minimum uptime SLA: 99.99%):

Multi-cluster Bigtable instance in multiple zones within a single region, with multi-cluster routing (minimum uptime SLA: 99.99%).

Multiple clusters

The following diagram shows a multi-cluster Bigtable instance in two regions, with multi-cluster routing (minimum uptime SLA: 99.99%):

Multi-cluster Bigtable instance in two regions, with multi-cluster routing (minimum uptime SLA: 99.99%).

Multiple clusters

The following diagram shows a multi-cluster Bigtable instance in three regions, with multi-cluster routing (minimum uptime SLA: 99.999%):

Multi-cluster Bigtable instance in three regions, with multi-cluster routing (minimum uptime SLA: 99.999%).

Location scopes

The location scope of a Google Cloud resource determines the extent to which an infrastructure failure can affect the resource. Most resources that you provision in Google Cloud have one of the following location scopes: zonal, regional, multi-region, or global.

The location scope of some resource types is fixed; that is, you can't choose or change the location scope. For example, Virtual Private Cloud (VPC) networks are global resources, and Compute Engine virtual machines (VMs) are zonal resources. For certain resources, you can choose the location scope while provisioning the resource. For example, when you create a Google Kubernetes Engine (GKE) cluster, you can choose to create a zonal or regional GKE cluster.

The following sections describe location scopes in more detail.

Zonal resources

Zonal resources are deployed within a single zone in a Google Cloud region. The following are examples of zonal resources. This list is not exhaustive.

  • Compute Engine VMs
  • Zonal managed instance groups (MIGs)
  • Zonal persistent disks
  • Single-zone GKE clusters
  • Filestore Basic and Filestore High Scale instances
  • Dataflow jobs
  • Cloud SQL instances
  • Dataproc clusters on Compute Engine

A failure in a zone might affect the zonal resources that are provisioned within that zone. Zones are designed to minimize the risk of correlated failures with other zones in the region. A failure in one zone usually does not affect the resources in the other zones in the region. Also, a failure in a zone doesn't necessarily cause all the infrastructure in that zone to be unavailable. The zone merely defines the expected boundary for the effect of a failure.

To protect applications that use zonal resources against zonal incidents, you can distribute or replicate the resources across multiple zones or regions. For more information, see Design reliable infrastructure for your workloads in Google Cloud.

Regional resources

Regional resources are deployed redundantly across multiple zones within a region. The following are examples of regional resources. This list is not exhaustive.

  • Regional MIGs
  • Regional Cloud Storage buckets
  • Regional persistent disks
  • Regional GKE clusters with the default (multi-zone) configuration
  • VPC subnets
  • Regional external HTTP(S) load balancers
  • Regional Cloud Spanner instances
  • Filestore Enterprise instances
  • Cloud Run services

Regional resources are resilient to incidents in a specific zone. A region outage can affect some or all the regional resources provisioned within that region. Such outages can be caused by natural disasters or by large-scale infrastructure failures.

Multi-region resources

Multi-region resources are distributed across specific regions. The following are examples of multi-region resources. This list is not exhaustive.

  • Dual-region and multi-region Cloud Storage buckets
  • Multi-region Cloud Spanner instances
  • Multi-cluster (multi-region) Bigtable instances
  • Multi-region key rings in Cloud Key Management Service

For a complete list of the Google Services that are available in multi-region configurations, see Products available by location.

Multi-region resources are resilient to incidents in specific regions and zones. An infrastructure outage that occurs in multiple regions can affect the availability of some or all the multi-region resources that are provisioned in the affected regions.

Global resources

Global resources are available across all Google Cloud locations. The following are examples of global resources. This list is not exhaustive.

  • Projects. For guidance and best practices about organizing your Google Cloud resources into folders and projects, see Decide a resource hierarchy for your Google Cloud landing zone.

  • VPC networks, including associated routes and firewall rules

  • Cloud DNS zones

  • Global external HTTP(S) load balancers

  • Global key rings in Cloud Key Management Service

  • Pub/Sub topics

  • Secrets in Secret Manager

For a complete list of the Google Services that are available globally, see Global products.

Global resources are resilient to zonal and regional incidents. These resources don't rely on infrastructure in any specific region. Google Cloud has systems and processes that help to minimize the risk of global infrastructure outages. Google also continually monitors the infrastructure, and quickly resolves any global outages.

The following table summarizes the relative resilience of zonal, regional, multi-region, and global resources to application and infrastructure issues. It also describes the effort required to set up these resources, and recommendations to mitigate the effects of outages.

Resource scope Resilience, and effort to set up Recommendations to mitigate the effects of infrastructure outages
Zonal Low Deploy the resources redundantly in multiple zones or regions.
Regional Medium Deploy the resources redundantly in multiple regions.
Multi-region or global High Manage changes carefully, and use defense-in-depth fallbacks where possible. For more information, see Recommendations to manage the risk of outages of global resources.

Recommendations to manage the risk of outages of global resources

To take advantage of the resilience of global resources to zone and region outages, you might consider using certain global resources in your architecture. Google recommends the following approaches to manage the risk of outages of global resources:

Careful management of changes to global resources

Despite their resilience, critical global resources in your architecture can be single points of failure (SPOF). For example, you might want to use a global load balancer as the frontend for a geographically-distributed application. A global load balancer is often a good choice for such an application. However, the global load balancer can increase the risk of configuration errors. To avoid this risk, you must manage configuration changes to global resources carefully. For more information, see Control changes to global resources.

Use of regional resources as defense-in-depth fallbacks

For applications that have exceptionally high availability requirements, regional defense-in-depth fallbacks can help minimize the effect of outages of global resources. Consider the example of a geographically-distributed application that has a global load balancer as the frontend. To ensure that the application remains accessible even if the global load balancer is affected by a global outage, you can deploy regional load balancers. You can configure the clients to prefer the global load balancer, but fail over to the nearest regional load balancer if the global load balancer is not available.

Example architecture with zonal, regional, and global resources

Your cloud topology can include a combination of zonal, regional, and global resources, as shown in the following diagram. The following diagram shows an example architecture for a multi-tier application that's deployed in Google Cloud.

Location scopes of Google Cloud resources.

As shown in the preceding diagram, a global external HTTP/S load balancer receives client requests. The load balancer distributes the requests to the backend, which is a regional MIG that has two Compute Engine VMs. The application running on the VMs writes data to and reads from a Cloud SQL database. The database is configured for HA. The primary and standby instances of the database are provisioned in separate zones, and the primary database is replicated synchronously to the standby database. In addition, the database is backed up automatically to a multi-region bucket in Cloud Storage.

The following table summarizes the Google Cloud resources in the preceding architecture and the resilience of each resource to zone and region outages:

Resource Resilience to outages
VPC network VPC networks, including associated routes and firewall rules, are global resources. They are resilient to zone and region outages.
Subnets VPC subnets are regional resources. They are resilient to zone outages.
Global external HTTP/S load balancer Global external HTTP/S load balancers are resilient to zone and region outages.
Regional MIG Regional MIGs are resilient to zone outages.
Compute Engine VMs Compute Engine VMs are zonal resources. If a zone outage occurs, the individual Compute Engine VMs might be affected. However, the application can continue to serve requests because the backend for the load balancer is a regional MIG, and not standalone VMs.
Cloud SQL instances The Cloud SQL deployment in this architecture is configured for HA; that is, the deployment includes a primary-standby pair of database instances. The primary database is replicated synchronously to the standby database by using regional persistent disks.

  • If an outage occurs in the zone that hosts the primary database, the Cloud SQL service fails over to the standby database automatically.
  • If a region outage occurs, you can restore the database in a different region by using the database backups.

Multi-region Cloud Storage bucket Data that's stored in multi-region Cloud Storage buckets is resilient to single-region outages.
Persistent disks Persistent disks can be zonal or regional. Regional persistent disks are resilient to zone outages. To prepare for recovery from region outages, you can schedule snapshots of persistent disks and store the snapshots in a multi-region Cloud Storage bucket.