Google Cloud infrastructure reliability guide

Last reviewed 2023-09-01 UTC

Reliable infrastructure is a critical requirement for workloads in the cloud. As a cloud architect, to design reliable infrastructure for your workloads, you need a good understanding of the reliability capabilities of your cloud provider of choice. This document describes the building blocks of reliability in Google Cloud (zones, regions, and location-scoped resources) and the availability levels that they provide. This document also provides guidelines for assessing the reliability requirements of your workloads, and presents architectural recommendations for building and managing reliable infrastructure in Google Cloud.

This document is divided into the following parts:

If you've read this guide previously and need a summary of the changes, see the Change log section.

Overview of reliability

An application or workload is reliable when it meets your current objectives for availability and resilience to failures.

Availability (or uptime) is the percentage of time that an application is usable. For example, for an application that has an availability target of 99.99%, the total downtime must not exceed 8.64 seconds during a 24-hour period. Sometimes, availability is measured as the proportion of requests that the application serves successfully during a given period. For example, for an application that has an availability target of 99.99%, for every 100,000 requests received, not more than ten requests can fail. Availability is often expressed as the number of nines in the percentage. For example, 99.99% availability is expressed as "4 nines".

Depending on the purpose of the application, you might have different sets of indicators for how reliable the application is. The following are examples of such reliability indicators:

For applications that serve content, availability, latency, and throughput are important reliability indicators. They indicate whether the application can respond to requests, how long the application takes to respond to requests, and how many requests the application can process successfully in a given period.
For databases and storage systems, latency, throughput, availability, and durability (how well data is protected against loss or corruption), are indicators of reliability. They indicate how long the system takes to read or write data, and whether data can be accessed on demand.
For big data and analytics workloads such as data processing pipelines, consistent pipeline performance (throughput and latency) is essential to ensure freshness of the data products, and is an important reliability indicator. It indicates how much data can be processed, and how long it takes for the pipeline to progress from data ingestion to data processing.
Most applications have data correctness as an essential reliability indicator.

For further guidelines to define the reliability objectives for your applications, see Assess the reliability requirements for your cloud workloads.

Factors that affect application reliability

The reliability of an application that's deployed in Google Cloud depends on the following factors:

The internal design of the application.
The secondary applications or components that the application depends on.
Google Cloud infrastructure resources such as compute, networking, storage, databases, and security that the application runs on, and how the application uses the infrastructure.
Infrastructure capacity that you provision, and how the capacity scales.
The DevOps processes and tools that you use to build, deploy, and maintain the application, its dependencies, and the Google Cloud infrastructure.

These factors are summarized in the following diagram:

Application reliability dependencies.

As shown in the preceding diagram, the reliability of an application that's deployed in Google Cloud depends on multiple factors. The focus of this guide is the reliability of the Google Cloud infrastructure.

What's next

Contributors

Authors:

Nir Tarcic | Cloud Lifecycle SRE UTL
Kumar Dhanagopal | Cross-Product Solution Developer

Other contributors:

Alok Kumar | Distinguished Engineer
Andrew Fikes | Engineering Fellow, Reliability
Chris Heiser | SRE TL
David Ferguson | Director, Site Reliability Engineering
Joe Tan | Senior Product Counsel
Krzysztof Duleba | Principal Engineer
Narayan Desai | Principal SRE
Sailesh Krishnamurthy | VP, Engineering
Steve McGhee | Reliability Advocate
Sudhanshu Jain | Product Manager
Yaniv Aknin | Software Engineer

Building blocks of reliability