Build highly available systems through resource redundancy

Last reviewed 2024-12-30 UTC

This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to plan, build, and manage resource redundancy, which can help you to avoid failures.

This principle is relevant to the scoping focus area of reliability.

Principle overview

After you decide the level of reliability that you need, you must design your systems to avoid any single points of failure. Every critical component in the system must be replicated across multiple machines, zones, and regions. For example, a critical database can't be located in only one region, and a metadata server can't be deployed in only one single zone or region. In those examples, if the sole zone or region has an outage, the system has a global outage.

Recommendations

To build redundant systems, consider the recommendations in the following subsections.

Identify failure domains and replicate services

Map out your system's failure domains, from individual VMs to regions, and design for redundancy across the failure domains.

To ensure high availability, distribute and replicate your services and applications across multiple zones and regions. Configure the system for automatic failover to make sure that the services and applications continue to be available in the event of zone or region outages.

For examples of multi-zone and multi-region architectures, see Design reliable infrastructure for your workloads in Google Cloud.

Detect and address issues promptly

Continuously track the status of your failure domains to detect and address issues promptly.

You can monitor the current status of Google Cloud services in all regions by using the Google Cloud Service Health dashboard. You can also view incidents relevant to your project by using Personalized Service Health. You can use load balancers to detect resource health and automatically route traffic to healthy backends. For more information, see Health checks overview.

Test failover scenarios

Like a fire drill, regularly simulate failures to validate the effectiveness of your replication and failover strategies.

For more information, see Simulate a zone outage for a regional MIG and Simulate a zone failure in GKE regional clusters.

Set realistic targets for reliability

Take advantage of horizontal scalability