Google Cloud Well-Architected Framework: Reliability

Last reviewed 2025-02-14 UTC

The reliability pillar in the Google Cloud Well-Architected Framework provides principles and recommendations to help you design, deploy, and manage reliable workloads in Google Cloud.

This document is intended for cloud architects, developers, platform engineers, administrators, and site reliability engineers.

Reliability is a system's ability to consistently perform its intended functions within the defined conditions and maintain uninterrupted service. Best practices for reliability include redundancy, fault-tolerant design, monitoring, and automated recovery processes.

As a part of reliability, resilience is the system's ability to withstand and recover from failures or unexpected disruptions, while maintaining performance. Google Cloud features, like multi-regional deployments, automated backups, and disaster recovery solutions, can help you improve your system's resilience.

Reliability is important to your cloud strategy for many reasons, including the following:

  • Minimal downtime: Downtime can lead to lost revenue, decreased productivity, and damage to reputation. Resilient architectures can help ensure that systems can continue to function during failures or recover efficiently from failures.
  • Enhanced user experience: Users expect seamless interactions with technology. Resilient systems can help maintain consistent performance and availability, and they provide reliable service even during high demand or unexpected issues.
  • Data integrity: Failures can cause data loss or data corruption. Resilient systems implement mechanisms such as backups, redundancy, and replication to protect data and ensure that it remains accurate and accessible.
  • Business continuity: Your business relies on technology for critical operations. Resilient architectures can help ensure continuity after a catastrophic failure, which enables business functions to continue without significant interruptions and supports a swift recovery.
  • Compliance: Many industries have regulatory requirements for system availability and data protection. Resilient architectures can help you to meet these standards by ensuring systems remain operational and secure.
  • Lower long-term costs: Resilient architectures require upfront investment, but resiliency can help to reduce costs over time by preventing expensive downtime, avoiding reactive fixes, and enabling more efficient resource use.

Organizational mindset

To make your systems reliable, you need a plan and an established strategy. This strategy must include education and the authority to prioritize reliability alongside other initiatives.

Set a clear expectation that the entire organization is responsible for reliability, including development, product management, operations, platform engineering, and site reliability engineering (SRE). Even the business-focused groups, like marketing and sales, can influence reliability.

Every team must understand the reliability targets and risks of their applications. The teams must be accountable to these requirements. Conflicts between reliability and regular product feature development must be prioritized and escalated accordingly.

Plan and manage reliability holistically, across all your functions and teams. Consider setting up a Cloud Centre of Excellence (CCoE) that includes a reliability pillar. For more information, see Optimize your organization's cloud journey with a Cloud Center of Excellence.

Focus areas for reliability

The activities that you perform to design, deploy, and manage a reliable system can be categorized in the following focus areas. Each of the reliability principles and recommendations in this pillar is relevant to one of these focus areas.

  • Scoping: To understand your system, conduct a detailed analysis of its architecture. You need to understand the components, how they work and interact, how data and actions flow through the system, and what could go wrong. Identify potential failures, bottlenecks, and risks, which helps you to take actions to mitigate those issues.
  • Observation: To help prevent system failures, implement comprehensive and continuous observation and monitoring. Through this observation, you can understand trends and identify potential problems proactively.
  • Response: To reduce the impact of failures, respond appropriately and recover efficiently. Automated responses can also help reduce the impact of failures. Even with planning and controls, failures can still occur.
  • Learning: To help prevent failures from recurring, learn from each experience, and take appropriate actions.

Core principles

The recommendations in the reliability pillar of the Well-Architected Framework are mapped to the following core principles:

Contributors

Authors:

Other contributors:

Define reliability based on user-experience goals

This principle in the reliability pillar of the Google Cloud Well-Architected Framework helps you to assess your users' experience, and then map the findings to reliability goals and metrics.

This principle is relevant to the scoping focus area of reliability.

Principle overview

Observability tools provide large amounts of data, but not all of the data directly relates to the impacts on the users. For example, you might observe high CPU usage, slow server operations, or even crashed tasks. However, if these issues don't affect the user experience, then they don't constitute an outage.

To measure the user experience, you need to distinguish between internal system behavior and user-facing problems. Focus on metrics like the success ratio of user requests. Don't rely solely on server-centric metrics, like CPU usage, which can lead to misleading conclusions about your service's reliability. True reliability means that users can consistently and effectively use your application or service.

Recommendations

To help you measure user experience effectively, consider the recommendations in the following sections.

Measure user experience

To truly understand your service's reliability, prioritize metrics that reflect your users' actual experience. For example, measure the users' query success ratio, application latency, and error rates.

Ideally, collect this data directly from the user's device or browser. If this direct data collection isn't feasible, shift your measurement point progressively further away from the user in the system. For example, you can use the load balancer or frontend service as the measurement point. This approach helps you identify and address issues before those issues can significantly impact your users.

Analyze user journeys

To understand how users interact with your system, you can use tracing tools like Cloud Trace. By following a user's journey through your application, you can find bottlenecks and latency issues that might degrade the user's experience. Cloud Trace captures detailed performance data for each hop in your service architecture. This data helps you identify and address performance issues more efficiently, which can lead to a more reliable and satisfying user experience.

Set realistic targets for reliability

This principle in the reliability pillar of the Google Cloud Well-Architected Framework helps you define reliability goals that are technically feasible for your workloads in Google Cloud.

This principle is relevant to the scoping focus area of reliability.

Principle overview

Design your systems to be just reliable enough for user happiness. It might seem counterintuitive, but a goal of 100% reliability is often not the most effective strategy. Higher reliability might result in a significantly higher cost, both in terms of financial investment and potential limitations on innovation. If users are already happy with the current level of service, then efforts to further increase happiness might yield a low return on investment. Instead, you can better spend resources elsewhere.

You need to determine the level of reliability at which your users are happy, and determine the point where the cost of incremental improvements begin to outweigh the benefits. When you determine this level of sufficient reliability, you can allocate resources strategically and focus on features and improvements that deliver greater value to your users.

Recommendations

To set realistic reliability targets, consider the recommendations in the following subsections.

Accept some failure and prioritize components

Aim for high availability such as 99.99% uptime, but don't set a target of 100% uptime. Acknowledge that some failures are inevitable.

The gap between 100% uptime and a 99.99% target is the allowance for failure. This gap is often called the error budget. The error budget can help you take risks and innovate, which is fundamental to any business to stay competitive.

Prioritize the reliability of the most critical components in the system. Accept that less critical components can have a higher tolerance for failure.

Balance reliability and cost

To determine the optimal reliability level for your system, conduct thorough cost-benefit analyses.

Consider factors like system requirements, the consequences of failures, and your organization's risk tolerance for the specific application. Remember to consider your disaster recovery metrics, such as the recovery time objective (RTO) and recovery point objective (RPO). Decide what level of reliability is acceptable within the budget and other constraints.

Look for ways to improve efficiency and reduce costs without compromising essential reliability features.

Build highly available systems through resource redundancy

This principle in the reliability pillar of the Google Cloud Well-Architected Framework provides recommendations to plan, build, and manage resource redundancy, which can help you to avoid failures.

This principle is relevant to the scoping focus area of reliability.

Principle overview

After you decide the level of reliability that you need, you must design your systems to avoid any single points of failure. Every critical component in the system must be replicated across multiple machines, zones, and regions. For example, a critical database can't be located in only one region, and a metadata server can't be deployed in only one single zone or region. In those examples, if the sole zone or region has an outage, the system has a global outage.

Recommendations

To build redundant systems, consider the recommendations in the following subsections.

Identify failure domains and replicate services

Map out your system's failure domains, from individual VMs to regions, and design for redundancy across the failure domains.

To ensure high availability, distribute and replicate your services and applications across multiple zones and regions. Configure the system for automatic failover to make sure that the services and applications continue to be available in the event of zone or region outages.

For examples of multi-zone and multi-region architectures, see Design reliable infrastructure for your workloads in Google Cloud.

Detect and address issues promptly

Continuously track the status of your failure domains to detect and address issues promptly.

You can monitor the current status of Google Cloud services in all regions by using the Google Cloud Service Health dashboard. You can also view incidents relevant to your project by using Personalized Service Health. You can use load balancers to detect resource health and automatically route traffic to healthy backends. For more information, see Health checks overview.

Test failover scenarios

Like a fire drill, regularly simulate failures to validate the effectiveness of your replication and failover strategies.

For more information, see Simulate a zone outage for a regional MIG and Simulate a zone failure in GKE regional clusters.

Take advantage of horizontal scalability

This principle in the reliability pillar of the Google Cloud Well-Architected Framework provides recommendations to help you use horizontal scalability. By using horizontal scalability, you can help ensure that your workloads in Google Cloud can scale efficiently and maintain performance.

This principle is relevant to the scoping focus area of reliability.

Principle overview

Re-architect your system to a horizontal architecture. To accommodate growth in traffic or data, you can add more resources. You can also remove resources when they're not in use.

To understand the value of horizontal scaling, consider the limitations of vertical scaling.

A common scenario for vertical scaling is to use a MySQL database as the primary database with critical data. As database usage increases, more RAM and CPU is required. Eventually, the database reaches the memory limit on the host machine, and needs to be upgraded. This process might need to be repeated several times. The problem is that there are hard limits on how much a database can grow. VM sizes are not unlimited. The database can reach a point when it's no longer possible to add more resources.

Even if resources were unlimited, a large VM can become a single point of failure. Any problem with the primary database VM can cause error responses or cause a system-wide outage that affects all users. Avoid single points of failure, as described in Build highly available systems through resource redundancy.

Besides these scaling limits, vertical scaling tends to be more expensive. The cost can increase exponentially as machines with greater amounts of compute power and memory are acquired.

Horizontal scaling, by contrast, can cost less. The potential for horizontal scaling is virtually unlimited in a system that's designed to scale.

Recommendations

To transition from a single VM architecture to a horizontal multiple-machine architecture, you need to plan carefully and use the right tools. To help you achieve horizontal scaling, consider the recommendations in the following subsections.

Use managed services

Managed services remove the need to manually manage horizontal scaling. For example, with Compute Engine managed instance groups (MIGs), you can add or remove VMs to scale your application horizontally. For containerized applications, Cloud Run is a serverless platform that can automatically scale your stateless containers based on incoming traffic.

Promote modular design

Modular components and clear interfaces help you scale individual components as needed, instead of scaling the entire application. For more information, see Promote modular design in the performance optimization pillar.

Implement a stateless design

Design applications to be stateless, meaning no locally stored data. This lets you add or remove instances without worrying about data consistency.

Detect potential failures by using observability

This principle in the reliability pillar of the Google Cloud Well-Architected Framework provides recommendations to help you proactively identify areas where errors and failures might occur.

This principle is relevant to the observation focus area of reliability.

Principle overview

To maintain and improve the reliability of your workloads in Google Cloud, you need to implement effective observability by using metrics, logs, and traces.

  • Metrics are numerical measurements of activities that you want to track for your application at specific time intervals. For example, you might want to track technical metrics like request rate and error rate, which can be used as service-level indicators (SLIs). You might also need to track application-specific business metrics like orders placed and payments received.
  • Logs are time-stamped records of discrete events that occur within an application or system. The event could be a failure, an error, or a change in state. Logs might include metrics, and you can also use logs for SLIs.
  • A trace represents the journey of a single user or transaction through a number of separate applications or the components of an application. For example, these components could be microservices. Traces help you to track what components were used in the journeys, where bottlenecks exist, and how long the journeys took.

Metrics, logs, and traces help you monitor your system continuously. Comprehensive monitoring helps you find out where and why errors occurred. You can also detect potential failures before errors occur.

Recommendations

To detect potential failures efficiently, consider the recommendations in the following subsections.

Gain comprehensive insights

To track key metrics like response times and error rates, use Cloud Monitoring and Cloud Logging. These tools also help you to ensure that the metrics consistently meet the needs of your workload.

To make data-driven decisions, analyze default service metrics to understand component dependencies and their impact on overall workload performance.

To customize your monitoring strategy, create and publish your own metrics by using the Google Cloud SDK.

Perform proactive troubleshooting

Implement robust error handling and enable logging across all of the components of your workloads in Google Cloud. Activate logs like Cloud Storage access logs and VPC Flow Logs.

When you configure logging, consider the associated costs. To control logging costs, you can configure exclusion filters on the log sinks to exclude certain logs from being stored.

Optimize resource utilization

Monitor CPU consumption, network I/O metrics, and disk I/O metrics to detect under-provisioned and over-provisioned resources in services like GKE, Compute Engine, and Dataproc. For a complete list of supported services, see Cloud Monitoring overview.

Prioritize alerts

For alerts, focus on critical metrics, set appropriate thresholds to minimize alert fatigue, and ensure timely responses to significant issues. This targeted approach lets you proactively maintain workload reliability. For more information, see Alerting overview.

Design for graceful degradation

This principle in the reliability pillar of the Google Cloud Well-Architected Framework provides recommendations to help you to design your Google Cloud workloads to fail gracefully.

This principle is relevant to the response focus area of reliability.

Principle overview

Graceful degradation is a design approach where a system that experiences a high load continues to function, possibly with reduced performance or accuracy. Graceful degradation ensures continued availability of the system and prevents complete failure, even if the system's work isn't optimal. When the load returns to a manageable level, the system resumes full functionality.

For example, during periods of high load, Google Search prioritizes results from higher-ranked web pages, potentially sacrificing some accuracy. When the load decreases, Google Search recomputes the search results.

Recommendations

To design your systems for graceful degradation, consider the recommendations in the following subsections.

Implement throttling

Ensure that your replicas can independently handle overloads and can throttle incoming requests during high-traffic scenarios. This approach helps you to prevent cascading failures that are caused by shifts in excess traffic between zones.

Use tools like Apigee to control the rate of API requests during high-traffic times. You can configure policy rules to reflect how you want to scale back requests.

Drop excess requests early

Configure your systems to drop excess requests at the frontend layer to protect backend components. Dropping some requests prevents global failures and enables the system to recover more gracefully.With this approach, some users might experience errors. However, you can minimize the impact of outages, in contrast to an approach like circuit-breaking, where all traffic is dropped during an overload.

Handle partial errors and retries

Build your applications to handle partial errors and retries seamlessly. This design helps to ensure that as much traffic as possible is served during high-load scenarios.

Test overload scenarios

To validate that the throttle and request-drop mechanisms work effectively, regularly simulate overload conditions in your system. Testing helps ensure that your system is prepared for real-world traffic surges.

Monitor traffic spikes

Use analytics and monitoring tools to predict and respond to traffic surges before they escalate into overloads. Early detection and response can help maintain service availability during high-demand periods.

Perform testing for recovery from failures

This principle in the reliability pillar of the Google Cloud Well-Architected Framework provides recommendations to help you design and run tests for recovery in the event of failures.

This principle is relevant to the learning focus area of reliability.

Principle overview

To be sure that your system can recover from failures, you must periodically run tests that include regional failovers, release rollbacks, and data restoration from backups.

This testing helps you to practice responses to events that pose major risks to reliability, such as the outage of an entire region. This testing also helps you verify that your system behaves as intended during a disruption.

In the unlikely event of an entire region going down, you need to fail over all traffic to another region. During normal operation of your workload, when data is modified, it needs to be synchronized from the primary region to the failover region. You need to verify that the replicated data is always very recent, so that users don't experience data loss or session breakage. The load balancing system must also be able to shift traffic to the failover region at any time without service interruptions. To minimize downtime after a regional outage, operations engineers also need to be able to manually and efficiently shift user traffic away from a region, in as less time as possible. This operation is sometimes called draining a region, which means you stop the inbound traffic to the region and move all the traffic elsewhere.

Recommendations

When you design and run tests for failure recovery, consider the recommendations in the following subsections.

Define the testing objectives and scope

Clearly define what you want to achieve from the testing. For example, your objectives can include the following:

  • Validate the recovery time objective (RTO) and the recovery point objective (RPO). For details, see Basics of DR planning.
  • Assess system resilience and fault tolerance under various failure scenarios.
  • Test the effectiveness of automated failover mechanisms.

Decide which components, services, or regions are in the testing scope. The scope can include specific application tiers like the frontend, backend, and database, or it can include specific Google Cloud resources like Cloud SQL instances or GKE clusters. The scope must also specify any external dependencies, such as third-party APIs or cloud interconnections.

Prepare the environment for testing

Choose an appropriate environment, preferably a staging or sandbox environment that replicates your production setup. If you conduct the test in production, ensure that you have safety measures ready, like automated monitoring and manual rollback procedures.

Create a backup plan. Take snapshots or backups of critical databases and services to prevent data loss during the test. Ensure that your team is prepared to do manual interventions if the automated failover mechanisms fail.

To prevent test disruptions, ensure that your IAM roles, policies, and failover configurations are correctly set up. Verify that the necessary permissions are in place for the test tools and scripts.

Inform stakeholders, including operations, DevOps, and application owners, about the test schedule, scope, and potential impact. Provide stakeholders with an estimated timeline and the expected behaviors during the test.

Simulate failure scenarios

Plan and execute failures by using tools like Chaos Monkey. You can use custom scripts to simulate failures of critical services such as a shutdown of a primary node in a multi-zone GKE cluster or a disabled Cloud SQL instance. You can also use scripts to simulate a region-wide network outage by using firewall rules or API restrictions based on your scope of test. Gradually escalate the failure scenarios to observe system behavior under various conditions.

Introduce load testing alongside failure scenarios to replicate real-world usage during outages. Test cascading failure impacts, such as how frontend systems behave when backend services are unavailable.

To validate configuration changes and to assess the system's resilience against human errors, test scenarios that involve misconfigurations. For example, run tests with incorrect DNS failover settings or incorrect IAM permissions.

Monitor system behavior

Monitor how load balancers, health checks, and other mechanisms reroute traffic. Use Google Cloud tools like Cloud Monitoring and Cloud Logging to capture metrics and events during the test.

Observe changes in latency, error rates, and throughput during and after the failure simulation, and monitor the overall performance impact. Identify any degradation or inconsistencies in the user experience.

Ensure that logs are generated and alerts are triggered for key events, such as service outages or failovers. Use this data to verify the effectiveness of your alerting and incident response systems.

Verify recovery against your RTO and RPO

Measure how long it takes for the system to resume normal operations after a failure, and then compare this data with the defined RTO and document any gaps.

Ensure that data integrity and availability align with the RPO. To test database consistency, compare snapshots or backups of the database before and after a failure.

Evaluate service restoration and confirm that all services are restored to a functional state with minimal user disruption.

Document and analyze results

Document each test step, failure scenario, and corresponding system behavior. Include timestamps, logs, and metrics for detailed analyses.

Highlight bottlenecks, single points of failure, or unexpected behaviors observed during the test. To help prioritize fixes, categorize issues by severity and impact.

Suggest improvements to the system architecture, failover mechanisms, or monitoring setups. Based on test findings, update any relevant failover policies and playbooks. Present a postmortem report to stakeholders. The report should summarize the outcomes, lessons learned, and next steps. For more information, see Conduct thorough postmortems.

Iterate and improve

To validate ongoing reliability and resilience, plan periodic testing (for example, quarterly).

Run tests under different scenarios, including infrastructure changes, software updates, and increased traffic loads.

Automate failover tests by using CI/CD pipelines to integrate reliability testing into your development lifecycle.

During the postmortem, use feedback from stakeholders and end users to improve the test process and system resilience.

Perform testing for recovery from data loss

This principle in the reliability pillar of the Google Cloud Well-Architected Framework provides recommendations to help you design and run tests for recovery from data loss.

This principle is relevant to the learning focus area of reliability.

Principle overview

To ensure that your system can recover from situations where data is lost or corrupted, you need to run tests for those scenarios. Instances of data loss might be caused by a software bug or some type of natural disaster. After such events, you need to restore data from backups and bring all of the services back up again by using the freshly restored data.

We recommend that you use three criteria to judge the success or failure of this type of recovery test: data integrity, recovery time objective (RTO), and recovery point objective (RPO). For details about the RTO and RPO metrics, see Basics of DR planning.

The goal of data restoration testing is to periodically verify that your organization can continue to meet business continuity requirements. Besides measuring RTO and RPO, a data restoration test must include testing of the entire application stack and all the critical infrastructure services with the restored data. This is necessary to confirm that the entire deployed application works correctly in the test environment.

Recommendations

When you design and run tests for recovering from data loss, consider the recommendations in the following subsections.

Verify backup consistency and test restoration processes

You need to verify that your backups contain consistent and usable snapshots of data that you can restore to immediately bring applications back into service. To validate data integrity, set up automated consistency checks to run after each backup.

To test backups, restore them in a non-production environment. To ensure your backups can be restored efficiently and that the restored data meets application requirements, regularly simulate data recovery scenarios. Document the steps for data restoration, and train your teams to execute the steps effectively during a failure.

Schedule regular and frequent backups

To minimize data loss during restoration and to meet RPO targets, it's essential to have regularly scheduled backups. Establish a backup frequency that aligns with your RPO. For example, if your RPO is 15 minutes, schedule backups to run at least every 15 minutes. Optimize the backup intervals to reduce the risk of data loss.

Use Google Cloud tools like Cloud Storage, Cloud SQL automated backups, or Spanner backups to schedule and manage backups. For critical applications, use near-continuous backup solutions like point-in-time recovery (PITR) for Cloud SQL or incremental backups for large datasets.

Define and monitor RPO

Set a clear RPO based on your business needs, and monitor adherence to the RPO. If backup intervals exceed the defined RPO, use Cloud Monitoring to set up alerts.

Monitor backup health

Use Google Cloud Backup and DR service or similar tools to track the health of your backups and confirm that they are stored in secure and reliable locations. Ensure that the backups are replicated across multiple regions for added resilience.

Plan for scenarios beyond backup

Combine backups with disaster recovery strategies like active-active failover setups or cross-region replication for improved recovery time in extreme cases. For more information, see Disaster recovery planning guide.

Conduct thorough postmortems

This principle in the reliability pillar of the Google Cloud Well-Architected Framework provides recommendations to help you conduct effective postmortems after failures and incidents.

This principle is relevant to the learning focus area of reliability.

Principle overview

A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve the incident, the root causes, and the follow-up actions to prevent the incident from recurring. The goal of a postmortem is to learn from mistakes and not assign blame.

The following diagram shows the workflow of a postmortem:

The workflow of a postmortem.

The workflow of a postmortem includes the following steps:

  • Create postmortem
  • Capture the facts
  • Identify and analyze the root causes
  • Plan for the future
  • Execute the plan

Conduct postmortem analyses after major events and non-major events like the following:

  • User-visible downtimes or degradations beyond a certain threshold.
  • Data losses of any kind.
  • Interventions from on-call engineers, such as a release rollback or rerouting of traffic.
  • Resolution times above a defined threshold.
  • Monitoring failures, which usually imply manual incident discovery.

Recommendations

Define postmortem criteria before an incident occurs so that everyone knows when a post mortem is necessary.

To conduct effective postmortems, consider the recommendations in the following subsections.

Conduct blameless postmortems

Effective postmortems focus on processes, tools, and technologies, and don't place blame on individuals or teams. The purpose of a postmortem analysis is to improve your technology and future, not to find who is guilty. Everyone makes mistakes. The goal should be to analyze the mistakes and learn from them.

The following examples show the difference between feedback that assigns blame and blameless feedback:

  • Feedback that assigns blame: "We need to rewrite the entire complicated backend system! It's been breaking weekly for the last three quarters and I'm sure we're all tired of fixing things piecemeal. Seriously, if I get paged one more time I'll rewrite it myself…"
  • Blameless feedback: "An action item to rewrite the entire backend system might actually prevent these pages from continuing to happen. The maintenance manual for this version is quite long and really difficult to be fully trained up on. I'm sure our future on-call engineers will thank us!"

Make the postmortem report readable by all the intended audiences

For each piece of information that you plan to include in the report, assess whether that information is important and necessary to help the audience understand what happened. You can move supplementary data and explanations to an appendix of the report. Reviewers who need more information can request it.

Avoid complex or over-engineered solutions

Before you start to explore solutions for a problem, evaluate the importance of the problem and the likelihood of a recurrence. Adding complexity to the system to solve problems that are unlikely to occur again can lead to increased instability.

Share the postmortem as widely as possible

To ensure that issues don't remain unresolved, publish the outcome of the postmortem to a wide audience and get support from management. The value of a postmortem is proportional to the learning that occurs after the postmortem. When more people learn from incidents, the likelihood of similar failures recurring is reduced.