Reliability principles

This document in the Google Cloud Architecture Framework explains some of the core principles to run reliable services on a cloud platform. These principles help to create a common understanding as you read additional sections of the Architecture Framework that show you how some of the Google Cloud products and features support reliable services.

Key terminology

In the Architecture Framework reliability category, the following terminology is used. These terms provide a key understanding of how to run reliable services.

Service level indicator (SLI)

A service level indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service that is being provided. It is a metric, not a target.

Service level objective (SLO)

A service level objective (SLO) specifies a target level for the reliability of your service. The SLO is a target value for an SLI. When the SLI is at or better than this value, the service is considered to be "reliable enough." Because SLOs are key to making data-driven decisions about reliability, they are the focal point of site reliability engineering (SRE) practices.

Error budget

An error budget is calculated as 100% – SLO over a period of time. Error budgets tell you if your system has been more or less reliable than is needed over a certain time window, and how many minutes of downtime are allowed during that period.

For example, if your availability SLO is 99.9%, your error budget over a 30-day period is (1 - 0.999) ✕ 30 days ✕ 24 hours ✕ 60 minutes = 43.2 minutes. The error budget for a system is consumed, or burned, whenever the system is unavailable. Using the previous example, if the system has had 10 minutes of downtime in the past 30 days and started the 30-day period with the full budget of 43.2 minutes unutilized, then the remaining error budget is reduced to 33.2 minutes.

We recommend using a rolling window of 30 days when computing your total error budget and the error budget burn rate.

Service level agreement (SLA)

A service level agreement (SLA) is an explicit or implicit contract with your users that includes consequences if you meet, or miss, the SLOs referenced in the contract.

Core principles

Google's approach to reliability is based on the following core principles.

Reliability is your top feature

New product features are sometimes your top priority in the short term. However, reliability is your top product feature in the long term, because if the product is too slow or is unavailable over a long period of time, your users might leave, making other product features irrelevant.

Reliability is defined by the user

For user-facing workloads, measure the user experience. The user must be happy with how your service performs. For example, measure the success ratio of user requests, not just server metrics like CPU usage.

For batch and streaming workloads, you might need to measure key performance indicators (KPIs) for data throughput, such as rows scanned per time window, instead of server metrics such as disk usage. Throughput KPIs can help ensure a daily or quarterly report required by the user finishes on time.

100% reliability is the wrong target

Your systems should be reliable enough that users are happy, but not excessively reliable such that the investment is unjustified. Define SLOs that set the reliability threshold you want, then use error budgets to manage the appropriate rate of change.

Apply the design and operational principles in this framework to a product only if the SLO for that product or application justifies the cost.

Reliability and rapid innovation are complementary

Use error budgets to achieve a balance between system stability and developer agility. The following guidance helps you determine when to move fast or slow:

  • When an adequate error budget is available, you can innovate rapidly and improve the product or add product features.
  • When the error budget is diminished, slow down and focus on reliability features.

Design and operational principles

To maximize system reliability, the following design and operational principles apply. Each of these principles is discussed in detail in the rest of the Architecture Framework reliability category.

Define your reliability goals

The best practices covered in this section of the Architecture Framework include the following:

  • Choose appropriate SLIs.
  • Set SLOs based on the user experience.
  • Iteratively improve SLOs.
  • Use strict internal SLOs.
  • Use error budgets to manage development velocity.

For more information, see Define your reliability goals in the Architecture Framework reliability category.

Build observability into your infrastructure and applications

The following design principle is covered in this section of the Architecture Framework:

  • Instrument your code to maximize observability.

For more information, see Build observability into your infrastructure and applications in the Architecture Framework reliability category.

Design for scale and high availability

The following design principles are covered in this section of the Architecture Framework:

  • Create redundancy for higher availability.
  • Replicate data across regions for disaster recovery.
  • Design a multi-region architecture for resilience to regional outages.
  • Eliminate scalability bottlenecks.
  • Degrade service levels gracefully when overloaded.
  • Prevent and mitigate traffic spikes.
  • Sanitize and validate inputs.
  • Fail safe in a way that preserves system function.
  • Design API calls and operational commands to be retryable.
  • Identify and manage system dependencies.
  • Minimize critical dependencies.
  • Ensure that every change can be rolled back.

For more information, see Design for scale and high availability in the Architecture Framework reliability category.

Create reliable operational processes and tools

The following operational principles are covered in this section of the Architecture Framework:

  • Choose good names for applications and services.
  • Implement progressive rollouts with canary testing procedures.
  • Spread out traffic for timed promotion and launches.
  • Automate the build, test, and deployment process.
  • Defend against operator error.
  • Test failure recovery procedures.
  • Conduct disaster recovery tests.
  • Practice chaos engineering.

For more information, see Create reliable operational processes and tools in the Architecture Framework reliability category.

Build efficient alerts

The following operational principles are covered in this section of the Architecture Framework:

  • Optimize alert delays.
  • Alert on symptoms, not causes.
  • Alert on outliers, not averages.

For more information, see Build efficient alerts in the Architecture Framework reliability category.

Build a collaborative incident management process

The following operational principles are covered in this section of the Architecture Framework:

  • Assign clear service ownership.
  • Reduce time to detect (TTD) with well tuned alerts.
  • Reduce time to mitigate (TTM) with incident management plans and training.
  • Design dashboard layouts and content to minimize TTM.
  • Document diagnostic procedures and mitigation for known outage scenarios.
  • Use blameless postmortems to learn from outages and prevent recurrences.

For more information, see Build a collaborative incident management process in the Architecture Framework reliability category.

What's next

Explore other categories in the Architecture Framework such as system design, operational excellence, and security, privacy, and compliance.