Reliability

This section of the architecture framework describes how to apply technical and procedural requirements to architect and operate reliable services on Google Cloud.

The framework consists of the following series of docs:

Reliability is the most important feature of any application, because if the application is not reliable, users will eventually leave, and the other features won't matter.

  • An application must have measurable reliability goals, and deviations must be promptly corrected.
  • The application must be architected for scalability, high availability, and automated change management.
  • The application must be self-healing where possible, and it must be instrumented for observability.
  • The operational procedures used to run the application must impose minimal manual work and cognitive load on operators, while ensuring rapid mitigation of failures.

Strategies

Use these strategies to achieve reliability.

Reliability is defined by the user. For user-facing workloads, measure the user experience, for example, query success ratio, as opposed to just server metrics such as CPU usage. For batch and streaming workloads, you might need to measure KPIs (Key Performance Indicators) such as rows being scanned per time window, to ensure a quarterly report is on track to finish on time, as opposed to just server metrics such as disk usage.

Use sufficient reliability. Your systems should be reliable enough that users are happy, but not excessively reliable such that the investment is unjustified. Define Service Level Objectives (SLOs) that set the reliability threshold, and use error budgets to manage the rate of change. Apply the additional principles listed below only if the SLO justifies the cost.

Create redundancy. Systems with high reliability needs must have no single points of failure, and their resources must be replicated across multiple failure domains. A failure domain is a pool of resources that can fail independently, such as a VM, zone, or region.

Include horizontal scalability. Ensure that every component of your system can accommodate growth in traffic or data by adding more resources.

Ensure overload tolerance. Design services to degrade gracefully under load.

Include rollback capability. Any change an operator makes to a service must have a well-defined method to undo it—that is, roll back the change.

Prevent traffic spikes. Don't synchronize requests across clients. Too many clients sending traffic at the same instant causes traffic spikes that can in the worst case cause cascading failures.

Test failure recovery. If you haven't recently tested your operational procedures to recover from failures, the procedures probably won't work when you need them. Items to test periodically include regional failover, rolling back a release, and restoring data from backups.

Detect failure. There is a tradeoff between alerting too soon and burning out the operation team versus alerting too late and having extended service outages. The delay before notifying operators about outages (also known as TTD: time to detect) must be tuned for this tradeoff.

Make incremental changes. Instantaneous global changes to service binaries or configuration are inherently risky. We recommend that you roll out changes gradually, with "canary testing" to detect bugs in the early stages of a rollout where their impact on users is minimal.

Have a coordinated emergency response. Design operational practices to minimize the duration of outages (also known as TTM: time to mitigate) while taking into account the customer experience and the well-being of the operators. This approach requires advance formalization of response procedures with well-defined roles and communication channels.

Instrument systems for observability. Systems must be sufficiently well instrumented to enable rapid triaging, troubleshooting, and diagnosis of problems to minimize TTM.

Document and automate emergency responses. In an emergency, people have difficulty defining what needs to be done and performing complex tasks. Therefore, preplan emergency actions, document them, and ideally automate them.

Perform capacity management. Forecast traffic and provision resources in advance of peak traffic events.

Reduce toil. Toil is manual and repetitive work with no enduring value, and it increases as the service grows. Continually aim to reduce or eliminate toil. Otherwise, operational work will eventually overwhelm operators, leaving little room for growth.

Best practices

Follow these best practices to help achieve reliability.

  • Define your reliability goals using Service Level Objectives (SLOs) and error budgets.
  • Build observability into your infrastructure and applications.
  • Design for scale and high availability.
  • Build flexible and automated deployment capabilities.
  • Build efficient alerting.
  • Build a collaborative process for incident management.

Define your reliability goals

We recommend measuring your existing customer experience, their tolerance for errors and mistakes while establishing reliability goals based on these events. For instance, an overall system uptime goal of 100% over an infinite amount of time can't be achieved and is not meaningful if the data that the user expects isn't there.

Set SLOs based on the user experience. Measure reliability metrics as close to the user as possible. If possible, instrument the mobile or web client. If that's not possible, instrument the load balancer. Measuring reliability at the server should be the last resort. Set the SLO high enough that the user is not unhappy, and no higher.

Because of network connectivity or other transient client-side issues, your customers might not perceive some reliability issues caused by your application, for brief periods of time.

We strongly recommend that you aim for a target lower than 100% for uptime and other vital metrics but close to it. This target allows anyone to deliver software faster and at a high quality. It's also true that in many cases, given a properly designed system, you can achieve higher availability by reducing the pace and volume of changes.

In other words, achievable reliability goals that are tuned to the customer experience tend to help define the maximum pace and scope of changes (that is, feature velocity) that customers can tolerate.

If you can't measure existing customer experience and define goals around it, we recommend a competitive benchmark analysis. In the absence of comparable competition, measure the customer experience even if you can't define goals yet—for example, system availability or the rate of meaningful and successful transactions to the customer. You can correlate this with business metrics or KPIs such as volume of orders (retail), volume of customer support calls and tickets and their severity, etc. Given some time, you can use such correlation exercises to get to a reasonable threshold of customer happiness—that is, SLO.

SLIs, SLOs, and SLAs

A Service Level Indicator (SLI) is a quantitative measure of some aspect of the level of service that is being provided. It is a metric, not a target.

Service level objectives (SLOs) specify a target level for the reliability of your service. Because SLOs are key to making data-driven decisions about reliability, they're at the core of SRE practices. The SLO is a value for an SLI, and when the SLI is at or above this value, the service is considered to be "reliable enough."

Error budgets are calculated as (100% – SLO) over a period of time. They tell you if your system is more or less reliable than is needed over a certain time window. We generally recommend using a rolling window of 30 days.

Service level agreements (SLAs) are an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.

It's a good practice to have stricter internal SLOs than external SLAs. The rationale for this approach is that SLA violations tend to require issuing a financial credit or customer refund, and you want to address problems well before they have financial impact. We recommend attaching stricter internal SLOs to a blameless post-mortem process with incident reviews.

If the application has a multi-tenant architecture, typical of SaaS applications used by multiple independent customers, be sure to capture SLIs at a per-tenant level. If you measure SLIs only at a global aggregate level, your monitoring will be unable to flag critical problems affecting individual customers or a minority of customers. Design the application to include a tenant identifier in each user request and propagate that identifier through each layer of the stack. Propagating this identifier lets your monitoring system aggregate statistics at the per-tenant level at every layer or microservice along the request path.

Error budgets

Use error budgets to manage development velocity. When the error budget is not yet consumed, continue to launch new features quickly. When the error budget is close to zero, freeze or slow down service changes and invest engineering resources in reliability features.

Google Cloud minimizes the effort of setting up SLOs and error budgets with service monitoring. This product offers a UI for configuring SLOs manually, an API for programmatic setup of SLOs, and built-in dashboards for tracking the error budget burn rate.

Example SLIs

For serving systems, the following SLIs are typical:

  • Availability tells you the fraction of the time that a service is usable. It is often defined in terms of the fraction of well-formed requests that succeed—for example, 99%.
  • Latency tells you how quickly a certain percentage of requests can be fulfilled. It is often defined in terms of a percentile other than 50th—for example, 99th percentile at 300 ms.
  • Quality tells you how good a certain response is. The definition of quality is often service specific, indicating the extent to which the content of the response to a request varies from the ideal response content. It could be binary (good or bad) or expressed on a scale from 0% to 100%.

For data processing systems, the following SLIs are typical:

  • Coverage tells you the fraction of data that has been processed—for example, 99.9%.
  • Correctness tells you the fraction of responses deemed to be correct—for example, 99.99%.
  • Freshness tells you how fresh the source data is or the aggregated responses are, most frequently the better—for example, 20 minutes.
  • Throughput tells you how much data is being processed—for example, 500 MiB/sec or even 1000 RPS.

For storage systems, the following SLIs are common:

  • Durability tells you how likely the data written to the system can be retrieved in the future—for example, 99.9999%. Any permanent data loss incident reduces the durability metric.
  • Throughput and Latency (as described previously).

Design questions

  • Are you measuring the user experience of application reliability?
  • Are the client applications designed to capture and report reliability metrics?
  • Is the system architecture designed with specific reliability and scalability goals in mind?
  • For multi-tenant systems, do user requests include a tenant identifier, and is that identifier propagated through each layer of the software stack?

Recommendations

  • Define and measure customer-centric SLIs.
  • Define a customer-centric error budget that's stricter than your external SLA with consequences for violations—for example, production freezes.
  • Set up latency SLIs to capture outlier values—that is, 90th or 99th percentile, to detect the slowest responses.
  • Review SLOs at least annually.

Resources

Build observability into your infrastructure and applications

Observability includes monitoring, logging, tracing, profiling, debugging, and similar systems.

Instrument your code to maximize observability. Write log entries and trace entries, and export monitoring metrics with debugging and troubleshooting in mind, prioritizing by the most likely or frequent failure modes of the system. Periodically audit and prune your monitoring, deleting unused or useless dashboards, alerts, tracing, and logging to eliminate clutter.

Monitoring is at the base of the service reliability hierarchy. Without proper monitoring, you cannot tell if an application is working in the first place.

A well-designed system aims to have the right amount of observability starting with its development phase. Don't wait until an application is in production to start observing it.

Google Cloud's operations suite delivers real-time monitoring, logging across (Google Cloud and AWS, plus tracing, profiling, and debugging. It is also capable of monitoring a service mesh using Istio, and App Engine services (Cloud Monitoring).

Over-engineering monitoring and over-alerting are common anti-patterns. Avoid these anti-patterns by proactively deleting timeseries, dashboards, and alerts that are not looked at or rarely fire during initial external launch stages. This is also valid for log entries that are rarely scanned.

Evaluate sending all or a sample of application events to a cloud data warehouse such as BigQuery. This is useful because it allows you to run arbitrary queries at a lower cost rather than designing your monitoring just right upfront. It also decouples reporting from monitoring. Reporting can be done by anyone using Google Data Studio or Looker.

Design questions

  • Does the design review process have standards for observability that guide design reviews and code reviews?
  • Are the metrics exported by the application adequate for troubleshooting outages?
  • Are application log entries sufficiently detailed and relevant to be useful for debugging?

Recommendations

  • Implement monitoring early on before initiating a migration or while building a new application before its first production deployment.
  • Disambiguate between application issues versus underlying cloud issues—for example, use Transparent SLI and Google Cloud Status Dashboard.
  • Define an observability strategy beyond profiling that includes tracing, profiling and debugging.
  • Regularly cleanup observability artifacts that aren't being used or aren't useful, including unactionable alerts.
  • Send application events (that is, high-cardinality metrics) to a data warehouse system such as BigQuery.

Design for scale and high availability

Design a multi-region architecture with failover

If your service needs to be up even when an entire region is down, then design it to use pools of compute resources spread across different regions, with automatic failover when a region goes down. Eliminate single points of failure, such as a single-region master database that can cause a global outage when it is unreachable.

Eliminate scalability bottlenecks

Identify system components that cannot grow beyond the resource limits of a single VM or a single zone. Some applications are designed for vertical scaling, where more CPU cores, memory, or network bandwidth are needed on a single VM to handle increased load. Such applications have hard limits on their scalability, and often require manual reconfiguration to handle growth. Redesign these components to be horizontally scalable using sharding (partitioning across VMs or zones), so that growth in traffic or usage can be handled easily by adding more shards, and shards use standard VM types that can be added automatically to handle increases in per-shard load. As an alternative to redesign, consider replacing these components with managed services that have been designed to scale horizontally with no user action.

Degrade service levels gracefully

Design your services to detect overload and return lower quality responses to the user or partially drop traffic rather than failing completely under overload. For example, a service can respond to user requests with static web pages while temporarily disabling dynamic behavior that is more expensive, or it can allow read-only operations while temporarily disabling data updates.

Implement exponential backoff with jitter

When mobile clients encounter an error response from your service, they must retry after a random delay. If they get repeated errors, they should wait exponentially longer before retrying, in addition to adding random time offsets (jitter) to each retry operation. This prevents large groups of clients from generating instantaneous traffic spikes after cellular network failures, because these spikes can potentially crash your servers.

Predict peak traffic events and plan for them

If your system experiences known periods of peak traffic, such as Black Friday for retailers, invest time in preparing for such events to avoid significant loss of traffic and revenue. Forecast the size of the traffic spike, add a buffer, and ensure that your system has sufficient compute capacity to handle the spike. Load test the system with the expected mix of user requests to ensure that its estimated load-handling capacity matches the actual capacity. Run exercises where your Ops team conducts simulated outage drills, rehearsing their response procedures and exercising the collaborative cross-team incident management procedures discussed below.

Conduct disaster recovery testing

Don't wait for a disaster to strike; periodically test and verify your disaster recovery procedures and processes. You might also be planning an architecture for high availability (HA). It doesn't entirely overlap with disaster recovery (DR), but it's often necessary to take HA into account when you're thinking about Recovery Time Objective (RTO) and Recovery Point Objective (RPO) values. HA helps to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. When you run production workloads on Google Cloud, you might use a globally distributed system so that if something goes wrong in one region, the application continues to provide service even if it's less widely available. In essence, that application invokes its DR plan.

Design questions

  • Can the application scale up by adding more VMs, with no architectural changes?
  • Is each component in the architecture horizontally scalable, through sharding or otherwise?
  • Are client applications designed to avoid synchronizing requests across clients?
  • Can the application handle failure of an entire cloud region without having a global outage?
  • Are user requests evenly distributed across shards and regions?
  • Can the application detect when it is overloaded and change its behavior to prevent an outage?

Recommendations

  • Implement exponential backoff with randomization in the error retry logic of client applications.
  • Implement a multi-region architecture with automatic failover for high availability.
  • Use load balancing to distribute user requests across shards and regions.
  • Design the application to degrade gracefully under overload, serving partial responses or providing limited functionality rather than failing completely.
  • Establish a recurring data-driven process for capacity planning, using load tests and traffic forecasts to drive provisioning of resources.
  • Establish disaster recovery procedures and test them periodically.

Build flexible and automated deployment capabilities

Ensure that every change can be rolled back

If there is no well defined way to undo certain types of changes, change the design of the service to support rollback and test the rollback processes periodically. This can be costly to implement for mobile applications, and we suggest that developers apply Firebase Remote Config to make feature rollback easier.

Spread out traffic for timed promotions and launches

For promotional events such as sales that start at a precise time—for example, midnight—and incentivize many users to connect to the service simultaneously, design client code to spread the traffic over a few seconds by adding random delays before initiating requests. This prevents instantaneous traffic spikes that could crash your servers at the scheduled start time.

Implement progressive rollouts with canary testing

Roll out new versions of executables and configuration changes incrementally, starting with a small scope such as a few VMs in a zone. Your goal is to catch bugs when only a small portion of user traffic is affected, before rolling the change out globally. Set up a "canary testing" system that is aware of such changes and does A/B comparison of the metrics of the changed servers with the remaining servers, flagging anomalous behavior. The canary system should alert operators to problems, and might even automatically halt rollouts. After the change passes canary testing, propagate it to larger scopes gradually, such as to a full zone, then to a second zone, allowing time for the changed system to handle progressively larger volumes of user traffic to expose any latent bugs.

Automate build, test, and deploy

Eliminate release toil by using CI/CD pipelines to build automated testing into your releases. Perform automated integration testing and deployment.

Automation is useful but not a panacea. It comes with a fair share of maintenance costs and risks to reliability beyond its initial development and setup costs.

We recommend that you start by inventorying and assessing the cost of toil on the teams managing your systems. Make this a continuous process that's initiated before you invest in customized automation to extend what's already provided by Google Cloud services and partners. You can often tweak Google Cloud's own automation—for example, Compute Engine's autoscaling algorithm.

We consider toil to be manual, repetitive, automatable, reactive work that tends to lack enduring value and grows at least as fast as its source. For details, see the SRE book chapter Eliminating Toil.

The following are some of the main areas where toil elimination with Google provided configurable automation or customized automation assisting our customers:

We recommend prioritizing toil elimination first, before automation, leveraging as much as possible Google's provided configurable automation to reduce the leftover toil as a second step. The third step, which can be done in parallel to the first and second ones, entails evaluating building or buying other solutions if toil cost stays high—for example, more than 50% of the time for any team managing your production systems. When building or buying solutions, consider integration, security, privacy, and compliance costs.

If you come across a Google Cloud product or service that only partially satisfies your technical needs in the realm of automating or eliminating manual workflows, consider filing a feature request through your Google Cloud account representative. It might be a priority for more of our customers or already a part of our roadmap, and if so, knowing the feature's priority and timeline helps you to better assess the trade-offs of building your own solution versus waiting to use it as a Google Cloud feature.

Design questions

  • Is the change process for executables and configurations automated?
  • Is change automation designed to enable fast rollback, for every possible change?
  • For changes that cannot be rolled back, such as schema changes, is there a design review process to ensure forward and backward compatibility between current or former binary versions and data schemas.
  • Are system configuration files sharded, such that config changes can be rolled out incrementally rather than globally?

Recommendations

  • Set up canary testing of new releases and configurations.
  • Define what's toil for your teams and regularly assess its cost.
  • Eliminate unnecessary toil/workflows before developing custom automation.
  • Use existing automation already available through Google Cloud services by tweaking the default configuration or crafting one if a default isn't provided.
  • Evaluate building (or buying) custom automation if the maintenance cost and risks for service reliability and security are worth it. We'd also recommend evaluating well-maintained open source software.

Build efficient alerting

Optimize alerting delay

Tune the configured delay before the monitoring system notifies humans of a problem to minimize TTD while maximizing signal versus noise. Use the error budget consumption rate to derive the optimal alerting configuration.

Alert on symptoms, not causes

Trigger alerts based on direct impact to user experience—that is, noncompliance with global or per-customer SLOs. Do not alert on every possible underlying root cause of a failure, especially when the impact is limited to a single replica. A well-designed distributed system recovers seamlessly from single-replica failures.

Build a collaborative incident management process

It's inevitable that your well-designed system will eventually fail its SLOs. Keep in mind that in the absence of an SLO, your customers will still loosely define what the acceptable service level is based on their past experience and will escalate to your technical support or similar group, regardless of what's in your SLA.

To properly serve your customers, we strongly recommend establishing and regularly implementing an incident management plan. The plan can be a single page checklist with 10 items or so. This process will assist your team to reduce Mean Time to Detect (MTTD) and Mean Time to Mitigate (MTTM). (We use MTTM as opposed to MTTR because the latter is ambiguous.) The R is often used as "repair" or "recovery" to mean "full fix" versus "mitigation". MTTM explicitly means mitigation as opposed to a full fix.

A well-designed system where operations are excellent will increase the Mean Time Between Failures (MTBF). See the Systems Design and Operational Excellence sections for details.

It's also important to establish both a blameless postmortem culture and an incident review process. "Blameless" means your team is required to evaluate and document what went wrong in an objective manner without finger pointing. Mistakes are treated as learning opportunities, not cause for criticism. Always aim to make the system more resilient so that it can recover quickly from human error, or even better, detect and prevent human error.

Reduce MTTD

A prerequisite to reducing MTTD is to implement the recommendations described in the "Observability" and "Define your monitoring goals" section—for example, disambiguating between application issues versus underlying cloud ones.

A well-tuned set of SLIs alerts your team at the right time without causing alert overload. For guidance, see Tune up your SLI metrics: CRE life lessons.

Reduce MTTM

To reduce MTTM, have a properly documented and well-exercised incident management plan. In addition, have readily available data on what's changed.

Example incident management plan

  • Production issues have been detected (alert, page) or escalated to me.
  • Should I delegate in the first place? Yes, if you and your team can't resolve this.
  • Is this a privacy or security breach? If yes, then delegate to the privacy/security team.
  • Is this an emergency or are SLO(s) at risk? If in doubt, treat it as an emergency.
  • Should I involve more people? Yes, if it is impacting more than X% of customers or if it takes more than Y minutes to resolve. If in doubt, always involve more people, especially during business hours.
    • Define a primary communications channel—for example, IRC, Hangouts Chat, or Slack.
    • Delegate previously defined roles—for example:
      • Incident commander - responsible for overall coordination.
      • Communications lead - responsible for handling internal and external communications.
      • Operations lead - responsible for mitigating the issue.
    • Define when the incident is over. This might require acknowledgment from a Support representative.
  • Collaborate on the postmortem.
  • Attend recurring postmortem incident review meeting to discuss and staff action items.

Graphs

Here is a non-exhaustive list of graphs to consider. Incident responders should be able to glance at them in a single view.

  • Service level indicator(s)—for example, successful requests divided by total.
  • Configuration and/or binary rollouts.
  • Requests per second to the system.
  • Errors per second returned by the system.
  • Requests per second from the system to its dependencies.
  • Errors per second from the system to its dependencies.

We also commonly see request and response size, query cost, thread pools (to look for a regression induced by pool exhaustion), and JVM metrics being graphed (where applicable).

Test a few scenarios in terms of placement of these graphs. You can also apply machine learning to surface the right subset of these graphs (that is, anomaly detection techniques).

Finally, as discussed earlier, another common approach for fast mitigation is to design systems and changes that can be easily rolled back.

Recommendations

  • Establish and train your teams on an incident management plan.
  • Implement "Observability" section recommendations to reduce MTTD.
  • Build a "what's changed" dashboard that you can glance at during incidents.
  • Document query snippets or build a Data Studio dashboard.
  • Evaluate Firebase Remote Config to mitigate rollout issues related to mobile applications.
  • Implement "Disaster Recovery" recommendations to decrease MTTM for a subset of your incidents.
  • Design for and test configuration and binary rollbacks.
  • Implement "Systems design" and "Disaster Recovery" (testing) sections recommendations in order to increase MTBF.

Resources