This document in the Google Cloud Architecture Framework provides best practices to define appropriate ways to measure the customer experience of your services so you can run reliable services. You learn how to iterate on the service level objectives (SLOs) you define, and use error budgets to know when reliability might suffer if you release additional updates.
Choose appropriate SLIs
It's important to choose appropriate service level indicators (SLIs) to fully understand how your service performs. For example, if your application has a multi-tenant architecture that is typical of SaaS applications used by multiple independent customers, capture SLIs at a per-tenant level. If your SLIs are measured only at a global aggregate level, you might miss critical problems in your application that affect a single important customer or a minority of customers. Instead, design your application to include a tenant identifier in each user request, then propagate that identifier through each layer of the stack. This identifier lets your monitoring system aggregate statistics at the per-tenant level at every layer or microservice along the request path.
The type of service you run also determines what SLIs to monitor, as shown in the following examples.
The following SLIs are typical in systems that serve data:
- Availability tells you the fraction of the time that a service is usable. It's often defined in terms of the fraction of well-formed requests that succeed, such as 99%.
- Latency tells you how quickly a certain percentage of requests can be fulfilled. It's often defined in terms of a percentile other than 50th, such as "99th percentile at 300 ms".
- Quality tells you how good a certain response is. The definition of quality is often service-specific, and indicates the extent to which the content of the response to a request varies from the ideal response content. The response quality could be binary (good or bad) or expressed on a scale from 0% to 100%.
Data processing systems
The following SLIs are typical in systems that process data:
- Coverage tells you the fraction of data that has been processed, such as 99.9%.
- Correctness tells you the fraction of output data deemed to be correct, such as 99.99%.
- Freshness tells you how fresh the source data or the aggregated output data is. Typically the more recently updated, the better, such as 20 minutes.
- Throughput tells you how much data is being processed, such as 500 MiB/sec or even 1000 requests per second (RPS).
The following SLIs are typical in systems that store data:
- Durability tells you how likely the data written to the system can be retrieved in the future, such as 99.9999%. Any permanent data loss incident reduces the durability metric.
- Throughput and latency are also common SLIs for storage systems.
Choose SLIs and set SLOs based on the user experience
One of the core principles in this Architecture Framework section is that reliability is defined by the user. Measure reliability metrics as close to the user as possible, such as the following options:
- If possible, instrument the mobile or web client.
- For example, use Firebase performance monitoring to gain insight into the performance characteristics of your iOS, Android, and web apps.
- If that's not possible, instrument the load balancer.
- For example, use Cloud Monitoring for external Application Load Balancer logging and monitoring.
- A measure of reliability at the server should be the last option.
- For example, monitor a Compute Engine instance with Cloud Monitoring.
Set your SLO just high enough that almost all users are happy with your service, and no higher. Because of network connectivity or other transient client-side issues, your customers might not notice brief reliability issues in your application, allowing you to lower your SLO.
For uptime and other vital metrics, aim for a target lower than 100% but close to it. Service owners should objectively assess the minimum level of service performance and availability that would make most users happy, not just set targets based on external contractual levels.
The rate at which you change affects your system's reliability. However, the ability to make frequent, small changes helps you deliver features faster and with higher quality. Achievable reliability goals tuned to the customer experience help define the maximum pace and scope of changes (feature velocity) that customers can tolerate.
If you can't measure the customer experience and define goals around it, you can run a competitive benchmark analysis. If there's no comparable competition, measure the customer experience, even if you can't define goals yet. For example, measure system availability or the rate of meaningful and successful transactions to the customer. You can correlate this data with business metrics or KPIs such as the volume of orders in retail or the volume of customer support calls and tickets and their severity. Over a period of time, you can use such correlation exercises to get to a reasonable threshold of customer happiness. This threshold is your SLO.
Iteratively improve SLOs
SLOs shouldn't be set in stone. Revisit SLOs quarterly, or at least annually, and confirm that they continue to accurately reflect user happiness and correlate well with service outages. Make sure that they cover current business needs and new critical user journeys. Revise and augment your SLOs as needed after these periodic reviews.
Use strict internal SLOs
It's a good practice to have stricter internal SLOs than external SLAs. As SLA violations tend to require issuing a financial credit or customer refunds, you want to address problems before they have financial impact.
We recommend that you use these stricter internal SLOs with a blameless postmortem process and incident reviews. For more information, see Build a collaborative incident management process in the Architecture Center reliability category.
Use error budgets to manage development velocity
Error budgets tell you if your system is more or less reliable than is needed over a certain time window. Error budgets are calculated as 100% – SLO over a period of time, such as 30 days.
When you have capacity left in your error budget, you can continue to launch improvements or new features quickly. When the error budget is close to zero, freeze or slow down service changes and invest engineering resources to improve reliability features.
Google Cloud's operations suite includes SLO monitoring to minimize the effort of setting up SLOs and error budgets. The operations suite includes a graphical user interface to help you to configure SLOs manually, an API for programmatic setup of SLOs, and built-in dashboards to track the error budget burn rate. For more information, see how to create an SLO.
To apply the guidance in the Architecture Framework to your own environment, follow these recommendations::
- Define and measure customer-centric SLIs, such as the availability or latency of the service.
- Define a customer-centric error budget that's stricter than your external SLA. Include consequences for violations, such as production freezes.
- Set up latency SLIs to capture outlier values, such as 90th or 99th percentile, to detect the slowest responses.
- Review SLOs at least annually and confirm that they correlate well with user happiness and service outages.
Learn more about how to define your reliability goals with the following resources:
- Build observability into your infrastructure and application (next document in this series)
- Coursera - SRE: Measuring and Managing Reliability
- SRE book chapter SLOs and SRE workbook to implement SLOs
- Tune up your SLI metrics
Explore other categories in the Architecture Framework such as system design, operational excellence, and security, privacy, and compliance.