This document in the Google Cloud Architecture Framework defines how the user experience defines reliability and how to choose the appropriate service level objectives to meet that level of reliability. This document builds on the concepts defined in Components of SLOs.
The culture of site reliability engineering (SRE) values reliable services and customer happiness (or customer satisfaction). Without a defined service level and a method to gather metrics, it's difficult (if not impossible) to determine where and how much to invest in improvements.
The overriding metric that you use to measure service level is the service level objective (SLO). An SLO is made up of the following values:
- A service level indicator (SLI): A metric of a specific aspect of your service as described in Choose your SLIs.
- Duration: The window where SLI is measured. This can be calendar-based or a rolling window.
- A target: The value (or range of values) that the SLI should meet in the given duration in a healthy service. For example, the percentage of good events to total events that you expect your service to meet, such as 99.9%.
Choosing the right SLOs for your service is a process. You start by defining the user journeys that define reliability and ultimately your SLOs. The SLOs that you choose need to measure the entire system while also balancing the needs of feature development against operational stability. After you've chosen your SLOs, you need to both iteratively improve upon them and manage them by using error budgets.
Define your user journeys
Your SLIs and SLOs are ideally based on critical user journeys (CUJs). CUJs considers user goals and how your service helps users accomplish those goals. You define a CUJ without considering service boundaries. When a CUJ is met, the customer is happy and this is an indication of a successful service.
Customer happiness, or dissatisfaction for that matter, dictates reliability and is the most critical feature of any service.
Therefore, set your SLO just high enough that most users are happy with your service, and no higher. Just as 100% availability is not the right goal, adding more "nines" to your SLOs quickly becomes expensive and might not even matter to the customer.
For uptime and other vital metrics, aim for a target lower than 100%, but close to it. Assess the minimum level of service performance and availability required. Don't set targets based on external contractual levels.
Use CUJs to develop SLOs
Choose your company's most important CUJs, and follow these steps to develop SLOs:
- Choose an SLI specification (such as availability or freshness).
- Decide how to implement the SLI specification.
- Ensure that your plan covers all CUJs.
- Set SLOs based on previous performance or business needs.
CUJs should not be constrained to a single service, nor to a single development team or organization. Your service may depend on dozens or more other services. You might also expect those services to operate at 99.5%. However, if end-to-end (entire system) performance is not tracked, running a reliable service is challenging.
Define target and duration
Defining target and duration (see the previous definition of an SLO) can be difficult. One way to begin the process is to identify your SLIs and chart them over time. Remember, an SLO doesn't have to be perfect from the start. Iterate on your SLO to ensure that it aligns with customer happiness and meets your business needs.
As you track SLO compliance during events such as deployments, outages, and daily traffic patterns, you'll gain insights about the target. These insights will make it more apparent what is good, bad, or tolerable for your targets and durations.
Feature development, code improvements, hardware upgrades, and other maintenance tasks can help make your service more reliable. The ability to make these frequent, small changes helps you deliver features faster and with higher quality. However, the rate at which your service changes also affects reliability. Achievable reliability goals define a pace and scope of change (called feature velocity) that customers can tolerate and benefit from.
If you can't measure the customer experience and define goals around it, you can turn to outside sources and benchmark analysis. If there's no comparable benchmark, measure the customer experience, even if you can't define goals yet. Over time, you can get to a reasonable threshold of customer happiness. This threshold is your SLO.
Understand the entire system
Your service may exist in a long line of services with both upstream and downstream processing. Measuring performance of a distributed system in a piecemeal manner (service by service) doesn't accurately reflect your customer's experience and might cause an overly sensitive interpretation.
Instead, you should measure performance against the SLO at the frontend of the process to understand what users experience. The user is not concerned about a component failure that causes a query to fail if the query is automatically and successfully retried.
If there are shared internal services in place, each service can measure performance separately against the associated SLO, with user-facing services acting as their customers. Handle these SLOs separately.
It's possible to build a highly-available service (for example, 99.99%) on top of a less-available service (for example, 99.9%) by using resilience factors such as smart retries, caching, and queueing. Anyone with a working knowledge of statistics should be able to read and understand your SLO without understanding your underlying service or organizational layout as described in Conway's law.
Choose the correct SLOs
There is a natural tension between product development speed and operational stability. The more you change your system, the more likely it will break. Monitoring and observability tools are critical to operational stability as you increase feature velocity. Such tools are known as application performance management (APM) tools, and can also be used to set SLOs.
When defined correctly, an SLO helps teams make data-driven operational decisions that increase development velocity without sacrificing stability. The SLO can also align development and operations teams around a single agreed upon objective. Sharing a single objective alleviates the natural tension mentioned previously: the development team's goal to create and iterate on products, and the operations team's goal to maintain system integrity.
Use this document and other reliability documents in the Architecture Framework to understand and develop SLOs. Once you have read and understood these articles, move to more detailed information about SLOs (and other SRE practices) in The SRE Book and The SRE Workbook.
Use strict internal SLOs
It's a good practice to have stricter internal SLOs than external SLAs. As SLA violations tend to require issuing a financial credit or customer refunds, you want to address problems before they reach a financial impact.
We recommend using these stricter internal SLOs with a blameless retrospective process and incident review. For more information, see Build a collaborative incident management process.
Iteratively improve SLOs
SLOs shouldn't be set in stone. Revisit SLOs periodically — quarterly, or at least annually — and confirm that they accurately reflect user happiness and correlate with service outages. Ensure they cover current business needs and any new critical user journeys. Revise and augment your SLOs as needed after these reviews.
Use error budgets to manage development velocity
Error budgets show if your service is more or less reliable than is needed for a specific time window. Error budgets are calculated as 100% – SLO over a period of time, such as 30 days.
When you have capacity left in your error budget, you can continue to launch improvements or new features quickly. When the error budget is close to zero, slow down or freeze service changes and invest engineering resources to improve reliability features.
Google Cloud Observability includes SLO monitoring to minimize the effort of setting up SLOs and error budgets. The operations suite includes a graphical user interface to help you to configure SLOs manually, an API for programmatic setup of SLOs, and built-in dashboards to track the error budget burn rate. For more information, see Creating an SLO.
Summary of SLO recommendations
- Define and measure customer-centric SLIs, such as the availability or latency of the service.
- Define a customer-centric error budget that's stricter than your external SLA. Include consequences for violations like production freezes.
- Set up latency SLIs to capture outlier values, such as 90th or 99th percentile, to detect the slowest responses.
- Review SLOs at least annually and confirm that they correlate well with user happiness and service outages.
What's next
- Read Choose your SLIs.
- Check out Coursera - SRE: Measuring and Managing Reliability.
- Learn more about SLOs in the SRE book chapter SLOs and SRE workbook to implement SLOs.
- Explore recommendations in other pillars of the Architecture Framework.
- For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.