This document in the Google Cloud Architecture Framework defines the key concepts needed to understand and create service level objectives (SLOs).
At their core, SLOs reflect the reliability goals of the service you provide your users. It's important to include input from all critical stakeholders when defining these objectives. Many different groups and management levels have a deep interest in your service. These includes business owners, product owners, executives, engineers, support staff, operations, sales, and any other teams associated with your service.
There are as many ways to obtain stakeholder input as there are different reliability objectives to choose. How you ultimately choose your objectives is up to you and your organization based on requirements, stakeholders, and other factors. While this process is out of scope for this guide, a simple approach is to create a shared document that describes your SLOs and how you developed them. Your team can iterate on the document as it implements and continues to improve the SLOs over time.
The following sections define the various components of SLOs.
Service level
A service level is a measurement of how well a service performs its expected work for the user. This metric can be described in terms of user happiness and measured by various methods that depend on the unique characteristics of the service, its user base, and user expectations. In this guide, we associate performance with the system's reliability.
Example service level: Our users expect the service to be available and fast.
Service level indicator
A service level indicator (SLI) is a gauge of user happiness that can be measured quantitatively. An indicator is similar to a line on a graph that changes over time as the service improves or degrades. To evaluate a service level, choose an indicator that represents some aspect of user happiness. Availability is a common SLI.
Example SLI: The number of successful requests in the last 10 minutes divided by the number of all valid requests in the same timeframe.
The SLI in the example is specific and well-defined, and expressed as a numerical value. That value reflects how available the service is. By consistently tracking this SLI over time, a team can determine the overall availability of its service.
For more information about choosing your SLIs, see Choose your SLIs.
Service level objective
The service level objective (SLO) is the target range that you expect the service to achieve as measured by the SLI. The following example uses response time, or speed of the service, as the SLI.
Example SLO: Service response is faster than 400 milliseconds (ms) for 95% of all valid requests measured over 14 days.
In the example SLO, the SLI is the number of requests faster than 400 ms divided by the number of valid requests. This percentage is tracked over 14 days. The objective is to meet 95% of all requests. That is, if the end result (the percentage of requests that meet the criteria) is more than 95%, you've met your SLO for the service.
To recap, the SLI is some measurement (such as speed, availability, and success) of your service. The SLO is the expectation that a specific amount of those measurements (the percentage) meets or exceeds some predetermined level or range. Anything below the expected level is bad. You've failed to provide your users with a reliable service in a specific area of performance.
For more information about choosing your SLOs, see Choose your SLOs.
Service level agreement
The service level agreement (SLA) is the contract between you, the service provider, and your customers. It lists the SLOs the customers are promised and ultimately will expect. The SLA also specifies what happens if a SLO is not met. A broken SLO may result in the service provider refunding money, providing discounted services, or in more critical services may result in legal action or punitive damages.
SLAs are not heavily discussed in this guide. SLAs are mentioned to augment the your understanding of SLO, SLI, and the user.
Error budget
The final value to understand when discussing SLOs is the percentage or number of negative events your service can withstand before violating the SLO. This number, called the error budget, defines the amount of errors your business can expect and tolerate.
To demonstrate, use availability as the SLI (represented by a percentage). Three or more "nines" in the percentage indicates the precision to which you want to measure that SLI. In other words, the number of "9s" express the availability percentage.
Consider an SLO of three nines is 99.9%. Subtracting the SLO value from 100%, leaves us with a 0.1% error budget. When discussing availability, a 0.1% budget is slightly less than nine hours a year during which the service is unavailable. Adding another nine drastically reduces the error budget. An availability of 99.99% (four nines) allows less than an hour of service downtime a year.
That downtime includes requests that fail, server downtime by fault (crash or software bugs) or design (upgrades or testing), human error, accidents and many others.
What's next
- Read Choose your SLOs.
- Explore recommendations in other pillars of the Architecture Framework.
- For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.