Adopt SLOs

This document defines several service level objectives (SLOs) that are useful for different types of common service workloads. This document is Part 2 of two parts. Part 1, Define SLOs, introduces SLOs, shows how SLOs are derived from service level indicators (SLIs), and describes what makes a good SLO.

The State of DevOps reports identified capabilities that drive software delivery performance. These two documents will help you with the following capabilities:

What to measure

Regardless of your domain, many services share common features and can use generic SLOs. The following discussion about generic SLOs is organized by service type and provides detailed explanations of SLIs that apply to each SLO.

Request-driven services

A request-driven service receives a request from a client (another service or a user), performs some computation, possibly sends network requests to a backend, and then returns a response to the client. Request-driven services are most often measured by availability and latency SLIs.

Availability as an SLI

The SLI for availability indicates whether the service is working. The SLI for availability is defined as follows:

The proportion of valid requests served successfully.

You first have to define valid. Some basic definitions might be "not zero-length" or "adheres to a client-server protocol," but it is up to a service owner to define what they mean by valid. A common method to gauge validity is to use an HTTP (or RPC) response code. For example, we often consider HTTP 500 errors to be server errors that count against an SLO, while 400 errors are client errors that do not.

After you decide what to measure, you need to examine every response code returned by your system to ensure that the application uses those codes properly and consistently. When using error codes for SLOs, it's important to ask whether a code is an accurate indicator of your users' experience of your service. For example, if a user attempts to order an item that is out of stock, does the site break and return an error message, or does the site suggest similar products? For use with SLOs, error codes need to be tied to users' expectations.

Developers can misuse errors. In the case where a user asks for a product that is temporarily out of stock, a developer might mistakenly program an error to be returned. However, the system is actually functioning correctly and not in error. The code needs to return as a success, even though the user could not purchase the item they wanted. Of course, the owners of this service need to know that a product is out of stock, but the inability to make a sale is not an error from the customer's perspective and should not count against an SLO. However, if the service cannot connect to the database to determine if the item is in stock, that is an error that counts against your error budget.

Your service might be more complex. For example, perhaps your service handles asynchronous requests or provides a long-running process for customers. In these cases, you might expose availability in another way. However, we recommend that you still represent availability as the proportion of valid requests that are successful. You might define availability as the number of minutes that a customer's workload is running as requested. (This approach is sometimes referred to as the "good minutes" method of measuring availability.) In the case of a virtual machine, you could measure availability in terms of the proportion of minutes after an initial request for a VM that the VM is accessible through SSH.

Latency as an SLI

The SLI for latency (sometimes called speed) indicates whether the service is fast enough. The SLI for latency is defined similarly to availability:

The proportion of valid requests served faster than a threshold.

You can measure latency by calculating the difference between when a timer starts and when it stops for a given request type. The key is a user's perception of latency. A common pitfall is to be too precise in measuring latency. In reality, users cannot distinguish between a 100-millisecond (ms) and a 300-ms refresh and might accept any point between 300 ms and 1000 ms.

Instead, it's a good idea to develop activity-centric metrics that keep the user in focus, for example, in the following processes:

  • Interactive: 1000 ms for the time that a user waits for a result after clicking an element.
  • Write: 1500 ms for changing an underlying distributed system. While this length of time is considered slow for a system, users tend to accept it. We recommend that you explicitly distinguish between writes and reads in your metrics.
  • Background: 5000 ms for an action that is not user-visible, like a periodic refresh of data or other asynchronous requests.

Latency is commonly measured as a distribution (see Choosing an SLI in Part 1 of this series). Given a distribution, you can measure various percentiles. For example, you might measure the number of requests that are slower than the historical 99th percentile. In this case, we consider good events to be events that are faster than this threshold, which was set by examining the historical distribution. You can also set this threshold based on product requirements. You can even set multiple latency SLOs, for example typical latency versus tail latency.

We recommend that you do not use only the average (or median) latency as your SLI. Discovering that the median latency is too slow means that half your users are already unhappy. In other words, you can have bad latency for days before you discover a real threat to your long-term error budget. Therefore, we recommend that you define your SLO for tail latency (95th percentile) and for median latency (50th percentile).

In the ACM article Metrics That Matter, Benjamin Treynor Sloss writes the following:

"A good practical rule of thumb ... is that the 99th-percentile latency should be no more than three to five times the median latency."

Treynor Sloss continues:

"We find the 50th-, 95th-, and 99th-percentile latency measures for a service are each individually valuable, and we will ideally set SLOs around each of them."

A good model to follow is to determine your latency thresholds based on historical percentiles, then measure how many requests fall into each bucket. For more details, see the section on latency alerts later in this document.

Quality as an SLI

Quality is a helpful SLI for complex services that are designed to fail gracefully by degrading when dependencies are slow or unavailable. The SLI for quality is defined as follows:

The proportion of valid requests served without degradation of service.

For example, a web page might load its main content from one datastore and load ancillary, optional assets from 100 other services and datastores. If one optional service is out of service or too slow, the page can still be rendered without the ancillary elements. By measuring the number of requests that are served a degraded response (that is, a response missing at least one backend service's response), you can report the ratio of requests that were bad. You might even track how many responses to the user were missing a response from a single backend, or were missing responses from multiple backends.

Data processing services

Some services are not built to respond to user requests but instead consume data from an input, process that data, and generate an output. How these services perform at intermediate steps is not as important as the final result. With services like these, your strongest SLIs are freshness, coverage, correctness, and throughput, not latency and availability.

Freshness as an SLI

The SLI for freshness is defined as follows:

The proportion of valid data updated more recently than a threshold.

In batch processing systems, for example, freshness can be measured as the time elapsed since a processing run completed successfully for a given output. In more complex or real-time processing systems, you might track the age of the most-recent record processed in a pipeline.

For example, consider an online game that generates map tiles in real time. Users might not notice how quickly map tiles are created, but they might notice when map data is missing or is not fresh.

Or, consider a system that reads records from an in-stock tracking system to generate the message "X items in stock" for an ecommerce website. You might define the SLI for freshness as follows:

The percentage of views that used stock information that was refreshed within the last minute.

You can also use a metric for serving non-fresh data to inform the SLI for quality.

Coverage as an SLI

The SLI for coverage is defined as follows:

The proportion of valid data processed successfully.

To define coverage, you first determine whether to accept an input as valid or to skip it. For example, if an input record is corrupted or zero-length and cannot be processed, you might consider that record as invalid for measuring your system.

Next, you count the number of your valid records. You might do this step with a simple count() method or another method. This number is your total record count.

Finally, to generate your SLI for coverage, you count the number of records that processed successfully and compare that number against the total valid record count.

Correctness as an SLI

The SLI for correctness is defined as follows:

The proportion of valid data that produced correct output.

In some cases, there are methods of determining the correctness of an output that can be used to validate the processing of the output. For example, a system that rotates or colorizes an image should never produce a zero-byte image, or an image with a length or width of zero. It is important to separate this validation logic from the processing logic itself.

One method of measuring a correctness SLI is to use known-good test input data, which is data that has a known correct output. The input data needs to be representative of user data. In other cases, it is possible that a mathematical or logical check might be made against the output, like in the preceding example of rotating an image. Another example might be a billing system that determines if a transaction is valid by checking whether the difference between the balance before the transaction and the balance after the transaction matches the value of the transaction itself.

Throughput as an SLI

The SLI for throughput is defined as follows:

The proportion of time where the data processing rate was faster than a threshold.

In a data processing system, throughput is often more representative of user happiness than, for example, a single latency measurement for a given piece of work. For example, if the size of each input varies dramatically, it might not make sense to compare how long each element takes to finish if a job progresses at an acceptable rate.

Bytes per second is a common way to measure the amount of work it takes to process data regardless of the size of a dataset. But any metric that roughly scales linearly with respect to the cost of processing can work.

It might be worthwhile to partition your data processing systems based upon expected throughput rates, or implement a quality of service system to ensure that high-priority inputs are handled and low-priority inputs are queued. Either way, measuring throughput as defined in this section can help you determine if your system is working as expected.

Scheduled execution services

For services that need to perform an action at a regular interval, such as Kubernetes cron jobs, you can measure skew and execution duration. The following is a sample scheduled Kubernetes cron job:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "0 * * * *"

Skew as an SLI

As an SLI, skew is defined as follows:

The proportion of executions that start within an acceptable window of the expected start time.

Skew measures the time difference between when a job is scheduled to start and when it does start. For example, if the preceding Kubernetes cron job, which is set up to start at minute zero of every hour, starts at three minutes past the hour, then the skew is three minutes. When a job runs early, you have a negative skew.

You can measure skew as a distribution over time, with corresponding acceptable ranges that define good skew. To determine the SLI, you would compare the number of runs that were within a good range.

Execution duration as an SLI

As an SLI, execution duration is defined as follows:

The proportion of executions that complete within the acceptable duration window.

Execution duration is the time a job takes to complete. For a given execution, a common failure mode is for actual duration to exceed scheduled duration.

One interesting case is how to apply this SLI to catch a never-ending job. Because these jobs don't finish, you need to record the time spent on a given job instead of waiting for a job to complete. This approach provides an accurate distribution of how long work takes to complete, even in worst-case scenarios.

As with skew, you can track execution duration as a distribution and define acceptable upper and lower bounds for good events.

Types of metrics for other systems

Many other workloads have their own metrics that you can use to generate SLIs and SLOs. Consider the following examples:

  • Storage systems: durability, throughput, time to first byte, blob availability
  • Media/video: client playback continuity, time to start playback, transcode graph execution completeness
  • Gaming: time to match active players, time to generate a map

How to measure

After you know what you're measuring, you can decide how to take the measurement. You can gather your SLIs in several ways.

Server-side logging

Method to generate SLIs

Processing server-side logs of requests or processed data.

Considerations

Advantages:

  • Existing logs can be reprocessed to backfill historical SLI records.
  • Cross-service session identifiers can reconstruct complex user journeys across multiple services.

Disadvantages:

  • Requests that do not arrive at the server are not recorded.
  • Requests that cause a server to crash might not be recorded.
  • Length of time to process logs can result in stale SLIs, which might be inadequate data for an operational response.
  • Writing code to process logs can be an error-prone, time- consuming task.

Implementation methods and tools:

Application server metrics

Method to generate SLIs

Exporting SLI metrics from the code that serves requests from users or processes their data.

Considerations

Advantage:

  • Adding new metrics to code is typically fast and inexpensive.

Disadvantages:

  • Requests that do not arrive to application servers are not recorded.
  • Multi-service requests might be hard to track.

Implementation methods and tools:

Frontend infrastructure metrics

Method to generate SLIs

Utilizing metrics from the load-balancing infrastructure (for example, Google Cloud's global Layer 7 load balancer).

Considerations

Advantages:

  • Metrics and historical data often already exist, thus reducing the engineering effort to get started.
  • Measurements are taken at the point nearest the customer yet still within the serving infrastructure.

Disadvantages:

  • Not viable for data processing SLIs.
  • Can only approximate multi-request user journeys.

Implementation methods and tools:

Synthetic clients or data

Method to generate SLIs

Building a client that sends fabricated requests at regular intervals and validates the responses. For data processing pipelines, creating synthetic known-good input data and validating outputs.

Considerations

Advantages:

  • Measures all steps of a multi-request user journey.
  • Sending requests from outside your infrastructure captures more of the overall request path in the SLI.

Disadvantages:

  • Approximates user experience with synthetic requests, which might be misleading (both false positives or false negatives).
  • Covering all corner cases is hard and can devolve into integration testing.
  • High reliability targets require frequent probing for accurate measurement.
  • Probe traffic can drown out real traffic.

Implementation methods and tools:

Client instrumentation

Method to generate SLIs

Adding observability features to the client that the user interacts with, and logging events back to your serving infrastructure that tracks SLIs.

Considerations

Advantages:

  • Provides the most accurate measure of user experience.
  • Can quantify reliability of third parties, for example, CDN or payments providers.

Disadvantages:

  • Client logs ingestion and processing latency make these SLIs unsuitable for triggering an operational response.
  • SLI measurements will contain a number of highly variable factors potentially outside of direct control.
  • Building instrumentation into the client can involve lots of engineering work.

Implementation methods and tools:

Choose a measurement method

Ideally, you need to choose a measurement method that most closely aligns with your customer's experience of your service and demands the least effort on your part. To achieve this ideal, you might need to use a combination of the methods in the preceding tables. Here is a suggested approach that you can implement over time, listed in order of increasing effort:

  1. Using application server exports and infrastructure metrics. Typically, you can access these metrics immediately, and they quickly provide value. Some APM tools include built-in SLO tooling.
  2. Using client instrumentation. Because legacy systems typically lack built-in, end-user client instrumentation, setting up instrumentation might require a significant investment. However, if you use an APM suite or frontend framework that provides client instrumentation, you can quickly gain insight into your customer's happiness.
  3. Using logs processing. If you cannot implement server exports or client instrumentation but logs exist, you might find logs processing to be your best value. Another approach is to combine exports and logs processing, using exports as an immediate source for some SLIs (such as immediate availability) and logs processing for long-term signals (such as slow-burn alerts discussed later in the SLOs and Alert) guide.
  4. Implementing synthetic testing. After you have a basic understanding of how your customers use your service, you test your service level. For example, you can seed test accounts with known-good data and query for it. This testing can help highlight failure modes that aren't easily observed, such as in the case of low-traffic services.

Set your objectives

One of the best ways to set objectives is to create a shared document that describes your SLOs and how you developed them. Your team can iterate on the document as it implements and iterates on the SLOs over time.

We recommend that business owners, product owners, and executives review this document. Those stakeholders can offer insights about service expectations and your product's reliability tradeoffs.

For your company's most important critical user journeys (CUJs), here is a template for developing an SLO:

  1. Choose an SLI specification (for example, availability or freshness).
  2. Define how to implement the SLI specification.
  3. Read through your plan to ensure that your CUJs are covered.
  4. Set SLOs based on past performance or business needs.

CUJs should not be constrained to a single service, nor to a single development team or organization. If your users depend on hundreds of microservices that operate at 99.5% yet nobody tracks end-to-end availability, your customer is likely not happy.

Suppose that you have a query that depends on five services that work in sequence: a load balancer, a frontend, a mixer, a backend, and a database.

If each component has a 99.5% availability, the worst-case user-facing availability is as follows:

99.5% * 99.5% * 99.5% * 99.5% * 99.5% = 97.52%

This is the worst-case user-facing availability because the overall system fails if any one of the five services fails. This would only be true if all layers of the stack must always be immediately available to handle each user request, without any resilience factors such as intermediate retries, caches, or queues. A system with such tight coupling between services is a bad design and defies the microservices model.

Simply measuring performance against the SLO of a distributed system in this piecemeal manner (service by service) doesn't accurately reflect your customer's experience and might result in an overly sensitive interpretation.

Instead, you should measure performance against the SLO at the frontend to understand what users experience. The user does not care if a component service fails, causing a query to be automatically and successfully retried, if the user's query still succeeds. If you have shared internal services, these services can separately measure performance against their SLOs, with the user-facing services acting as their customers. You should handle these SLOs separately from each other.

It is possible to build a highly available service (for example, 99.99%) on top of a less-available service (for example, 99.9%) by using resilience factors such as smart retries, caching, and queueing.

As a general rule, anyone with a working knowledge of statistics should be able to read and understand your SLO without understanding your underlying service or organizational layout.

Example SLO worksheet

When you develop your SLO, remember to do the following:

  • Make sure that your SLIs specify an event, a success criterion, and where and how you record success or failure.
  • Define the SLI specification in terms of the proportion of events that are good.
  • Make sure that your SLO specifies both a target level and a measurement window.
  • Describe the advantages and disadvantages of your approach so that interested parties understand the tradeoffs and subtleties involved.

For example, consider the following SLO worksheet.

CUJ: Home page load

SLI type: Latency

SLI specification: Proportion of home page requests served in less than 100 ms

SLI implementations:

  • Proportion of home page requests served in less than 100 ms as measured from the latency column of the server log. (Disadvantage: This measurement misses requests that fail to reach the backend.)
  • Proportion of home page requests served in less than 100 ms as measured by probers that execute JavaScript in a browser running in a virtual machine. (Advantages and disadvantages: This measurement catches errors when requests cannot reach the network but might miss issues affecting only a subset of users.)

SLO: 99% of home page requests in the past 28 days served in less than 100 ms

What's next?