Adopting SLOs

This document defines several service level objectives (SLOs) that are useful for different types of common service workloads. This document is Part 2 of two parts. Part 1, Defining SLOs, introduces SLOs, shows how SLOs are derived from service level indicators (SLIs), and describes what makes a good SLO.

The State of DevOps reports identified capabilities that drive software delivery performance. These two documents will help you with the following capabilities:

What to measure

Regardless of your domain, many services share common features and can use generic SLOs. The following discussion about generic SLOs is organized by service type and provides detailed explanations of SLIs that apply to each SLO.

Request-driven services

A request-driven service receives a request from a client (another service or a user), performs some computation, possibly sends network requests to a backend, and then returns a response to the client. Request-driven services are most often measured by availability and latency SLIs.

Availability as an SLI

The SLI for availability indicates whether the service is working. The SLI for availability is defined as follows:

The proportion of valid requests served successfully.

You first have to define valid. Some basic definitions might be "not zero-length" or "adheres to a client-server protocol," but it is up to a service owner to define what they mean by valid. A common method to gauge validity is to use an HTTP (or RPC) response code. For example, we often consider HTTP 500 errors to be server errors that count against an SLO, while 400 errors are client errors that do not.

After you decide what to measure, you need to examine every response code returned by your system to ensure that the application uses those codes properly and consistently. When using error codes for SLOs, it's important to ask whether a code is an accurate indicator of your users' experience of your service. For example, if a user attempts to order an item that is out of stock, does the site break and return an error message, or does the site suggest similar products? For use with SLOs, error codes need to be tied to users' expectations.

Developers can misuse errors. In the case where a user asks for a product that is temporarily out of stock, a developer might mistakenly program an error to be returned. However, the system is actually functioning correctly and not in error. The code needs to return as a success, even though the user could not purchase the item they wanted. Of course, the owners of this service need to know that a product is out of stock, but the inability to make a sale is not an error from the customer's perspective and should not count against an SLO. However, if the service cannot connect to the database to determine if the item is in stock, that is an error that counts against your error budget.

Your service might be more complex. For example, perhaps your service handles asynchronous requests or provides a long-running process for customers. In these cases, you might expose availability in another way. However, we recommend that you still represent availability as the proportion of valid requests that are successful. You might define availability as the number of minutes that a customer's workload is running as requested. (This approach is sometimes referred to as the "good minutes" method of measuring availability.) In the case of a virtual machine, you could measure availability in terms of the proportion of minutes after an initial request for a VM that the VM is accessible through SSH.

Latency as an SLI

The SLI for latency (sometimes called speed) indicates whether the service is fast enough. The SLI for latency is defined similarly to availability:

The proportion of valid requests served faster than a threshold.

You can measure latency by calculating the difference between when a timer starts and when it stops for a given request type. The key is a user's perception of latency. A common pitfall is to be too precise in measuring latency. In reality, users cannot distinguish between a 100-millisecond (ms) and a 300-ms refresh and might accept any point between 300 ms and 1000 ms.

Instead, it's a good idea to develop activity-centric metrics that keep the user in focus, for example, in the following processes:

  • Interactive: 1000 ms for the time that a user waits for a result after clicking an element.
  • Write: 1500 ms for changing an underlying distributed system. While this length of time is considered slow for a system, users tend to accept it. We recommend that you explicitly distinguish between writes and reads in your metrics.
  • Background: 5000 ms for an action that is not user-visible, like a periodic refresh of data or other asynchronous requests.

Latency is commonly measured as a distribution (see Choosing an SLI in Part 1 of this series). Given a distribution, you can measure various percentiles. For example, you might measure the number of requests that are slower than the historical 99th percentile. In this case, we consider good events to be events that are faster than this threshold, which was set by examining the historical distribution. You can also set this threshold based on product requirements. You can even set multiple latency SLOs, for example typical latency versus tail latency.

We recommend that you do not use only the average (or median) latency as your SLI. Discovering that the median latency is too slow means that half your users are already unhappy. In other words, you can have bad latency for days before you discover a real threat to your long-term error budget. Therefore, we recommend that you define your SLO for tail latency (95th percentile) and for median latency (50th percentile).

In the ACM article Metrics That Matter, Benjamin Treynor Sloss writes the following:

"A good practical rule of thumb ... is that the 99th-percentile latency should be no more than three to five times the median latency."

Treynor Sloss continues:

"We find the 50th-, 95th-, and 99th-percentile latency measures for a service are each individually valuable, and we will ideally set SLOs around each of them."

A good model to follow is to determine your latency thresholds based on historical percentiles, then measure how many requests fall into each bucket. For more details, see the section on latency alerts later in this document.

Quality as an SLI

Quality is a helpful SLI for complex services that are designed to fail gracefully by degrading when dependencies are slow or unavailable. The SLI for quality is defined as follows:

The proportion of valid requests served without degradation of service.

For example, a web page might load its main content from one datastore and load ancillary, optional assets from 100 other services and datastores. If one optional service is out of service or too slow, the page can still be rendered without the ancillary elements. By measuring the number of requests that are served a degraded response (that is, a response missing at least one backend service's response), you can report the ratio of requests that were bad. You might even track how many responses to the user were missing a response from a single backend, or were missing responses from multiple backends.

Data processing services

Some services are not built to respond to user requests but instead consume data from an input, process that data, and generate an output. How these services perform at intermediate steps is not as important as the final result. With services like these, your strongest SLIs are freshness, coverage, correctness, and throughput, not latency and availability.

Freshness as an SLI

The SLI for freshness is defined as follows:

The proportion of valid data updated more recently than a threshold.

In batch processing systems, for example, freshness can be measured as the time elapsed since a processing run completed successfully for a given output. In more complex or real-time processing systems, you might track the age of the most-recent record processed in a pipeline.

For example, consider an online game that generates map tiles in real time. Users might not notice how quickly map tiles are created, but they might notice when map data is missing or is not fresh.

Or, consider a system that reads records from an in-stock tracking system to generate the message "X items in stock" for an ecommerce website. You might define the SLI for freshness as follows:

The percentage of views that used stock information that was refreshed within the last minute.

You can also use a metric for serving non-fresh data to inform the SLI for quality.

Coverage as an SLI

The SLI for coverage is defined as follows:

The proportion of valid data processed successfully.

To define coverage, you first determine whether to accept an input as valid or to skip it. For example, if an input record is corrupted or zero-length and cannot be processed, you might consider that record as invalid for measuring your system.

Next, you count the number of your valid records. You might do this step with a simple count() method or another method. This number is your total record count.

Finally, to generate your SLI for coverage, you count the number of records that processed successfully and compare that number against the total valid record count.

Correctness as an SLI

The SLI for correctness is defined as follows:

The proportion of valid data that produced correct output.

In some cases, there are methods of determining the correctness of an output that can be used to validate the processing of the output. For example, a system that rotates or colorizes an image should never produce a zero-byte image, or an image with a length or width of zero. It is important to separate this validation logic from the processing logic itself.

One method of measuring a correctness SLI is to use known-good test input data, which is data that has a known correct output. The input data needs to be representative of user data. In other cases, it is possible that a mathematical or logical check might be made against the output, like in the preceding example of rotating an image. Another example might be a billing system that determines if a transaction is valid by checking whether the difference between the balance before the transaction and the balance after the transaction matches the value of the transaction itself.

Throughput as an SLI

The SLI for throughput is defined as follows:

The proportion of time where the data processing rate was faster than a threshold.

In a data processing system, throughput is often more representative of user happiness than, for example, a single latency measurement for a given piece of work. For example, if the size of each input varies dramatically, it might not make sense to compare how long each element takes to finish if a job progresses at an acceptable rate.

Bytes per second is a common way to measure the amount of work it takes to process data regardless of the size of a dataset. But any metric that roughly scales linearly with respect to the cost of processing can work.

It might be worthwhile to partition your data processing systems based upon expected throughput rates, or implement a quality of service system to ensure that high-priority inputs are handled and low-priority inputs are queued. Either way, measuring throughput as defined in this section can help you determine if your system is working as expected.

Scheduled execution services

For services that need to perform an action at a regular interval, such as Kubernetes cron jobs, you can measure skew and execution duration. The following is a sample scheduled Kubernetes cron job:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "0 * * * *"

Skew as an SLI

As an SLI, skew is defined as follows:

The proportion of executions that start within an acceptable window of the expected start time.

Skew measures the time difference between when a job is scheduled to start and when it does start. For example, if the preceding Kubernetes cron job, which is set up to start at minute zero of every hour, starts at three minutes past the hour, then the skew is three minutes. When a job runs early, you have a negative skew.

You can measure skew as a distribution over time, with corresponding acceptable ranges that define good skew. To determine the SLI, you would compare the number of runs that were within a good range.

Execution duration as an SLI

As an SLI, execution duration is defined as follows:

The proportion of executions that complete within the acceptable duration window.

Execution duration is the time a job takes to complete. For a given execution, a common failure mode is for actual duration to exceed scheduled duration.

One interesting case is how to apply this SLI to catch a never-ending job. Because these jobs don't finish, you need to record the time spent on a given job instead of waiting for a job to complete. This approach provides an accurate distribution of how long work takes to complete, even in worst-case scenarios.

As with skew, you can track execution duration as a distribution and define acceptable upper and lower bounds for good events.

Types of metrics for other systems

Many other workloads have their own metrics that you can use to generate SLIs and SLOs. Consider the following examples:

  • Storage systems: durability, throughput, time to first byte, blob availability
  • Media/video: client playback continuity, time to start playback, transcode graph execution completeness
  • Gaming: time to match active players, time to generate a map

How to measure

After you know what you're measuring, you can decide how to take the measurement. You can gather your SLIs in several ways.

Server-side logging

Method to generate SLIs

Processing server-side logs of requests or processed data.

Considerations

Advantages:

  • Existing logs can be reprocessed to backfill historical SLI records.
  • Cross-service session identifiers can reconstruct complex user journeys across multiple services.

Disadvantages:

  • Requests that do not arrive at the server are not recorded.
  • Requests that cause a server to crash might not be recorded.
  • Length of time to process logs can result in stale SLIs, which might be inadequate data for an operational response.
  • Writing code to process logs can be an error-prone, time- consuming task.

Implementation methods and tools:

Application server metrics

Method to generate SLIs

Exporting SLI metrics from the code that serves requests from users or processes their data.

Considerations

Advantage:

  • Adding new metrics to code is typically fast and inexpensive.

Disadvantages:

  • Requests that do not arrive to application servers are not recorded.
  • Multi-service requests might be hard to track.

Implementation methods and tools:

Frontend infrastructure metrics

Method to generate SLIs

Utilizing metrics from the load-balancing infrastructure (for example, Google Cloud's global Layer 7 load balancer).

Considerations

Advantages:

  • Metrics and historical data often already exist, thus reducing the engineering effort to get started.
  • Measurements are taken at the point nearest the customer yet still within the serving infrastructure.

Disadvantages:

  • Not viable for data processing SLIs.
  • Can only approximate multi-request user journeys.

Implementation methods and tools:

Synthetic clients or data

Method to generate SLIs

Building a client that sends fabricated requests at regular intervals and validates the responses. For data processing pipelines, creating synthetic known-good input data and validating outputs.

Considerations

Advantages:

  • Measures all steps of a multi-request user journey.
  • Sending requests from outside your infrastructure captures more of the overall request path in the SLI.

Disadvantages:

  • Approximates user experience with synthetic requests, which might be misleading (both false positives or false negatives).
  • Covering all corner cases is hard and can devolve into integration testing.
  • High reliability targets require frequent probing for accurate measurement.
  • Probe traffic can drown out real traffic.

Implementation methods and tools:

Client instrumentation

Method to generate SLIs

Adding observability features to the client that the user interacts with, and logging events back to your serving infrastructure that tracks SLIs.

Considerations

Advantages:

  • Provides the most accurate measure of user experience.
  • Can quantify reliability of third parties, for example, CDN or payments providers.

Disadvantages:

  • Client logs ingestion and processing latency make these SLIs unsuitable for triggering an operational response.
  • SLI measurements will contain a number of highly variable factors potentially outside of direct control.
  • Building instrumentation into the client can involve lots of engineering work.

Implementation methods and tools:

Choosing a measurement method

Ideally, you need to choose a measurement method that most closely aligns with your customer's experience of your service and demands the least effort on your part. To achieve this ideal, you might need to use a combination of the methods in the preceding tables. Here is a suggested approach that you can implement over time, listed in order of increasing effort:

  1. Using application server exports and infrastructure metrics. Typically, you can access these metrics immediately, and they quickly provide value. Some APM tools include built-in SLO tooling.
  2. Using client instrumentation. Because legacy systems typically lack built-in, end-user client instrumentation, setting up instrumentation might require a significant investment. However, if you use an APM suite or frontend framework that provides client instrumentation, you can quickly gain insight into your customer's happiness.
  3. Using logs processing. If you cannot implement server exports or client instrumentation but logs exist, you might find logs processing to be your best value. Another approach is to combine exports and logs processing, using exports as an immediate source for some SLIs (such as immediate availability) and logs processing for long-term signals (such as slow-burn alerts discussed later in the section on alerting).
  4. Implementing synthetic testing. After you have a basic understanding of how your customers use your service, you test your service level. For example, you can seed test accounts with known-good data and query for it. This testing can help highlight failure modes that aren't easily observed, such as in the case of low-traffic services.

Setting your objectives

One of the best ways to set objectives is to create a shared document that describes your SLOs and how you developed them. Your team can iterate on the document as it implements and iterates on the SLOs over time.

We recommend that business owners, product owners, and executives review this document. Those stakeholders can offer insights about service expectations and your product's reliability tradeoffs.

For your company's most important critical user journeys (CUJs), here is a template for developing an SLO:

  1. Choose an SLI specification (for example, availability or freshness).
  2. Define how to implement the SLI specification.
  3. Read through your plan to ensure that your CUJs are covered.
  4. Set SLOs based on past performance or business needs.

CUJs should not be constrained to a single service, nor to a single development team or organization. If your users depend on hundreds of microservices that operate at 99.5% yet nobody tracks end-to-end availability, your customer is likely not happy.

Suppose that you have a query that depends on five services that work in sequence: a load balancer, a frontend, a mixer, a backend, and a database.

If each component has a 99.5% availability, the worst-case user-facing availability is as follows:

99.5% * 99.5% * 99.5% * 99.5% * 99.5% = 97.52%

This is the worst-case user-facing availability because the overall system fails if any one of the five services fails. This would only be true if all layers of the stack must always be immediately available to handle each user request, without any resilience factors such as intermediate retries, caches, or queues. A system with such tight coupling between services is a bad design and defies the microservices model.

Simply measuring performance against the SLO of a distributed system in this piecemeal manner (service by service) doesn't accurately reflect your customer's experience and might result in an overly sensitive interpretation.

Instead, you should measure performance against the SLO at the frontend to understand what users experience. The user does not care if a component service fails, causing a query to be automatically and successfully retried, if the user's query still succeeds. If you have shared internal services, these services can separately measure performance against their SLOs, with the user-facing services acting as their customers. You should handle these SLOs separately from each other.

It is possible to build a highly available service (for example, 99.99%) on top of a less-available service (for example, 99.9%) by using resilience factors such as smart retries, caching, and queueing.

As a general rule, anyone with a working knowledge of statistics should be able to read and understand your SLO without understanding your underlying service or organizational layout.

Example SLO worksheet

When you develop your SLO, remember to do the following:

  • Make sure that your SLIs specify an event, a success criterion, and where and how you record success or failure.
  • Define the SLI specification in terms of the proportion of events that are good.
  • Make sure that your SLO specifies both a target level and a measurement window.
  • Describe the advantages and disadvantages of your approach so that interested parties understand the tradeoffs and subtleties involved.

For example, consider the following SLO worksheet.

CUJ: Home page load

SLI type: Latency

SLI specification: Proportion of home page requests served in less than 100 ms

SLI implementations:

  • Proportion of home page requests served in less than 100 ms as measured from the latency column of the server log. (Disadvantage: This measurement misses requests that fail to reach the backend.)
  • Proportion of home page requests served in less than 100 ms as measured by probers that execute JavaScript in a browser running in a virtual machine. (Advantages and disadvantages: This measurement catches errors when requests cannot reach the network but might miss issues affecting only a subset of users.)

SLO: 99% of home page requests in the past 28 days served in less than 100 ms

SLOs and alerts

A mistaken approach to introducing a new observability system like SLOs is to use the system to completely replace an old system. Rather, you should see SLOs as a complementary system. For example, instead of deleting your existing alerts, we recommend that you run them in parallel with the SLO alerts introduced here. This approach lets you discover which legacy alerts are predictive of SLO alerts, which alerts fire in parallel with your SLO alerts, and which alerts never fire.

A tenet of SRE is to alert based on symptoms, not on causes. SLOs are, by their very nature, measurements of symptoms. As you adopt SLO alerts, you might find that the symptom alert fires alongside other alerts. If you discover that your legacy, cause-based alerts fire with no SLO or symptoms, these are good candidates to be turned off entirely, turned into ticketing alerts, or simply logged for later reference.

For more information on this topic, see SRE Workbook, Chapter 5.

SLO burn rate

An SLO's burn rate is a measurement of how quickly an outage exposes users to errors and depletes the error budget. By measuring your burn rate, you can determine the time until a service violates its SLO. Alerting based on the SLO burn rate is a valuable approach. Remember that your SLO is based on a duration, which might be quite long (weeks or even months). However, the goal is to quickly detect a condition that results in an SLO violation before that violation actually occurs.

The following table shows the time it takes to exceed an objective if 100% of requests are failing for the given interval, assuming queries per second (QPS) is constant. For example, if you have a 99.9% SLO measured over 30 days, you can withstand 43.2 minutes of full downtime during that 30 days. For example, that downtime can occur all at once, or spaced over several incidents.

Objective 90 days 30 days 7 days 1 day
90% 9 days 3 days 16.8 hours 2.4 hours
99% 21.6 hours 7.2 hours 1.7 hours 14.4 minutes
99.9% 2.2 hours 43.2 minutes 10.1 minutes 1.4 minutes
99.99% 13 minutes 4.3 minutes 1 minute 8.6 seconds
99.999% 1.3 minutes 25.9 seconds 6 seconds 0.9 seconds

In practice, you cannot afford any 100%-outage incidents if you want to achieve high-success percentages. However, many distributed systems can partially fail or degrade gracefully. Even in those cases, you still want to know if a human needs to step in, even in such partial failures, and SLO alerts give you a way to determine that.

When to alert

An important question is when to act based on your SLO burn rate. As a rule, if you will exhaust your error budget in 24 hours, the time to page someone to fix an issue is now.

Measuring the rate of failure isn't always straightforward. A series of small errors might look terrifying in the moment but turn out to be short-lived and have an inconsequential impact on your SLO. Similarly, if a system is slightly broken for a long time, these errors can add up to an SLO violation.

Ideally, your team will react to these signals so that you spend almost all of your error budget (but not exceed it) for a given time period. If you spend too much, you violate your SLO. If you spend too little, you're not taking enough risk or possibly burning out your on-call team.

You need a way to determine when a system is broken enough that a human should intervene. The following sections discuss some approaches to that question.

Fast burns

One type of SLO burn is a fast SLO burn because it burns through your error budget quickly and demands that you intervene to avoid an SLO violation.

Suppose your service operates normally at 1000 queries per second (QPS), and you want to maintain 99% availability as measured over a seven-day week. Your error budget is about 6 million allowable errors (out of about 600 million requests). If you have 24 hours before your error budget is exhausted, for example, that gives you a limit of about 70 errors per second, or 252,000 errors in one hour. These parameters are based on the general rule that pageable incidents should consume at least 1% of the quarterly error budget.

You can choose to detect this rate of errors before that one hour has elapsed. For example, after observing 15 minutes of a 70-error-per-second rate, you might decide to page the on-call engineer, as the following diagram shows.

image

Ideally, the problem is solved before you expend one hour of your 24-hour budget. Choosing to detect this rate in a shorter window (for example, one minute) is likely to be too error-prone. If your target mean time to detect (MTTD) is shorter than 15 minutes, this number can be adjusted.

Slow burns

Another type of burn rate is a slow burn. Suppose you introduce a bug that burns your weekly error budget by day five or six, or your monthly budget by week two? What is the best response?

In this case, you might introduce a slow SLO burn alert that lets you know you're on course to consume your entire error budget before the end of the alerting window. Of course, that alert might return many false positives. For example, there might often be a condition where errors occur briefly but at a rate that would quickly consume your error budget. In these cases, the condition is a false positive because it lasts only a short time and does not threaten your error budget in the long term. Remember, the goal is not to eliminate all sources of error; it is to stay within the acceptable range to not exceed your error budget. You want to avoid alerting a human to intervene for events that are not legitimately threatening your error budget.

We recommend that you notify a ticket queue (as opposed to paging or emailing) for slow-burn events. Slow-burn events are not emergencies but do require human attention before the budget expires. These alerts should not be emails to a team list, which quickly become a nuisance to be ignored. Tickets should be trackable, assignable, and transferrable. Teams should develop reports for ticket load, closure rates, actionability, and duplicates. Excessive, unactionable tickets are a great example of toil.

Using SLO alerts skillfully can take time and depend on your team's culture and expectations. Remember that you can fine-tune your SLO alerts over time. You can also have multiple alert methods, with varying alert windows, depending on your needs.

Latency alerts

In addition to availability alerts, you can also have latency alerts. With latency SLOs, you're measuring the percent of requests that are not meeting a latency target. By using this model, you can utilize the same alerting model that you use to detect fast or slow burns of your error budget.

As noted earlier about median latency SLOs, fully half your requests can be out of SLO. In other words, your users can suffer bad latency for days before you detect the impact on your long-term error budget. Instead, services should define tail latency objectives and typical latency objectives. We suggest using the historical 90th percentile to define typical and the 99th percentile for tail. After you set these targets, you can define SLOs based on the number of requests you expect to land in each latency category and how many are too slow. This approach is the same concept as an error budget and should be treated the same. Thus, you might end up with a statement like "90% of requests will be handled within typical latency and 99.9% within tail latency targets." These targets ensure that most users experience your typical latency and still let you track how many requests are slower than your tail latency targets.

Some services might have highly variant expected runtimes. For example, you might have dramatically different performance expectations for reading from a datastore system versus writing to it. Instead of enumerating every possible expectation, you can introduce runtime performance buckets, as the following tables show. This approach presumes that these types of requests are identifiable and pre-categorized into each bucket. You should not expect to categorize requests on the fly.

User-facing website
Bucket Expected maximum runtime
Read 1 second
Write / update 3 seconds
Data processing systems
Bucket Expected maximum runtime
Small 10 seconds
Medium 1 minute
Large 5 minutes
Giant 1 hour
Enormous 8 hours

By measuring the system as it is today, you can understand how long these requests typically take to run. As an example, consider a system for processing video uploads. If the video is very long, the processing time should be expected to take longer. We can use the length of the video in seconds to categorize this work into a bucket, as the following table shows. The table records the number of requests per bucket as well as various percentiles for runtime distribution over the course of a week.

Video length Number of requests measured in one week 10% 90% 99.95%
Small 0 - - -
Medium 1.9 million 864 milliseconds 17 seconds 86 seconds
Large 25 million 1.8 seconds 52 seconds 9.6 minutes
Giant 4.3 million 2 seconds 43 seconds 23.8 minutes
Enormous 81,000 36 seconds 1.2 minutes 41 minutes

From such analysis, you can derive a few parameters for alerting:

  • fast_typical: At most, 10% of requests are faster than this time. If too many requests are faster than this time, your targets might be wrong, or something about your system might have changed.
  • slow_typical: At least 90% of requests are faster than this time. This limit drives your main latency SLO. This parameter indicates whether most of the requests are fast enough.
  • slow_tail: At least 99.95% of requests are faster than this time. This limit ensures that there aren't too many slow requests.
  • deadline: The point at which a user RPC or background processing times out and fails (a limit typically already hard-coded into the system). These requests won't actually be slow but will have actually failed with an error and instead count against your availability SLO.

A guideline in defining buckets is to keep a bucket's fast_typical, slow_typical, and slow_tail within an order of magnitude of each other. This guideline ensures that you don't have too broad of a bucket. We recommend that you don't attempt to prevent overlap or gaps between the buckets.

Bucket fast_typical slow_typical slow_tail deadline
Small 100 milliseconds 1 second 10 seconds 30 seconds
Medium 600 milliseconds 6 seconds 60 seconds (1 minute) 300 seconds
Large 3 seconds 30 seconds 300 seconds (5 minutes) 10 minutes
Giant 30 seconds 6 minutes 60 minutes (1 hour) 3 hours
Enormous 5 minutes 50 minutes 500 minutes (8 hours) 12 hours

This results in a rule like api.method: SMALL => [1s, 10s]. In this case, the SLO tracking system would see a request, determine its bucket (perhaps by analysing its method name or URI and comparing the name to a lookup table), then update the statistic based on the runtime of that request. If this took 700 milliseconds, it is within the slow_typical target. If it is 3 seconds, it is within slow_tail. If it is 22 seconds, it is beyond slow_tail, but not yet an error.

In terms of user happiness, you can think of missing tail latency as equivalent to being unavailable. (That is, the response is so slow that it should be considered a failure.) Due to this, we suggest using the same percentage that you use for availability, for example:

99.95% of all requests are satisfied within 10 seconds.

What you consider typical latency is up to you. Some teams within Google consider 90% to be a good target. This is related to your analysis and how you chose durations for slow_typical. For example:

90% of all requests are handled within 1 second.

Suggested alerts

Given these guidelines, the following table includes a suggested baseline set of SLO alerts.

SLOs Measurement window Burn rate Action

Availability, fast burn

Typical latency

Tail latency

1-hour window Less than 24 hours to SLO violation Page someone

Availability, slow burn

Typical latency, slow burn

Tail latency, slow burn

7-day window Greater than 24 hours to SLO violation Create a ticket

SLO alerting is a skill that can take time to develop. The durations in this section are suggestions; you can adjust these according to your own needs and level of precision. Tying your alerts to the measurement window or error budget expenditure might be helpful, or you might add another layer of alerting between fast burns and slow burns.

What's next?