Measure your SLOs

Last reviewed 2024-03-29 UTC

This document in the Google Cloud Architecture Framework builds on the previous discussions of service level objectives (SLOs) by exploring the what and how of measuring in respect to common service workloads. This document builds on the concepts defined in Components of service level objectives.

Decide what to measure

Regardless of your domain, many services share common features and can use generic SLOs. This section discusses generic SLOs for different service types and provides detailed explanations of the SLIs that apply to each SLO.

Each of the following subsections identifies a particular service type and provides a short description of that service. Then, listed under each service type are possible SLIs, a definition of the indicator, and other information related to the SLI.

Request-driven services

This service type receives a request from a client (a user or another service), performs some computation, possibly sends network requests to a backend, and then returns a response to the client.

Availability as an SLI

Availability is the proportion of valid requests that are served successfully. The following list covers information to consider when using availability as an SLI:

As a service owner, you decide what is a valid request. Common definitions include not zero-length or adheres to a client-server protocol. One method to gauge validity is reviewing HTTP (or RPC) response codes. For example, HTTP 5xx codes are server-related errors that count against an SLO, while 4xx codes are client errors that don't count.
Each response code returned by your service must be examined to ensure that the application uses those codes properly and consistently. Is the code an accurate indicator of your users' experience of the service? For example, how does an ecommerce site respond when a user attempts to order an item that is out of stock? Does the request fail and return an error message? Does the site suggest similar products? Error codes must be tied to user expectations.
Developers can inadvertently misuse errors. Using the out-of-stock scenario from the previous bullet, a developer might mistakenly return an error. However, the system is working properly and not in error. The code needs to return a success, even though the user couldn't purchase the item. While service owners should be notified about the low inventory, the inability to make a sale isn't an error from the customer's perspective and doesn't count against an SLO.
An example of a more complex service is one that handles asynchronous requests or provides a long-running process for customers. While you can expose availability in another way, we recommend representing availability as the proportion of valid requests that are successful. Availability can also be defined as the number of minutes a customer's workload is running (sometimes referred to as the good minutes approach).
Consider a service offering virtual machines (VMs). You could measure availability in terms of the number of minutes after an initial request that the VM is available to the user.

Latency as an SLI

Latency (or speed) is the proportion of valid requests that are served faster than a threshold. Thus, latency indicates service quickness, and can be measured by calculating the difference between the start and stop times for a given request type. Remember, this is the user's perception of latency, and service owners commonly measure this value too precisely. In reality, users can't distinguish between a 100 millisecond (ms) and a 300 ms refresh, and might even accept responses between 300 ms and 1000 ms as normal. For more information, see the Rail model.

Develop activity-centric metrics that focus on the user. The following are some examples of such metrics:

Interactive: A user waits 1000 ms for a result after clicking an element.
Write: A change to an underlying distributed system takes 1500 ms. While this length of time is considered slow, users tend to accept it. We recommend that you explicitly distinguish between writes and reads in your metrics.
Background: Actions that are not user-visible,like a periodic refresh of data or other asynchronous requests, take no more than 5000 ms to complete.

Latency is commonly measured as a distribution and as mentioned in Choose your SLIs. Given a distribution, you can measure various percentiles. For example, you might measure the number of requests that are slower than the historical 99th percentile. Events faster than this threshold are considered good; slower requests are considered not so good. You set this threshold based on product requirements. You can even set multiple latency SLOs, for example typical latency versus tail latency.

Don't only use the average (or median) latency as your SLI. If the median latency is too slow, half your users are already unhappy. Also, your service can experience bad latency for days before you discover the problem. Therefore, define your SLO for tail latency (95th percentile) and for median latency (50th percentile).

In the ACM article Metrics That Matter, Benjamin Treynor Sloss writes the following:

"A good practical rule of thumb ... is that the 99th-percentile latency should be no more than three to five times the median latency."
"We find the 50th-, 95th-, and 99th-percentile latency measures for a service are each individually valuable, and we will ideally set SLOs around each of them."

Determine your latency thresholds based on historical percentiles, then measure how many requests fall into each bucket. This approach is a good model to follow.

Quality as an SLI

Quality is the proportion of valid requests that are served without degradation of service. As an example, quality is a helpful SLI for complex services that are designed to fail gracefully. To illustrate, consider a web page that loads its main content from one datastore and loads optional assets from 100 other services and datastores. If one of the optional elements is out of service or too slow, the page is still rendered without the optional elements. In this scenario, you can use SLIs to do the following:

By measuring the number of requests that receive a degraded response (a response missing a reply from at least one backend service), you can report the ratio of bad requests.
You can track the number of responses that are missing a reply from a single backend, or from multiple backends.

Data processing services

These services consume data from an input, process that data, and generate an output. Service performance at intermediate steps is not as important as the final result. The strongest SLIs are freshness, coverage, correctness, and throughput. Latency and availability are less useful.

Freshness as an SLI

Freshness is the proportion of valid data updated more recently than a threshold. The following list provides some examples of using freshness as an SLI:

In a batch system, freshness is measured as the time elapsed during a successfully completed processing run for a given output.
In real-time processing or more complex systems, freshness tracks the age of the most-recent record processed in a pipeline.
In an online game that generates map tiles in real time, users might not notice how quickly map tiles are created, but they might notice when map data is missing or is not fresh. In this case, freshness tracks missing or stale data.
In a service that reads records from a tracking system to generate the message "X items in stock" for an ecommerce website, a freshness SLI could be defined as the percentage of requests that are using recently refreshed (within the last minute) stock information.
You can also use a metric for serving non-fresh data to update the SLI for quality.

Coverage as an SLI

Coverage is the proportion of valid data processed successfully.

To define coverage, follow these steps:

Determine whether to accept an input as valid or skip it. For example, if an input record is corrupted or zero-length and cannot be processed, you might consider that record invalid as a metric.
Count the number of valid records. This count can be accomplished with a simple *count() *method and represents your total record count.
Finally, count the number of records that were processed successfully and compare that number against the total valid record count. This value is your SLI for coverage.

Correctness as an SLI

Correctness is the proportion of valid data that produced correct output. Consider the following points when using correctness as an SLI:

In some cases, the methods to determine the correctness of an output are used to validate the output processing. For example, a system that rotates or colorizes an image should never produce a zero-byte image, or an image with a length or width of zero. It is important to separate this validation logic from the processing logic itself.
One method of measuring a correctness SLI is to use known-good test input data. This type of data is data that produces a known correct output. Remember, input data must be representative of user data.
In other cases, it's possible to use a mathematical or logical check against the output, as in the preceding example of rotating an image.
Lastly, consider a billing system that determines transaction validity by checking the difference between the balance before and after a transaction. If this matches the value of the transaction itself, it's a valid transaction.

Throughput as an SLI

Throughput is the proportion of time where the data processing rate was faster than the threshold. Here are some points to consider when using throughput as an SLI:

Throughput in a data processing system is often more representative of user happiness than a single latency measurement for a given operation. If the size of each input varies dramatically, it might not make sense to compare the time each element takes to finish (if all jobs progress at an acceptable rate).
Bytes per second is a common way to measure the amount of work it takes to process data regardless of the size of the dataset. But any metric that roughly scales linearly with respect to the cost of processing will work.
It might be worthwhile to partition your data processing systems based upon expected throughput rates. Or implement a quality of service system that ensures high-priority inputs are handled, and low-priority inputs are queued. Either way, measuring throughput as defined in this section helps determine if your system is working as within the SLO.

Scheduled execution services

For services that need to perform an action at a regular interval (such as Kubernetes cron jobs), measure skew and execution duration. The following is a sample scheduled Kubernetes cron job:

  apiVersion: batch/v1beta1
  kind: CronJob
  metadata:
  name: hello
  spec:

  schedule: "0 * * * *"

Skew as an SLI

Skew is the proportion of executions that start within an acceptable window of the expected start time. When using skew, consider the following:

Skew measures the time difference between when a job is scheduled to start and its actual start time. Consider the preceding Kubernetes cron job example. If it's set to start at minute zero of every hour, but starts at three minutes past the hour, the skew is three minutes. When a job runs early, you have a negative skew.
You can measure skew as a distribution over time, with corresponding acceptable ranges that define good skew. To determine the SLI, compare the number of runs that were within a good range.

Execution duration as an SLI

Execution duration is the proportion of executions that complete within the acceptable duration window. The following covers important concepts related to using execution duration:

Execution duration is the time a job takes to complete. For a given execution, a common failure mode is when actual duration exceeds scheduled duration.
An interesting case is applying this SLI to a never-ending job. Because these jobs don't finish, record the time spent on a given job instead of waiting for a job to complete. This approach provides an accurate distribution of how long work takes to complete, even in worst-case scenarios.
As with skew, you can track execution duration as a distribution and define acceptable upper and lower bounds for good events.

Metrics for other system types

Many other workloads have their own metrics to generate SLIs and SLOs. Consider the following examples:

Storage systems: Durability, throughput, time to first byte, blob availability.
Media/video: Client playback continuity, time to start playback, transcode graph execution completeness.
Gaming: Time to match active players, time to generate a map.

Decide how to measure

The previous section covered what you're measuring. After you have determined what to measure, you can decide how to measure. You can collect SLI metrics in several ways. The following sections identify various measurement methods, provide a brief description of the method, list the method's advantages and disadvantages, and identify appropriate implementation tools for the method.

Server-side logging

The server-side logging method involves processing server-side logs of requests or processed data.

Server-side logging	Details
Advantages	Reprocess existing logs to backfill historical SLI records. Use cross-service session identifiers to reconstruct complex user journeys across multiple services.
Disadvantages	Requests that don't arrive at the server are not recorded. Requests that cause a server to crash might not be recorded. Length of time to process logs can cause stale SLIs, which might be inadequate data for an operational response. Writing code to process logs can be an error-prone, time-consuming task.
Implementation method & tools	BigQuery Dataflow, Apache Spark Splunk

Application server metrics

The application server metrics method involves exporting SLI metrics from the code that serves user requests or processes their data.

Application server metrics	Details
Advantages	Adding new metrics to code is typically fast and inexpensive.
Disadvantages	Requests that don't reach the application servers are not recorded. Multi-service requests could be hard to track.
Implementation method & tools	Cloud Monitoring Prometheus Third-party APM products

Frontend infrastructure metrics

The fronted infrastructure metrics method utilizes metrics from the load-balancing infrastructure (such as, a global Layer 7 load balancer in Google Cloud).

Frontend insfrasture metrics	Details
Advantages	Metrics and historical data often already exist, reducing the engineering effort to get started. Measurements are taken at the point nearest the customer yet still within the serving infrastructure.
Disadvantages	Isn't viable for data processing SLIs. Can only approximate multi-request user journeys.
Implementation method & tools	Cloud Monitoring CDN/LB metrics Reporting APIs (Cloudflare, Akamai, Fastly)

Synthetic clients or data

The synthetic clients or data method involves using clients that send fabricated requests at regular intervals and validates the responses.

Synthetic clients or data	Details
Advantages	Measures all steps of a multi-request user journey. Sending requests from outside your infrastructure captures more of the overall request path in the SLI.
Disadvantages	Approximating user experience with synthetic requests might be misleading (both false positives or false negatives). Covering all corner cases is hard and can devolve into integration testing. High reliability targets require frequent probing for accurate measurement. Probe traffic can drown out real traffic.
Implementation method & tools	Cloud Monitoring uptime checks Open source solutions: Cloudprober blackbox_exporter Vendor solutions: Catchpoint Synthetic Monitoring New Relic Synthetics Datadog Synthetic Monitoring

Client instrumentation

The client instrumentation method involves adding observability features to the client that the user interacts with, and logging events back to your serving infrastructure that tracks SLIs.

Client instrumentation	Details
Advantages	Provides the most accurate measure of user experience. Can quantify reliability of third parties, for example, CDN or payments providers.
Disadvantages	Client log ingestion and processing latency make these SLIs unsuitable for triggering an operational response. SLI measurements contain a number of highly variable factors potentially outside of direct control. Building instrumentation into the client can involve lots of engineering work.
Implementation method & tools	Google Analytics, Firebase User analytics: Amplitude and similar tools Mobile/frontend monitoring: New Relic Browser, Raygun Real User Monitoring, Sentry Custom client and custom server to record and track

Choose a measurement method

After you have decided what and how to measure your SLO, your next step is to choose a measurement method that most closely aligns with your customer's experience of your service, and demands the least effort on your part. To achieve this ideal, you might need to use a combination of the methods in the previous tables. The following are suggested approaches that you can implement over time, listed in order of increasing effort:

Use application server exports and infrastructure metrics. Typically, you can access these metrics immediately, and they quickly provide value. Some APM tools include built-in SLO tooling.
Use client instrumentation. Because legacy systems typically lack built-in, end-user client instrumentation, setting up instrumentation might require a significant investment. However, if you use an APM suite or frontend framework that provides client instrumentation, you can quickly gain insight into your customer's happiness.
Use logs processing. If you can't implement server exports or client instrumentation (previous bullets) but logs do exist, logs processing might be your best approach. Another method is to combine exports and logs processing. Use exports as an immediate source for some SLIs (such as immediate availability) and logs processing for long-term signals (such as slow-burn alerts discussed in the SLOs and Alert) guide.
Implement synthetic testing. After you have a basic understanding of how your customers use your service, you test your service. For example, you can seed test accounts with known-good data and query for it. This approach can help highlight failure modes that aren't readily observed, such as for low-traffic services.

What's next?

Read SLOs and alerts.
Read The Art of SLOs, a workshop developed by Google's Customer Reliability Engineering team.
Try Site Reliability Engineering: Measuring and Managing Reliability, an online course about building SLOs.
Read Site Reliability Engineering: Implementing SLOs.
Read Concepts in service monitoring.
Read Implementing Service Level Objectives by Alex Hidalgo.
Read about developing SLOs with Cloud Monitoring.
Try the flexible SLO Generator from Google's Professional Services Organization (PSO).
Read our resources about DevOps.
Learn more about the DevOps capabilities related to this series:
Take the DevOps quick check to understand where you stand in comparison with the rest of the industry.
Explore other categories in the Architecture Framework.
For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.