Service level objectives overview

Service Level Objectives (SLOs) are a core tool in the Google service monitoring toolkit. SLOs can give you a concise and low-noise signal as to the overall health of your services. Anthos Service Mesh lets you set SLOs for your services, and monitor and alert on your services in terms of those SLOs.

To monitor the health of a service, you need to understand which behaviors matter for that service and how to measure and evaluate those behaviors. A service level indicator (SLI) is a quantitative measure of some aspect of the service. Typical SLIs are:

  • Latency: How long it takes to return a response to a request, usually measured in milliseconds (ms). Latency is typically presented as an aggregate. That is, the raw data is collected over a period of time and calculated as percentiles. Anthos Service Mesh displays a Latency graph on the Metrics page for each of your services. The Latency graph shows you the latency over time, which can help you determine a latency threshold or upper bound for a service.
  • Availability: The fraction of the time that a service responds successfully. This is typically presented as a ratio of the number of successful responses over the total number of responses. The Error rate graph on the Metrics page can help you determine the availability of each service.

An SLO is a target value for a service level that is measured by an SLI. An SLO can be represented as: SLI ≤ upper_bound or SLI ≥ lower_bound. SLOs are measurable goals for performance over a period of time. For example, you might have requirements like the following for some of your services:

  • Latency can exceed 300ms in only 5 percent of the requests over a rolling 30-day period.
  • The system must have 99% availability measured over a calendar week.

You can set and view SLOs for your services based on their telemetry data on the Health page. You can then create alerts in Stackdriver Monitoring to warn you if a service isn't performing as expected.

What's next