Concepts in service monitoring

Service monitoring and the Service Monitoring API help you manage your services like Google manages its own services. The core notions of service monitoring include the following:

  • Selecting metrics that act as service-level indicators (SLIs).
  • Using the SLIs to set service-level objectives (SLOs) for the SLI values.
  • Using the error budget implied by the SLO to mitigate risk in your service.

This page introduces these concepts and describes some of the things to consider when designing an SLO. The other pages in this section put these concepts into practice:

Terminology

Service monitoring has a set of core concepts, which are introduced here:

  • Service-level indicator (SLI): a measurement of performance.
  • Service-level objectives (SLO): a statement of desired performance.
  • Error budgets: a measurement of the difference between actual and desired performance.

Service-level indicators

Stackdriver Monitoring collects metrics that measure the performance of the service infrastructure. Examples of performance metrics include the following:

  • Request count: for example, the number of HTTP requests per minute that result in 2xx or 5xx responses.
  • Response latencies: for example, the latency for HTTP 2xx responses.

The performance metrics are automatically identified based for a set of known service types: App Engine, Istio on Google Kubernetes Engine, and Cloud Endpoints. You can also define your own service type and select performance metrics for it.

The performance metrics are the basis of the SLIs for your service. An SLI describes the performance of some aspect of your service. For services on App Engine, Istio on Google Kubernetes Engine, and Cloud Endpoints, useful SLIs are already known. For example, if your service has request-count or response-latencies metrics, standard service-level indicators (SLIs) can be derived from those metrics by creating ratios as follows:

  • An availability SLI is the ratio of the number of successful responses to the number of all responses.
  • A latency SLI is the ratio of the number of calls below a latency threshold to the number of all calls.

You can also set up service-specific SLIs for some other measure of what “good performance” means. These generally fall into two categories:

  • Request-based SLIs, where good service is measured by counting atomic units of service, like the number of successful HTTP requests.
  • Windows-based SLIs, where good service is measured by counting the number of time periods, or windows, during which performance meets a goodness criterion, like response latency below a given threshold.

These are described in more detail in Compliance in request- and windows-based SLOs.

Service-level objectives

An SLO is a target value for an SLI, measured over a period of time. The available SLIs are determined by the service, and you specify SLOs based on the SLIs. The SLO defines what qualifies as good service. An SLO is built on the following kinds of information:

  • An SLI, which measures the performance of the service.
  • A performance goal, which specifies the desired level of performance.
  • A time period, called the compliance period, for measuring how the SLI compares to the performance goal.

For example, you might have requirements like these:

  • Latency can exceed 300ms in only 5 percent of the requests over a rolling 30-day period.
  • The system must have 99% availability measured over a calendar week.

Requirements like these can provide the basis for SLOs. See Designing and using SLOs for guidance on setting good SLOs.

Changes in SLO compliance can also indicate the onset of failures. Monitoring these changes might give you enough warning that you can fix a problem before it cascades. So alerting policies are typically used to monitor SLO compliance. For more information, see Alerting on your error budget.

A useful SLO will be under 100%, because the SLO determines your error budget. SLOs are typically described as a “number of nines”: 99% (2 nines), 99.9% (3 nines) and so forth. The highest value you can set is 99.9%, but you can use any lower value that is appropriate for your service.

Error budgets

An SLO specifies the degree to which a service must perform during a compliance period. What's left over in the compliance period becomes the error budget. The error budget quantifies the degree to which a service can fail to perform during the compliance period and still meet the SLO.

An error budget is defined as 100% − SLO%. If your SLO target is 99.99%, your error budget is 0.01% within the compliance period. A service required to meet a 100% SLO has no error budget. Setting such an SLO is a bad practice.

Error budgets let you track how many bad individual events (like requests) are allowed to occur during the remainder of your compliance period before you violate the SLO. You can use the error budget to help you manage maintenance tasks like deployment of new versions. If the error budget is close to depleted, then taking risky actions like pushing new updates might result in your violating a SLO..

For example, if your SLO is for 85% of requests to be good in a 7-day rolling period, then your error budget allows 15% of the requests to be bad. If you get an average of, say, 60,480 requests in a week, your error-budget is 15% of that total, or 9,072 requests that are permitted to be bad. Your starting error budget is 9,072 requests, and as bad requests occur, the error budget is consumed until it reaches 0.

Designing and using SLOs

What makes a good SLO? What are things to consider in making the choices? This section provides an overview of some of the general concepts behind designing and using SLOs. This topic is covered in much more detail in Site Reliability Engineering: How Google Runs Production Systems, in the chapter on SLOs.

SLOs define the target performance you want from your service. In general, SLOs should be no higher than necessary or meaningful. If your users cannot tell the difference between 99% availablility and 99.9% availability of your service, use the lower value as the SLO. The higher value will be more expensive to meet, and it won't make a difference to your users.

SLOs are typically more stringent than public or contractual commitments. You want an SLO to be tighter than a public commitment so that, if something happens that causes violation of the SLO, you are aware of and fixing the problem before the commitment or contract is violated. Violating a commitment or contract may have reputational, financial, or legal implications. An SLO is part of an early-warning system to prevent that from happening.

Compliance periods

There are two types of compliance periods for SLOs:

  • Calendar-based periods (from date to date)
  • Rolling periods (from n days ago to now, where n ranges from 1 to 30 days)

Calendar-based compliance periods

Compliance periods can be set to calendar periods like a week or a month. The compliance period and error budget reset on well-known calendar boundaries. For the possible values, see CalendarPeriod.

With a calendar period, you get a performance score at the end of the period. Measured against the performance threshold, the performance score lets you know you know if your service was compliant or not. When you use a calendar period, you only get a compliance rating once every compliance period, even though you see the performance throughout the period. But the end-of-period score gives you an easy-to-read value that matches easily against your customer billing periods (if you have external paying customers).

Like months on a calendar, monthly compliance periods vary in the number of days they cover.

Rolling window-based compliance periods

You can also measure compliance over a rolling period, so that you are always evaluating, for example, the last 30 days. This means that the oldest data in the previous calculation drops out of the current calculation, where it is replaced by new data.

With a rolling window, you get more compliance measurements (each day, you get a measure of compliance for the last 30 days, rather than one per month). Services can transition between compliance and noncompliance as the SLO status changes daily, as old data points are dropped and new ones added.

Compliance in request- and windows-based SLOs

Determining whether or not an SLO is in compliance depends on two factors:

  • How the compliance period is determined. This is discussed in Compliance periods.
  • The type of SLO. There are two types of SLOs:
    • Request-based SLOs
    • Windows-based SLOs

Compliance is the ratio of good events to total events, measured over the compliance period. The type of SLO determines what constitutes an “event”.

If your SLO is 99.9% then you're meeting it if your compliance is at least 99.9%. The max value is 100%.

Request-based SLOs

A request-based SLO is based on an SLI that is defined as the ratio of the number of good requests to the total number of requests. A request-based SLO is met when that ratio meets or exceeds the goal for the compliance period.

For example, consider this request-based SLO: “Latency is below 100ms for at least 95% of requests.” A good request is one with a response time less than 100ms, so the measure of compliance is the fraction of requests with response timess under 100ms. The service is compliant if this fraction is at least 0.95.

Request-based SLOs give you a sense of what percentage of work your service did properly over the entire compliance period, no matter how the load was distributed throughout the compliance period.

Windows-based SLOs

A windows-based SLO is based on an SLI defined as the ratio of the number of measurement intervals that meets some goodness criterion to the total number of intervals. A windows-based SLO is met when that ratio meets or exceeds the goal for the compliance period.

For example, consider this SLO: “The 95th percentile latency metric is less than 100ms for at least 99% of 10-minute windows”. A good measurement period is a 10-minute span in which 95% of the requests have latency under 100ms. The measure of compliance is the fraction of such good periods. The service is compliant if this fraction is at least 0.99.

For another example, suppose you configure your compliance period to be a rolling 30 days, the measurement interval to be a minute, and the SLO goal to be 99%. To meet this SLO, your service must have 42,768 “good” intervals out of 43,200 minutes (99% of the number of minutes in 30 days).

A windows-based SLO gives you an idea of what percentage of the time your customers found the service to be working well or poorly. This type of SLO can hide the effects of “bursty” behavior: A measurement interval that failed every one of its calls count against the SLO as much as a measurement interval that had one error too many. Also, intervals with a low number of calls counts against the SLO as much as a measurement interval with heavy activity.

Trajectory of error budgets

The error budget is the difference between 100% good service and your SLO, the desired level of good service. The difference between then is your wiggle room.

In general, an error budget starts as a maximum value and drops over time, triggering an SLO violation when the error budget drops below 0.

There are a couple of notable exceptions to this pattern:

  • If you have a request-based SLO measured over a calendar compliance period, and the service has increased activity over the compliance period, the remaining error budget can actually rise.

    How is that possible? The SLO system can't know in advance how much activity the service will have each compliance period, so it extrapolates what the traffic will be by simply creating a ratio of calls up to the present time over the time elapsed since the beginning of the compliance period and multiplying that by the length of the compliance period.

    As activity rate goes up, the expected traffic for the period also goes up, and as a result, the error budget rises.

  • If you are measuring an SLO over a rolling compliance period, you are effectively always at the end of a compliance period. Rather than starting from scratch, old data points are continuously dropped and new data points are continuously added.

    If a period of poor compliance rolls out of the compliance window, and if the present time, replacing it, is compliant, the error budget goes up. At any point in time, an error budget ≥ 0 indicates a compliant rolling SLO window, and an error budget < 0 indicates a non-compliant rolling SLO window.

Monitoring your error budget

You can create alerting policies to warn you if your error budget is being consumed at a faster than desired rate. See Alerting on your error budget for more information.

What's next

¿Te ha resultado útil esta página? Enviar comentarios:

Enviar comentarios sobre...

Stackdriver Monitoring