Alerting on your burn rate

You can create alerting policies on your service-level objectives (SLOs) to let you know if you are in danger of violating an SLO. You select the SLO you want to monitor, and then set up a condition that triggers an alert if the condition is violated. The condition is typically expressed by selecting a threshold value that constitutes a violation, and a period of time for which the violation is permitted. If the threshold is exceeded for more than the allowable period of time, the alerting policy is triggered.

This page describes how to set up an alerting policy on the burn rate of your error budget. It does not cover alerting policies in detail; it assumes you already know the basic concepts of conditions and notification channels. For general information about alerting policies and how to create them, see Using alerting policies.

Burn rate of error budget

Your error budget for a compliance period is 100% − SLO%. If your SLO goal is 95%, then your error budget is 5% of your performance. Burn rate tells you how fast you are consuming the error budget. For more information, see Error budgets.

The burn-rate metric is retrieved by the time-series selector select_slo_burn_rate. A burn-rate alerting policy notifies you if your error budget is being depleted faster than usual.

There are other time-series selectors; see Retrieving SLO data for more information. You can create alerting policies that use some of these other time-series selectors, but you must create them by using the Stackdriver Monitoring API.

Overview: creating an alerting policy on an SLO

Creating an alerting policy for an SLO is very similar to creating a alerting policy for any other metric, and the steps are the same whether you use the API or the UI.

This section reviews the general steps, and the following sections describe them in more detail.

To create an alerting policy for an SLO, you take the following steps:

  1. Identify the SLO you want to base the alerting policy on.

  2. Construct a condition for your alerting policy that uses the chosen SLO. This is where you use the time-series selector to retrieve the data for the SLO. For example, if you use the time-series selector for burn rate, the retrieved data will reflect the burn rate of the error budget for the chosen SLO.

    This condition is also where you specify the threshold and duration of violations of the SLO before triggering an alert. For example, you want the burn rate to be some amount over the desired rate for some period of time before triggering an alert. The value for “some amount over” is the condition's threshold, and the value for “some period of time” is the condition's duration.

  3. Identify or create a notification channel to use in your alerting policy.

  4. Provide documentation that explains to users what triggered the alerting policy

  5. Assemble these pieces into an invocation to create an alerting policy.

For general information about alerting policies and how to create them, see Using alerting policies.

Alerting policies and lookback periods

When you retrieve the SLO data with a time-series selector, you specify an identifier for the SLO and a lookback period. The lookback period determines how far back in time to retrieve data. The lookback period is used in determining what's normal over that period of time.

If you want to alert on the per-day consumption of the error budget, choose a lookback period of 24 hours. To alert on the per-hour consumption, choose a lookback period of an hour.

Types of error-budget alerts

When setting up alerting policies to monitor your error budget, it's a good idea two set up two related alerting policies:

  • Fast-burn alert, which warns you of a sudden, large change in consumption that, if uncorrected, will exhaust your error budget very soon. “At this rate, we'll burn through the whole month's error budget in two days!”

    For a condition like this, a shorter lookback duration makes sense. You want to know if a potentially disastrous condition has emerged and persisted, even briefly. If it is truly disastrous, you don't want to wait very long to notice it.

    The threshold for rate of consumption you alert on here is much higher than the baseline ideal for the lookback period.

  • Slow-burn alert, which warns you of a rate of consumption that, if not altered, exhausts your error budget before the end of the compliance period. This type of condition is less urgent than a fast-burn condition. “We are slightly above where we'd like to be at this point in the month, but we aren't in big trouble yet.”

    For a slow-burn condition, a longer lookback duration makes sense, to smooth out variations in shorter-term consumption.

    The threshold for rate of consumption you alert on here is higher than the baseline ideal for the lookback period, but not hugely so. A policy based on a shorter lookback period with high threshold could generate too many alerts, if the longer-term consumption evens out. But if the consumption stays even a little too high for a longer period, it eventually consumes all your error budget.

Creating an SLO alert: API

Alerting policies for the burn rate of your error budget are based on the time-series selector select_slo_burn_rate, described in Retrieving SLO data. There are other time-series selectors, and you can use some of them as the basis for alerting policies.

You create alerting policies by using the alertPolicies.create method. The general use of this method is documented in Managing alerting policies.

Alerting policies for SLOs are similar to other alerting policies in many ways: they are alerting policies with a metric-threshold condition. They differ from other alerting policies in a very specific way: the filter in the MetricThreshold specification of the condition uses a time-series selector instead of a pair of metric and monitored-resource types.

Conditions for SLO-based alerting policies

An alerting policy must have at least one condition. For a SLO-based condition, use a [MetricThreshold]-type condition.

A metric-threshold condition can contain two pairs of time-series configurations: filter and aggregations, and, for building ratios, denominatorFilter and denominatorAggregations. Because SLO data is not retrieved using the standard monitoring filters, the only field used in a condition for an SLO is the filter field.

A condition for an SLO does set the comparison, thresholdValue, duration, and trigger fields.

This example creates a condition that is violated when the burn rate exceeds 2 times the normal rate. The structure looks like this:

  "conditions": [
    {
      "displayName":"SLO burn rate alert for ${SLO_ID} exceeds 2",
      "conditionThreshold": {
        "filter": [TO_BE_DETERMINED],
        "comparison":"COMPARISON_GT",
        "thresholdValue": 2,
        "duration": {
          "seconds":"0",
        },
      },
    }
  ],

To set the filter field, you need the resource name of a specific SLO. This will be a value of the form projects/${PROJECT}/services/${SERVICE_ID}/serviceLevelObjectives/${SLO_ID}. For information on finding the SLO ID, see Listing SLOs.

To create an alert on burn rate, use the time-series selector select_slo_burn_rate. This selector takes two values, the target SLO and the lookback period. For more information, see select_slo_burn_rate.

For example, the following filter gets the burn rate of the target SLO with a 1-hour lookback period:

"filter":"select_slo_burn_rate(\"projects/${PROJECT}/services/${SERVICE_ID}/serviceLevelObjectives/${SLO_ID}\", \"60m\")"

The rest of the alerting policy

To complete the alerting policy, specify values for the remaining fields:

  • displayName: A description of the alerting policy.
  • combiner: Describes the logic for combining conditions. This policy has only one condition, so either AND or OR works.
  • notificationChannels: An array of existing notification channels to use when the alerting policy is triggered. For information on finding and creating notification channels, see Notification channels.
  • documentation: Information that is sent when the condition is violated to help recipients diagnose the problem. For details, see Documentation.

Creating the alerting policy

The following example uses the API to create a burn-rate alerting policy. For information about listing, modifying, and deleting alerting policies, see Managing alerting policies by API.

Protocol

To create the alerting policy by using curl, send a POST message to the https://monitoring.googleapis.com/v3/projects/${PROJECT_ID}/alertPolicies endpoint, and provide the alerting policy in the request body. The JSON in the request body describes an alerting policy that uses a threshold condition based on the select_slo_burn_rate time-series selector with a one-hour lookback period.

  1. Create a variable to hold the request body:

    CREATE_ALERT_POST_BODY=$(cat <<EOF
    {
      "displayName":"SLO burn-rate alert for ${SLO_ID} with a threshold of 2",
      "combiner":"AND",
      "conditions": [
        {
          "displayName":"SLO burn rate alert for ${SLO_ID} exceeds 2",
          "conditionThreshold": {
            "filter":"select_slo_burn_rate(\"projects/${PROJECT}/services/${SERVICE_ID}/serviceLevelObjectives/${SLO_ID}\", \"60m\")",
            "comparison":"COMPARISON_GT",
            "thresholdValue": 2,
            "duration": {
              "seconds":"0",
            },
          },
        }
      ],
      "notificationChannels": ["${NOTIFICATION_CHANNEL}", ],
      "documentation": {
         "content": "SLO burn for the past 60m exceeded twice the normal budget burn rate.",
         "mime_type": "text/markdown",
      },
    }
    EOF
    )
    
  2. Post the request to the endpoint:

    curl  --http1.1 --header "Authorization: Bearer ${ACCESS_TOKEN}" --header "Content-Type: application/json" -X POST -d "${CREATE_ALERT_POST_BODY}}" https://monitoring.googleapis.com/v3/projects/${PROJECT_ID}/alertPolicies
    

Creating SLO-based alerting policies with the UI

From the Create New Alerting Policy page, you can configure the SLO Burn Rate condition type. The target metric is displayed as burn rate. To create a burn-rate condition, do the following:

  1. Select the service you want to monitor.
  2. Select the SLO whose burn rate you want to monitor.
  3. Set a lookback duration, the period over which you want to estimate the burn rate.
  4. Set the threshold at which you want the alerting policy to be triggered.

Then complete the policy definition with any notification channels or documentation you want to provide. See Alerting policies: using the console for more information. For information on SLO-based alerting policies in Anthos Service Mesh, see Anthos Service Mesh documentation: Creating an alerting policy for an SLO.

本頁內容對您是否有任何幫助?請提供意見:

傳送您對下列選項的寶貴意見...

這個網頁
Stackdriver Monitoring