Add severity levels to an alerting policy

Stay organized with collections Save and categorize content based on your preferences.

This document describes how you can add user-defined labels to an alerting policy, and how you can use those labels to help you manage your incidents and notifications.

About labels

Labels are key-value pairs that are associated with time series, alerting policies, and incidents. There are metric labels, resource labels, and user-defined labels. Metric and resource labels contain specific information about the metric being collected or the resource against which the metric is written. In contrast, user-defined labels are those labels that you create and that record information specific to your needs.

When time-series data is written, labels are attached to the data to record information about that data. For example, the labels on a time series might identify a virtual machine (VM), a zone, a Google Cloud project, and a device type.

You can add user labels to alerting policies and to incidents:

  • A label on an alerting policy has a static value. For an example that illustrates how to add a label with the value of critical, see Create static severity levels.

    You can add these labels to an alerting policy when you use the Google Cloud console or the Cloud Monitoring API:

    • Google Cloud console: To add policy labels, use the preview alerting interface. The policy labels options are included on the page where you configure notification channels and documentation. For more information, see Create an alerting policy.

    • Cloud Monitoring API: To add policy labels, use the userLabels field of the AlertPolicy object. For more information, see Managing alerting policies by API.

    Label keys must start with a lowercase letter. Both label keys and label values can contain only lowercase letters, digits, underscores, and dashes.

  • A label on an incident can have its value set dynamically. That is, the value of the time-series data can determine the label value. For an example, see Create dynamic severity levels using Monitoring Query Language (MQL).

    You can define these labels when you specify the condition of an alerting policy with MQL.

Because labels are included in notifications, when you add labels that indicate the severity of an incident, you can use these labels to prioritize your alerts for investigation.

View labels in notifications

You can view the labels of an alert policy or an incident on the details page of an incident, the details page of an alert policy, and in some notifications:

  • In email notifications, the labels you add to the policy are listed in the Policy labels section while labels you add to an incident are listed in the Metric labels section.

  • In PagerDuty, Webhooks, and Pub/Sub notifications, the labels you add to an alerting policy or incident are included in the JSON data. Alerting policy labels are listed in the policy_user_labels field of the JSON structure:

    "policy_user_labels": {
      "severity": "critical",
    }
    

    Incident labels are included in the metric field of the JSON structure:

    "metric": {
      "type" : "compute.googleapis.com/instance/cpu/utilization"
      "displayName": "CPU Utilization",
      "labels": {
        "instance_name": "some_instance_name",
        "severity": "critical"
      },
    }
    

    As previously shown, the metric field lists a metric type, the display name for the metric, the metric labels, and any user-defined labels added to the incident.

Example: report severity level with labels

This section provides two examples that illustrate how you can use labels to include severity information in a notification.

Assume that you want to be notified when the CPU utilization of a VM exceeds a threshold, and you want the following severity information included with the notification:

  • critical: CPU utilization is at least 90%.
  • warning: CPU utilization is at least 80% but less than 90%.
  • info: CPU utilization is at least 70% but less than 80%.

Create static severity levels

You create three alerting policies. For each policy, you configure its condition to trigger when the CPU utilization is higher than a threshold. Also, for each policy, you add a severity label whose value determines the threshold of the policy.

You create three policies because labels on alerting policies have static values. Therefore, you must create one policy for each value of the label.

  • Policy A: The condition triggers when the CPU utilization is at least 90%. The severity label is set to critical:

    "userLabels": {
       "severity": "critical",
    }
    
  • Policy B: The condition triggers when the CPU utilization is at least 80%. The severity label is set to warning:

    "userLabels": {
       "severity": "warning",
    }
    
  • Policy C: The condition triggers when the CPU utilization is at least 70%. The severity label is set to info:

    "userLabels": {
       "severity": "info",
    }
    

For this example, when the CPU utilization exceeds 90%, you receive three notifications, one from each policy. You can use the value of the severity label to determine which incident to investigate first.

When the CPU utilization falls to a value less than 70%, Monitoring automatically closes all open incidents. For information about incident closure, see Closing incidents.

Create dynamic severity levels using MQL

When you create labels on incidents, you can use one policy and let the value of the data determine the value of the label included in the notification. That is, you don't need to create a different alerting policy for each value that your label can have.

For example, consider the following the MQL query, which adds a Severity label:

fetch gce_instance
| metric 'compute.googleapis.com/instance/cpu/utilization'
| group_by sliding(5m), [value_utilization_mean: mean(value.utilization)]
| map
   add[
     Severity:
       if(val() >= 90 '%', 'CRITICAL',
         if(val() >= 80 '%' && val() < 90 '%', 'WARNING', 'INFO'))]
| condition val() >= 70 '%'

The following figure illustrates how MQL-alerting policies process the time-series data they monitor:

Illustration of how alerting policies process their monitored time series.

The policy handler processes the CPU utilization data and outputs a time series that indicates when the condition is triggered. In the previous example, the condition is triggered when the CPU utilization is at least 70%. For each input time series, the policy handler can generate one of four time series:

Output time series name Condition triggered Description
"Good" No This time series has the same labels as the input time series. It doesn't have a severity label.
"Critical" Yes The CPU utilization is at least 90%. The output time series has the same labels as the "Good" time series plus a severity label with the value of "CRITICAL".
"Warning" Yes The CPU utilization is at least 80% but less than 90%. The output time series has the same labels as the "Good" time series plus a severity label with the value of "WARNING".
"Info" Yes The CPU utilization is at least 70% but less than 80%. The output time series has the same labels as the "Good" time series plus a severity label with the value of "INFO".

The time-series data generated by the policy handler is the input to the incident manager, which determines when incidents are created and closed. To determine when to close an incident, the incident manager uses the values of the duration, evaluationMissingData, and autoClose fields.

Best practices

To ensure that at most one incident is open at a time when you create labels whose values are set dynamically, do the following:

  • In the MetricThreshold object, override the default values for the following fields:

    • duration field: Set to a non-zero value.
    • evaluationMissingData field: Set so that incidents are closed when data stops arriving. When you use the Cloud Monitoring API, set this field to EVALUATION_MISSING_DATA_INACTIVE. When you use the Google Cloud console, set the field to "Missing data points treated as values that don't violate the policy condition".
  • In the AlertStrategy object, set the autoClose field to its minimum value of 30 minutes. When you use the Cloud Monitoring API, set this field to 30m.

For more information, see Partial metric data.

Incident flow

Suppose that the CPU utilization measurements are less than 70% when the alert policy is created. The following sequence illustrates how incidents are opened and closed:

  1. Because the CPU utilization measurements are less than 70%, the policy handler generates the "Good" time series and no incidents are opened.

  2. Next, assume the CPU utilization rises to 93%. The policy handler stops generating the "Good" time-series data and starts generating data for the "Critical" time series.

    The incident manager sees a new time series that is triggering the condition, the "Critical" time series, and it creates an incident. The notification includes the severity label with a value of CRITICAL.

  3. Assume the CPU utilization falls to 75%. The policy handler stops generating the "Critical" time series and starts generating the "Info" time series.

    The incident manager sees a new time series that is triggering the condition, the "Info" time series, and it opens an incident. The notification includes the severity label with a value of INFO.

    The incident manager sees that no data is arriving for the "Critical" time series and that an incident is open for that time series. Because the policy is configured to close incidents when data stops arriving, the incident manager closes the incident associated with the "Critical" time series. Therefore, only the incident whose severity label has a value of INFO remains open.

  4. Finally, assume that the CPU utilization falls to 45%. This value is less than all thresholds, so the policy handler stops generating the "Info" time series and starts generating the "Good" time series.

    The incident manager sees that no data is arriving for the "Info" time series and that an incident is open for that time series. Because the policy is using the recommended settings, the incident is closed.

If you don't use the recommended value for the evaluationMissingData field, then when data stops arriving, open incidents aren't closed immediately. The result is that you might see multiple open incidents for the same input time series. For more information, see Partial metric data.

What's next