Annotate alerts with labels

This document describes how you can manage your notifications and incidents by adding user-defined labels to an alerting policy. Because user-defined labels are included in notifications, if you add labels that indicate the severity of an incident, then the notification contains information that can help you prioritize your alerts for investigation.

About labels

Labels are key-value pairs that are associated with time series, alerting policies, and incidents. There are metric labels, resource labels, and user-defined labels. Metric and resource labels contain specific information about the metric being collected or the resource against which the metric is written. In contrast, user-defined labels are those labels that you create and that record information specific to your needs.

When time-series data is written, labels are attached to the data to record information about that data. For example, the labels on a time series might identify a virtual machine (VM), a zone, a Google Cloud project, and a device type.

You can add user labels to alerting policies and to incidents:

A label on an alerting policy has a static value. You can add these labels to an alerting policy when you use the Google Cloud console or the Cloud Monitoring API. For more information, see the following documents:
- Create metric-threshold alerting policies.
- Create alerting policies by API.
Label keys must start with a lowercase letter. Both label keys and label values can contain only lowercase letters, digits, underscores, and dashes.
A label on an incident can have its value set dynamically. That is, the value of the time-series data can determine the label value. For an example, see Create dynamic severity levels using Monitoring Query Language (MQL).

You can define these labels when you specify the condition of an alerting policy with MQL.

View labels in notifications

You can view the labels of an alert policy or an incident on the details page of an incident, the details page of an alert policy, and in some notifications:

In email notifications, the labels you add to the policy are listed in the Policy labels section while labels you add to an incident are listed in the Metric labels section.
In PagerDuty, Webhooks, and Pub/Sub notifications, the labels you add to an alerting policy or incident are included in the JSON data. Alerting policy labels are listed in the policy_user_labels field of the JSON structure:
```
"policy_user_labels": {
  "severity": "critical",
}
```
Incident labels are included in the metric field of the JSON structure:
```
"metric": {
  "type" : "compute.googleapis.com/instance/cpu/utilization"
  "displayName": "CPU Utilization",
  "labels": {
    "instance_name": "some_instance_name",
    "severity": "critical"
  },
}
```
As previously shown, the metric field lists a metric type, the display name for the metric, the metric labels, and any user-defined labels added to the incident.

Example: create dynamic severity levels using labels and MQL

You can use MQL to configure a label so that its value changes dynamically based on time-series data. For example, you want your incidents to have a Criticality label whose value changes depending on the value of the monitored CPU utilization metric:

fetch gce_instance
| metric 'compute.googleapis.com/instance/cpu/utilization'
| group_by sliding(5m), [value_utilization_mean: mean(value.utilization)]
| map
    add[
      Criticality:
        if(val() >= 90 '%', 'CRITICAL',
          if(val() >= 80 '%', 'WARNING',
            if(val() >= 70 '%', 'INFO', 'GOOD')))
    ]
| condition val() >= 70 '%'

The following figure illustrates how alerting policies that use MQL queries process the time-series data they monitor:

Illustration of how alerting policies process their monitored time series.

The policy handler processes the CPU utilization data and outputs a time series that indicates when the condition is triggered. In the previous example, the condition is triggered when the CPU utilization is at least 70%. For each input time series, the policy handler can generate one of four time series:

Output time series name	Condition triggered	Description
"GOOD"	No	This time series has the same labels as the input time series. It doesn't have a severity label.
"CRITICAL"	Yes	The CPU utilization is at least 90%. The output time series has the same labels as the "GOOD" time series plus a severity label with the value of "CRITICAL".
"WARNING"	Yes	The CPU utilization is at least 80% but less than 90%. The output time series has the same labels as the "GOOD" time series plus a severity label with the value of "WARNING".
"INFO"	Yes	The CPU utilization is at least 70% but less than 80%. The output time series has the same labels as the "GOOD" time series plus a severity label with the value of "INFO".

The time-series data generated by the policy handler is the input to the incident manager, which determines when incidents are created and closed. To determine when to close an incident, the incident manager uses the values of the duration, evaluationMissingData, and autoClose fields.

Best practices

To ensure that at most one incident is open at a time when you create labels whose values are set dynamically, do the following:

In the MetricThreshold object, override the default values for the following fields:
- duration field: Set to a non-zero value.
- evaluationMissingData field: Set so that incidents are closed when data stops arriving. When you use the Cloud Monitoring API, set this field to EVALUATION_MISSING_DATA_INACTIVE. When you use the Google Cloud console, set the field to "Missing data points treated as values that don't violate the policy condition".
In the AlertStrategy object, set the autoClose field to its minimum value of 30 minutes. When you use the Cloud Monitoring API, set this field to 30m.

For more information, see Partial metric data.

Incident flow

Suppose that the CPU utilization measurements are less than 70% when the alert policy is created. The following sequence illustrates how incidents are opened and closed:

Because the CPU utilization measurements are less than 70%, the policy handler generates the "GOOD" time series and no incidents are opened.
Next, assume the CPU utilization rises to 93%. The policy handler stops generating the "GOOD" time-series data and starts generating data for the "Critical" time series.

The incident manager sees a new time series that is triggering the condition, the "CRITICAL" time series, and it creates an incident. The notification includes the severity label with a value of CRITICAL.
Assume the CPU utilization falls to 75%. The policy handler stops generating the "CRITICAL" time series and starts generating the "INFO" time series.

The incident manager sees a new time series that is triggering the condition, the "INFO" time series, and it opens an incident. The notification includes the severity label with a value of INFO.

The incident manager sees that no data is arriving for the "CRITICAL" time series and that an incident is open for that time series. Because the policy is configured to close incidents when data stops arriving, the incident manager closes the incident associated with the "CRITICAL" time series. Therefore, only the incident whose severity label has a value of INFO remains open.
Finally, assume that the CPU utilization falls to 45%. This value is less than all thresholds, so the policy handler stops generating the "INFO" time series and starts generating the "GOOD" time series.

The incident manager sees that no data is arriving for the "INFO" time series and that an incident is open for that time series. Because the policy is using the recommended settings, the incident is closed.

If you don't use the recommended value for the evaluationMissingData field, then when data stops arriving, open incidents aren't closed immediately. The result is that you might see multiple open incidents for the same input time series. For more information, see Partial metric data.