Troubleshoot alerting policies

This page explains why some alerting policies might behave differently than intended, and offers possible remedies for those situations.

For information about the variables that can affect an alerting policy, by the choice of retest window, for example, see Behavior of metric-based alerting policies.

Disk-utilization policy creates unexpected incidents

You created an alerting policy to monitor the "used" capacity of the disks in your system. This policy monitors the metric agent.googleapis.com/disk/percent_used. You expect to be notified only when the utilization of any physical disk exceeds the threshold you set in the condition. However, this policy is creating incidents when the disk utilization of every physical disk is less than the threshold.

A known cause of unexpected incidents for these policies is that the conditions aren't restricted to monitoring physical disks. Instead, these policies monitor all disks, including virtual disks such as loopback devices. If a virtual disk is constructed such that its utilization is 100%, then that would cause an incident for the policy to be created.

For example, consider the following output of the Linux df command, which shows the disk space available on mounted file systems, for one system:

$ df
/dev/root     9983232  2337708  7629140   24%  /
devtmpfs      2524080        0  2524080    0%  /dev
tmpfs         2528080        0  2528080    0%  /dev/shm
...
/dev/sda15     106858     3934   102924    4%  /boot/efi
/dev/loop0      56704    56704        0  100%  /snap/core18/1885
/dev/loop1     129536   129536        0  100%  /snap/google-cloud-sdk/150
...

For this system, a disk-utilization alerting policy should be configured to filter out the time series for the loopback devices /dev/loop0 and /dev/loop1. For example, you might add the filter device !=~ ^/dev/loop.*, which excludes all time series whose device label doesn't match the regular expression ^/dev/loop.*.

Common causes for anomalous incidents

You created an alerting policy and the policy appears to prematurely or incorrectly create incidents.

There are different reasons why you might receive notification of incidents that appear to be incorrect:

If there is a gap in data, particularly for those alerting policies with metric-absence or “less than” threshold conditions, then an incident can be created that appears to be anomalous. Sometimes the incident doesn't show the data gap, and sometimes the data gap is automatically corrected:
- In charts, for example, gaps might not appear because the values for missing data are interpolated. Even when several minutes of data are missing, the chart connects missing points for visual continuity. Such a gap in the underlying data might be enough for an alerting policy to create an incident.
- Points in log-based metrics can arrive late and be backfilled, for up to 10 minutes in the past. The backfill behavior effectively corrects the gap; the gap is filled in when the data finally arrives. Thus, a gap in a log-based metric that can no longer be seen might have caused an alerting policy to create an incident.
Metric-absence and “less than” threshold conditions are evaluated in real time, with a small query delay. The status of the condition can change between the time it is evaluated and the time the corresponding incident is visible in Monitoring.
Conditions that are configured to create an incident on a single measure can result in incidents that appear to be premature or incorrect. To prevent this situation, ensure multiple measurements are required before an incident is created by setting the retest window of a condition to be more than double the metric's sampling rate.

For example, if a metric is sampled every 60 seconds, then set the retest window to at least 3 minutes. If you set the retest window to most recent value, or equivalently to 0 seconds, then a single measurement can cause an incident to be created.
When the condition of an alerting policy is edited, it can take several minutes for the change to propagate through the alerting infrastructure. During this time period, you might receive notification of incidents that met the original alerting policy conditions.
When time-series data arrive, it can take up to a minute for the data to propagate through the entire alerting infrastructure. During this process, an alerting policy might evaluate a condition as being met even though the time series data hasn't propagated to the time series chart. As a result, you might receive a notification even though the chart doesn't indicate that the condition is met. To reduce the possibility of this situation, use an alignment period of at least five minutes.

Incident isn't closed when data stops arriving

You follow the guidance in Partial metric data and configure an alerting policy to close incidents when data stops arriving. In some cases, data stops arriving but an open incident isn't automatically closed.

If the underlying resource being monitored by an alerting policy contains the metadata.system_labels.state label, and if that policy isn't written with the Monitoring Query Language, then Monitoring can determine the state of the resource. If the state of a resource is known to be disabled, then Monitoring doesn't automatically close incidents when data stops arriving. However, you can close these incidents manually.

Unable to view incident details due to a permission error

You navigate to the incidents page in the Google Cloud console and select an incident to view. You expect to have the details page open. However, the details page fails to open and a "Permission denied" message is displayed.

To view all incident details except metric data, ensure that you have the Identity and Access Management (IAM) roles of Monitoring Cloud Console Incident Viewer (roles/monitoring.cloudConsoleIncidentViewer) and Stackdriver Accounts Viewer (roles/stackdriver.accounts.viewer).

To view all incident details, including the metric data, and to be able to acknowledge or close incidents, ensure that you have the IAM roles of Monitoring Viewer (roles/monitoring.viewer) and Monitoring Cloud Console Incident Editor (roles/monitoring.cloudConsoleIncidentEditor).

Custom roles can't grant the permission required to view incident details.

Incident isn't created when condition is met

You created an alerting policy that has one condition. The chart for the alerting policy shows that the monitored data violates the condition, but you didn't receive a notification and an incident wasn't created.

If any of the following criteria are true after the alerting policy condition is met, then Monitoring doesn't open the incident.

The alerting policy is snoozed.
The alerting policy is disabled.
The alerting policy has reached the maximum number of incidents that it can open simultaneously.
The state of the resource that the alerting policy monitors is known to be disabled. Monitoring can determine the state of a resource when the resource contains the metadata.system_labels.state label and when the alerting policy isn't written with the Monitoring Query Language.

Incident details list wrong project

You receive a notification and the condition summary lists the Google Cloud project in which the incident was created, that is, it lists the scoping project. However, you expect the incident to list the name of the Google Cloud project that stores the time series that caused Monitoring to create the incident.

The aggregation options specified in the condition of an alerting policy determine the Google Cloud project that is referenced in a notification:

When the aggregation options eliminate the label that stores the project ID, the incident information lists the scoping project. For example, if you group the data only by zone, then after grouping, the label that stores the project ID is removed.
When the aggregation options preserve the label that stores the project ID, the incident notifications include the name of the Google Cloud project that stores the time series which causes the incident to occur. To preserve the project ID label, include the label project_id in the grouping field, or don't group the time series.

Unable to manually close an incident

You received a notification of an incident on your system. You go to the incident details page and click Close incident. You expect the incident to be closed; however, you receive the error message:

Unable to close incident with active conditions.

You can only close an incident when no observations arrive in the most recent alerting period. The alerting period, which typically has a default value of 5 minutes, is defined as part of the alerting policy condition and is configurable. The previous error message indicates that data has been received within the alerting period.

The following error occurs when an incident can't be closed due to an internal error:

Unable to close incident. Please try again in a few minutes.

When you see the previous error message, you can retry the close operation or let Monitoring automatically close the incident.

For more information, see Managing incidents.

Multi-condition policy creates multiple notifications

You created an alerting policy that contains multiple conditions, and you joined those conditions with a logical AND. You expect to get one notification and have one incident created when all conditions are met. However, you receive multiple notifications and see that multiple incidents are created.

Monitoring sends a notification and creates an incident for each time series that causes a condition to be met. As a result, when you have alerting policies with multiple conditions, you can potentially receive one notification and incident for each time series that causes the joined conditions to be met.

For example, you have an alerting policy with two conditions, where each condition monitors 3 time series. The policy sends a notification only when both conditions are met. When your policy's conditions are met, you could receive between 2 (one time series is met in each condition) and 6 (all time series are met in each condition) notifications and incidents.

You can't configure Monitoring to create a single incident and send a single notification.

For more information, see Notifications per incident.

Variable for a metric label is null

You create an alerting policy and add a variable for a metric label to the documentation section. You expect the notifications will show the value of the variable; however, the value is set to null.

To resolve this situation, try the following:

Ensure that the aggregation settings for the alerting policy preserve the label that you want to display.

For example, assume that you create an alerting policy that monitors the disk bytes written by VM instances. You want the documentation to list the device that is causing the notification, so you add to the documentation field the following: device: ${metric.label.device}.

You must also ensure that your aggregation settings preserve the value of the device label. You can preserve this label by setting the aggregation function to none or by ensuring that the grouping selections include device.
Verify the syntax and applicability of the variable. For syntax information, see Annotate notifications with user-defined documentation.

For example, the variable log.extracted_label.KEY is only supported for log-based alerting policies. This variable always renders as null when an alerting policy monitors a metric, even a log-based metric.

No new data after changes to metric definitions

You change the definition of a user-defined metric, for example, by modifying the filter you used in a log-based metric, and the alerting policy isn't reflecting the change you made to the metric definition.

To resolve this problem, force the alerting policy to update by editing the display name of the policy.

Alerting policy creation fails in the API due to missing metric

You recently created a metric and then referenced that metric when you tried to create an alerting policy in the Cloud Monitoring API. However, the API command fails and shows the following error:

Error 404: Cannot find metric(s) that match type = "METRIC_NAME".
If a metric was created recently, it could take up to 10 minutes to become
available. Please try again soon.

To resolve this problem, wait at least ten minutes and then resubmit the API request.