Troubleshooting alerting policies

This page explains why some alerting policies might behave differently than intended, and offers possible remedies for those situations.

For information about the variables that can affect an alerting policy, by the choice of duration window, for example, see Alerting behavior.

Disk-utilization policy creates unexpected incidents

You created an alerting policy to monitor the "used" capacity of the disks in your system. This policy monitors the metric agent.googleapis.com/disk/percent_used. You expect to be notified only when the utilization of any physical disk exceeds the threshold you set in the condition. However, this policy is creating incidents when the disk utilization of every physical disk is less than the threshold.

A known cause of unexpected incidents for these policies is that the conditions aren't restricted to monitoring physical disks. Instead, these policies monitor all disks, including virtual disks such as loopback devices. If a virtual disk is constructed such that its utilization is 100%, then that would cause an incident for the policy to be created.

For example, consider the following output of the Linux df command, which shows the disk space available on mounted file systems, for one system:

$ df
/dev/root     9983232  2337708  7629140   24%  /
devtmpfs      2524080        0  2524080    0%  /dev
tmpfs         2528080        0  2528080    0%  /dev/shm
...
/dev/sda15     106858     3934   102924    4%  /boot/efi
/dev/loop0      56704    56704        0  100%  /snap/core18/1885
/dev/loop1     129536   129536        0  100%  /snap/google-cloud-sdk/150
...

For this system, a disk-utilization alerting policy should be configured to filter out the time series for the loopback devices /dev/loop0 and /dev/loop1.

Uptime policy doesn't create expected alerts

You want to be notified if a virtual machine (VM) reboots or shuts down, so you create an alerting policy that monitors the metric compute.googleapis.com/instance/uptime. You create and configure the condition to generate an incident when there is no metric data. You don't define the condition by using Monitoring Query Language (MQL)1. You aren't notified when the virtual machine (VM) reboots or shuts down.

This alerting policy only monitors time series for Compute Engine VM instances that are in the RUNNING state. Time series for VMs that are in any other state, such as STOPPED or DELETED, are filtered out before the condition is evaluated. Because of this behavior, you can't use an alerting policy with a metric-absence alerting condition to determine if a VM instance is running. For information on VM instance states, see VM instance life cycle.

To resolve this problem, create an alerting policy to monitor an uptime check. To create an uptime check, your VM must have an external IP address.

If your VM doesn't have an external IP address, then you can create an alerting policy with MQL that notifies you the VM has been shut down. MQL-defined conditions don't pre-filter time-series data based on the state of the VM instance. Because MQL doesn't filter data by VM states, you can use it to detect the absence of data from VMs that have been shut down.

Consider the following MQL condition which monitors the compute.googleapis.com/instance/cpu/utilization metric:

fetch gce_instance::compute.googleapis.com/instance/cpu/utilization
|absent_for 3m

If a VM monitored by this condition is shut down, then three minutes later, an incident is generated and notifications are sent. The absent_for value must be at least three minutes.

For more information about MQL, see Alerting policies with MQL.

1: MQL is an expressive text-based language that can be used with Cloud Monitoring API calls and in the Google Cloud Console. To configure a condition with MQL when you use the Cloud Console, you must select Query Editor.

Common causes for anomalous incidents

You created an alerting policy and the policy appears to prematurely or incorrectly create incidents.

If there is a gap in data, particularly for those alerting policies with metric-absence or “less than” threshold conditions, then an incident can be created that appears to be anomalous. Determining if a gap exists in data might not be easy. Sometimes the gap is obscured, and sometimes it is automatically corrected:

  • In charts, for example, gaps might be obscured because the values for missing data are interpolated. Even when several minutes of data are missing, the chart connects missing points for visual continuity. Such a gap in the underlying data might be enough for an alerting policy to create an incident.

  • Points in logs-based metrics can arrive late and be backfilled, for up to 10 minutes in the past. The backfill behavior effectively corrects the gap; the gap is filled in when the data finally arrives. Thus, a gap in a logs-based metric that can no longer be seen might have caused an alerting policy to create an incident.

Metric-absence and “less than” threshold conditions are evaluated in real time, with a small query delay. The status of the condition can change between the time it is evaluated and the time the corresponding incident is visible in Monitoring.

To ensure multiple measurements are required before an incident is created, ensure that the duration field of a condition is more than double the metric's sampling rate. For example, if a metric is sampled every 60 seconds, then set the duration to at least 3 minutes. If you set the duration field to most recent value, or equivalently to 0 seconds, then a single measurement can cause an incident to be created.

Multi-condition policy creates multiple notifications

You created an alerting policy that contains multiple conditions, and you joined those conditions with a logical AND. You expect to get one notification and have one incident created when all conditions are met. However, you receive multiple notifications and see that multiple incidents are created.

When an alerting policy contains multiple conditions that are joined by a logical AND, if that policy triggers, then for each time series that results in a condition being met, the policy sends a notification and creates an incident. For example, if you have a policy with two conditions and each condition is monitoring one time series, then two incidents are opened and you receive two notifications.

You can't configure Cloud Monitoring to create a single incident and send a single notification.

For more information, see Notifications per incident.

Unable to view incident details due to a permission error

You navigate to the incidents page in the Google Cloud Console and select an incident to view. You expect to have the details page open. However, the details page fails to open and a "Permission denied" message is displayed.

To resolve this situation, ensure that your Identity and Access Management (IAM) role is roles/monitoring.viewer or one that includes all permissions of that role. For example, the roles roles/monitoring.editor and roles/monitoring.admin include all permissions of the viewer role.

Custom roles can't grant the permission required to view incident details.

Unable to manually close an incident

You received a notification of an incident on your system. You go to the incident details page and click Close incident. You expect the incident to be closed; however, you receive the error message:

Unable to close incident with active conditions.

You can only close an incident when no observations arrive in the most recent alerting period. The alerting period, which typically has a default value of 5 minutes, is defined as part of the alerting policy condition and is configurable. The previous error message indicates that data has been received within the alerting period.

The following error occurs when an incident can't be closed due to an internal error:

Unable to close incident. Please try again in a few minutes.

When you see the previous error message, you can retry the close operation or let Monitoring automatically close the incident.

For more information, see Managing incidents.

Webhook notifications fail when configured for a private endpoint

You configure a webhook notification to a private endpoint and expect to be notified when incidents occur. You don't receive any notifications.

You can't use webhooks for notifications unless the endpoint is public.

To resolve this situation, use Pub/Sub notifications combined with a pull subscription to that notification topic.

When you configure a Pub/Sub notification channel, incident notifications are sent to a Pub/Sub queue that has Identity and Access Management controls. Any service that can query for, or listen to, a Pub/Sub topic can consume these notifications. For example, applications running on App Engine, Cloud Run, or Compute Engine virtual machines can consume these notifications.

If you use a pull subscription, then a request is sent to Google that waits for a message to arrive. These subscriptions require access to Google but they don't require rules for firewalls or inbound access.