Troubleshooting alerting policies

This page explains why some alerting policies might behave differently than intended, and offers possible remedies for those situations.

For information about how alerting policies are affected by the metrics and resources monitored by conditions, the duration window for conditions, and when incidents are created, see Alerting behavior.

Disk-utilization policy creates unexpected incidents

You created an alerting policy to monitor the "used" capacity of the disks in your system. This policy monitors the metric agent.googleapis.com/disk/percent_used. You expect to be notified only when the utilization of any physical disk exceeds the threshold you set in the condition; however, this policy is creating incidents when the disk utilization of every physical disk is below the threshold.

A known cause of unexpected incidents for these policies is that the conditions aren't restricted to monitoring physical disks. Instead, these policies monitor all disks, including virtual disks such as loopback devices. If a virtual disk is constructed such that its utilization is 100%, then that would cause an incident for the policy to be created.

For example, consider the following output of the Linux df command, which shows the disk space available on mounted filesystems, for one system:

$ df
/dev/root     9983232  2337708  7629140   24%  /
devtmpfs      2524080        0  2524080    0%  /dev
tmpfs         2528080        0  2528080    0%  /dev/shm
...
/dev/sda15     106858     3934   102924    4%  /boot/efi
/dev/loop0      56704    56704        0  100%  /snap/core18/1885
/dev/loop1     129536   129536        0  100%  /snap/google-cloud-sdk/150
...

For this system, a disk-utilization alerting policy should be configured to filter out the time series for the loopback devices /dev/loop0 and /dev/loop1.

Uptime policy doesn't create expected alerts

You want to be notified if a virtual machine (VM) reboots or shuts down, so you create an alerting policy that monitors the metric compute.googleapis.com/instance/uptime. You create and configure the condition to generate an incident when there is no metric data. You don't define the condition by using Monitoring Query Language (MQL)1. You aren't notified when the virtual machine (VM) reboots or shuts down.

This alerting policy only monitors time series for Compute Engine VM instances that are in the RUNNING state. Time series for VMs that are in any other state, such as STOPPED or DELETED, are filtered out prior to the condition being evaluated. This means that you can't use an alerting policy with a metric-absence alerting condition to determine if a VM instance is running. For information on VM instance states, see VM instance life cycle.

To resolve this problem, create an uptime check that is monitored by an alerting policy.

When you can't create an uptime check to monitor a VM because your VM doesn't have an external IP address, you can create an alerting policy with MQL that notifies you if the VM has been shut down. MQL-defined conditions don't pre-filter time-series data based on the state of the VM instance. Because MQL doesn't filter data by VM states, you can use it to detect the absence of data from VMs that have been shut down.

Consider the following MQL condition which monitors the compute.googleapis.com/instance/cpu/utilization:

fetch gce_instance::compute.googleapis.com/instance/cpu/utilization
|absent_for 3m

If you have a VM that is running and then this VM is shut down, then three minutes after the shut down, an incident is generated and notifications are sent. Note that the absent_for value must be at least three minutes.

For more information about MQL, see Alerting policies with MQL.

1: MQL is an expressive text based language that can be used with Cloud Monitoring API calls and in the Google Cloud Console. To configure a condition with MQL when you use the Cloud Console, you must select Query Editor.

Common causes for anomalous incidents

You created an alerting policy and the policy appears to prematurely or incorrectly create incidents.

If there is a gap in data, particularly for those alerting policies with metric-absence or “less than” threshold conditions, then an incident can be created that appears to be anomalous. Determining if a gap exists in data might not be easy. Sometimes the gap is obscured, and sometimes it is automatically corrected:

  • In charts, for example, gaps might be obscured because the values for missing data are interpolated. Even when several minutes of data are missing, the chart connects missing points for visual continuity. Such a gap in the underlying data might be enough for an alerting policy to create an incident.

  • Points in logs-based metrics can arrive late and be backfilled, for up to 10 minutes in the past. The backfill behavior effectively corrects the gap; the gap is filled in when the data finally arrives. Thus, a gap in a logs-based metric that can no longer be seen might have caused an alerting policy to create an incident.

Metric-absence and “less than” threshold conditions are evaluated in real time, with a small query delay. The status of the condition can change between the time it is evaluated and the time the corresponding incident is visible in Monitoring.