This page explains why some alerting policies might behave differently than intended, and offers possible remedies for those situations.
For information about how alerting policies are affected by the metrics and resources monitored by conditions, the duration window for conditions, and when incidents are created, see Alerting behavior.
Disk-utilization policy creates unexpected incidents
You created an alerting policy to monitor the "used" capacity of the disks
in your system. This policy monitors the metric
agent.googleapis.com/disk/percent_used
.
You expect to be notified only when the utilization of any physical disk
exceeds the threshold you set in the condition; however, this
policy is creating incidents when the disk utilization of every
physical disk is below the threshold.
A known cause of unexpected incidents for these policies is that the conditions aren't restricted to monitoring physical disks. Instead, these policies monitor all disks, including virtual disks such as loopback devices. If a virtual disk is constructed such that its utilization is 100%, then that would cause an incident for the policy to be created.
For example, consider the following output of the Linux df
command,
which shows the disk space available on mounted filesystems, for one
system:
$ df /dev/root 9983232 2337708 7629140 24% / devtmpfs 2524080 0 2524080 0% /dev tmpfs 2528080 0 2528080 0% /dev/shm ... /dev/sda15 106858 3934 102924 4% /boot/efi /dev/loop0 56704 56704 0 100% /snap/core18/1885 /dev/loop1 129536 129536 0 100% /snap/google-cloud-sdk/150 ...
For this system, a disk-utilization alerting policy should be configured to
filter out the time series for the
loopback devices /dev/loop0
and /dev/loop1
.
Uptime policy doesn't create expected alerts
You want to be notified if a virtual machine (VM) reboots or shuts down, so you
create an alerting policy that monitors the metric
compute.googleapis.com/instance/uptime
.
You create and configure the condition to generate an incident when there
is no metric data. You don't define the condition by using
Monitoring Query Language (MQL)1.
You aren't notified when the virtual machine (VM) reboots or shuts down.
This alerting policy only monitors time series for Compute Engine VM instances
that are in the RUNNING
state. Time series for VMs that are in any other
state, such as STOPPED
or DELETED
, are filtered out prior to
the condition being evaluated. This means that you can't use an alerting policy
with a metric-absence alerting condition to determine if a VM instance
is running. For information on VM instance states, see
VM instance life cycle.
To resolve this problem, create an uptime check that is monitored by an alerting policy.
When you can't create an uptime check to monitor a VM because your VM doesn't have an external IP address, you can create an alerting policy with MQL that notifies you if the VM has been shut down. MQL-defined conditions don't pre-filter time-series data based on the state of the VM instance. Because MQL doesn't filter data by VM states, you can use it to detect the absence of data from VMs that have been shut down.
Consider the following MQL condition which monitors the
compute.googleapis.com/instance/cpu/utilization
:
fetch gce_instance::compute.googleapis.com/instance/cpu/utilization
|absent_for 3m
If you have a VM that is running and then this VM is shut down,
then three minutes after the shut down, an incident is generated and
notifications are sent. Note that the absent_for
value must be at
least three minutes.
For more information about MQL, see Alerting policies with MQL.
1: MQL is an expressive text based language that can be used with Cloud Monitoring API calls and in the Google Cloud Console. To configure a condition with MQL when you use the Cloud Console, you must select Query Editor.
Common causes for anomalous incidents
You created an alerting policy and the policy appears to prematurely or incorrectly create incidents.
If there is a gap in data, particularly for those alerting policies with metric-absence or “less than” threshold conditions, then an incident can be created that appears to be anomalous. Determining if a gap exists in data might not be easy. Sometimes the gap is obscured, and sometimes it is automatically corrected:
In charts, for example, gaps might be obscured because the values for missing data are interpolated. Even when several minutes of data are missing, the chart connects missing points for visual continuity. Such a gap in the underlying data might be enough for an alerting policy to create an incident.
Points in logs-based metrics can arrive late and be backfilled, for up to 10 minutes in the past. The backfill behavior effectively corrects the gap; the gap is filled in when the data finally arrives. Thus, a gap in a logs-based metric that can no longer be seen might have caused an alerting policy to create an incident.
Metric-absence and “less than” threshold conditions are evaluated in real time, with a small query delay. The status of the condition can change between the time it is evaluated and the time the corresponding incident is visible in Monitoring.