This page explains why some alerting policies might behave differently than intended, and offers possible remedies for those situations.
For information about the variables that can affect an alerting policy, by the choice of duration window, for example, see Alerting behavior.
Disk-utilization policy creates unexpected incidents
You created an alerting policy to monitor the "used" capacity of the disks
in your system. This policy monitors the metric
You expect to be notified only when the utilization of any physical disk
exceeds the threshold you set in the condition. However, this
policy is creating incidents when the disk utilization of every
physical disk is less than the threshold.
A known cause of unexpected incidents for these policies is that the conditions aren't restricted to monitoring physical disks. Instead, these policies monitor all disks, including virtual disks such as loopback devices. If a virtual disk is constructed such that its utilization is 100%, then that would cause an incident for the policy to be created.
For example, consider the following output of the Linux
which shows the disk space available on mounted file systems, for one
$ df /dev/root 9983232 2337708 7629140 24% / devtmpfs 2524080 0 2524080 0% /dev tmpfs 2528080 0 2528080 0% /dev/shm ... /dev/sda15 106858 3934 102924 4% /boot/efi /dev/loop0 56704 56704 0 100% /snap/core18/1885 /dev/loop1 129536 129536 0 100% /snap/google-cloud-sdk/150 ...
For this system, a disk-utilization alerting policy should be configured to
filter out the time series for the
Uptime policy doesn't create expected alerts
You want to be notified if a virtual machine (VM) reboots or shuts down, so you
create an alerting policy that monitors the metric
You create and configure the condition to generate an incident when there
is no metric data. You don't define the condition by using
Monitoring Query Language (MQL)1.
You aren't notified when the virtual machine (VM) reboots or shuts down.
This alerting policy only monitors time series for Compute Engine VM instances
that are in the
RUNNING state. Time series for VMs that are in any other
state, such as
DELETED, are filtered out before
the condition is evaluated. Because of this behavior, you can't use an
with a metric-absence alerting condition to determine if a VM instance
is running. For information on VM instance states, see
VM instance life cycle.
To resolve this problem, create an alerting policy to monitor an uptime check. To create an uptime check, your VM must have an external IP address.
If your VM doesn't have an external IP address, then you can create an alerting policy with MQL that notifies you the VM has been shut down. MQL-defined conditions don't pre-filter time-series data based on the state of the VM instance. Because MQL doesn't filter data by VM states, you can use it to detect the absence of data from VMs that have been shut down.
Consider the following MQL condition which monitors the
fetch gce_instance::compute.googleapis.com/instance/cpu/utilization |absent_for 3m
If a VM monitored by this condition is shut down,
then three minutes later, an incident is generated and
notifications are sent. The
absent_for value must be at
least three minutes.
For more information about MQL, see Alerting policies with MQL.
1: MQL is an expressive text-based language that can be used with Cloud Monitoring API calls and in the Google Cloud Console. To configure a condition with MQL when you use the Cloud Console, you must select Query Editor.
Common causes for anomalous incidents
You created an alerting policy and the policy appears to prematurely or incorrectly create incidents.
There are different reasons why you might receive notification of incidents that appear to be incorrect:
If there is a gap in data, particularly for those alerting policies with metric-absence or “less than” threshold conditions, then an incident can be created that appears to be anomalous. Determining if a gap exists in data might not be easy. Sometimes the gap is obscured, and sometimes it is automatically corrected:
In charts, for example, gaps might be obscured because the values for missing data are interpolated. Even when several minutes of data are missing, the chart connects missing points for visual continuity. Such a gap in the underlying data might be enough for an alerting policy to create an incident.
Points in logs-based metrics can arrive late and be backfilled, for up to 10 minutes in the past. The backfill behavior effectively corrects the gap; the gap is filled in when the data finally arrives. Thus, a gap in a logs-based metric that can no longer be seen might have caused an alerting policy to create an incident.
Metric-absence and “less than” threshold conditions are evaluated in real time, with a small query delay. The status of the condition can change between the time it is evaluated and the time the corresponding incident is visible in Monitoring.
Conditions that are configured to create an incident on a single measure can result in incidents that appear to be premature or incorrect. To prevent this situation, ensure multiple measurements are required before an incident is created by setting the duration field of a condition to be more than double the metric's sampling rate.
For example, if a metric is sampled every 60 seconds, then set the duration to at least 3 minutes. If you set the duration field to most recent value, or equivalently to 0 seconds, then a single measurement can cause an incident to be created.
When the condition of an alerting policy is edited, it can take several minutes for the change to propagate through the alerting infrastructure. During this time period, you might receive notification of incidents that met the original alerting policy conditions.
Multi-condition policy creates multiple notifications
You created an alerting policy that contains multiple conditions, and you joined
those conditions with a logical
AND. You expect to get one notification and
have one incident created when all conditions are met. However, you
receive multiple notifications and see that multiple incidents are created.
When an alerting policy contains multiple conditions that are joined
by a logical
AND, if that policy triggers, then for each time series that
results in a condition being met, the policy sends a notification and creates
an incident. For example, if you have a policy with two conditions and each
condition is monitoring one time series, then two incidents are opened
and you receive two notifications.
You can't configure Cloud Monitoring to create a single incident and send a single notification.
For more information, see Notifications per incident.
Unable to view incident details due to a permission error
You navigate to the incidents page in the Google Cloud Console and select an incident to view. You expect to have the details page open. However, the details page fails to open and a "Permission denied" message is displayed.
To resolve this situation, ensure that your Identity and Access Management (IAM)
roles/monitoring.viewer or one that includes all permissions
of that role. For example, the roles
roles/monitoring.admin include all permissions of the viewer role.
Custom roles can't grant the permission required to view incident details.
Unable to manually close an incident
You received a notification of an incident on your system. You go to the incident details page and click Close incident. You expect the incident to be closed; however, you receive the error message:
Unable to close incident with active conditions.
You can only close an incident when no observations arrive in the most recent alerting period. The alerting period, which typically has a default value of 5 minutes, is defined as part of the alerting policy condition and is configurable. The previous error message indicates that data has been received within the alerting period.
The following error occurs when an incident can't be closed due to an internal error:
Unable to close incident. Please try again in a few minutes.
When you see the previous error message, you can retry the close operation or let Monitoring automatically close the incident.
For more information, see Managing incidents.
Webhook notifications fail
You configure a webhook notification channel and expect to be notified when incidents occur. You don't receive any notifications.
You can't use webhooks for notifications unless the endpoint is public.
When you configure a Pub/Sub notification channel, incident notifications are sent to a Pub/Sub queue that has Identity and Access Management controls. Any service that can query for, or listen to, a Pub/Sub topic can consume these notifications. For example, applications running on App Engine, Cloud Run, or Compute Engine virtual machines can consume these notifications.
If you use a pull subscription, then a request is sent to Google that waits for a message to arrive. These subscriptions require access to Google but they don't require rules for firewalls or inbound access.
To identify why the delivery failed, examine your Cloud Logging log entries for failure information.
For example, you can search for log entries for the notification channel resource by using the Logs Explorer, with a filter like the following: