This page explains why some alerting policies might behave differently than intended, and offers possible remedies for those situations.
For information about the variables that can affect an alerting policy, by the choice of duration window, for example, see Behavior of metric-based alerting policies.
Disk-utilization policy creates unexpected incidents
You created an alerting policy to monitor the "used" capacity of the disks
in your system. This policy monitors the metric
agent.googleapis.com/disk/percent_used
.
You expect to be notified only when the utilization of any physical disk
exceeds the threshold you set in the condition. However, this
policy is creating incidents when the disk utilization of every
physical disk is less than the threshold.
A known cause of unexpected incidents for these policies is that the conditions aren't restricted to monitoring physical disks. Instead, these policies monitor all disks, including virtual disks such as loopback devices. If a virtual disk is constructed such that its utilization is 100%, then that would cause an incident for the policy to be created.
For example, consider the following output of the Linux df
command,
which shows the disk space available on mounted file systems, for one
system:
$ df /dev/root 9983232 2337708 7629140 24% / devtmpfs 2524080 0 2524080 0% /dev tmpfs 2528080 0 2528080 0% /dev/shm ... /dev/sda15 106858 3934 102924 4% /boot/efi /dev/loop0 56704 56704 0 100% /snap/core18/1885 /dev/loop1 129536 129536 0 100% /snap/google-cloud-sdk/150 ...
For this system, a disk-utilization alerting policy should be configured to
filter out the time series for the
loopback devices /dev/loop0
and /dev/loop1
. For example, you might
add the filter device !=~ ^/dev/loop.*
, which excludes all time series
whose device
label doesn't match the regular expression ^/dev/loop.*
.
Uptime policy doesn't create expected alerts
You want to be notified if a virtual machine (VM) reboots or shuts down, so you
create an alerting policy that monitors the metric
compute.googleapis.com/instance/uptime
.
You create and configure the condition to generate an incident when there
is no metric data. You don't define the condition by using
Monitoring Query Language (MQL)1.
You aren't notified when the virtual machine (VM) reboots or shuts down.
This alerting policy only monitors time series for Compute Engine VM instances
that are in the RUNNING
state. Time series for VMs that are in any other
state, such as STOPPED
or DELETED
, are filtered out before
the condition is evaluated. Because of this behavior, you can't use an
alerting policy
with a metric-absence alerting condition to determine if a VM instance
is running. For information on VM instance states, see
VM instance life cycle.
To resolve this problem, create an alerting policy to monitor an uptime check. For private endpoints, please use private uptime checks.
A possible alternative to alerting on uptime checks is to use alerting policies that monitor the absence of data. We strongly recommend alerting on uptime checks instead of absence of data: absence alerts can generate false positives if there are transient issues with the availability of Monitoring data.
However, if using uptime checks is not possible, you can create an alerting policy with MQL that notifies you the VM has been shut down. MQL-defined conditions don't pre-filter time-series data based on the state of the VM instance. Because MQL doesn't filter data by VM states, you can use it to detect the absence of data from VMs that have been shut down.
Consider the following MQL condition which monitors the
compute.googleapis.com/instance/cpu/utilization
metric:
fetch gce_instance::compute.googleapis.com/instance/cpu/utilization
|absent_for 3m
If a VM monitored by this condition is shut down,
then three minutes later, an incident is generated and
notifications are sent. The absent_for
value must be at
least three minutes.
For more information about MQL, see Alerting policies with MQL.
1: MQL is an expressive text-based language that can be used with Cloud Monitoring API calls and in the Google Cloud console. To configure a condition with MQL when you use the Google Cloud console, you must select Query Editor.
Common causes for anomalous incidents
You created an alerting policy and the policy appears to prematurely or incorrectly create incidents.
There are different reasons why you might receive notification of incidents that appear to be incorrect:
If there is a gap in data, particularly for those alerting policies with metric-absence or “less than” threshold conditions, then an incident can be created that appears to be anomalous. Determining if a gap exists in data might not be easy. Sometimes the gap is obscured, and sometimes it is automatically corrected:
In charts, for example, gaps might be obscured because the values for missing data are interpolated. Even when several minutes of data are missing, the chart connects missing points for visual continuity. Such a gap in the underlying data might be enough for an alerting policy to create an incident.
Points in log-based metrics can arrive late and be backfilled, for up to 10 minutes in the past. The backfill behavior effectively corrects the gap; the gap is filled in when the data finally arrives. Thus, a gap in a log-based metric that can no longer be seen might have caused an alerting policy to create an incident.
Metric-absence and “less than” threshold conditions are evaluated in real time, with a small query delay. The status of the condition can change between the time it is evaluated and the time the corresponding incident is visible in Monitoring.
Conditions that are configured to create an incident on a single measure can result in incidents that appear to be premature or incorrect. To prevent this situation, ensure multiple measurements are required before an incident is created by setting the duration field of a condition to be more than double the metric's sampling rate.
For example, if a metric is sampled every 60 seconds, then set the duration to at least 3 minutes. If you set the duration field to most recent value, or equivalently to 0 seconds, then a single measurement can cause an incident to be created.
When the condition of an alerting policy is edited, it can take several minutes for the change to propagate through the alerting infrastructure. During this time period, you might receive notification of incidents that met the original alerting policy conditions.
When time-series data arrive, it can take up to a minute for the data to propagate through the entire alerting infrastructure. When the alignment period is set to one minute or to the most recent sample, the propagation latency might make it appear that the alerting policy is triggering incorrectly. To reduce the possibility of this situation, use an alignment period of at least five minutes.
Incident isn't closed when data stops arriving
You follow the guidance in Partial metric data and configure an alerting policy to close incidents when data stops arriving. In some cases, data stops arriving but an open incident isn't automatically closed.
If the underlying resource being monitored by an alerting policy contains
the metadata.system_labels.state
label, and if that policy isn't written
with the Monitoring Query Language, then Monitoring can determine the state
of the resource. If the state of a resource is known to be disabled, then
Monitoring doesn't automatically close incidents when data
stops arriving. However, you can close these incidents manually.
Multi-condition policy creates multiple notifications
You created an alerting policy that contains multiple conditions, and you joined
those conditions with a logical AND
. You expect to get one notification and
have one incident created when all conditions are met. However, you
receive multiple notifications and see that multiple incidents are created.
When an alerting policy contains multiple conditions that are joined
by a logical AND
, if that policy triggers, then for each time series that
results in a condition being met, the policy sends a notification and creates
an incident. For example, if you have a policy with two conditions and each
condition is monitoring one time series, then two incidents are opened
and you receive two notifications.
You can't configure Cloud Monitoring to create a single incident and send a single notification.
For more information, see Notifications per incident.
Unable to view incident details due to a permission error
You navigate to the incidents page in the Google Cloud console and select an incident to view. You expect to have the details page open. However, the details page fails to open and a "Permission denied" message is displayed.
To resolve this situation, ensure that your Identity and Access Management (IAM)
role is roles/monitoring.viewer
or one that includes all permissions
of that role. For example, the roles roles/monitoring.editor
and
roles/monitoring.admin
include all permissions of the viewer role.
Custom roles can't grant the permission required to view incident details.
Incident isn't created when condition is met
You created an alerting policy that has one condition. The alert chart shows that the monitored data violates the condition, but you didn't receive a notification and an incident wasn't created.
If any of the following criteria are true after the alerting policy condition triggers, then Cloud Monitoring doesn't open the incident.
- The alerting policy is snoozed.
- The alerting policy is disabled.
- The alerting policy has reached the maximum number of incidents that it can open simultaneously.
The state of the resource that the alerting policy monitors is known to be disabled. Monitoring can determine the state of a resource when the resource contains the
metadata.system_labels.state
label and when the alerting policy isn't written with the Monitoring Query Language.
Incident details list wrong project
You receive notification of an alert and the condition summary lists the Google Cloud project in which the alert was created, that is, it lists the scoping project. However, you expect the incident to list the name of the Google Cloud project that stores the time series which is causing the incident to trigger.
The aggregation options specified in the condition of an alerting policy determine the Google Cloud project that is referenced in a notification:
When the aggregation options eliminate the label that stores the project ID, the incident information lists the scoping project. For example, if you group the data only by zone, then after grouping, the label that stores the project ID is removed.
When the aggregation options preserve the label that stores the project ID, the incident notifications include the name of the Google Cloud project that stores the time series which is causing the incident to trigger. To preserve the project ID label, don't group the time series or include the label
project_id
in the grouping field.
Unable to manually close an incident
You received a notification of an incident on your system. You go to the incident details page and click Close incident. You expect the incident to be closed; however, you receive the error message:
Unable to close incident with active conditions.
You can only close an incident when no observations arrive in the most recent alerting period. The alerting period, which typically has a default value of 5 minutes, is defined as part of the alerting policy condition and is configurable. The previous error message indicates that data has been received within the alerting period.
The following error occurs when an incident can't be closed due to an internal error:
Unable to close incident. Please try again in a few minutes.
When you see the previous error message, you can retry the close operation or let Monitoring automatically close the incident.
For more information, see Managing incidents.
Notifications aren't received
You configure notification channels and expect to be notified when incidents occur. You don't receive any notifications.
For information about how to resolve issues with webhook and Pub/Sub notifications, see the following sections of this document:
To gather information about the cause of the failure, do the following:
In the Google Cloud console, go to the Logs Explorer page:
Select the appropriate Google Cloud project.
Query the logs for notification channel events:
- Expand the Log name menu, and select notification_channel_events.
- Expand the Severity menu and select Error.
- Optional: To select a custom time range, use the time-range selector.
- Click Run query.
The previous steps create the following query:
logName="projects/PROJECT_ID/logs/monitoring.googleapis.com%2Fnotification_channel_events" severity=ERROR
Failure information is typically included in the summary line and in the
jsonPayload
field.The summary line and the
jsonPayload
field typically contain failure information. For example, when a gateway error occurs, the summary line includes "failed with 502 Bad Gateway".
No new data after changes to metric definitions
You change the definition of a user-defined metric, for example, by modifying the filter you used in a log-based metric, and the alerting policy isn't reflecting the change you made to the metric definition.
To resolve this problem, force the alerting policy to update by editing the display name of the policy.
Webhook Notifications sent to Google Chat aren't received
You configure a webhook notification channel in Cloud Monitoring and then
configure the webhook to send to Google Chat. However, you aren't receiving
notifications or you are receiving 400 Bad Request
errors.
To resolve this problem, configure a Pub/Sub notification channel in Cloud Monitoring, and then configure a Cloud Run service to convert the Pub/Sub messages into the form Chat expects and then deliver the notification to Google Chat. For an example of this configuration, see Creating custom notifications with Cloud Monitoring and Cloud Run.
Webhook notifications aren't received
You configure a webhook notification channel and expect to be notified when incidents occur. You don't receive any notifications.
Private endpoint
You can't use webhooks for notifications unless the endpoint is public.
To resolve this situation, use Pub/Sub notifications combined with a pull subscription to that notification topic.
When you configure a Pub/Sub notification channel, incident notifications are sent to a Pub/Sub queue that has Identity and Access Management controls. Any service that can query for, or listen to, a Pub/Sub topic can consume these notifications. For example, applications running on App Engine, Cloud Run, or Compute Engine virtual machines can consume these notifications.
If you use a pull subscription, then a request is sent to Google that waits for a message to arrive. These subscriptions require access to Google but they don't require rules for firewalls or inbound access.
Public endpoint
To identify why the delivery failed, examine your Cloud Logging log entries for failure information.
For example, you can search for log entries for the notification channel resource by using the Logs Explorer, with a filter like the following:
resource.type="stackdriver_notification_channel"
Pub/Sub notifications aren't received
You configure a Pub/Sub notification channel but you don't receive any alert notifications.
To resolve this condition, try the following:
Ensure that the notifications service account exists. Notifications aren't sent when the service account has been deleted.
To verify that the service account exists, do the following:
In the Google Cloud console, go to the IAM page:
Search for a service account that has the following naming convention:
service-PROJECT_NUMBER@gcp-sa-monitoring-notification.iam.gserviceaccount.com
If this service account isn't listed, then select Include Google-provided role grants, as shown in the following screenshot:
To create a notifications service account, do the following:
In the Google Cloud console, select Monitoring
Click Alerting and then click Edit notification channels.
In the Pub/Sub section, click Add new.
The Created Pub/Sub Channel dialog displays the name of the service account that Monitoring created.
Click Cancel.
Grant the service account permissions to publish your Pub/Sub topics as described in Authorize service account.
Ensure that the notifications service account has been authorized to send notifications for the Pub/Sub topics of interest.
To view the permissions for a service account, you can use the Google Cloud console or the Google Cloud CLI command:
- The IAM page in the Google Cloud console lists the roles for each service account.
- The Pub/Sub Topics page in the Google Cloud console, lists each topic. When you select a topic, the Permissions tab lists the roles granted to service accounts.
To list all service accounts and their roles, run the following Google Cloud CLI command:
gcloud projects get-iam-policy PROJECT_ID
The following is a partial response for this command:
serviceAccount:service-PROJECT_NUMBER@gcp-sa-monitoring-notification.iam.gserviceaccount.com role: roles/monitoring.notificationServiceAgent - members: [...] role: roles/owner - members: - serviceAccount:service-PROJECT_NUMBER@gcp-sa-monitoring-notification.iam.gserviceaccount.com role: roles/pubsub.publisher
The command response includes only roles, it doesn't include per-topic authorization.
To list the IAM bindings for a specific topic, run the following command:
gcloud pubsub topics get-iam-policy TOPIC
The following is a sample response for this command:
bindings: - members: - serviceAccount:service-PROJECT_NUMBER@gcp-sa-monitoring-notification.iam.gserviceaccount.com role: roles/pubsub.publisher etag: BwXPRb5WDPI= version: 1
For information about how to authorize the notifications service account, see Authorize service account.