Troubleshooting alerting policies

This page explains why some alerting policies might behave differently than intended, and offers possible remedies for those situations.

For information about the variables that can affect an alerting policy, by the choice of duration window, for example, see Behavior of metric-based alerting policies.

Variable for a metric label is null

You create an alerting policy and add a variable for a metric label to the documentation section. You expect the notifications will show the value of the variable; however, the value is set to null.

To resolve this situation, try the following:

  • Ensure that the aggregation settings for the alerting policy preserve the label that you want to display.

    For example, assume that you create an alerting policy that monitors the disk bytes written by VM instances. You want the documentation to list the device that is causing the notification, so you add to the documentation field the following: device: ${metric.label.device}.

    You must also ensure that your aggregation settings preserve the value of the device label. You can preserve this label by setting the aggregation function to none or by ensuring that the grouping selections include device.

  • Verify the syntax and applicability of the variable. For syntax information, see Annotate alerts with user-defined documentation.

    For example, the variable log.extracted_label.KEY is only supported for log-based alerts. This variable always renders as null when an alerting policy monitors a metric, even a log-based metric.

Disk-utilization policy creates unexpected incidents

You created an alerting policy to monitor the "used" capacity of the disks in your system. This policy monitors the metric agent.googleapis.com/disk/percent_used. You expect to be notified only when the utilization of any physical disk exceeds the threshold you set in the condition. However, this policy is creating incidents when the disk utilization of every physical disk is less than the threshold.

A known cause of unexpected incidents for these policies is that the conditions aren't restricted to monitoring physical disks. Instead, these policies monitor all disks, including virtual disks such as loopback devices. If a virtual disk is constructed such that its utilization is 100%, then that would cause an incident for the policy to be created.

For example, consider the following output of the Linux df command, which shows the disk space available on mounted file systems, for one system:

$ df
/dev/root     9983232  2337708  7629140   24%  /
devtmpfs      2524080        0  2524080    0%  /dev
tmpfs         2528080        0  2528080    0%  /dev/shm
...
/dev/sda15     106858     3934   102924    4%  /boot/efi
/dev/loop0      56704    56704        0  100%  /snap/core18/1885
/dev/loop1     129536   129536        0  100%  /snap/google-cloud-sdk/150
...

For this system, a disk-utilization alerting policy should be configured to filter out the time series for the loopback devices /dev/loop0 and /dev/loop1. For example, you might add the filter device !=~ ^/dev/loop.*, which excludes all time series whose device label doesn't match the regular expression ^/dev/loop.*.

Uptime policy doesn't create expected alerts

You want to be notified if a virtual machine (VM) reboots or shuts down, so you create an alerting policy that monitors the metric compute.googleapis.com/instance/uptime. You create and configure the condition to generate an incident when there is no metric data. You don't define the condition by using Monitoring Query Language (MQL)1. You aren't notified when the virtual machine (VM) reboots or shuts down.

This alerting policy only monitors time series for Compute Engine VM instances that are in the RUNNING state. Time series for VMs that are in any other state, such as STOPPED or DELETED, are filtered out before the condition is evaluated. Because of this behavior, you can't use an alerting policy with a metric-absence alerting condition to determine if a VM instance is running. For information on VM instance states, see VM instance life cycle.

To resolve this problem, create an alerting policy to monitor an uptime check. For private endpoints, use private uptime checks.

A possible alternative to alerting on uptime checks is to use alerting policies that monitor the absence of data. We strongly recommend alerting on uptime checks instead of absence of data: absence alerts can generate false positives if there are transient issues with the availability of Monitoring data.

However, if using uptime checks is not possible, you can create an alerting policy with MQL that notifies you the VM has been shut down. MQL-defined conditions don't pre-filter time-series data based on the state of the VM instance. Because MQL doesn't filter data by VM states, you can use it to detect the absence of data from VMs that have been shut down.

Consider the following MQL condition which monitors the compute.googleapis.com/instance/cpu/utilization metric:

fetch gce_instance::compute.googleapis.com/instance/cpu/utilization
|absent_for 3m

If a VM monitored by this condition is shut down, then three minutes later, an incident is generated and notifications are sent. The absent_for value must be at least three minutes.

For more information about MQL, see Alerting policies with MQL.

1: MQL is an expressive text-based language that can be used with Cloud Monitoring API calls and in the Google Cloud console. To configure a condition with MQL when you use the Google Cloud console, you must use the code editor.

Request count policy doesn't create expected alerts

You want to monitor the number of completed requests. You created an alerting policy that monitors that monitors the metric serviceruntime.googleapis.com/api/request_count, but you aren't notified when the number of requests exceeds the threshold you configured.

The maximum alignment period for the request count metric is 7 hours 30 minutes.

To resolve this issue, check the value of the alignment period in your alert policy. If the value is longer than the maximum for this metric, reduce the alignment period so that it is no more than 7 hours 30 minutes.

Common causes for anomalous incidents

You created an alerting policy and the policy appears to prematurely or incorrectly create incidents.

There are different reasons why you might receive notification of incidents that appear to be incorrect:

  • If there is a gap in data, particularly for those alerting policies with metric-absence or “less than” threshold conditions, then an incident can be created that appears to be anomalous. Determining if a gap exists in data might be difficult. Sometimes the gap is obscured, and sometimes it is automatically corrected:

    • In charts, for example, gaps might be obscured because the values for missing data are interpolated. Even when several minutes of data are missing, the chart connects missing points for visual continuity. Such a gap in the underlying data might be enough for an alerting policy to create an incident.

    • Points in log-based metrics can arrive late and be backfilled, for up to 10 minutes in the past. The backfill behavior effectively corrects the gap; the gap is filled in when the data finally arrives. Thus, a gap in a log-based metric that can no longer be seen might have caused an alerting policy to create an incident.

  • Metric-absence and “less than” threshold conditions are evaluated in real time, with a small query delay. The status of the condition can change between the time it is evaluated and the time the corresponding incident is visible in Monitoring.

  • Conditions that are configured to create an incident on a single measure can result in incidents that appear to be premature or incorrect. To prevent this situation, ensure multiple measurements are required before an incident is created by setting the duration field of a condition to be more than double the metric's sampling rate.

    For example, if a metric is sampled every 60 seconds, then set the duration to at least 3 minutes. If you set the duration field to most recent value, or equivalently to 0 seconds, then a single measurement can cause an incident to be created.

  • When the condition of an alerting policy is edited, it can take several minutes for the change to propagate through the alerting infrastructure. During this time period, you might receive notification of incidents that met the original alerting policy conditions.

  • When time-series data arrive, it can take up to a minute for the data to propagate through the entire alerting infrastructure. When the alignment period is set to one minute or to the most recent sample, the propagation latency might make it appear that the alerting policy is triggering incorrectly. To reduce the possibility of this situation, use an alignment period of at least five minutes.

Incident isn't closed when data stops arriving

You follow the guidance in Partial metric data and configure an alerting policy to close incidents when data stops arriving. In some cases, data stops arriving but an open incident isn't automatically closed.

If the underlying resource being monitored by an alerting policy contains the metadata.system_labels.state label, and if that policy isn't written with the Monitoring Query Language, then Monitoring can determine the state of the resource. If the state of a resource is known to be disabled, then Monitoring doesn't automatically close incidents when data stops arriving. However, you can close these incidents manually.

Multi-condition policy creates multiple notifications

You created an alerting policy that contains multiple conditions, and you joined those conditions with a logical AND. You expect to get one notification and have one incident created when all conditions are met. However, you receive multiple notifications and see that multiple incidents are created.

When an alerting policy contains multiple conditions that are joined by a logical AND, if that policy triggers, then for each time series that results in a condition being met, the policy sends a notification and creates an incident. For example, if you have a policy with two conditions and each condition is monitoring one time series, then two incidents are opened and you receive two notifications.

You can't configure Cloud Monitoring to create a single incident and send a single notification.

For more information, see Notifications per incident.

Unable to view incident details due to a permission error

You navigate to the incidents page in the Google Cloud console and select an incident to view. You expect to have the details page open. However, the details page fails to open and a "Permission denied" message is displayed.

To view all incident details except metric data, ensure that you have the Identity and Access Management (IAM) roles of Monitoring Cloud Console Incident Viewer (roles/monitoring.cloudConsoleIncidentViewer) and Stackdriver Accounts Viewer (roles/stackdriver.accounts.viewer).

To view all incident details, including the metric data, and to be able to acknowledge or close incidents, ensure that you have the IAM roles of Monitoring Viewer (roles/monitoring.viewer) and Monitoring Cloud Console Incident Editor (roles/monitoring.cloudConsoleIncidentEditor).

Custom roles can't grant the permission required to view incident details.

Incident isn't created when condition is met

You created an alerting policy that has one condition. The alert chart shows that the monitored data violates the condition, but you didn't receive a notification and an incident wasn't created.

If any of the following criteria are true after the alerting policy condition triggers, then Cloud Monitoring doesn't open the incident.

  • The alerting policy is snoozed.
  • The alerting policy is disabled.
  • The alerting policy has reached the maximum number of incidents that it can open simultaneously.
  • The state of the resource that the alerting policy monitors is known to be disabled. Monitoring can determine the state of a resource when the resource contains the metadata.system_labels.state label and when the alerting policy isn't written with the Monitoring Query Language.

Incident details list wrong project

You receive notification of an alert and the condition summary lists the Google Cloud project in which the alert was created, that is, it lists the scoping project. However, you expect the incident to list the name of the Google Cloud project that stores the time series which is causing the incident to trigger.

The aggregation options specified in the condition of an alerting policy determine the Google Cloud project that is referenced in a notification:

  • When the aggregation options eliminate the label that stores the project ID, the incident information lists the scoping project. For example, if you group the data only by zone, then after grouping, the label that stores the project ID is removed.

  • When the aggregation options preserve the label that stores the project ID, the incident notifications include the name of the Google Cloud project that stores the time series which is causing the incident to trigger. To preserve the project ID label, don't group the time series or include the label project_id in the grouping field.

Unable to manually close an incident

You received a notification of an incident on your system. You go to the incident details page and click Close incident. You expect the incident to be closed; however, you receive the error message:

Unable to close incident with active conditions.

You can only close an incident when no observations arrive in the most recent alerting period. The alerting period, which typically has a default value of 5 minutes, is defined as part of the alerting policy condition and is configurable. The previous error message indicates that data has been received within the alerting period.

The following error occurs when an incident can't be closed due to an internal error:

Unable to close incident. Please try again in a few minutes.

When you see the previous error message, you can retry the close operation or let Monitoring automatically close the incident.

For more information, see Managing incidents.

Notifications aren't received

You configure notification channels and expect to be notified when incidents occur. You don't receive any notifications.

For information about how to resolve issues with webhook and Pub/Sub notifications, see the following sections of this document:

To gather information about the cause of the failure, do the following:

  1. In the navigation panel of the Google Cloud console, select Logging, and then select Logs Explorer:

    Go to Logs Explorer

  2. Select the appropriate Google Cloud project.
  3. Query the logs for notification channel events:

    1. Expand the Log name menu, and select notification_channel_events.
    2. Expand the Severity menu and select Error.
    3. Optional: To select a custom time range, use the time-range selector.
    4. Click Run query.

    The previous steps create the following query:

    logName="projects/PROJECT_ID/logs/monitoring.googleapis.com%2Fnotification_channel_events"
    severity=ERROR
    

    Failure information is typically included in the summary line and in the jsonPayload field.

    The summary line and the jsonPayload field typically contain failure information. For example, when a gateway error occurs, the summary line includes "failed with 502 Bad Gateway".

No new data after changes to metric definitions

You change the definition of a user-defined metric, for example, by modifying the filter you used in a log-based metric, and the alerting policy isn't reflecting the change you made to the metric definition.

To resolve this problem, force the alerting policy to update by editing the display name of the policy.

Webhook Notifications sent to Google Chat aren't received

You configure a webhook notification channel in Cloud Monitoring and then configure the webhook to send to Google Chat. However, you aren't receiving notifications or you are receiving 400 Bad Request errors.

To resolve this problem, configure a Pub/Sub notification channel in Cloud Monitoring, and then configure a Cloud Run service to convert the Pub/Sub messages into the form Chat expects and then deliver the notification to Google Chat. For an example of this configuration, see Creating custom notifications with Cloud Monitoring and Cloud Run.

Webhook notifications aren't received

You configure a webhook notification channel and expect to be notified when incidents occur. You don't receive any notifications.

Private endpoint

You can't use webhooks for notifications unless the endpoint is public.

To resolve this situation, use Pub/Sub notifications combined with a pull subscription to that notification topic.

When you configure a Pub/Sub notification channel, incident notifications are sent to a Pub/Sub queue that has Identity and Access Management controls. Any service that can query for, or listen to, a Pub/Sub topic can consume these notifications. For example, applications running on App Engine, Cloud Run, or Compute Engine virtual machines can consume these notifications.

If you use a pull subscription, then a request is sent to Google that waits for a message to arrive. These subscriptions require access to Google but they don't require rules for firewalls or inbound access.

Public endpoint

To identify why the delivery failed, examine your Cloud Logging log entries for failure information.

For example, you can search for log entries for the notification channel resource by using the Logs Explorer, with a filter like the following:

resource.type="stackdriver_notification_channel"

Pub/Sub notifications aren't received

You configure a Pub/Sub notification channel but you don't receive any alert notifications.

To resolve this condition, try the following:

  • Ensure that the notifications service account exists. Notifications aren't sent when the service account has been deleted.

    To verify that the service account exists, do the following:

    1. In the navigation panel of the Google Cloud console, select IAM:

      Go to IAM

    2. Search for a service account that has the following naming convention:

      service-PROJECT_NUMBER@gcp-sa-monitoring-notification.iam.gserviceaccount.com

      If this service account isn't listed, then select Include Google-provided role grants, as shown in the following screenshot:

      Select the Include Google-provided role grants option.

    To create a notifications service account, do the following:

    1. In the navigation panel of the Google Cloud console, select Monitoring, and then select  Alerting:

      Go to Alerting

    2. Click Edit notification channels.
    3. In the Pub/Sub section, click Add new.

      The Created Pub/Sub Channel dialog displays the name of the service account that Monitoring created.

    4. Click Cancel.

    5. Grant the service account permissions to publish your Pub/Sub topics as described in Authorize service account.

  • Ensure that the notifications service account has been authorized to send notifications for the Pub/Sub topics of interest.

    To view the permissions for a service account, you can use the Google Cloud console or the Google Cloud CLI command:

    • The IAM page in the Google Cloud console lists the roles for each service account.
    • The Pub/Sub Topics page in the Google Cloud console, lists each topic. When you select a topic, the Permissions tab lists the roles granted to service accounts.
    • To list all service accounts and their roles, run the following Google Cloud CLI command:

      gcloud projects get-iam-policy PROJECT_ID
      

      The following is a partial response for this command:

          serviceAccount:service-PROJECT_NUMBER@gcp-sa-monitoring-notification.iam.gserviceaccount.com
             role: roles/monitoring.notificationServiceAgent
           - members:
             [...]
             role: roles/owner
           - members:
             - serviceAccount:service-PROJECT_NUMBER@gcp-sa-monitoring-notification.iam.gserviceaccount.com
             role: roles/pubsub.publisher
      

      The command response includes only roles, it doesn't include per-topic authorization.

    • To list the IAM bindings for a specific topic, run the following command:

      gcloud pubsub topics get-iam-policy TOPIC
      

      The following is a sample response for this command:

          bindings:
          - members:
            - serviceAccount:service-PROJECT_NUMBER@gcp-sa-monitoring-notification.iam.gserviceaccount.com
            role: roles/pubsub.publisher
          etag: BwXPRb5WDPI=
          version: 1
      

    For information about how to authorize the notifications service account, see Authorize service account.