Alerting behavior

Alerting policies exist in dynamic and complex environments, so using them effectively requires an understanding of some of the variables that can affect their behavior. The metrics and resources monitored by conditions, the duration windows for conditions, and the notification channels can each have an effect.

This document provides some additional information to help you understand the behavior of your metric-based alerting policies. This content does not apply to log-based alerting policies. For information about log-based alerting policies, see Monitoring your logs.

The alignment period and the duration

The alignment period and the duration window are two fields that you set when specifying a condition for an alerting policy. This section provides a brief illustration of the meaning of these fields.

Alignment period

The alignment period is a look-back interval from a particular point in time. For example, if the alignment period is five minutes, then at 1:00 PM, the alignment period contains the samples received between 12:55 PM and 1:00 PM. At 1:01 PM, the alignment period slides one minute and contains the samples received between 12:56 PM and 1:01 PM.

To illustrate the effect of the alignment period on a condition in an alerting policy, consider a condition that is monitoring a metric with a sampling period of one minute. Assume that the alignment period is set to five minutes and that the aligner is set to sum. Finally, assume that the condition is met when the aligned value of the time series is greater than two for a duration of 3 minutes, and that the condition is evaluated every minute.

The following figure illustrates several sequential evaluations of the condition:

Figure illustrating the effect of the alignment period.

Each row in the figure illustrates a single evaluation of the condition. The time series data is shown. The points in the alignment period are shown with blue dots, all older dots are gray. Each row displays the aligned value and whether this value is above the threshold of two. For the row labeled start, the aligned value computes to 1, which is below the threshold. At the next evaluation, the sum of the samples in the alignment period is 2. On the third evaluation, the sum is 3, and is greater than the threshold. The condition evaluator also starts the duration window timer.

Duration window

You use the duration, or the duration window, to prevent a condition from being met due to a single measurement. In the Google Cloud Console, use the following fields to configure the duration:

  • Legacy interface: The For field of the alerting policy Configuration pane.
  • Preview interface: The Time above threshold (or Time below threshold) field in the Configure trigger step.

The previous figure illustrated three evaluations of the condition. At the time start + 2m, the aligned value was above the threshold; however the condition isn't met because the duration window is set to three minutes. The following figure illustrates the results for the next evaluations of the condition:

Figure illustrating the effect of the duration window.

As illustrated, even though the aligned value is above the threshold at time start + 2m, the condition isn't met until the aligned value is above the threshold for three minutes. That occurs at time start + 5m.

For clarity, the previous example omitted the possibility of combining the aligned data points from multiple time series into a single measurement. It is that measurement that is compared to the threshold to determine if the condition is met.

A condition resets its duration window each time a measurement doesn't satisfy the condition. This behavior is illustrated in the following example:

Example: This policy specifies a five-minute duration window.

If HTTP response latency is higher than two seconds,
and if the latency is above that threshold for five minutes,
open an incident and send email to your support team.

The following sequence illustrates how the duration window affects the evaluation of the condition:

  1. The HTTP latency is below two seconds.
  2. For the next three consecutive minutes, HTTP latency is above two seconds.
  3. In the next measurement, the latency falls below two seconds, so the condition resets the duration window.
  4. For the next five consecutive minutes, HTTP latency is above two seconds, so the condition is met and the policy is triggered.

Set the duration window to be long enough to minimize false positives, but short enough to ensure that incidents are opened in a timely manner.

Selecting the alignment period and duration window

Alerting policy conditions are evaluated at a fixed frequency. The choices that you make for the alignment period and the duration window don't impact how often the condition is evaluated.

The previous figure illustrates that the alignment period determines how many data samples are combined with the aligner. If you choose a long period, many samples are combined. If you choose a short period, it's possible that only one data point is in the interval. In contrast, the duration window specifies how long the aligned values must be above the threshold before the condition is met. If the duration window is set to 0, then a single aligned value being above the threshold means the condition is met.

Policies with multiple conditions

An alerting policy can contain up to 6 conditions.

If you are using the Cloud Monitoring API or if your alerting policy has multiple conditions, then you must specify when the individual conditions are met. When a condition is met, that results in an event being created and potentially an incident being opened. To configure how multiple conditions are combined, do one of the following:

  • In the Google Cloud Console, use one of the following:

    • Legacy interface: Policy triggers field.
    • Preview interface: Multi-condition trigger step.
  • In the Cloud Monitoring API, use the combiner field.

This table lists the settings in the Cloud Console, the equivalent value in the Cloud Monitoring API, and a description of each setting:

Cloud Console
Policy triggers value
Cloud Monitoring API
combiner value
Meaning
Any condition is met OR An incident is opened if any resource causes any of the conditions to be met.
All conditions are met
even for different
resources for each condition

(default)
AND An incident is opened if each condition is met by at least one resource, even if a different resource causes those conditions to be met.
All conditions are met AND_WITH_MATCHING_RESOURCE An incident is opened if the same resource causes each condition to be met. This setting is the most stringent combining choice.

In this context, the term met means that the condition's configuration evaluates to true. For example, if the configuration is Any time series is above 10 for 5 minutes, then when this statement evaluates to true, the condition is met.

Example

Consider a Google Cloud project that contains two VM instances, vm1 and vm2. Also, assume that you create an alerting policy with 2 conditions:

  • The condition named CPU usage is too high monitors the CPU usage of the instances. This condition is met when the CPU usage of any instance is above 100ms/s for 1 minute.
  • The condition named Excessive utilization monitors the CPU utilization of the instances. This condition is met when the CPU utilization of any instance is above 60% for 1 minute.

Initially, assume that both conditions evaluate to false.

Next, assume that the CPU usage of vm1 exceeds 100ms/s for 1 minute. This causes the condition CPU usage is too high to be met. If the conditions are combined with Any condition is met, then an incident is created because a condition is met. If the conditions are combined with All conditions are met or All conditions are met even for different resources for each condition, then an incident isn't created. These combiner choices require that both conditions be met.

Next, assume that the CPU usage of vm1 continues to be above 100ms/s and that the CPU utilization of vm2 exceeds 60% for 1 minute. The result is that both conditions are met. The following describes what occurs based on how the conditions are combined:

  • Any condition is met: An incident is created when a resource causes a condition to be met. In this example, vm2 causes the condition Excessive utilization to be met.

    As a side note, if vm2 caused the condition CPU usage is too high to be met, that also results in an incident being created. This is because vm1 causing the condition CPU usage is too high to be met and vm2 causing the condition CPU usage is too high to be met are distinct events.

  • All conditions are met even for different resources for each condition: An incident is created because both conditions are met.

  • All conditions are met: An incident isn't created because this combiner requires that the same resource cause all conditions to be met. In this example, no incident is created because vm1 causes CPU usage is too high to be met while vm2 causes Excessive utilization to be met.

Disabled alerting policies

Alerting policies can be temporarily paused and restarted by disabling and enabling the policy. For example, if you have an alerting policy that notifies you when a process is down for more than 5 minutes, you can disable the alerting policy when you take the process down for upgrade or other maintenance.

Disabling an alerting policy prevents the policy from triggering or closing incidents, but it doesn't stop Cloud Monitoring from evaluating the policy conditions and recording the results. If you disable an alerting policy and want open issues put into the closed state, then silence the incident. For information on that process, see Silencing incidents.

Suppose the monitored process is down for 20 minutes for maintenance. If you restart the process and re-enable the alerting policy immediately, it recognizes that the process hasn't been up for the last 5 minutes and opens an incident.

When a disabled policy is re-enabled, Monitoring examines the values of all conditions over the most recent duration window, which might include data taken before, during, and after the paused interval. Policies can trigger immediately after resuming them, even with large duration windows.

Partial metric data

If measurements are missing (for example, if there are no HTTP requests for a couple of minutes), the policy uses the last recorded value to evaluate conditions.

Example

  1. A condition specifies HTTP latency of two seconds or higher for five consecutive minutes.
  2. For three consecutive minutes, HTTP latency is three seconds.
  3. For two consecutive minutes, there are no HTTP requests. In this case, a condition carries forward the last measurement (three seconds) for these two minutes.
  4. After a total of five minutes the policy triggers, even though there has been no data for the last two minutes.

Missing or delayed metric data can result in policies not alerting and incidents not closing. Delays in data arriving from third-party cloud providers can be as high as 30 minutes, with 5-15 minute delays being the most common. A lengthy delay—longer than the duration window—can cause conditions to enter an "unknown" state. When the data finally arrives, Cloud Monitoring might have lost some of the recent history of the conditions. Later inspection of the time-series data might not reveal this problem because there is no evidence of delays once the data arrives.

You can minimize these problems by doing any of the following:

  • Contact your third-party cloud provider to see if there is a way to reduce metric collection latency.
  • Use longer duration windows in your conditions. This has the disadvantage of making your alerting policies less responsive.
  • Choose metrics that have a lower collection delay:

    • Monitoring agent metrics, especially when the agent is running on VM instances in third-party clouds.
    • Custom metrics, when you write their data directly to Cloud Monitoring.
    • Logs-based metrics, if logs collection is not delayed.

For more information, see Monitoring Agent Overview, Using Custom Metrics, and Logs-based Metrics.

Incidents per policy

An alerting policy can apply to many resources, and a problem affecting all resources can trigger the policy and open incidents for each resource. An incident is opened for each time series that results in a condition being met.

To prevent overloading the system, the number of incidents that a single policy can open simultaneously is limited to 5000.

For example, if a policy applies to 2000 (or 20,000) Compute Engine instances, and each instance causes the alerting conditions to be met, then only 5000 incidents are opened. Any remaining conditions that are met are ignored until some of the open incidents for that policy close.

Notifications per incident

By default, a notification is sent out when a time series causes a condition to be met. You might receive multiple notifications if any of the following are true:

  • A condition is monitoring multiple time series.

  • A policy contains multiple conditions:

    • All conditions are met: When all conditions are met, then for each time series that results in a condition being met, the policy sends a notification and creates an incident. For example, if you have a policy with two conditions and each condition is monitoring one time series, then when the policy is triggered, you receive two notifications and you see two incidents.

    • Any condition is met: The policy sends a notification each time a new combination of conditions is met. For example, assume that ConditionA is met, that an incident opens, and that a notification is sent. If the incident is still open when a subsequent measurement meets both ConditionA and ConditionB, then another notification is sent.

If you create an alerting policy by using the Cloud Monitoring API, then you are notified when the condition is met and when the condition stops being met. If you create the policy by using the Google Cloud Console, then the default behavior is to send a notification only when the condition is met. If you want a notification when the condition stops being met, then select Notify on incident closure in the notifications section.

Notification latency

Notification latency is the delay from the time a problem first starts until the time a policy is triggered.

The following events and settings contribute to the overall notification latency:

  • Metric collection delay: The time Cloud Monitoring needs to collect metric values. For Google Cloud values, this is typically negligible. For AWS CloudWatch metrics, this can be several minutes. For uptime checks, this can be an average of two minutes (from the end of the duration window).

  • Duration window: The window configured for the condition. Note that conditions are only met if a condition is true throughout the duration window. For example, if you specify a five-minute window, the notification is delayed at least five minutes from when the event first occurs.

  • Time for notification to arrive: Notification channels such as email and SMS themselves may experience network or other latencies (unrelated to what's being delivered), sometimes approaching minutes. On some channels—such as SMS and Slack—there is no guarantee that the messages are delivered.

What's next