Behavior of metric-based alerting policies

This document describes how alignment periods and retest windows determine when a condition is met, how alerting policies combine multiple conditions, and how alerting policies replace missing data points. It also describes the maximum number of open incidents for a policy, the number of notifications per incident, and what causes notification delays.

This content does not apply to log-based alerting policies. For information about log-based alerting policies, see Monitoring your logs.

Alignment periods and retest windows

Cloud Monitoring evaluates the alignment period and retest window when determining whether the condition of an alerting policy has been met.

Alignment period

Before time-series data is monitored by an alerting policy, it must be regularized so that the alerting policy has regularly spaced data to evaluate. The regularization process is called alignment.

Alignment involves two steps:

  • Dividing the time series into regular time intervals, also called bucketing the data. The interval is the alignment period.

  • Computing a single value for the points in the alignment period. You choose how that single point is computed; you might sum all the values, or compute their average, or use the maximum. The function that combines the data points is called the aligner. The result of the combination is the called the aligned value.

    For more information about alignment, see Alignment: within-series regularization.

For example, when the alignment period is five minutes, at 1:00 PM, the alignment period contains the samples received between 12:55 PM and 1:00 PM. At 1:01 PM, the alignment period slides one minute and contains the samples received between 12:56 PM and 1:01 PM.

Monitoring configures an alignment period as follows:

Google Cloud console

You configure the alignment period by choosing a value for the following fields on the Alert conditions page:

  • Rolling window: Specifies the range of time to evaluate.
  • Rolling window function: Specifies the mathematical function to perform on the window of data points.

For more information about available functions, see Aligner in the API reference. Some of the aligner functions both align the data and convert it from one metric kind or type to another. For a detailed explanation, see Kinds, types, and conversions.

API

You configure the alignment period by setting the aggregations.alignmentPeriod and aggregations.perSeriesAligner fields in the MetricThreshold and MetricAbsence structures.

For more information about available functions, see Aligner in the API reference. Some of the aligner functions both align the data and convert it from one metric kind or type to another. For a detailed explanation, see Kinds, types, and conversions.

To illustrate the effect of the alignment period on a condition in an alerting policy, consider a metric-threshold condition that is monitoring a metric with a sampling period of one minute. Assume that the alignment period is set to five minutes and that the aligner is set to sum. Also, assume that the condition is met when the aligned value of the time series is greater than two for at least three minutes, and that the condition is evaluated every minute. In this example, the alignment period is two minutes, and the retest window, which is described in the next section, is three minutes. The following figure illustrates several sequential evaluations of the condition:

Figure illustrating the effect of the alignment period on the retest window/duration.

Each row in the figure illustrates a single evaluation of the condition. The time series data is shown. The points in the alignment period are shown with blue dots; older dots are black. Each row displays the aligned value and whether this value is greater than the threshold of two. For the row labeled start, the aligned value computes to one, which is less than the threshold. At the next evaluation, the sum of the samples in the alignment period is two. On the third evaluation, the sum is three, and because this value is greater than the threshold, a timer for the retest window is started.

Retest windows

The condition of an alerting policy has a retest window, which prevents the condition from being met due to a single measurement or forecast. For example, assume that the retest window of a condition is set to 15 minutes. The following describes the behavior of the condition based on its type:

  • Metric-threshold conditions are met when, for a single time series, every aligned measurement in a 15-minute interval violates the threshold.
  • Metric-absence conditions are met when no data arrives for a time series in a 15-minute interval.
  • Forecast conditions are met when every forecast produced during a 15-minute window predicts that the time series will violate the threshold within the forecast window.

For policies with one condition, an incident is opened and notifications are sent when the condition is met. These incidents stay open while the the condition continues to be met.

Google Cloud console

You configure the retest window by using the Retest window field in the Configure alert trigger step.

API

You configure the retest window by setting the field called duration in the MetricThreshold and MetricAbsence structures.

The previous figure illustrated three evaluations of a metric-threshold condition. At the time start + 2 minutes, the aligned value is greater than the threshold; however, the condition isn't met because the retest window is set to three minutes. The following figure illustrates the results for the next evaluations of the condition:

Figure illustrating the effect of the retest window.

Even though the aligned value is greater than the threshold at time start + 2 minutes, the condition isn't met until the aligned value is greater than the threshold for three minutes. That event occurs at time start + 5 minutes.

A condition resets its retest window each time a measurement or forecast doesn't satisfy the condition. This behavior is illustrated in the following example:

Example: This alerting policy contains one metric-threshold condition that specifies a five-minute retest window.

If HTTP response latency is greater than two seconds,
and if the latency is greater than the threshold for five minutes,
then open an incident and email your support team.

The following sequence illustrates how the retest window affects the evaluation of the condition:

  1. The HTTP latency is less than two seconds.
  2. For the next three consecutive minutes, HTTP latency is greater than two seconds.
  3. In the next measurement, the latency is less than two seconds, so the condition resets the retest window.
  4. For the next five consecutive minutes, HTTP latency is greater than two seconds, so the condition is met.

    Because the alerting policy has one condition, Monitoring sends notifications when the condition is met.

Set the retest window to be long enough to minimize false positives, but short enough to ensure that incidents are opened in a timely manner.

Best practices for setting the alignment period and retest window

The alignment period determines how many samples are combined with the aligner:

  • The minimum value of the alignment period for a metric type is the sampling period of that metric type. For example, if the metric type is sampled every 300 seconds, then the alignment period should be at least 300 seconds. However, if you want to combine 5 samples, then set the alignment period to 5 * 300 sec or 1500 seconds.

  • The maximum value of the alignment period is 24 hours less the ingestion delay of the metric type. For example, if the ingestion delay for a metric is 6 hours, then the maximum value of the alignment period is 18 hours.

Use the retest window to specify the responsiveness of the alert. For example, if you set the retest window to 20 minutes for a metric-absence condition, then there must be no data for 20 minutes before the condition is met. For a more responsive alerting policy, set the retest window to a smaller value. For metric-threshold conditions, to have the most responsive alerting policy, set the retest window to zero. A single aligned value causes these types of conditions to be met.

Alerting policy conditions are evaluated at a fixed frequency. The choices that you make for the alignment period and the retest window don't determine how often the condition is evaluated.

Policies with multiple conditions

An alerting policy can contain up to 6 conditions.

If you are using the Cloud Monitoring API or if your alerting policy has multiple conditions, then you must specify when an incident is opened. To configure how multiple conditions are combined, do one of the following:

Google Cloud console

You configure combiner options in the Multi-condition trigger step.

API

You configure combiner options with the combiner field of the AlertPolicy structure.

This table lists the settings in the Google Cloud console, the equivalent value in the Cloud Monitoring API, and a description of each setting:

Google Cloud console
Policy triggers value
Cloud Monitoring API
combiner value
Meaning
Any condition is met OR An incident is opened if any resource causes any of the conditions to be met.
All conditions are met
even for different
resources for each condition

(default)
AND An incident is opened if each condition is met, even if a different resource causes those conditions to be met.
All conditions are met AND_WITH_MATCHING_RESOURCE An incident is opened if the same resource causes each condition to be met. This setting is the most stringent combining choice.

In this context, the term met means that the condition's configuration evaluates to true. For example, if the configuration is Any time series is greater than 10 for 5 minutes, then when this statement evaluates to true, the condition is met.

Example

Consider a Google Cloud project that contains two VM instances, vm1 and vm2. Also, assume that you create an alerting policy with 2 conditions:

  • The condition named CPU usage is too high monitors the CPU usage of the instances. This condition is met when the CPU usage of any instance is greater than 100ms/s for 1 minute.
  • The condition named Excessive utilization monitors the CPU utilization of the instances. This condition is met when the CPU utilization of any instance is greater than 60% for 1 minute.

Initially, assume that both conditions evaluate to false.

Next, assume that the CPU usage of vm1 exceeds 100ms/s for 1 minute. Because the CPU usage is greater than the threshold for one minute, the condition CPU usage is too high is met. If the conditions are combined with Any condition is met, then an incident is created because a condition is met. If the conditions are combined with All conditions are met or All conditions are met even for different resources for each condition, then an incident isn't created. These combiner choices require that both conditions be met.

Next, assume that the CPU usage of vm1 remains greater than 100ms/s and that the CPU utilization of vm2 exceeds 60% for 1 minute. The result is that both conditions are met. The following describes what occurs based on how the conditions are combined:

  • Any condition is met: An incident is created when a resource causes a condition to be met. In this example, vm2 causes the condition Excessive utilization to be met.

    If vm2 causes the condition CPU usage is too high to be met, then that also results in an incident being created. An incident is created because vm1 and vm2 causing the condition CPU usage is too high to be met are distinct events.

  • All conditions are met even for different resources for each condition: An incident is created because both conditions are met.

  • All conditions are met: An incident isn't created because this combiner requires that the same resource cause all conditions to be met. In this example, no incident is created because vm1 causes CPU usage is too high to be met while vm2 causes Excessive utilization to be met.

Partial metric data

When time series data stops arriving or when data is delayed, Monitoring classifies the data as missing. Missing data can prevent incidents from closing. Delays in data arriving from third-party cloud providers can be as high as 30 minutes, with 5-15 minute delays being the most common. A lengthy delay—longer than the retest window—can cause conditions to enter an "unknown" state. When the data finally arrives, Monitoring might have lost some of the recent history of the conditions. Later inspection of the time-series data might not reveal this problem because there is no evidence of delays once the data arrives.

Google Cloud console

You can configure how Monitoring evaluates a metric-threshold condition when data stops arriving. For example, when an incident is open and an expected measurement doesn't arrive, do you want Monitoring to leave the incident open or to close it immediately? Similarly, when data stops arriving and no incident is open, do you want an incident to be opened? Lastly, how long should an incident stay open after data stops arriving?

There are two configurable fields that specify how Monitoring evaluates metric-threshold conditions when data stops arriving:

  • To configure how Monitoring determines the replacement value for missing data, use the Evaluation of missing data field which you set in the Condition trigger step. This field is disabled when the retest window is set to No retest.

    The retest window is the field called duration in the Cloud Monitoring API.

  • To configure how long Monitoring waits before closing an open incident after data stops arriving, use the Incident autoclose duration field. You set the auto-close duration in the Notification step. The default auto-close duration is seven days.

The following describes the different options for the missing data field:

Google Cloud console
"Evaluation of missing data" field
Summary Details
Missing data empty Open incidents stay open.
New incidents aren't opened.

For conditions that are met, the condition continues to be met when data stops arriving. If an incident is open for this condition, then the incident stays open. When an incident is open and no data arrives, the auto-close timer starts after a delay of at least 15 minutes. If the timer expires, then the incident is closed.

For conditions that aren't met, the condition continues to not be met when data stops arriving.

Missing data points treated as values that violate the policy condition Open incidents stay open.
New incidents can be opened.

For conditions that are met, the condition continues to be met when data stops arriving. If an incident is open for this condition, then the incident stays open. When an incident is open and no data arrives for the auto-close duration plus 24 hours, the incident is closed.

For conditions that aren't met, this setting causes the metric-threshold condition to behave like a metric-absence condition. If data doesn't arrive in the time specified by the retest window, then the condition is evaluated as met. For an alerting policy with one condition, the condition being met results in an incident being opened.

Missing data points treated as values that don't violate the policy condition Open incidents are closed.
New incidents aren't opened.

For conditions that are met, the condition stops being met when data stops arriving. If an incident is open for this condition, then the incident is closed.

For conditions that aren't met, the condition continues to not be met when data stops arriving.

API

You can configure how Monitoring evaluates a metric-threshold condition when data stops arriving. For example, when an incident is open and an expected measurement doesn't arrive, do you want Monitoring to leave the incident open or to close it immediately? Similarly, when data stops arriving and no incident is open, do you want an incident to be opened? Lastly, how long should an incident stay open after data stops arriving?

There are two configurable fields that specify how Monitoring evaluates metric-threshold conditions when data stops arriving:

  • To configure how Monitoring determines the replacement value for missing data, use the evaluationMissingData field of the MetricThreshold structure. This field is ignored when the duration field is zero.

  • To configure how long Monitoring waits before closing an open incident after data stops arriving, use the autoClose field in the AlertStrategy structure.

The following describes the different options for the missing data field:

API
evaluationMissingData field
Summary Details
EVALUATION_MISSING_DATA_UNSPECIFIED Open incidents stay open.
New incidents aren't opened.

For conditions that are met, the condition continues to be met when data stops arriving. If an incident is open for this condition, then the incident stays open. When an incident is open and no data arrives, the auto-close timer starts after a delay of at least 15 minutes. If the timer expires, then the incident is closed.

For conditions that aren't met, the condition continues to not be met when data stops arriving.

EVALUATION_MISSING_DATA_ACTIVE Open incidents stay open.
New incidents can be opened.

For conditions that are met, the condition continues to be met when data stops arriving. If an incident is open for this condition, then the incident stays open. When an incident is open and no data arrives for the auto-close duration plus 24 hours, the incident is closed.

For conditions that aren't met, this setting causes the metric-threshold condition to behave like a metric-absence condition. If data doesn't arrive in the time specified by the `duration` field, then the condition is evaluated as met. For an alerting policy with one condition, the condition being met results in an incident being opened.

EVALUATION_MISSING_DATA_INACTIVE Open incidents are closed.
New incidents aren't opened.

For conditions that are met, the condition stops being met when data stops arriving. If an incident is open for this condition, then the incident is closed.

For conditions that aren't met, the condition continues to not be met when data stops arriving.

You can minimize problems due to missing data by doing any of the following:

  • Contact your third-party cloud provider to identify ways to reduce metric collection latency.
  • Use longer retest windows in your conditions. Using a longer retest window window has the disadvantage of making your alerting policies less responsive.
  • Choose metrics that have a lower collection delay:

    • Monitoring agent metrics, especially when the agent is running on VM instances in third-party clouds.
    • Custom metrics, when you write their data directly to Monitoring.
    • Log-based metrics, if collection of log entries isn't delayed.

For more information, see Monitoring agent overview, User-defined metrics overview, and log-based metrics.

When Monitoring sends notifications and creates incidents

Cloud Monitoring sends a notification when a time series causes a condition to be met. The notification is sent to all notification channels. You can't restrict a notification to a specific channel or to a subset of your policy's channels.

If you configure repeated notifications, then the same notification is re-sent to specific notification channels for your alerting policy.

You might receive multiple unique notifications related to one alerting policy when any of the following are true:

  • A condition is monitoring multiple time series.

  • A policy contains multiple conditions. In this case, the notifications you receive depend on the value of the alerting policy's multi-condition trigger:

    • All conditions are met: When all conditions are met, for each time series that results in a condition being met, the alerting policy sends a notification and creates an incident.

      You can't configure Cloud Monitoring to create only one incident and send only one notification when the alerting policy contains multiple conditions.

    • Any condition is met: The alerting policy sends a notification when a time series causes the condition to be met.

    For more information, see Policies with multiple conditions.

Alerting policies created by using the Cloud Monitoring API also notify you when the condition is met and when the condition stops being met. Alerting policies created by using the Google Cloud console don't send a notification when the condition stops being met unless you've enabled that behavior.

When Monitoring doesn't send notifications or create incidents

In the following situations, Monitoring doesn't create incidents or send notifications when the conditions of an alerting policy are met:

  • The alerting policy is disabled.
  • The alerting policy is snoozed.
  • Monitoring has reached the limit for the maximum number of open incidents.

Disabled alerting policies

Monitoring doesn't send create incidents or send notifications for disabled alerting policies. However, Monitoring continues to evaluate a disabled alerting policy's conditions.

When you enable a disabled policy, Monitoring evaluates the values of all conditions over the most recent retest window. The most recent retest window might include data taken before, during, and after the policy was enabled. The conditions of a disabled policy can be met immediately after resuming the policy, even with large retest windows.

For example, suppose you have an alerting policy that monitors a specific process and that you disable this policy. The following week, the process goes down, and because the alerting policy is disabled you aren't notified. If you restart the process and enable the alerting policy immediately, then Monitoring recognizes that the process hasn't been up for the last five minutes and opens an incident.

The incidents related to a disabled alerting policy remain open until the policy's auto-close duration expires.

Snoozed alerting policies

Monitoring doesn't send notifications or create incidents for an alerting policy that is snoozed. We recommend snoozing alerting policies when you want to prevent an alerting policy from sending notifications for only short intervals. For example, before you perform maintenance on a virtual machine (VM), you might create a snooze and add to the snooze criteria the alerting policies that monitor the instance.

When you snooze an alerting policy, Monitoring closes all open incidents related to the policy. Monitoring can open new incidents after the snooze expires. For information, see Snooze notifications and alerts.

Limits of notifications and open incidents

An alerting policy can apply to many resources, and a problem affecting all resources can cause the alerting policy to open incidents for each resource. An incident is opened for each time series that results in a condition being met.

To prevent overloading the system, the number of incidents that a single policy can open simultaneously is limited to 1,000.

For example, consider a policy that applies to 2000 Compute Engine instances, and each instance causes the alerting conditions to be met. Monitoring limits the number of open incidents to 1,000. Any remaining conditions that are met are ignored until some of the open incidents for that policy close.

As a result of this limit, a single notification channel can receive up to 1,000 notifications at one time. If your alerting policy has multiple notification channels, then this limit applies to each notification channel independently.

Latency

Latency refers to the delay between when Monitoring samples a metric and when the metric data point becomes visible as time series data. The latency affects when notifications are sent. For example, if a monitored metric has a latency of up to 180 seconds, then Monitoring won't create an incident for up to 180 seconds after the alerting policy condition evaluates to true. For more information, see Latency of metric data.

The following events and settings contribute to the latency:

  • Metric collection delay: The time Monitoring needs to collect metric values. For Google Cloud values, most metrics aren't visible for 60 seconds after collection; however, the delay is dependent upon the metric. Alerting policy computations take an additional delay of 60 to 90 seconds. For AWS CloudWatch metrics, the visibility delay can be several minutes. For uptime checks, this can be an average of two minutes (from the end of the retest window).

  • Retest window: The window configured for the condition. Conditions are only met when a condition is true throughout the retest window. For example, a retest window setting of five minutes causes delays in the notification of at least five minutes from when the event first occurs.

  • Time for notification to arrive: Notification channels, such as email and SMS, may experience network or other latencies (unrelated to what's being delivered), sometimes approaching minutes. On some channels—such as SMS and Slack—there is no guarantee that the messages are delivered.

What's next