This document describes how the alignment period and duration settings determine when a condition triggers, how alerting policies combine multiple conditions, and how alerting policies replace missing data points. It also describes the maximum number of open incidents for a policy, the number of notifications per incident, and what causes notification delays.
This content does not apply to log-based alerting policies. For information about log-based alerting policies, see Monitoring your logs.
Alignment period and duration settings
The alignment period and the duration window are two fields that you set when specifying a condition for an alerting policy. This section provides a brief illustration of the meaning of these fields.
Alignment period
The alignment period is a look-back interval from a particular point in time; the aligner is the function that combines the points into the look-back interval into an aligned value. For example, when the alignment period is five minutes, at 1:00 PM, the alignment period contains the samples received between 12:55 PM and 1:00 PM. At 1:01 PM, the alignment period slides one minute and contains the samples received between 12:56 PM and 1:01 PM.
Google Cloud console
You configure the alignment fields with the Rolling window and Rolling window function menus that are part of the New condition dialog.
For more information about the available functions,
see Aligner
in the API reference. Some of the aligner
functions both align the data
and convert it from one metric kind or type to another. For a detailed
explanation, see Kinds, types, and conversions.
API
You configure the alignment fields by setting the
aggregations.alignmentPeriod
and aggregations.perSeriesAligner
fields
in the MetricThreshold
and
MetricAbsence
structures.
For more information about the available functions,
see Aligner
in the API reference. Some of the aligner
functions both align the data
and convert it from one metric kind or type to another. For a detailed
explanation, see Kinds, types, and conversions.
To illustrate the effect of the alignment period on a condition in an alerting
policy, consider a metric-threshold condition that is
monitoring a metric with a sampling period
of one minute. Assume that the alignment period is set to five minutes and that
the aligner is set to sum
. Also, assume that the condition triggers
when the aligned value of the time series is greater than two for at
least three minutes, and that the condition is evaluated every minute.
The following figure illustrates several sequential evaluations of the
condition:
Each row in the figure illustrates a single evaluation of the condition. The
time series data is shown. The points in the alignment period are shown with
blue dots, all older dots are black. Each row displays the aligned value and
whether this value is greater than the threshold of two. For the row labeled
start
, the aligned value computes to one, which is less than the threshold.
At the next evaluation, the sum of the samples in the alignment period is two.
On the third evaluation, the sum is three, and because this value is
greater than the threshold, a timer for the duration is started.
Duration window
You use the duration, or the duration window, to prevent a condition from triggering due to a single measurement or forecast. For example, assume that the duration field for a condition is set to 15 minutes. The following describes the behavior of the condition based on its type:
- Metric-threshold conditions trigger when, for a single time series, every aligned measurement in a 15-minute interval violates the threshold.
- Metric-absence conditions trigger when no data arrives for a time series in a 15-minute interval.
- Forecast conditions trigger when every forecast produced during a 15-minute window predicts that the time series will violate the threshold within the forecast window.
For policies with one condition, an incident is opened and notifications are sent when its condition triggers. These incidents stay open while the the condition continues to be met.
Google Cloud console
You configure the duration window by using the Retest window field in the Configure trigger step.
API
You configure the duration window by setting the
duration
field in the MetricThreshold
and
MetricAbsence
structures.
The previous figure illustrated three evaluations of a metric-threshold
condition. At the
time start + 2 minutes
, the aligned value is greater than the threshold;
however, the condition doesn't trigger because the duration window is set to
three minutes. The following figure illustrates the results for the next
evaluations of the condition:
Even though the aligned value is greater than the threshold at
time start + 2 minutes
, the condition doesn't trigger until the aligned value
is greater than the threshold for three minutes. That event occurs at time
start + 5 minutes
.
A condition resets its duration window each time a measurement or forecast doesn't satisfy the condition. This behavior is illustrated in the following example:
Example: This alerting policy contains one metric-threshold condition that specifies a five-minute duration window.
If HTTP response latency is greater than two seconds,
and if the latency is greater than the threshold for five minutes,
then open an incident and send email to your support team.The following sequence illustrates how the duration window affects the evaluation of the condition:
- The HTTP latency is less than two seconds.
- For the next three consecutive minutes, HTTP latency is greater than two seconds.
- In the next measurement, the latency is less than two seconds, so the condition resets the duration window.
For the next five consecutive minutes, HTTP latency is greater than two seconds, so the condition triggers.
Because the policy has one condition, an incident is opened and notifications are sent when the condition triggers.
Set the duration window to be long enough to minimize false positives, but short enough to ensure that incidents are opened in a timely manner.
Select the alignment period and duration window
Always set the alignment period, which determines how many samples are combined with the aligner, to be at least as long as the sampling period. To combine five samples, set the alignment period to be five times the sampling period.
Use the duration window specify the responsiveness of the alert. For example, if you set the duration window to 20 minutes for a metric-absence condition, then there must be no data for 20 minutes before the condition triggers. If you want a more responsive alert, then set the duration to a smaller value. For metric-threshold conditions, to have the most responsive alert, set the duration to zero. A single aligned value cause these types of conditions to trigger.
Alerting policy conditions are evaluated at a fixed frequency. The choices that you make for the alignment period and the duration window don't determine how often the condition is evaluated.
Policies with multiple conditions
An alerting policy can contain up to 6 conditions.
If you are using the Cloud Monitoring API or if your alerting policy has multiple conditions, then you must specify when an incident is opened. To configure how multiple conditions are combined, do one of the following:
Google Cloud console
You configure combiner options in the Multi-condition trigger step.
API
You configure combiner options with the combiner
field of the
AlertPolicy
structure.
This table lists the settings in the Google Cloud console, the equivalent value in the Cloud Monitoring API, and a description of each setting:
Google Cloud console Policy triggers value |
Cloud Monitoring API combiner value |
Meaning |
---|---|---|
Any condition is met | OR |
An incident is opened if any resource causes any of the conditions to be met. |
All conditions are met even for different resources for each condition (default) |
AND |
An incident is opened if each condition is met, even if a different resource causes those conditions to be met. |
All conditions are met | AND_WITH_MATCHING_RESOURCE |
An incident is opened if the same resource causes each condition to be met. This setting is the most stringent combining choice. |
In this context, the term met
means that the condition's configuration evaluates to true. For
example, if the configuration is
Any time series is greater than 10 for 5 minutes
,
then when this statement evaluates to true, the condition is met.
Example
Consider a Google Cloud project that contains two VM instances, vm1 and vm2. Also, assume that you create an alerting policy with 2 conditions:
- The condition named
CPU usage is too high
monitors the CPU usage of the instances. This condition is met when the CPU usage of any instance is greater than 100ms/s for 1 minute. - The condition named
Excessive utilization
monitors the CPU utilization of the instances. This condition is met when the CPU utilization of any instance is greater than 60% for 1 minute.
Initially, assume that both conditions evaluate to false.
Next, assume that the CPU usage of vm1 exceeds 100ms/s for 1 minute. Because
the CPU usage is greater than the threshold for one minute, the condition
CPU usage is too high
is met. If the
conditions are combined with Any condition is met, then an incident
is created because a condition is met. If the conditions are combined with
All conditions are met or
All conditions are met even for different resources for each condition,
then an incident isn't created. These combiner choices require that both
conditions be met.
Next, assume that the CPU usage of vm1 remains greater than 100ms/s and that the CPU utilization of vm2 exceeds 60% for 1 minute. The result is that both conditions are met. The following describes what occurs based on how the conditions are combined:
Any condition is met: An incident is created when a resource causes a condition to be met. In this example, vm2 causes the condition
Excessive utilization
to be met.If vm2 causes the condition
CPU usage is too high
to be met, then that also results in an incident being created. An incident is created because vm1 and vm2 causing the conditionCPU usage is too high
to be met are distinct events.All conditions are met even for different resources for each condition: An incident is created because both conditions are met.
All conditions are met: An incident isn't created because this combiner requires that the same resource cause all conditions to be met. In this example, no incident is created because vm1 causes
CPU usage is too high
to be met while vm2 causesExcessive utilization
to be met.
Partial metric data
When time series data stops arriving or when data is delayed, Monitoring classifies the data as missing; missing data can result in policies not alerting and incidents not closing. Delays in data arriving from third-party cloud providers can be as high as 30 minutes, with 5-15 minute delays being the most common. A lengthy delay—longer than the duration window—can cause conditions to enter an "unknown" state. When the data finally arrives, Monitoring might have lost some of the recent history of the conditions. Later inspection of the time-series data might not reveal this problem because there is no evidence of delays once the data arrives.
Google Cloud console
You can configure how Monitoring evaluates a metric-threshold condition when data stops arriving. For example, when an incident is open and an expected measurement doesn't arrive, do you want Monitoring to leave the incident open or to close it immediately? Similarly, when data stops arriving and no incident is open, do you want an incident to be opened? Lastly, how long should an incident stay open after data stops arriving?
There are two configurable fields that specify how Monitoring evaluates metric-threshold conditions when data stops arriving:
To configure how Monitoring determines the replacement value for missing data, use the Evaluation of missing data field which you set in the Condition trigger step. This field is disabled when the retest window is set to No retest.
To configure how long Monitoring waits before closing an open incident after data stops arriving, use the Incident autoclose duration field. You set the auto-close duration in the Notification step. The default auto-close duration is seven days.
The following describes the different options for the missing data field:
Google Cloud console "Evaluation of missing data" field |
Summary | Details |
---|---|---|
Missing data empty | Open incidents stay open. New incidents aren't opened. |
For conditions that are met, the condition continues to be met when data stops arriving. If an incident is open for this condition, then the incident stays open. When an incident is open and no data arrives, the auto-close timer starts after a delay of at least 15 minutes. If the timer expires, then the incident is closed. For conditions that aren't met, the condition continues to not be met when data stops arriving. |
Missing data points treated as values that violate the policy condition | Open incidents stay open. New incidents can be opened. |
For conditions that are met, the condition continues to be met when data stops arriving. If an incident is open for this condition, then the incident stays open. When an incident is open and no data arrives for the auto-close duration plus 24 hours, the incident is closed. For conditions that aren't met, this setting causes the
metric-threshold condition to behave like a |
Missing data points treated as values that don't violate the policy condition | Open incidents are closed. New incidents aren't opened. |
For conditions that are met, the condition stops being met when data stops arriving. If an incident is open for this condition, then the incident is closed. For conditions that aren't met, the condition continues to not be met when data stops arriving. |
API
You can configure how Monitoring evaluates a metric-threshold condition when data stops arriving. For example, when an incident is open and an expected measurement doesn't arrive, do you want Monitoring to leave the incident open or to close it immediately? Similarly, when data stops arriving and no incident is open, do you want an incident to be opened? Lastly, how long should an incident stay open after data stops arriving?
There are two configurable fields that specify how Monitoring evaluates metric-threshold conditions when data stops arriving:
To configure how Monitoring determines the replacement value for missing data, use the
evaluationMissingData
field of theMetricThreshold
structure. This field is ignored when theduration
field is zero.To configure how long Monitoring waits before closing an open incident after data stops arriving, use the
autoClose
field in theAlertStrategy
structure.
The following describes the different options for the missing data field:
APIevaluationMissingData field |
Summary | Details |
---|---|---|
EVALUATION_MISSING_DATA_UNSPECIFIED |
Open incidents stay open. New incidents aren't opened. |
For conditions that are met, the condition continues to be met when data stops arriving. If an incident is open for this condition, then the incident stays open. When an incident is open and no data arrives, the auto-close timer starts after a delay of at least 15 minutes. If the timer expires, then the incident is closed. For conditions that aren't met, the condition continues to not be met when data stops arriving. |
EVALUATION_MISSING_DATA_ACTIVE |
Open incidents stay open. New incidents can be opened. |
For conditions that are met, the condition continues to be met when data stops arriving. If an incident is open for this condition, then the incident stays open. When an incident is open and no data arrives for the auto-close duration plus 24 hours, the incident is closed. For conditions that aren't met, this setting causes the
metric-threshold condition to behave like a |
EVALUATION_MISSING_DATA_INACTIVE |
Open incidents are closed. New incidents aren't opened. |
For conditions that are met, the condition stops being met when data stops arriving. If an incident is open for this condition, then the incident is closed. For conditions that aren't met, the condition continues to not be met when data stops arriving. |
You can minimize problems due to missing data by doing any of the following:
- Contact your third-party cloud provider to identify ways to reduce metric collection latency.
- Use longer duration windows in your conditions. Using a longer duration window has the disadvantage of making your alerting policies less responsive.
Choose metrics that have a lower collection delay:
- Monitoring agent metrics, especially when the agent is running on VM instances in third-party clouds.
- Custom metrics, when you write their data directly to Cloud Monitoring.
- Log-based metrics, if logs collection is not delayed.
For more information, see Monitoring agent overview, User-defined metrics overview, and log-based metrics.
Notifications and incidents per policy
To prevent Monitoring from creating incidents and sending notifications for an alerting policy, one option is to disable the policy. You can also create a snooze and include the policy in the snooze criteria. When the snooze is active, the policy doesn't create incidents or send notifications.
When policies are enabled and don't match the criteria of an active snooze, incidents can be created and notifications can be sent. This section describes the limits on the number of open incidents per policy and when you might see multiple notifications for the same incident.
Number of open incidents per policy
An alerting policy can apply to many resources, and a problem affecting all resources can trigger the policy and open incidents for each resource. An incident is opened for each time series that results in a condition being met.
To prevent overloading the system, the number of incidents that a single policy can open simultaneously is limited to 1,000.
For example, consider a policy that applies to 2000 (or 20,000) Compute Engine instances, and each instance causes the alerting conditions to be met. Monitoring limits the number of open incidents to 1,000. Any remaining conditions that are met are ignored until some of the open incidents for that policy close.
Number of notifications per policy
By default, a notification is sent out when a time series causes a condition to trigger. You might receive multiple notifications when any of the following are true:
A condition is monitoring multiple time series.
A policy contains multiple conditions:
All conditions are met: When all conditions trigger, then for each time series that results in a condition triggering, the policy sends a notification and creates an incident. For example, assume that you have a policy with two conditions and each condition is monitoring one time series. When all conditions trigger, you receive two notifications and you see two incidents.
Any condition is met: The policy sends a notification each time a condition triggers. For example, assume that you have a policy with two conditions and each condition is monitoring one time series. Assume that the first condition triggers. This causes an incident to be opened and a notification to be sent. If the incident is still open when a subsequent measurement causes the second condition to trigger, then another notification is sent.
Alerting policies created by using the Cloud Monitoring API notify you when the condition triggers and when the condition stops being met. By default, alerting policies created by using the Google Cloud console notify you when an incident is opened. They don't notify you when an incident is closed. You can enable notifications on incident closure.
Notifications for disabled alerting policies
When you disable an alerting policy, the policy continues to evaluate its conditions. However, incidents aren't created and notifications aren't sent.
When you enable a disabled policy, Monitoring examines the values of all conditions over the most recent duration window. The most recent duration might include data taken before, during, and after the policy was enabled. Policies can trigger immediately after resuming them, even with large duration windows.
For example, suppose you have an alerting policy that monitors a specific process and that you disable this policy. The following week, the process goes down, and because the policy is disabled you aren't notified. If you restart the process and enable the alerting policy immediately, then Monitoring recognizes that the process hasn't been up for the last five minutes and opens an incident.
When you disable an alerting policy, its incidents remain open unless you silence them. Silencing an incident closes all open incidents for the same condition. For information on that process, see Silencing incidents.
Notifications for policies that match the criteria of an active snooze
When you want to prevent an alerting policy from sending notifications for short intervals, we recommend that you create a snooze instead of disabling the policy. For example, before you perform maintenance on a virtual machine (VM), you might create a snooze and add to the snooze criteria the alerting policies that monitor the instance.
When a condition of an alerting policy triggers and that policy matches the criteria of an active snooze, then no incident is created and no notification is sent. When the snooze expires, the policy can create incidents and send notifications.
Send repeated notifications
To remind notification recipients of their open and acknowledged incidents, set up repeated notifications. This feature is useful for alerting policies that monitor critical resources that, when exhausted, can cause a service to fail. For example, you could set up repeated notifications for an alerting policy that monitors the amount of free disk space.
By default, an alerting policy sends one notification to each notification channel when an incident is opened. However, you can change the default behavior and configure an alerting policy to resend notifications to all or some of the alerting policy notification channels. These repeated notifications are sent for incidents with a status of Open or Acknowledged. The interval of these notifications must be at least 30 minutes and no more than 24 hours, expressed in seconds.
Google Cloud console
You can't configure repeated notifications in the Google Cloud console. Use the Google Cloud CLI or API instead.
API
Add to your AlertStrategy
object at least one
NotificationChannelStrategy
object.
A NotificationChannelStrategy
object has two fields:
renotifyInterval
: The amount of time, in seconds, between repeated notifications.You can change the value of the
renotifyInterval
field at any time. If an incident related to the alerting policy is open when you change the interval, then the policy sends out another notification for the incident and then restarts the interval period.notificationChannelNames
: An array of notification channel resource names, which are strings in the format ofprojects/PROJECT_ID/notificationChannels/CHANNEL_ID
These notification channels receive the repeated notifications at the intervals defined by therenotifyInterval
value.The resource name channel ID is a numeric value. For information about how to retrieve the channel ID, see List notification channels in a project.
For example, the following JSON sample shows an alert strategy configured to send repeated notifications every 1800 seconds (30 minutes) to the listed notification channel:
"alertStrategy": { "notificationChannelStrategy": [ { "notificationChannelNames": [ "projects/PROJECT_ID/notificationChannels/CHANNEL_ID" ], "renotifyInterval": "1800s" } ] }
For more information about creating alerting policies with the API, see Create alerting policies by API.
To stop repeated notifications for a period of time, disable or create a snooze
for the alerting policy. To completely stop repeated notifications, edit
the alerting policy by using the API and remove the
NotificationChannelStrategy
object.
Notification latency
Notification latency is the delay from the time a problem first starts until the time a policy is triggered.
The following events and settings contribute to the overall notification latency:
Metric collection delay: The time Cloud Monitoring needs to collect metric values. For Google Cloud values, most metrics aren't visible for 60 seconds after collection; however, the delay is dependent upon the metric. Alerting policy computations take an additional delay of 60 to 90 seconds. For AWS CloudWatch metrics, the visibility delay can be several minutes. For uptime checks, this can be an average of two minutes (from the end of the duration window).
Duration window: The window configured for the condition. Conditions are only met when a condition is true throughout the duration window. For example, a duration window setting of five minutes causes delays in the notification of at least five minutes from when the event first occurs.
Time for notification to arrive: Notification channels such as email and SMS themselves may experience network or other latencies (unrelated to what's being delivered), sometimes approaching minutes. On some channels—such as SMS and Slack—there is no guarantee that the messages are delivered.
What's next
For information about how to create an alerting policy, see the following documents:
For an assortment of alerting policies, see Sample policies.