An alerting policy is represented in the Cloud Monitoring API
by an AlertPolicy
object,
which describes a set of conditions indicating a potentially
unhealthy status in your system.
This page describes how the Monitoring API represents alerting policies and discusses the types of conditions the Monitoring API provides for alerting policies.
Structure of an alerting policy
The AlertPolicy
structure defines the components of an
alerting policy. When you create a policy, either by using the Google Cloud console
or the Monitoring API, you specify values for the following
AlertPolicy
fields:
displayName
: A descriptive label for the policy.documentation
: Any information provided to help responders.userLabels
: Any user-defined labels attached to the policy. For information about using labels with alerting, see Add severity levels to an alerting policy.conditions[]
: An array ofCondition
structures.combiner
: A logical operator that determines how to handle multiple conditions.notificationChannels[]
: an array of resource names, each identifying aNotificationChannel
.alertStrategy
: Specifies how quickly Monitoring closes incidents when data stops arriving.
There are other fields you might use, depending on the conditions you create.
By default, alerting policies created by using the Monitoring API send notifications when a condition for triggering the policy is met and when the condition stops being met. You can't change this behavior by using the Monitoring API, but you can turn off notifications about incident closure by editing the policy in the Google Cloud console. To turn off incident-closure notifications, clear the Notify on incident closure option in the notifications section and save the edited policy.
When you create or modify the alerting policy, Monitoring sets
other fields as well, including the name
field. The value of the name
field is the resource name for the alerting policy, which identifies the
policy. The resource name has the following form:
projects/PROJECT_ID/alertPolicies/POLICY_ID
The conditions in the alerting policy are the most variable part of the alerting policy.
Types of conditions in the API
The Cloud Monitoring API supports a variety of condition types in the
Condition
structure. There are multiple condition
types for metric-based alerting policies, and one for log-based alerting
policies. The following sections describe the available condition types.
Conditions for metric-based alerting policies
To create an alerting policy that monitors metric data, including log-based metrics, you can use the following condition types:
Filter-based metric conditions
The MetricAbsence
and MetricThreshold
conditions use
Monitoring filters to select the time-series data
to monitor. Other fields in the condition structure specify how to filter,
group, and aggregate the data. For more information on these concepts, see
Filtering and aggregation: manipulating time series.
If you use the MetricAbsence
condition type, then you can create a condition
that triggers only when all of the time series are absent by aggregating
the time series into a single time series by using aggregations
; see
the MetricAbsence
reference in the API documentation.
A metric-absence alerting policy requires that some data has been written previously; for more information, see Metric absence condition.
If you want to create an alert based on a forecast, then use the
MetricThreshold
condition type and set the forecastOptions
field. When
this field is omitted, then the measured data is compared to a threshold.
However, when this field is set, then predicted data is compared to a threshold.
For more information, see Forecast conditions.
MQL-based metric conditions
The MonitoringQueryLanguageCondition
condition uses Monitoring Query Language (MQL) to
select and manipulate the time-series data to monitor. You can create alerting
policies that compare values against a threshold or test for the absence
of values with this condition type.
If you use a MonitoringQueryLanguageCondition
condition, it must be the only
condition in your alerting policy. For more information, see
Alerting policies with MQL.
Conditions for alerting on ratios
You can create metric-threshold alerting policies to monitor the
ratio of two metrics. You can create these policies by using either
the MetricThreshold
or MonitoringQueryLanguageCondition
condition type.
You can also use MQL directly in the Google Cloud console. You can't create
or manage ratio-based conditions by using the graphical interface for creating
threshold conditions.
We recommend using MQL to create ratio-based alerting policies.
MQL lets you build more powerful and flexible queries than you can
build by using the MetricTheshold
condition type and
Monitoring filters.
For example, with a MonitoringQueryLanguageCondition
condition, you can
compute the ratio of a gauge metric to a delta metric. For examples, see
MQL alerting-policy examples.
If you use the MetricThreshold
condition, the numerator and denominator
of the ratio must have the same MetricKind
.
For a list of metrics and their properties, see Metric lists.
In general, it is best to compute ratios based on time series collected for a single metric type, by using label values. A ratio computed over two different metric types is subject to anomalies due to different sampling periods and alignment windows.
For example, suppose that you have two different metric types, an RPC total count and an RPC error count, and you want to compute the ratio of error-count RPCs over total RPCs. The unsuccessful RPCs are counted in the time series of both metric types. Therefore, there is a chance that, when you align the time series, an unsuccessful RPC doesn't appear in the same alignment interval for both time series. This difference can happen for several reasons, including the following:
- Because there are two different time series recording the same event, there are two underlying counter values implementing the collection, and they aren't updated atomically.
- The sampling rates might differ. When the time series are aligned to a common period, the counts for a single event might appear in adjacent alignment intervals in the time series for the different metrics.
The difference in the number of values in corresponding alignment intervals can
lead to nonsensical error/total
ratio values like 1/0 or 2/1.
Ratios of larger numbers are less likely to result in nonsensical values. You can get larger numbers by aggregation, either by using an alignment window that is longer than the sampling period, or by grouping data for certain labels. These techniques minimize the effect of small differences in the number of points in a given interval. That is, a two-point disparity is more significant when the expected number of points in an interval is 3 than when the expected number is 300.
If you are using built-in metric types, then you might have no choice but to compute ratios across metric types to get the value you need.
If you are designing custom metrics that might count the same thing—like RPCs returning error status—in two different metrics, consider instead a single metric, which includes each count only once. For example, suppose that you are counting RPCs and you want to track the ratio of unsuccessful RPCs to all RPCs. To solve this problem, create a single metric type to count RPCs, and use a label to record the status of the invocation, including the "OK" status. Then each status value, error or "OK", is recorded by updating a single counter for that case.
Condition for log-based alerting policies
To create a log-based alerting policy, which notifies you when a message
matching your filter appears in your log entries, use the
LogMatch
condition type. If you use a LogMatch
condition, it must be the only condition in your alerting policy.
Don't try to use the LogMatch
condition type in conjunction with log-based
metrics. Alerting policies that monitor log-based metrics are metric-based
policies. For more information about choosing between alerting policies that
monitor log-based metrics or log entries, see
Monitoring your logs.
The alerting policies used in the examples in the Managing alerting policies document are metric-based alerting policies, although the principles are the same for log-based alerting policies. For information specific to log-based alerting policies, see Create a log-based alert (Monitoring API) in the Cloud Logging documentation.