An alerting policy is represented in the Cloud Monitoring API
which describes a set of conditions indicating a potentially
unhealthy status in your system.
This page describes how the Monitoring API represents alerting policies and discusses the types of conditions the Monitoring API provides for alerting policies.
Structure of an alerting policy
AlertPolicy structure defines the components of an
alerting policy. When you create a policy, either by using the Google Cloud Console
or the Monitoring API, you specify values for the following
displayName: A descriptive label for the policy.
documentation: Any information provided to help responders.
conditions: An array of
combiner: A logical operator that determines how to handle multiple conditions.
notificationChannels: an array of resource names, each identifying a
There are other fields you might use, depending on the conditions you create.
By default, alerting policies created by using the Monitoring API send notifications when a condition for triggering the policy is met and when the condition stops being met. You can't change this behavior by using the Monitoring API, but you can turn off notifications about incident closure by editing the policy in the Google Cloud Console. To turn off incident-closure notifications, clear the Notify on incident closure option in the notifications section and save the edited policy.
When you create or modify the alerting policy, Monitoring sets
other fields as well, including the
name field. The value of the
field is the resource name for the alerting policy, which identifies the
policy. The resource name has the following form:
The conditions in the alerting policy are the most variable part of the alerting policy.
Types of conditions in the API
The Cloud Monitoring API supports a variety of condition types in the
Condition structure. There are multiple condition
types for metric-based alerting policies, and one for log-based alerting
policies. The following sections describe the available condition types.
Conditions for metric-based alerting policies
To create an alerting policy that monitors metric data, including logs-based metrics, you can use the following condition types:
Filter-based metric conditions
MetricThreshold conditions use
Monitoring filters to select the time-series data
to monitor. Other fields in the condition structure specify how to filter,
group, and agreggate the data. For more information on these concepts, see
Filtering and aggregation: manipulating time series.
If you use the
MetricAbsence condition type, you can create a condition
that triggers only when all of the time series are absent by aggregating
the time series into a single time series by using
MetricAbsence reference in the API documentation.
A metric-absence alerting policy requires that some data has been written previously; for more information, see Metric absence condition.
MQL-based metric conditions
MonitoringQueryLanguageCondition condition uses Monitoring Query Language (MQL) to
select and manipulate the time-series data to monitor. You can create alerting
policies that compare values against a threshold or test for the absence
of values with this condition type.
If you use a
MonitoringQueryLanguageCondition condition, it must be the only
condition in your alerting policy. For more information, see
Alerting policies with MQL.
Conditions for alerting on ratios
You can create metric-threshold alerting policies to monitor the
ratio of two metrics. You can create these policies by using either
MonitoringQueryLanguageCondition condition type.
You can also use MQL directly in the Google Cloud Console. You can't create
or manage ratio-based conditions by using the graphical interface for creating
We recommend using MQL to create ratio-based alerting policies.
MQL lets you build more powerful and flexible queries than you can
build by using the
MetricTheshold condition type and
For example, with a
MonitoringQueryLanguageCondition condition, you can
compute the ratio of a gauge metric to a delta metric. For examples, see
MQL alerting-policy examples.
In general, it is best to compute ratios based on time series collected for a single metric type, by using label values. A ratio computed over two different metric types is subject an edge effect.
For example, suppose that you have two different metric types, an RPC total count and an RPC error count, and you want to compute the ratio of error-count RPCs over total RPCs. The unsuccessful RPCs are counted in time series of both metric types, so there is a chance that, when you align the time series, an unsuccessful RPC can appear in one alignment interval in the total count but in a different alignment interval for the error count. This difference can happen for several reasons, including the following:
- Because there are two different time series recording the same event, there are two underlying counter values implementing the collection, and they won't be updated atomically.
- The sampling rates might differ. When the time series are aligned to a common period, counts for a single event might appear in adjacent alignment intervals in the time series for the different metrics.
The difference in the number of values in corresponding alignment intervals can
lead to nonsensical
error/total ratio values like 1/0 or 2/1.
The edge effect is typically less for ratios between larger numbers. You can get larger numbers by aggregation, either by using an alignment window that is longer than the sampling period, or by grouping together data for certain labels. These techniques minimize the effect of small differences in the number of points in a given interval; a two-point disparity is more significant if the expected number of points in an interval is three than if the expected number is 300.
If you are using built-in metric types, then you might have no choice but to compute ratios across metric types to get the value you need.
If you are designing custom metrics that might count the same thing—like RPCs returning error status—in two different metrics, consider instead a single metric, which includes each count only once. For example, if you are counting RPCs and you want to track the ratio of unsuccessful RPCs to all RPCs, create a single metric type to count RPCs, and use a label to record the status of the invocation, including the "OK" status. Then each status value, error or "OK", is recorded by updating single counter for that case.
Condition for log-based alerting policies
To create a log-based alerting policy, which notifies you when a message
matching your filter appears in your log entries, use the
LogMatch condition type. If you use a
condition, it must be the only condition in your alerting policy.
Don't try to use the
LogMatch condition type in conjunction with logs-based
metrics. Alerting policies that monitor logs-based metrics are metric-based
policies. For more information about choosing between alerting policies that
monitor logs-based metrics or log entries, see
Monitoring you logs.
The alerting policies used in the examples in the Managing alerting policies document are metric-based alerting policies, although the principles are the same for log-based alerting policies. For information specific to log-based alerting policies, see Create a log-based alert (Monitoring API) in the Cloud Logging documentation.