Alerting policies in the Monitoring API

An alerting policy is represented in the Cloud Monitoring API by an AlertPolicy object, which describes a set of conditions indicating a potentially unhealthy status in your system.

This page describes how the Monitoring API represents alerting policies and discusses the types of conditions the Monitoring API provides for alerting policies.

Structure of an alerting policy

The AlertPolicy structure defines the components of an alerting policy. When you create a policy, either by using the Google Cloud Console or the Monitoring API, you specify values for the following AlertPolicy fields:

  • displayName: A descriptive label for the policy.
  • documentation: Any information provided to help responders.
  • conditions[]: An array of Condition structures.
  • combiner: A logical operator that determines how to handle multiple conditions.
  • notificationChannels[]: an array of resource names, each identifying a NotificationChannel.

There are other fields you might use, depending on the conditions you create.

By default, alerting policies created by using the Monitoring API send notifications when a condition for triggering the policy is met and when the condition stops being met. You can't change this behavior by using the Monitoring API, but you can turn off notifications about incident closure by editing the policy in the Google Cloud Console. To turn off incident-closure notifications, clear the Notify on incident closure option in the notifications section and save the edited policy.

When you create or modify the alerting policy, Monitoring sets other fields as well, including the name field. The value of the name field is the resource name for the alerting policy, which identifies the policy. The resource name has the following form:

projects/PROJECT_ID/alertPolicies/POLICY_ID

The conditions in the alerting policy are the most variable part of the alerting policy.

Types of conditions in the API

The Cloud Monitoring API supports a variety of condition types in the Condition structure. There are multiple condition types for metric-based alerting policies, and one for log-based alerting policies. The following sections describe the available condition types.

Conditions for metric-based alerting policies

To create an alerting policy that monitors metric data, including logs-based metrics, you can use the following condition types:

Filter-based metric conditions

The MetricAbsence and MetricThreshold conditions use Monitoring filters to select the time-series data to monitor. Other fields in the condition structure specify how to filter, group, and agreggate the data. For more information on these concepts, see Filtering and aggregation: manipulating time series.

If you use the MetricAbsence condition type, you can create a condition that triggers only when all of the time series are absent by aggregating the time series into a single time series by using aggregations; see the MetricAbsence reference in the API documentation.

A metric-absence alerting policy requires that some data has been written previously; for more information, see Metric absence condition.

MQL-based metric conditions

The MonitoringQueryLanguageCondition condition uses Monitoring Query Language (MQL) to select and manipulate the time-series data to monitor. You can create alerting policies that compare values against a threshold or test for the absence of values with this condition type. If you use a MonitoringQueryLanguageCondition condition, it must be the only condition in your alerting policy. For more information, see Alerting policies with MQL.

Conditions for alerting on ratios

You can create metric-threshold alerting policies to monitor the ratio of two metrics. You can create these policies by using either the MetricThreshold or MonitoringQueryLanguageCondition condition type. You can also use MQL directly in the Google Cloud Console. You can't create or manage ratio-based conditions by using the graphical interface for creating threshold conditions.

We recommend using MQL to create ratio-based alerting policies. MQL lets you build more powerful and flexible queries than you can build by using the MetricTheshold condition type and Monitoring filters. For example, with a MonitoringQueryLanguageCondition condition, you can compute the ratio of a gauge metric to a delta metric. For examples, see MQL alerting-policy examples.

If you use the MetricThreshold condition, the numerator and denominator of the ratio must have the same MetricKind. For a list of metrics and their properties, see Metric lists.

In general, it is best to compute ratios based on time series collected for a single metric type, by using label values. A ratio computed over two different metric types is subject an edge effect.

For example, suppose that you have two different metric types, an RPC total count and an RPC error count, and you want to compute the ratio of error-count RPCs over total RPCs. The unsuccessful RPCs are counted in time series of both metric types, so there is a chance that, when you align the time series, an unsuccessful RPC can appear in one alignment interval in the total count but in a different alignment interval for the error count. This difference can happen for several reasons, including the following:

  • Because there are two different time series recording the same event, there are two underlying counter values implementing the collection, and they won't be updated atomically.
  • The sampling rates might differ. When the time series are aligned to a common period, counts for a single event might appear in adjacent alignment intervals in the time series for the different metrics.

The difference in the number of values in corresponding alignment intervals can lead to nonsensical error/total ratio values like 1/0 or 2/1.

The edge effect is typically less for ratios between larger numbers. You can get larger numbers by aggregation, either by using an alignment window that is longer than the sampling period, or by grouping together data for certain labels. These techniques minimize the effect of small differences in the number of points in a given interval; a two-point disparity is more significant if the expected number of points in an interval is three than if the expected number is 300.

If you are using built-in metric types, then you might have no choice but to compute ratios across metric types to get the value you need.

If you are designing custom metrics that might count the same thing—like RPCs returning error status—in two different metrics, consider instead a single metric, which includes each count only once. For example, if you are counting RPCs and you want to track the ratio of unsuccessful RPCs to all RPCs, create a single metric type to count RPCs, and use a label to record the status of the invocation, including the "OK" status. Then each status value, error or "OK", is recorded by updating single counter for that case.

Condition for log-based alerting policies

To create a log-based alerting policy, which notifies you when a message matching your filter appears in your log entries, use the LogMatch condition type. If you use a LogMatch condition, it must be the only condition in your alerting policy.

Don't try to use the LogMatch condition type in conjunction with logs-based metrics. Alerting policies that monitor logs-based metrics are metric-based policies. For more information about choosing between alerting policies that monitor logs-based metrics or log entries, see Monitoring you logs.

The alerting policies used in the examples in the Managing alerting policies document are metric-based alerting policies, although the principles are the same for log-based alerting policies. For information specific to log-based alerting policies, see Create a log-based alert (Monitoring API) in the Cloud Logging documentation.