An alerting policy is represented in the Cloud Monitoring API
by an AlertPolicy
object,
which describes a set of conditions indicating a potentially
unhealthy status in your system.
This document describes the following:
- How the Monitoring API represents alerting policies.
- The types of conditions the Monitoring API provides for alerting policies.
- How to create an alerting policy by using the Google Cloud CLI or client libraries.
Structure of an alerting policy
The AlertPolicy
structure defines the components of an
alerting policy. When you create a policy, you specify values for the
following AlertPolicy
fields:
displayName
: A descriptive label for the policy.documentation
: We recommend that you use this field to provide information that helps incident responders. For more information, see Annotate notifications with user-defined documentation.userLabels
: Any user-defined labels attached to the policy. For information about using labels with alerting, see Annotate incidents with labels.conditions[]
: An array ofCondition
structures.combiner
: A logical operator that determines how to handle multiple conditions.notificationChannels[]
: an array of resource names, each identifying aNotificationChannel
.alertStrategy
: Specifies the following:- How quickly Monitoring closes incidents when data stops arriving.
- For metric-based alerting policies, whether Monitoring sends a notification when an incident is closed.
- For metric-based alerting policies, whether repeated notifications are enabled, and the interval between those notifications. For more information, see Configure repeated notifications for metric-based alerting policies.
You can also specify the severity
field when you use the Cloud Monitoring API
and the Google Cloud console. This field lets you define the severity level of
incidents. If you don't specify a severity,
then Cloud Monitoring sets the alerting policy severity to No Severity
.
There are other fields you might use, depending on the conditions you create.
When an alerting policy contains one condition, a notification is sent when that condition is met. For information about notifications when alerting policies contain multiple conditions, see Policies with multiple conditions and Number of notifications per policy.
When you create or modify the alerting policy, Monitoring sets
other fields as well, including the name
field. The value of the name
field is the resource name for the alerting policy, which identifies the
policy. The resource name has the following form:
projects/PROJECT_ID/alertPolicies/POLICY_ID
Types of conditions in the API
The Cloud Monitoring API supports a variety of condition types in the
Condition
structure. There are multiple condition
types for metric-based alerting policies, and one for log-based alerting
policies. The following sections describe the available condition types.
Conditions for metric-based alerting policies
To create an alerting policy that monitors metric data, including log-based metrics, you can use the following condition types:
Filter-based metric conditions
The MetricAbsence
and MetricThreshold
conditions use
Monitoring filters to select the time-series data
to monitor. Other fields in the condition structure specify how to filter,
group, and aggregate the data. For more information on these concepts, see
Filtering and aggregation: manipulating time series.
If you use the MetricAbsence
condition type, then you can create a condition
that is met only when all of the time series are absent. This condition uses
the aggregations
parameter to aggregate multiple time series into a single
time series. For more information, see
the MetricAbsence
reference in the API documentation.
A metric-absence alerting policy requires that some data has been written previously; for more information, see Create metric-absence alerting policies.
If you want to get notified based on a forecasted value, then configure
your alerting policy to use the
MetricThreshold
condition type and to set the forecastOptions
field. When
this field isn't set, then the measured data is compared to a threshold.
However, when this field is set, then predicted data is compared to a
threshold. For more information, see
Create forecasted metric-value alerting policies.
MQL-based metric conditions
The MonitoringQueryLanguageCondition
condition uses Monitoring Query Language (MQL) to
select and manipulate the time-series data to monitor. You can create alerting
policies that compare values against a threshold or test for the absence
of values with this condition type.
If you use a MonitoringQueryLanguageCondition
condition, it must be the only
condition in your alerting policy. For more information, see
Alerting policies with MQL.
PromQL-based metric conditions
The PrometheusQueryLanguageCondition
condition uses Prometheus Query Language (PromQL)
queries to select and manipulate time-series data to monitor.
Your condition can compute a ratio of metrics,
evaluate metric comparisons, and more.
If you use a PrometheusQueryLanguageCondition
condition, it must be the only
condition in your alerting policy. For more information, see
Alerting policies with PromQL.
Conditions for alerting on ratios
You can create metric-threshold alerting policies to monitor the
ratio of two metrics. You can create these policies by using either
the MetricThreshold
or MonitoringQueryLanguageCondition
condition type.
You can also use MQL directly in the Google Cloud console. You can't create
or manage ratio-based conditions by using the graphical interface for creating
threshold conditions.
We recommend using MQL to create ratio-based alerting policies.
MQL lets you build more powerful and flexible queries than you can
build by using the MetricTheshold
condition type and
Monitoring filters.
For example, with a MonitoringQueryLanguageCondition
condition, you can
compute the ratio of a gauge metric to a delta metric. For examples, see
MQL alerting-policy examples.
If you use the MetricThreshold
condition, the numerator and denominator
of the ratio must have the same MetricKind
.
For a list of metrics and their properties, see Metric lists.
In general, it is best to compute ratios based on time series collected for a single metric type, by using label values. A ratio computed over two different metric types is subject to anomalies due to different sampling periods and alignment windows.
For example, suppose that you have two different metric types, an RPC total count and an RPC error count, and you want to compute the ratio of error-count RPCs over total RPCs. The unsuccessful RPCs are counted in the time series of both metric types. Therefore, there is a chance that, when you align the time series, an unsuccessful RPC doesn't appear in the same alignment interval for both time series. This difference can happen for several reasons, including the following:
- Because there are two different time series recording the same event, there are two underlying counter values implementing the collection, and they aren't updated atomically.
- The sampling rates might differ. When the time series are aligned to a common period, the counts for a single event might appear in adjacent alignment intervals in the time series for the different metrics.
The difference in the number of values in corresponding alignment intervals can
lead to nonsensical error/total
ratio values like 1/0 or 2/1.
Ratios of larger numbers are less likely to result in nonsensical values. You can get larger numbers by aggregation, either by using an alignment window that is longer than the sampling period, or by grouping data for certain labels. These techniques minimize the effect of small differences in the number of points in a given interval. That is, a two-point disparity is more significant when the expected number of points in an interval is 3 than when the expected number is 300.
If you are using built-in metric types, then you might have no choice but to compute ratios across metric types to get the value you need.
If you are designing custom metrics that might count the same thing—like RPCs returning error status—in two different metrics, consider instead a single metric, which includes each count only once. For example, suppose that you are counting RPCs and you want to track the ratio of unsuccessful RPCs to all RPCs. To solve this problem, create a single metric type to count RPCs, and use a label to record the status of the invocation, including the "OK" status. Then each status value, error or "OK", is recorded by updating a single counter for that case.
Condition for log-based alerting policies
To create a log-based alerting policy, which notifies you when a message
matching your filter appears in your log entries, use the
LogMatch
condition type. If you use a LogMatch
condition, it must be the only condition in your alerting policy.
Don't try to use the LogMatch
condition type in conjunction with log-based
metrics. Alerting policies that monitor log-based metrics are metric-based
policies. For more information about choosing between alerting policies that
monitor log-based metrics or log entries, see
Monitoring your logs.
The alerting policies used in the examples in the Manage alerting policies by API document are metric-based alerting policies, although the principles are the same for log-based alerting policies. For information specific to log-based alerting policies, see Create a log-based alerting policy by using the Monitoring API in the Cloud Logging documentation.
Before you begin
Before writing code against the API, you should:
- Be familiar with the general concepts and terminology used with alerting policies; see Alerting overview for more information.
- Ensure that the Cloud Monitoring API is enabled for use; see Enabling the API for more information.
- If you plan to use client libraries, then install the libraries for the languages that you want to use; see Client Libraries for details. Currently, API support for alerting is available only for C#, Go, Java, Node.js, and Python.
If you plan to use the Google Cloud CLI, then install it. However, if you use Cloud Shell, then Google Cloud CLI is already installed.
Examples using the
gcloud
interface are also provided here. Note that thegcloud
examples all assume that the current project has already been set as the target (gcloud config set project [PROJECT_ID]
) so invocations omit the explicit--project
flag. The ID of the current project in the examples isa-gcp-project
.
-
To get the permissions that you need to create and modify alerting policies by using the Cloud Monitoring API, ask your administrator to grant you the Monitoring AlertPolicy Editor (
roles/monitoring.alertPolicyEditor
) IAM role on your project. For more information about granting roles, see Manage access to projects, folders, and organizations.You might also be able to get the required permissions through custom roles or other predefined roles.
For detailed information about IAM roles for Monitoring, see Control access with Identity and Access Management.
Design your application to single-thread Cloud Monitoring API calls that modify the state of an alerting policy in a Google Cloud project. For example, single-thread API calls that create, update, or delete an alerting policy.
Create an alerting policy
To create an alerting policy in a project, use the
alertPolicies.create
method. For information about how to invoke this
method, its parameters, and the response data, see the reference page
alertPolicies.create
.
You can create policies from JSON or YAML files.
The Google Cloud CLI accepts these files as arguments, and
you can programmatically read JSON files, convert them to AlertPolicy
objects, and create policies from them
by using the alertPolicies.create
method. If you
have a Prometheus JSON or YAML configuration file with an alerting rule, then
the gcloud CLI can migrate it to a Cloud Monitoring alerting
policy with a PromQL condition. For more information, see
Migrate alerting rules and receivers from Prometheus.
Each alerting policy belongs to a scoping project of a metrics scope. Each
project can contain up to 500 policies.
For API calls, you must provide a “project ID”; use the
ID of the scoping project of a metrics scope as the value. In these examples,
the ID of the scoping project of a metrics scope is a-gcp-project
.
The following samples illustrate the creation of alerting policies, but they don't describe how to create a JSON or YAML file that describes an alerting policy. Instead, the samples assume that a JSON-formatted file exists and they illustrate how to issue the API call. For example JSON files, see Sample policies. For general information about monitoring ratios of metrics, see Ratios of metrics.
gcloud
To create an alerting policy in a project, use the gcloud alpha monitoring
policies create
command. The following example creates an alerting policy in
a-gcp-project
from the rising-cpu-usage.json
file:
gcloud alpha monitoring policies create --policy-from-file="rising-cpu-usage.json"
If successful, this command returns the name of the new policy, for example:
Created alert policy [projects/a-gcp-project/alertPolicies/12669073143329903307].
The file rising-cpu-usage.json
file contains the JSON for a policy with
the display name “High CPU rate of change”. For details about this policy, see
Rate-of-change policy.
See the
gcloud alpha monitoring policies create
reference for more information.
C#
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Go
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
PHP
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To authenticate to Monitoring, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
The created AlertPolicy
object will have additional fields.
The policy itself will have name
, creationRecord
, and mutationRecord
fields. Additionally, each condition in the policy is also given a name
.
These fields cannot be modified externally, so there is no need to set them
when creating a policy. None of the JSON examples used for creating
policies include them, but if policies created from them are retrieved after
creation, the fields will be present.
Configure repeated notifications for metric-based alerting policies
By default, a metric-based alerting policy sends one notification to each notification channel when an incident is opened. However, you can change the default behavior and configure an alerting policy to resend notifications to all or some of the notification channels for your alerting policy. These repeated notifications are sent for incidents with a status of Open or Acknowledged. The interval between these notifications must be at least 30 minutes and no more than 24 hours, expressed in seconds.
To configure repeated notifications, add to the alerting policy's configuration
an AlertStrategy
object that contains at least one
NotificationChannelStrategy
object.
A NotificationChannelStrategy
object has two fields:
renotifyInterval
: The interval, in seconds, between repeated notifications.If you change the value of the
renotifyInterval
field when an incident for the alerting policy is opened, then the following happens:- The alerting policy sends out another notification for the incident.
- The alerting policy restarts the interval period.
notificationChannelNames
: An array of notification channel resource names, which are strings in the format ofprojects/PROJECT_ID/notificationChannels/CHANNEL_ID
, where CHANNEL_ID is a numeric value. For information about how to retrieve the channel ID, see List notification channels in a project.
For example, the following JSON sample shows an alert strategy configured to send repeated notifications every 1800 seconds (30 minutes) to one notification channel:
"alertStrategy": { "notificationChannelStrategy": [ { "notificationChannelNames": [ "projects/PROJECT_ID/notificationChannels/CHANNEL_ID" ], "renotifyInterval": "1800s" } ] }
To temporarily stop repeated notifications, create a snooze. To
prevent repeated notifications, edit the alerting policy by using the API and
remove the NotificationChannelStrategy
object.