Alerting policies with a Monitoring Query Language (MQL)-based condition let you configure your alerting environment for many possible use cases. Certain configurations are available only through the use of MQL queries.
This document describes several uses cases and sample queries for deploying alerting policies with an MQL-based condition in a production environment.
Alert on dynamic thresholds
You can use an MQL query to configure an alerting policy that triggers alerts based on a threshold that varies over time, such as days of the week. This configuration isn't supported in alerting policy conditions without MQL queries.
For example, you have an MQL query that sends an alert if the CPU utilization of a Compute Engine instance exceeds 95%:
fetch gce_instance :: compute.googleapis.com/instance/cpu/utilization | align | every 30s | condition utilization > 95'%'
However, you want to set a lower utilization threshold, such as 85%, for weekends, to account for longer response times from your support team. In this case, you could configure your query with a value column that contains the alerting threshold:
fetch gce_instance :: compute.googleapis.com/instance/cpu/utilization | align | every 30s | value add [day_of_week: end().timestamp_to_string('%w').string_to_int64] | value [utilization, is_weekend: day_of_week = 0 || day_of_week = 6] | value [utilization, max_allowed_utilization: if(is_weekend, 85'%', 95'%')] | condition utilization > scale(max_allowed_utilization)
The value operations do the following:
value add [day_of_week: end().timestamp_to_string('%w').string_to_int64]
adds a value column whose value is a number between 0 and 6, where0
is Sunday and6
is Saturday.value [utilization, is_weekend: day_of_week = 0 || day_of_week = 6]
replaces your day number with a boolean that indicates whether the data point was on a weekend or a weekday.value [utilization, max_allowed_utilization: if(is_weekend, 85'%', 95'%')]
replaces the boolean with a threshold that varies depending on the value ofis_weekend
.
The condition, condition utilization > scale(max_allowed_utilization)
,
compares the two value columns.
For an example of an alerting policy with an MQL-based condition that configures incident severity levels based on dynamic criteria, see Create dynamic severity levels using MQL.
Alert on thresholds based on rate of change
You can configure alerting policy MQL queries to evaluate thresholds
based on the rate of change for a metric. For example, you want to evaluate
the rate of 5xx
errors per instance of resource.method
in your
API requests, where your rate is equivalent to requests per second. If the rate
is greater than 5 error responses per second, then Cloud Monitoring
sends an alert:
fetch consumed_api | metric 'serviceruntime.googleapis.com/api/request_count' | filter (metric.response_code_class == '5xx') | align rate(10m) | every 30s | group_by [resource.method], [value_request_count_mean: mean(value.request_count)] | condition val() > 0.05'1/s'
You can create rate-of-change alerting policies without using MQL:
- For an example that uses the Google Cloud console, see Monitor a rate of change.
- For an example that uses the Cloud Monitoring API, see Rate of change policy.
Alert on ratio-based thresholds
Your alerting policy can use an MQL query to evaluate
ratios derived by joining two metrics and then dividing the value columns.
For example, you want to query the ratio of read
bytes compared to write
bytes for each of your Compute Engine instances. If the ratio is greater
than 3/5
, or 60%, then Cloud Monitoring sends an alert:
{ fetch gce_instance :: compute.googleapis.com/instance/disk/read_bytes_count; fetch gce_instance :: compute.googleapis.com/instance/disk/write_bytes_count } | every 30s | join | value val(0) / val(1) | condition val() > 0.6
You can also query the ratio of aggregated values. For example, you want to
compute the average CPU usage time per core across your Compute Engine
instances. If the ratio is greater than than 3/5
, or 60%, then
Cloud Monitoring sends an alert. In this example, you must also include
a cast_units
function to align the units of measurement.
{ fetch gce_instance :: compute.googleapis.com/instance/cpu/usage_time | group_by [], .sum; fetch gce_instance :: compute.googleapis.com/instance/cpu/reserved_cores | group_by [], .sum | cast_units('s{CPU}') } | every 30s | ratio | condition val() > 0.6
You can create ratio-based alerting policies without using MQL:
- For an example that uses the Google Cloud console, see Compute ratios.
- For an example that uses the Cloud Monitoring API, see Metric ratio.