This document describes strategies you can use to reduce costs for alerting.
Consolidate alerting policies to operate over more resources
Because of the $1.50-per-condition cost, it is more cost effective to use one alerting policy to monitor multiple resources than it is to use one alerting policy to monitor each resource. Consider the following examples:
Example 1
Data
- 100 VMs
- Each VM emits one metric,
metric_name
metric_name
has one label, which has 10 values
- One alert condition
- Condition aggregates to the VM level
- 30-second execution period
- Condition cost: 1 condition * $1.50 per month = $1.50 per month
- Time series cost: 100 time series returned per period * 86,400 periods per month = 8.6 million time series returned per month * $0.35 per million time series = $3.02 per month
- Total cost: $4.52 per month
Example 2
Data
- 100 VMs
- Each VM emits one metric,
metric_name
metric_name
has one label, which has 10 values
- 100 conditions
- Each condition is filtered and aggregated to one VM
- 30-second execution period
- Condition cost: 100 conditions * $1.50 per month = $150 per month
- Time series cost: 100 conditions * 1 time series returned per condition per period * 86,400 periods per month = 8.6 million time series returned per month * $0.35 per million time series = $3.02 per month
- Total cost: $153.02 per month
In both examples, you monitor the same number of resources. However, Example 2 uses 100 alerting policies, while Example 1 uses only one alerting policy. As a result, Example 1 is almost $150 cheaper per month.
Aggregate to only the level that you need to alert on
Aggregating to higher levels of granularity results in higher costs than aggregating to lower levels of granularity. For example, aggregating to the Google Cloud project level is cheaper than aggregating to the cluster level, and aggregating to the cluster level is cheaper than aggregating to the cluster and namespace level.
Consider the following examples:
Example 1
Data
- 100 VMs
- Each VM emits one metric,
metric_name
metric_name
has one label, which has 10 values
- One alert condition
- Condition aggregates to the VM level
- 30-second execution period
- Condition cost: 1 condition * $1.50 per month = $1.50 per month
- Time series cost: 100 time series returned per period * 86,400 periods per month = 8.6 million time series returned per month * $0.35 per million time series = $3.02 per month
- Total cost: $4.52 per month
Example 4
Data
- 100 VMs, where each VM belongs to one service
- Five total services
- Each VM emits one metric,
metric_name
metric_name
has one label, which has 10 values
- One condition
- Condition aggregates to the service level
- 30-second execution period
- Condition cost: 1 condition * $1.50 per month = $1.50 per month
- Time series cost: 5 time series returned per period * 86,400 periods per month = 432,000 time series returned per month * $0.35 per million time series = $0.14 per month
- Total cost: $1.64 per month
Example 5
Data
- 100 VMs
- Each VM emits one metric,
metric_name
metric_name
has 100 labels with 1,000 values each
- One condition
- Condition aggregates to the VM level
- 30-second execution period
- Condition cost: 1 condition * $1.50 per month = $1.50 per month
- Time series cost: 100 time series returned per period * 86,400 periods per month = 8.5 million time series returned per month * $0.35 per million time series = $3.02 per month
- Total cost: $4.52 per month
Compare Example 1 to Example 4: Both examples operate over the same underlying data and have a single alerting policy. However, because the alerting policy in Example 4 aggregates to the service, it is less expensive than the alerting policy in Example 1, which aggregates more granularly to the VM.
In addition, compare Example 1 to Example 5: In this case, the metric cardinality in Example 5 is 10,000 times higher than the metric cardinality in Example 1. However, because the alerting policy in Example 1 and in Example 5 both aggregate to the VM, and because the number of VMs is the same in both examples, the examples are equivalent in price.
When you configure your alerting policies, choose aggregation levels that work best for your use case. For example, if you care about alerting on CPU utilization, then you might want to aggregate to the VM and CPU level. If you care about alerting on latency by endpoint, then you might want to aggregate to the endpoint level.
Don't alert on raw, unaggregated data
Monitoring uses a dimensional metrics system, where any metric has total [cardinality][cardinality] equal to the number of resources monitored multiplied by the number of label combinations on that metric. For example, if you have 100 VMs emitting a metric, and that metric has 10 labels with 10 values each, then your total cardinality is 100 * 10 * 10 = 10,000.
As a result of how cardinality scales, alerting on raw data can be extremely expensive. In the previous example, you have 10,000 time series returned for each execution period. However, if you aggregate to the VM, then you have only 100 time series returned per execution period, regardless of the label cardinality of the underlying data.
Alerting on raw data also puts you at risk for increased time series when your metrics receive new labels. In the previous example, if a user adds a new label to your metric, then your total cardinality increases to 100 * 11 * 10 = 11,000 time series. In this case, your number of returned time series increases by 1,000 each execution period even though your alerting policy is unchanged. If you instead aggregate to the VM, then, despite the increased underlying cardinality, you still have only 100 time series returned.
Filter out unnecessary responses
Configure your conditions to evaluate only data that's necessary for your alerting needs. If you wouldn't take action to fix something, then exclude it from your alerting policies. For example, you probably don't need to alert on an intern's development VM.
To reduce unnecessary alerts and costs, you can filter out time series that aren't important. You can use Google Cloud metadata labels to tag assets with categories and then filter out the unneeded metadata categories.
Use top-stream operators to reduce the number of time series returned
If your condition uses a PromQL or an MQL query, then you can use a top-streams operator to select a number of the time series returned with the highest values:
For example, a topk(metric, 5)
clause in a PromQL query limits
the number of time series returned to five in each execution period.
Limiting to a top number of time series might result in missing data and faulty alerts, such as:
- If more than N time series violate your threshold, then you will miss data outside the top N time series.
- If a violating time series occurs outside the top N time series, then your incidents might auto-close despite the excluded time series still violating the threshold.
- Your condition queries might not show you important context such as baseline time series that are functioning as intended.
To mitigate such risks, choose large values for N and use the top-streams operator only in alerting policies that evaluate many time series, such as alerts for individual Kubernetes containers.
Increase the length of the execution period (PromQL only)
If your condition uses a PromQL query, then you can modify the length
of your execution period by setting the
evaluationInterval
field in the
condition.
Longer evaluation intervals result in fewer time series returned per month; for example, a condition query with a 15-second interval runs twice as often as a query with a 30-second interval, and a query with a 1-minute interval runs half as often as a query with a 30-second interval.