Evaluation of rules and alerts with self-deployed collection

This document describes a configuration for rule and alert evaluation in a Managed Service for Prometheus deployment that uses self-deployed collection.

The following diagram illustrates a deployment that uses multiple clusters in two Google Cloud projects and uses both rule and alert evaluation:

A deployment for rule and alert evaluation that uses self-deployed collection.

To set up and use a deployment like the one in the diagram, note the following:

  • Rules are installed within each Managed Service for Prometheus collection server, just as they are when using standard Prometheus. Rule evaluation executes against the data stored locally on each server. Servers are configured to retain data long enough to cover the lookback period of all rules, which is typically no more than 1 hour. Rule results are written to Monarch after evaluation.

  • A Prometheus AlertManager instance is manually deployed in every single cluster. Prometheus servers are configured by editing the alertmanager_config field of the configuration file to send fired alerting rules to their local AlertManager instance. Silences, acknowledgements, and incident management workflows are typically handled in a third-party tool such as PagerDuty.

    You can centralize alert management across multiple clusters into a single AlertManager by using a Kubernetes Endpoints resource.

  • One single cluster running inside Google Cloud is designated as the global rule evaluation cluster for a metrics scope. The standalone rule evaluator is deployed in that cluster and rules are installed using the standard Prometheus rule-file format.

    The standalone rule evaluator is configured to use scoping_project_A, which contains Projects 1 and 2. Rules executed against scoping_project_A automatically fan out to Projects 1 and 2. The underlying service account must be given the Monitoring Viewer permissions for scoping_project_A.

    The rule evaluator is configured to send alerts to the local Prometheus Alertmanager by using the alertmanager_config field of the configuration file.

Using a self-deployed global rule evaluator may have unexpected effects, depending on whether you preserve or aggregate the project_id, location, cluster, and namespace labels in your rules:

  • If your rules preserve the project_id label (by using a by(project_id) clause), then rule results are written back to Monarch using the original project_id value of the underlying time series.

    In this scenario, you need to ensure the underlying service account has the Monitoring Metric Writer permissions for each monitored project in scoping_project_A. If you add a new monitored project to scoping_project_A, then you must also manually add a new permission to the service account.

  • If your rules do not preserve the project_id label (by not using a by(project_id) clause), then rule results are written back to Monarch using the project_id value of the cluster where the global rule evaluator is running.

    In this scenario, you do not need to further modify the underlying service account.

  • If your rules preserve the location label (by using a by(location) clause), then rule results are written back to Monarch using each original Google Cloud region from which the underlying time series originated.

    If your rules do not preserve the location label, then data is written back to the location of the cluster where the global rule evaluator is running.

We strongly recommend preserving the cluster and namespace labels in rule evaluation results whenever possible. Otherwise, query performance might decline and you might encounter cardinality limits. Removing both labels is strongly discouraged.