Managed rule evaluation and alerting

Stay organized with collections Save and categorize content based on your preferences.

Google Cloud Managed Service for Prometheus supports Prometheus-compatible rule evaluation and alerting. This document describes how to set up managed rule evaluation.

Rule evaluation

Managed Service for Prometheus provides a rule-evaluator component that allows you to safely write rules in the context of a global Prometheus backend, preventing you from interfering with other users' data in larger organizations. The component is automatically deployed as part of managed collection when running on Kubernetes clusters.

Rules

The managed rule-evaluator uses the Rules resource to configure recording and alerting rules. The following is an example Rules resource:

apiVersion: monitoring.googleapis.com/v1
kind: Rules
metadata:
  namespace: gmp-test
  name: example-rules
spec:
  groups:
  - name: example
    interval: 30s
    rules:
    - record: job:up:sum
      expr: sum without(instance) (up)
    - alert: AlwaysFiring
      expr: vector(1)

The format of the .spec.groups element is identical to the upstream Prometheus rule_group array. Alerting and recording rules defined in Rules are scoped to project_id, cluster, and namespace of the resource. For example, the job:up:sum rule in the above resource effectively queries sum without(instance) (up{project_id="test-project", cluster="test-cluster", namespace="gmp-test"}). This guarantee ensures that alerting or recording rules do not accidentally evaluate metrics from applications you may not even know about.

To apply the example rules to your cluster, run the following command:

kubectl apply -n gmp-test -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.4.3-gke.0/examples/rules.yaml

After a few minutes, the metric job:up:sum becomes available. The alert AlwaysFiring also starts firing. For information about about how to send alerts to an Alertmanager, see Alertmanager configuration.

The ClusterRules and GlobalRules resources provide the same interface as the Rules resource, but they apply the rules to wider scopes. ClusterRules select data by using the project_id and cluster labels, and GlobalRules select all data in the queried metrics scope without restricting labels.

For reference documentation about all the Managed Service for Prometheus custom resources, see the prometheus-engine/doc/api reference.

Converting from Prometheus rules to Rules

The Rules resource provides a compatible interface to Prometheus rules to provide a seamless migration path for incorporating existing rules into managed rule evaluation. You can include your existing rules in a Rules resource. For example, the following is a Prometheus rule:

groups:
- name: example
  interval: 30s
  rules:
  - record: job:up:sum
    expr: sum without(instance) (up)
  - alert: AlwaysFiring
    expr: vector(1)

The corresponding Rules resource, with the original Prometheus rule in bold type, follows:

apiVersion: monitoring.googleapis.com/v1
kind: Rules
metadata:
  namespace: gmp-test
  name: example-rules
spec:
  groups:
  - name: example
    interval: 30s
    rules:
    - record: job:up:sum
      expr: sum without(instance) (up)
    - alert: AlwaysFiring
      expr: vector(1)

ClusterRules

You can use the ClusterRules resource to configure recording and alerting rules that can evaluate all time series sent to Managed Service for Prometheus from all namespaces in a particular cluster. The spec is identical to that of Rules. The previous example Prometheus rule becomes the following ClusterRules resource:

apiVersion: monitoring.googleapis.com/v1
kind: ClusterRules
metadata:
  name: example-clusterrules
spec:
  groups:
  - name: example
    interval: 30s
    rules:
    - record: job:up:sum
      expr: sum without(instance) (up)
    - alert: AlwaysFiring
      expr: vector(1)

We recommend that you use ClusterRules resources only on horizontal metrics, such as those produced by a service mesh. For metrics of individual deployments, use Rules resources to ensure that the evaluation doesn't include unintended data.

GlobalRules

You can use the GlobalRules resource to configure recording and alerting rules that can evaluate all time series sent to Managed Service for Prometheus across all projects within a metrics scope. The spec is identical to that of Rules. The previous example Prometheus rule becomes the following GlobalRules resource:

apiVersion: monitoring.googleapis.com/v1
kind: GlobalRules
metadata:
  name: example-globalrules
spec:
  groups:
  - name: example
    interval: 30s
    rules:
    - record: job:up:sum
      expr: sum without(instance) (up)
    - alert: AlwaysFiring
      expr: vector(1)

We recommend that you use GlobalRules only for those rare use cases where an alert might need data across all clusters at once. For metrics of individual deployments, use Rules or ClusterRules resources for higher reliability and to ensure that the evaluation doesn't include unintended data.

We strongly recommend preserving the cluster and namespace labels in rule evaluation results unless the purpose of the rule is to aggregate away those labels, otherwise query performance might decline and you might encounter cardinality limits. Removing both labels is strongly discouraged.

Multi-project and global rule evaluation

When deployed on Google Kubernetes Engine, the rule evaluator uses the Google Cloud project associated with the cluster, which the rule evaluator automatically detects. To evaluate rules that span projects, you must configure the rule evaluator that executes the GlobalRules resource to use a project with a multi-project metrics scope. You can do this in two ways:

  • Place your GlobalRules resource in a project that has a multi-project metrics scope.
  • Set the queryProjectID field within the OperatorConfig to use a project with a multi-project metrics scope.

You must also update the permissions of the service account used by the rule evaluator (which is usually the default service account on the node) so the service account can read from the scoping project and write to all monitored projects in the metrics scope.

If your metrics scope contains all your projects, then your rules evaluate globally. For more information, see Metrics scopes.

Alerting using Cloud Monitoring metrics

You can configure the rule evaluator to alert on Google Cloud system metrics using PromQL. For instructions on how to create a valid query, see PromQL for Cloud Monitoring metrics.

Configuring rules and alerts using Terraform

You can automate the creation and management of Rules, ClusterRules, and GlobalRules resources by using the kubernetes_manifest Terraform resource type or the kubectl_manifest Terraform resource type, either of which lets you specify arbitrary custom resources.

For general information about using Google Cloud with Terraform, see Terraform with Google Cloud.

Provide credentials explicitly

When running on GKE, the rule-evaluator automatically retrieves credentials from the environment based on the Compute Engine default service account or the Workload Identity setup.

In non-GKE Kubernetes clusters, credentials must be explicitly provided through the OperatorConfig resource in the gmp-public namespace.

  1. Set the context to your target project:

    gcloud config set project PROJECT_ID
    
  2. Create a service account:

    gcloud iam service-accounts create gmp-test-sa
    

  3. Grant the required permissions to the service account:

    gcloud projects add-iam-policy-binding PROJECT_ID \
      --member=serviceAccount:gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com \
      --role=roles/monitoring.viewer \
    && \
    gcloud projects add-iam-policy-binding PROJECT_ID\
      --member=serviceAccount:gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com \
      --role=roles/monitoring.metricWriter
    

  4. Create and download a key for the service account:

    gcloud iam service-accounts keys create gmp-test-sa-key.json \
      --iam-account=gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com
    
  5. Add the key file as a secret to your non-GKE cluster:

    kubectl -n gmp-public create secret generic gmp-test-sa \
      --from-file=key.json=gmp-test-sa-key.json
    

  6. Open the OperatorConfig resource for editing:

    kubectl -n gmp-public edit operatorconfig config
    

  7. Add the text shown in bold to the resource:

    apiVersion: monitoring.googleapis.com/v1
    kind: OperatorConfig
    metadata:
      namespace: gmp-public
      name: config
    rules:
      credentials:
        name: gmp-test-sa
        key: key.json
    
    Make sure you also add these credentials to the collection section so that managed collection works.

  8. Save the file and close the editor. After the change is applied, the pods are re-created and start authenticating to the metric backend with the given service account.

Alertmanager configuration

You can use the OperatorConfig resource to configure the managed rule-evaluator to send alerts to a self-deployed Prometheus Alertmanager. To configure the rule-evaluator, do the following:

  1. Open the OperatorConfig resource for editing:

    kubectl -n gmp-public edit operatorconfig config
    
  2. Configure the resource to send alerts to your Alertmanager service:

    apiVersion: monitoring.googleapis.com/v1
    kind: OperatorConfig
    metadata:
      namespace: gmp-public
      name: config
    rules:
      alerting:
        alertmanagers:
        - name: SERVICE_NAME
          namespace: SERVICE_NAMESPACE
          port: PORT_NAME
    

If your Alertmanager is located in a different cluster than your rule-evaluator, you might need to set up a Endpoints resource. For example, if your OperatorConfig says that Alertmanager endpoints can be found in Endpoints object ns=alertmanager/name=alertmanager, you can manually or programmatically create this object yourself and populate it with reachable IPs from the other cluster. The AlertmanagerEndpoints configuration section provides options for authorization configuration if necessary.