Self-deployed rule evaluation and alerting

Google Cloud Managed Service for Prometheus supports Prometheus-compatible rule evaluation and alerting. This document describes how to set up self-deployed rule evaluation, including the standalone rule-evaluator component.

You only need to follow these instructions if you want to execute rules and alerts against the global datastore.

Rule evaluation for self-deployed collection

After you have deployed Managed Service for Prometheus, you can continue to evaluate rules locally in each deployed instance by using the rule_files field of your Prometheus configuration file. However, the maximum query window for the rules is constrained by how long the server keeps local data.

Most rules execute only over the last few minutes of data, so running rules on each local server is often a valid strategy. In that case, no further setup is necessary.

However, sometimes it's useful to be able to evaluate rules against the global metric backend, for example, when all data for a rule is not co-located on a given Prometheus instance. For these cases, Managed Service for Prometheus also provides a rule-evaluator component.

Before you begin

This section describes the configuration needed for the tasks described in this document.

Configure your environment

To avoid repeatedly entering your project ID or cluster name, perform the following configuration:

  • Configure the command-line tools as follows:

    • Configure the gcloud CLI to refer to the ID of your Google Cloud project:

      gcloud config set project PROJECT_ID
      
    • Configure the kubectl CLI to use your cluster:

      kubectl config set-cluster CLUSTER_NAME
      

    For more information about these tools, see the following:

Set up a namespace

Create the NAMESPACE_NAME Kubernetes namespace for resources you create as part of the example application:

kubectl create ns NAMESPACE_NAME

Verify service account credentials

You can skip this section if your Kubernetes cluster has Workload Identity Federation for GKE enabled.

When running on GKE, Managed Service for Prometheus automatically retrieves credentials from the environment based on the Compute Engine default service account. The default service account has the necessary permissions, monitoring.metricWriter and monitoring.viewer, by default. If you don't use Workload Identity Federation for GKE, and you have previously removed either of those roles from the default node service account, you will have to re-add those missing permissions before continuing.

If you are not running on GKE, see Provide credentials explicitly.

Configure a service account for Workload Identity Federation for GKE

You can skip this section if your Kubernetes cluster does not have Workload Identity Federation for GKE enabled.

Managed Service for Prometheus captures metric data by using the Cloud Monitoring API. If your cluster is using Workload Identity Federation for GKE, you must grant your Kubernetes service account permission to the Monitoring API. This section describes the following:

Create and bind the service account

This step appears in several places in the Managed Service for Prometheus documentation. If you have already performed this step as part of a prior task, then you don't need to repeat it. Skip ahead to Authorize the service account.

The following command sequence creates the gmp-test-sa service account and binds it to the default Kubernetes service account in the NAMESPACE_NAME namespace:

gcloud config set project PROJECT_ID \
&&
gcloud iam service-accounts create gmp-test-sa \
&&
gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:PROJECT_ID.svc.id.goog[NAMESPACE_NAME/default]" \
  gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com \
&&
kubectl annotate serviceaccount \
  --namespace NAMESPACE_NAME \
  default \
  iam.gke.io/gcp-service-account=gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com

If you are using a different GKE namespace or service account, adjust the commands appropriately.

Authorize the service account

Groups of related permissions are collected into roles, and you grant the roles to a principal, in this example, the Google Cloud service account. For more information about Monitoring roles, see Access control.

The following command grants the Google Cloud service account, gmp-test-sa, the Monitoring API roles it needs to read and write metric data.

If you have already granted the Google Cloud service account a specific role as part of prior task, then you don't need to do it again.

gcloud projects add-iam-policy-binding PROJECT_ID \
  --member=serviceAccount:gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com \
  --role=roles/monitoring.viewer \
&& \
gcloud projects add-iam-policy-binding PROJECT_ID\
  --member=serviceAccount:gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com \
  --role=roles/monitoring.metricWriter

Debug your Workload Identity Federation for GKE configuration

If you are having trouble getting Workload Identity Federation for GKE to work, see the documentation for verifying your Workload Identity Federation for GKE setup and the Workload Identity Federation for GKE troubleshooting guide.

As typos and partial copy-pastes are the most common sources of errors when configuring Workload Identity Federation for GKE, we strongly recommend using the editable variables and clickable copy-paste icons embedded in the code samples in these instructions.

Workload Identity Federation for GKE in production environments

The example described in this document binds the Google Cloud service account to the default Kubernetes service account and gives the Google Cloud service account all necessary permissions to use the Monitoring API.

In a production environment, you might want to use a finer-grained approach, with a service account for each component, each with minimal permissions. For more information on configuring service accounts for workload-identity management, see Using Workload Identity Federation for GKE.

Deploy the standalone rule evaluator

The Managed Service for Prometheus rule evaluator evaluates Prometheus alerting and recording rules against the Managed Service for Prometheus HTTP API and writes the results back to Monarch. It accepts the same configuration-file format and rule-file format as Prometheus. The flags are mostly identical, as well.

  1. Create an example deployment of the rule evaluator that is pre-configured to evaluate an alerting and a recording rule:

    kubectl apply -n NAMESPACE_NAME -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.13.0/manifests/rule-evaluator.yaml
    
  2. Verify that the pods for the rule-evaluator deployed successfully:

    kubectl -n NAMESPACE_NAME get pod
    

    If the deployment was successful, then you see output similar to the following:

    NAME                              READY   STATUS    RESTARTS   AGE
    ...
    rule-evaluator-64475b696c-95z29   2/2     Running   0          1m
    

After you verify that the rule-evaluator deployed successfully, you can make adjustments to the installed manifests to do the following:

  • Add your custom rules files.
  • Configure the rule-evaluator to send alerts to a self-deployed Prometheus Alertmanager by using the alertmanager_config field of the configuration file.

If your Alertmanager is located in a different cluster than your rule-evaluator, then you might need to set up an Endpoints resource. For example, if your OperatorConfig specifies that Alertmanager endpoints can be found in Endpoints object ns=alertmanager/name=alertmanager, then you can manually or programmatically create this object yourself and populate it with reachable IPs from the other cluster.

Provide credentials explicitly

When running on GKE, the rule-evaluator automatically retrieves credentials from the environment based on the node's service account or the Workload Identity Federation for GKE setup. In non-GKE Kubernetes clusters, credentials must be explicitly provided to the rule-evaluator by using flags or the GOOGLE_APPLICATION_CREDENTIALS environment variable.

  1. Set the context to your target project:

    gcloud config set project PROJECT_ID
    
  2. Create a service account:

    gcloud iam service-accounts create gmp-test-sa
    

    This step creates the service account that you might have already created in the Workload Identity Federation for GKE instructions.

  3. Grant the required permissions to the service account:

    gcloud projects add-iam-policy-binding PROJECT_ID \
      --member=serviceAccount:gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com \
      --role=roles/monitoring.viewer \
    && \
    gcloud projects add-iam-policy-binding PROJECT_ID\
      --member=serviceAccount:gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com \
      --role=roles/monitoring.metricWriter
    

  4. Create and download a key for the service account:

    gcloud iam service-accounts keys create gmp-test-sa-key.json \
      --iam-account=gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com
    
  5. Add the key file as a secret to your non-GKE cluster:

    kubectl -n NAMESPACE_NAME create secret generic gmp-test-sa \
      --from-file=key.json=gmp-test-sa-key.json
    

  6. Open the rule-evaluator Deployment resource for editing:

    kubectl -n NAMESPACE_NAME edit deploy rule-evaluator
    
    1. Add the text shown in bold to the resource:

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        namespace: NAMESPACE_NAME
        name: rule-evaluator
      spec:
        template
          containers:
          - name: evaluator
            args:
            - --query.credentials-file=/gmp/key.json
            - --export.credentials-file=/gmp/key.json
      ...
            volumeMounts:
            - name: gmp-sa
              mountPath: /gmp
              readOnly: true
      ...
          volumes:
          - name: gmp-sa
            secret:
              secretName: gmp-test-sa
      ...
      

    2. Save the file and close the editor. After the change is applied, the pods are re-created and start authenticating to the metric backend with the given service account.

    Alternatively, instead of using the flags set in this example, you can set the key-file path by using the GOOGLE_APPLICATION_CREDENTIALS environment variable.

    Multi-project and global rule evaluation

    We recommend that you run one instance of the rule evaluator in each Google Cloud project and region rather than running one instance that evaluates against many projects and regions. However, we do support multi-project rule evaluation for scenarios that require it.

    When deployed on Google Kubernetes Engine, the rule evaluator uses the Google Cloud project associated with the cluster, which it automatically detects. To evaluate rules that span projects, you can override the queried project by using the --query.project-id flag and specifying a project with a multi-project metrics scope. If your metrics scope contains all your projects, then your rules evaluate globally. For more information, see Metrics scopes.

    You must also update the permissions of the service account used by the rule evaluator so the service account can read from the scoping project and write to all monitored projects in the metrics scope.

    Preserve labels when writing rules

    For data the evaluator writes back to Managed Service for Prometheus, the evaluator supports the same --export.* flags and external_labels-based configuration as the Managed Service for Prometheus server binary. We strongly recommend that you write rules so that the project_id, location, cluster, and namespace labels are preserved appropriately for their aggregation level, otherwise query performance might decline and you might encounter cardinality limits.

    The project_id or location labels are mandatory. If these labels are missing, then the values in rule-evaluation results are set based on the configuration of the rule evaluator. Missing cluster or namespace labels are not given values.

    Self-observability

    The rule-evaluator emits Prometheus metrics on a configurable port using the --web.listen-address flag.

    For example, if the pod rule-evaluator-64475b696c-95z29 is exposing these metrics on port 9092, the metrics can be viewed manually by using kubectl:

    # Port forward the metrics endpoint.
    kubectl port-forward rule-evaluator-64475b696c-95z29 9092
    # Then query in a separate terminal.
    curl localhost:9092/metrics
    

    You can configure your Prometheus stack to collect these so you have visibility to the performance of the rule-evaluator.

    High-availability deployments

    The rule evaluator can run in a highly available setup by following the same approach as documented for the Prometheus server.

    Alerting using Cloud Monitoring metrics

    You can configure the rule evaluator to alert on Google Cloud system metrics using PromQL. For instructions on how to create a valid query, see PromQL for Cloud Monitoring metrics.