Google Cloud Managed Service for Prometheus supports Prometheus-compatible rule evaluation and alerting. This document describes how to set up managed rule evaluation.
Rule evaluation
Managed Service for Prometheus provides a rule-evaluator component that allows you to safely write rules in the context of a global Prometheus backend, preventing you from interfering with other users' data in larger organizations. The component is automatically deployed as part of managed collection when running on Kubernetes clusters.
You can write rules and alerts on both Managed Service for Prometheus metrics and Cloud Monitoring metrics. You need to use the GlobalRules resource when writing rules for Cloud Monitoring metrics.
Rules
The managed rule-evaluator uses the Rules resource to configure recording and alerting rules. The following is an example Rules resource:
apiVersion: monitoring.googleapis.com/v1 kind: Rules metadata: namespace: NAMESPACE_NAME name: example-rules spec: groups: - name: example interval: 30s rules: - record: job:up:sum expr: sum without(instance) (up) - alert: AlwaysFiring expr: vector(1)
The format of the .spec.groups
element is identical to the upstream Prometheus
rule_group
array. Alerting and recording
rules defined in Rules
are scoped to project_id
, cluster
, and namespace
of the resource. For example, the job:up:sum
rule in the above resource
effectively queries
sum without(instance) (up{project_id="test-project", cluster="test-cluster", namespace="NAMESPACE_NAME"})
.
This guarantee ensures that alerting or recording rules do not accidentally
evaluate metrics from applications you may not even know about.
To apply the example rules to your cluster, run the following command:
kubectl apply -n NAMESPACE_NAME -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.13.0/examples/rules.yaml
After a few minutes, the metric job:up:sum
becomes available.
The alert AlwaysFiring
also starts firing. For information about
about how to send alerts to an Alertmanager, see
Alertmanager configuration.
The ClusterRules and
GlobalRules resources provide the same
interface as the Rules
resource, but they apply the rules to wider scopes.
ClusterRules select data by using the project_id
and cluster
labels,
and GlobalRules select all data in the queried metrics scope without
restricting labels.
For reference documentation about all the Managed Service for Prometheus custom resources, see the prometheus-engine/doc/api reference.
Converting from Prometheus rules to Rules
The Rules resource provides a compatible interface to Prometheus rules to provide a seamless migration path for incorporating existing rules into managed rule evaluation. You can include your existing rules in a Rules resource. For example, the following is a Prometheus rule:
groups: - name: example interval: 30s rules: - record: job:up:sum expr: sum without(instance) (up) - alert: AlwaysFiring expr: vector(1)
The corresponding Rules resource, with the original Prometheus rule in bold type, follows:
apiVersion: monitoring.googleapis.com/v1 kind: Rules metadata: namespace: NAMESPACE_NAME name: example-rules spec: groups: - name: example interval: 30s rules: - record: job:up:sum expr: sum without(instance) (up) - alert: AlwaysFiring expr: vector(1)
ClusterRules
You can use the ClusterRules resource
to configure recording and
alerting rules that can evaluate all time series sent to
Managed Service for Prometheus from all namespaces in a particular cluster.
The spec is identical to that of Rules
. The previous
example Prometheus rule becomes the following
ClusterRules
resource:
apiVersion: monitoring.googleapis.com/v1 kind: ClusterRules metadata: name: example-clusterrules spec: groups: - name: example interval: 30s rules: - record: job:up:sum expr: sum without(instance) (up) - alert: AlwaysFiring expr: vector(1)
We recommend that you use ClusterRules resources only on horizontal metrics, such as those produced by a service mesh. For metrics of individual deployments, use Rules resources to ensure that the evaluation doesn't include unintended data.
GlobalRules
You can use the GlobalRules resource
to configure recording and
alerting rules that can evaluate all time series sent to
Managed Service for Prometheus across all projects within a metrics scope.
The spec is identical to that of Rules
. The previous
example Prometheus rule becomes the following
GlobalRules
resource:
apiVersion: monitoring.googleapis.com/v1 kind: GlobalRules metadata: name: example-globalrules spec: groups: - name: example interval: 30s rules: - record: job:up:sum expr: sum without(instance) (up) - alert: AlwaysFiring expr: vector(1)
Because Cloud Monitoring metrics are not scoped to a namespace or cluster, you must use the GlobalRules resource when writing rules or alerts for Cloud Monitoring metrics. Using GlobalRules is also required when alerting on Google Kubernetes Engine system metrics.
If your rule does not preserve the project_id
or location
labels, they default to the values of the cluster.
For Managed Service for Prometheus metrics, we recommend that you use
GlobalRules only for those rare use cases where an alert might need data across
all clusters at once. For metrics of individual deployments, use Rules or
ClusterRules resources for higher reliability and to ensure that the evaluation
doesn't include unintended data. We strongly recommend preserving the cluster
and namespace
labels in rule evaluation results unless the purpose of the rule
is to aggregate away those labels, otherwise query performance might decline and
you might encounter cardinality limits. Removing both labels is strongly
discouraged.
Multi-project and global rule evaluation
When deployed on Google Kubernetes Engine, the rule evaluator uses the Google Cloud project associated with the cluster, which the rule evaluator automatically detects. To evaluate rules that span projects, you must configure the rule evaluator that executes the GlobalRules resource to use a project with a multi-project metrics scope. You can do this in two ways:
- Place your GlobalRules resource in a project that has a multi-project metrics scope.
- Set the
queryProjectID
field within the OperatorConfig to use a project with a multi-project metrics scope.
You must also update the permissions of the service account used by the rule evaluator (which is usually the default service account on the node) so the service account can read from the scoping project and write to all monitored projects in the metrics scope.
If your metrics scope contains all your projects, then your rules evaluate globally. For more information, see Metrics scopes.
Alerting using Cloud Monitoring metrics
You can use the GlobalRules resource to alert on Google Cloud system metrics using PromQL. For instructions on how to create a valid query, see PromQL for Cloud Monitoring metrics.
Configuring rules and alerts using Terraform
You can automate the creation and management of Rules, ClusterRules, and
GlobalRules resources by using the kubernetes_manifest
Terraform
resource type or the kubectl_manifest
Terraform resource type, either of
which lets you specify arbitrary custom resources.
For general information about using Google Cloud with Terraform, see Terraform with Google Cloud.
Provide credentials explicitly
When running on GKE, the rule-evaluator automatically retrieves credentials from the environment based on the node's service account. In non-GKE Kubernetes clusters, credentials must be explicitly provided through the OperatorConfig resource in the gmp-public namespace.
Set the context to your target project:
gcloud config set project PROJECT_ID
Create a service account:
gcloud iam service-accounts create gmp-test-sa
Grant the required permissions to the service account:
gcloud projects add-iam-policy-binding PROJECT_ID \ --member=serviceAccount:gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com \ --role=roles/monitoring.viewer \ && \ gcloud projects add-iam-policy-binding PROJECT_ID\ --member=serviceAccount:gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com \ --role=roles/monitoring.metricWriter
Create and download a key for the service account:
gcloud iam service-accounts keys create gmp-test-sa-key.json \ --iam-account=gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com
Add the key file as a secret to your non-GKE cluster:
kubectl -n gmp-public create secret generic gmp-test-sa \ --from-file=key.json=gmp-test-sa-key.json
Open the OperatorConfig resource for editing:
kubectl -n gmp-public edit operatorconfig config
Add the text shown in bold to the resource:
Make sure you also add these credentials to theapiVersion: monitoring.googleapis.com/v1 kind: OperatorConfig metadata: namespace: gmp-public name: config rules: credentials: name: gmp-test-sa key: key.json
collection
section so that managed collection works.Save the file and close the editor. After the change is applied, the pods are re-created and start authenticating to the metric backend with the given service account.
Scaling rule-evaluation
The rule-evaluator runs as a single replica Deployment with fixed resource requests and limits. You might notice the workload experiences disruptions, such as being OOMKilled when evaluating a high number of rules. To mitigate this, you can deploy a
VerticalPodAutoscaler
to vertically scale the deployment. First, ensure that Vertical Pod Autoscaling is enabled on your Kubernetes cluster. Then apply aVerticalPodAutoscaler
resource such as the following:apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: rule-evaluator namespace: gmp-system spec: resourcePolicy: containerPolicies: - containerName: evaluator controlledResources: - memory maxAllowed: memory: 4Gi minAllowed: memory: 16Mi mode: Auto targetRef: apiVersion: apps/v1 kind: Deployment name: rule-evaluator updatePolicy: updateMode: Auto
You can verify the autoscaler is working by checking the status of the autoscaler:
kubectl get vpa --namespace gmp-system rule-evaluator
If the autoscaler is working, then it reports that it calculated the resource recommendations for the workload in the "PROVIDED" column:
NAME MODE CPU MEM PROVIDED AGE rule-evaluator Auto 2m 11534336 True 30m
Compress configurations
If you have many Rules resources, you might run out of ConfigMap space. To fix this, enable
gzip
compression in your OperatorConfig resource:apiVersion: monitoring.googleapis.com/v1 kind: OperatorConfig metadata: namespace: gmp-public name: config features: config: compression: gzip
Alertmanager configuration
You can use the OperatorConfig resource to configure the managed rule-evaluator to send alerts to a Prometheus Alertmanager. You can send alerts to the automatically-deployed managed Alertmanager in addition to any self-deployed Alertmanagers.
Managed Alertmanager
Managed Service for Prometheus deploys a managed instance of Alertmanager, to which the rule evaluators are automatically configured to forward alerts. By default, this configuration is set with a specifically named Kubernetes Secret containing an Alertmanager config file.
To enable and configure reporting to the deployed Alertmanager instance, do the following:
Create a local config file containing your Alertmanager settings (see sample config templates):
touch alertmanager.yaml
Update the file with your desired Alertmanager settings and create a Secret named
alertmanager
in thegmp-public
namespace:kubectl create secret generic alertmanager \ -n gmp-public \ --from-file=alertmanager.yaml
After a few moments, Managed Service for Prometheus picks up the new config Secret and enables the managed Alertmanager with your settings.
Customizing the config Secret name
The managed Alertmanager also supports custom Secret names for loading the config. This capability is useful when you have multiple config Secrets and you want your Alertmanager instance to switch between the corresponding configs. For example, you might want to change the alert notification channels based on rotating on-call shifts, or you might want to swap in an experimental Alertmanager config to test a new alerting route.
To specify a non-default Secret name by using the OperatorConfig resource, do the following:
Create a Secret from your local Alertmanager config file:
kubectl create secret generic SECRET_NAME \ -n gmp-public \ --from-file=FILE_NAME
Open the OperatorConfig resource for editing:
kubectl -n gmp-public edit operatorconfig config
To enable the managed Alertmanager reporting, edit the resource by modifying the
managedAlertmanager
section as shown in the following bold text:apiVersion: monitoring.googleapis.com/v1 kind: OperatorConfig metadata: namespace: gmp-public name: config managedAlertmanager: configSecret: name: SECRET_NAME key: FILE_NAME
If you need to make any changes to the Alertmanager configuration, then you can then edit the configuration for this Alertmanager by updating the Secret you created earlier.
Customizing the external URL
You can configure the external URL for the managed Alertmanager so that alert notifications can provide a callback link to your alerting UI. This is equivalent to using upstream Prometheus Alertmanager's
--web.external-url
flag.apiVersion: monitoring.googleapis.com/v1 kind: OperatorConfig metadata: namespace: gmp-public name: config managedAlertmanager: externalURL: EXTERNAL_URL
Self-deployed Alertmanager
To configure the rule-evaluator for a self-deployed Alertmanager, do the following:
Open the OperatorConfig resource for editing:
kubectl -n gmp-public edit operatorconfig config
Configure the resource to send alerts to your Alertmanager service:
apiVersion: monitoring.googleapis.com/v1 kind: OperatorConfig metadata: namespace: gmp-public name: config rules: alerting: alertmanagers: - name: SERVICE_NAME namespace: SERVICE_NAMESPACE port: PORT_NAME
If your Alertmanager is located in a different cluster than your rule-evaluator, you might need to set up a Endpoints resource. For example, if your OperatorConfig says that Alertmanager endpoints can be found in Endpoints object
ns=alertmanager/name=alertmanager
, you can manually or programmatically create this object yourself and populate it with reachable IPs from the other cluster. The AlertmanagerEndpoints configuration section provides options for authorization configuration if necessary.Conserving resources when idle
When no Rules, ClusterRules, or GlobalRules resources are configured, GKE scales the rule-evaluator and Alertmanager deployments to zero to conserve cluster resources for customers who don't use managed rules or alerts. These deployments will automatically scale up when you apply a new Rules resource. You can force them to scale up by applying a Rules resource which doesn't do anything.