Google Cloud Managed Service for Prometheus supports Prometheus-compatible rule evaluation. This document describes how to set up managed rule evaluation.
Rule evaluation
Managed Service for Prometheus provides a rule-evaluator component that allows you to safely write rules in the context of a global Prometheus backend, preventing you from interfering with other users' data in larger organizations. The component is automatically deployed as part of managed collection when running on Kubernetes clusters.
Rules
The managed rule-evaluator uses the Rules resource to configure recording and alerting rules. The following is an example Rules resource:
apiVersion: monitoring.googleapis.com/v1 kind: Rules metadata: namespace: gmp-test name: example-rules spec: groups: - name: example interval: 30s rules: - record: job:up:sum expr: sum without(instance) (up) - alert: AlwaysFiring expr: vector(1)
The format of the .spec.groups
element is identical to the upstream Prometheus
rule_group
array. Alerting and recording
rules defined in Rules
are scoped to project_id
, cluster
, and namespace
of the resource. For example, the job:up:sum
rule in the above resource
effectively queries
sum without(instance) (up{project_id="test-project", cluster="test-cluster", namespace="gmp-test"})
.
This guarantee ensures that alerting or recording rules do not accidentally
evaluate metrics from applications you may not even know about.
To apply the example rules to your cluster, run the following command:
kubectl apply -n gmp-test -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.4.0/examples/rules.yaml
After a few minutes, the metric job:up:sum
becomes available.
The alert AlwaysFiring
also starts firing. For information about
about how to send alerts to an Alertmanager, see
Alertmanager configuration.
The ClusterRules and
GlobalRules resources provide the same
interface as the Rules
resource, but they apply the rules to wider scopes.
ClusterRules select data by using the project_id
and cluster
labels,
and GlobalRules select all data in the queried metrics scope without
restricting labels.
For reference documentation about all the Managed Service for Prometheus custom resources, see the prometheus-engine/doc/api reference.
Converting from Prometheus rules to Rules
The Rules resource provides a compatible interface to Prometheus rules to provide a seamless migration path for incorporating existing rules into managed rule evaluation. You can include your existing rules in a Rules resource. For example, the following is a Prometheus rule:
groups: - name: example interval: 30s rules: - record: job:up:sum expr: sum without(instance) (up) - alert: AlwaysFiring expr: vector(1)
The corresponding Rules resource, with the original Prometheus rule in bold type, follows:
apiVersion: monitoring.googleapis.com/v1 kind: Rules metadata: namespace: gmp-test name: example-rules spec: groups: - name: example interval: 30s rules: - record: job:up:sum expr: sum without(instance) (up) - alert: AlwaysFiring expr: vector(1)
ClusterRules
You can use the ClusterRules resource
to configure recording and
alerting rules that can evaluate all time series sent to
Managed Service for Prometheus from all namespaces in a particular cluster.
The spec is identical to that of Rules
. The previous
example Prometheus rule becomes the following
ClusterRules
resource:
apiVersion: monitoring.googleapis.com/v1 kind: ClusterRules metadata: name: example-clusterrules spec: groups: - name: example interval: 30s rules: - record: job:up:sum expr: sum without(instance) (up) - alert: AlwaysFiring expr: vector(1)
We recommend that you use ClusterRules resources only on horizontal metrics, such as those produced by a service mesh. For metrics of individual deployments, use Rules resources to ensure that the evaluation doesn't include unintended data.
GlobalRules
You can use the GlobalRules resource
to configure recording and
alerting rules that can evaluate all time series sent to
Managed Service for Prometheus across all projects within a metrics scope.
The spec is identical to that of Rules
. The previous
example Prometheus rule becomes the following
GlobalRules
resource:
apiVersion: monitoring.googleapis.com/v1 kind: GlobalRules metadata: name: example-globalrules spec: groups: - name: example interval: 30s rules: - record: job:up:sum expr: sum without(instance) (up) - alert: AlwaysFiring expr: vector(1)
We recommend that you use GlobalRules only for those rare use cases where an alert might need data across all clusters at once. For metrics of individual deployments, use Rules or ClusterRules resources for higher reliability and to ensure that the evaluation doesn't include unintended data.
We strongly recommend preserving the cluster
and namespace
labels in rule
evaluation results unless the purpose of the rule is to aggregate away those
labels, otherwise query performance might decline and you might encounter
cardinality limits. Removing both labels is strongly discouraged.
Multi-project and global rule evaluation
When deployed on Google Kubernetes Engine, the rule evaluator uses the Google Cloud project associated with the cluster, which the rule evaluator automatically detects. To evaluate rules that span projects, you must configure the rule evaluator that executes the GlobalRules resource to use a project with a multi-project metrics scope. You can do this in two ways:
- Place your GlobalRules resource in a project that has a multi-project metrics scope.
- Set the
queryProjectID
field within the OperatorConfig to use a project with a multi-project metrics scope.
You must also update the permissions of the service account used by the rule evaluator (which is usually the default service account on the node) so the service account can read from the scoping project and write to all monitored projects in the metrics scope.
If your metrics scope contains all your projects, then your rules evaluate globally. For more information, see Metrics scopes.
Provide credentials explicitly
When running on GKE, the rule-evaluator automatically retrieves credentials from the environment based on the Compute Engine default service account or the Workload Identity setup.
In non-GKE Kubernetes clusters, credentials must be explicitly provided through the OperatorConfig resource in the gmp-public namespace.Create a service account:
gcloud iam service-accounts create gmp-test-sa
Grant the required permissions to the service account:
gcloud projects add-iam-policy-binding PROJECT_ID \ --member=serviceAccount:gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com \ --role=roles/monitoring.viewer \ && \ gcloud projects add-iam-policy-binding PROJECT_ID\ --member=serviceAccount:gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com \ --role=roles/monitoring.metricWriter
Create and download a key for the service account:
gcloud iam service-accounts keys create gmp-test-sa-key.json \ --iam-account=gmp-test-sa@PROJECT_ID.iam.gserviceaccount.com
Add the key file as a secret to your non-GKE cluster:
kubectl -n gmp-public create secret generic gmp-test-sa \ --from-file=key.json=gmp-test-sa-key.json
Open the OperatorConfig resource for editing:
kubectl -n gmp-public edit operatorconfig config
Add the text shown in bold to the resource:
apiVersion: monitoring.googleapis.com/v1 kind: OperatorConfig metadata: namespace: gmp-public name: config rules: credentials: name: gmp-test-sa key: key.json
Save the file and close the editor. After the change is applied, the pods are re-created and start authenticating to the metric backend with the given service account.
Alertmanager configuration
You can use the OperatorConfig resource to configure the managed rule-evaluator to send alerts to a self-deployed Prometheus Alertmanager. To configure the rule-evaluator, do the following:
Open the OperatorConfig resource for editing:
kubectl -n gmp-public edit operatorconfig config
Configure the resource to send alerts to your Alertmanager service:
apiVersion: monitoring.googleapis.com/v1 kind: OperatorConfig metadata: namespace: gmp-public name: config rules: alerting: alertmanagers: - name: SERVICE_NAME namespace: SERVICE_NAMESPACE port: PORT_NAME
If your Alertmanager is located in a different cluster
than your rule-evaluator, you might need to set up a Endpoints resource.
For example, if your OperatorConfig says that Alertmanager endpoints can be
found in Endpoints object ns=alertmanager/name=alertmanager
, you can
manually or programmatically create this object yourself and populate it
with reachable IPs from the other cluster. The
AlertmanagerEndpoints configuration section
provides options for authorization configuration if necessary.