This page shows how to create alerting policies for Google Distributed Cloud clusters.
Before you begin
You must have the following permissions to create alerting policies:
- monitoring.alertPolicies.create
- monitoring.alertPolicies.delete
- monitoring.alertPolicies.update
You'll have these permissions if you have any one of the following roles:
- monitoring.alertPolicyEditor
- monitoring.editor
- Project editor
- Project owner
To check your roles, go to the IAM page in the Google Cloud console.
Creating a policy: admin cluster API server down
In this exercise, you create an alerting policy for Kubernetes API servers of admin clusters. With this policy in place, you can arrange to be notified whenever the API server of an admin cluster goes down.
Download the policy configuration file: admin-cluster-apiserver-down.json.
Create the policy:
gcloud alpha monitoring policies create --policy-from-file=POLICY_CONFIG
Replace POLICY_CONFIG with the path of the configuration file you just downloaded.
View your alerting policies:
Console
In the Google Cloud console, go to the Monitoring page.
On the left, select Alerting.
Under Policies, you can see a list of your alerting policies.
In the list, select GKE on-prem admin cluster API server down (critical) to see details about your new policy. Under Conditions, you can see a description of the policy. For example:
Policy violates when ANY condition is met Anthos On-Prem Admin Cluster API Server is up Violates when: Any kubernetes.io/anthos/up stream is absent for greater than 5 minutes
gcloud
gcloud alpha monitoring policies list
The output shows detailed information about the policy. For example:
combiner: OR conditions: – conditionAbsent: aggregations: - alignmentPeriod: 60s crossSeriesReducer: REDUCE_SUM groupByFields: - resource.label.project_id - resource.label.location - resource.label.cluster_name perSeriesAligner: ALIGN_MEAN duration: 300s filter: resource.type="k8s_container" AND metric.type="kubernetes.io/anthos/up" AND resource.label."container_name"=monitoring.regex.full_match("kube-apiserver") trigger: count: 1 ... displayName: GKE on-prem admin cluster API server down (critical) enabled: true ... name: projects/xxxxxx/alertPolicies/12331540576820203183
Creating additional alerting policies
This section provides descriptions and configuration files for a set of recommended alerting policies.
To create a policy, follow the same steps that you used in the preceding exercise:
Click the link in the right column to download the configuration file.
Run
gcloud alpha monitoring policies create
to create the policy.
Admin cluster control plane components availability
Alert name | Description | Alerting policy definition in Cloud Monitoring |
---|---|---|
GKE on-prem admin cluster API server down (critical) | Admin cluster API server has disappeared from metrics target discovery | admin-cluster-apiserver-down.json |
GKE on-prem admin cluster scheduler down (critical) | Admin cluster scheduler has disappeared from metrics target discovery | admin-cluster-scheduler-down.json |
GKE on-prem admin cluster controller manager down (critical) | Admin cluster controller manager has disappeared from metrics target discovery | admin-cluster-controller-manager-down.json |
GKE on-prem admin cluster etcd down (critical) | Admin cluster etcd has disappeared from metrics target discovery | admin-cluster-etcd-down.json |
User cluster control plane components availability
The user cluster control plane alerts are based on metrics. For most cluster
metrics, the cluster_name
field is the name of the cluster itself. But for
user cluster control plane metrics, the cluster_name
field is the name of the
admin cluster, and the namespace_name
field is the name of the user cluster.
You can see this in a screenshot under Create a control plane status dashboard.
Alert name | Description | Alerting policy definition in Cloud Monitoring | GKE on-prem user cluster API server down (critical) | User cluster API server has disappeared from metrics target discovery | user-cluster-apiserver-down.json |
---|---|---|---|
GKE on-prem user cluster scheduler down (critical) | User cluster scheduler has disappeared from metrics target discovery | user-cluster-scheduler-down.json | GKE on-prem user cluster controller manager down (critical) | User cluster controller manager has disappeared from metrics target discovery | user-cluster-controller-manager-down.json | GKE on-prem user cluster etcd down (critical) | User cluster etcd has disappeared from metrics target discovery | user-cluster-etcd-down.json |
Kubernetes system
Alert name | Description | Alerting policy definition in Cloud Monitoring | GKE on-prem pod crash looping (critical) | Pod is in a crash loop status | pod-crash-looping.json |
---|---|---|
GKE on-prem pod not ready for more than one hour (critical) | Pod is in a non-ready state for more than one hour | pod-not-ready-1h.json | GKE on-prem persistent volume high usage (critical) | Persistent volume claimed is expected to fill up | persistent-volume-usage-high.json | GKE on-prem node not ready for more than one hour (critical) | Node is in a non-ready state for more than one hour | node-not-ready-1h.json |
Kubernetes performance
Alert name | Description | Alerting policy definition in Cloud Monitoring | GKE on-prem admin cluster API server error count ratio exceeds 10 percent (critical) | Admin cluster API server is returning errors for more than 10% of requests | admin-cluster-apiserver-error-ratio-10-percent.json |
---|---|---|
GKE on-prem admin cluster API server error count ratio exceeds 5 percent (warning) | Admin cluster API server is returning errors for more than 5% of requests | admin-cluster-apiserver-error-ratio-5-percent.json | GKE on-prem user cluster API server error count ratio exceeds 10 percent (critical) | User cluster API server is returning errors for more than 10% of requests | user-cluster-apiserver-error-ratio-10-percent.json | GKE on-prem user cluster API server error count ratio exceeds 5 percent (warning) | User cluster API server is returning errors for more than 5% of requests | user-cluster-apiserver-error-ratio-5-percent.json |
Getting notified
After you create an alerting policy, you can define one or more notification channels for the policy. There are several kinds of notification channels. For example, you could be notified by email, a Slack channel, or a mobile app. You can choose the channels that suit your needs.
For instructions about how to configure notification channels, see Managing notification channels.