此页面由 Cloud Translation API 翻译。

创建提醒政策

本页面介绍了如何为 Google Distributed Cloud 集群创建基于指标的提醒政策。我们提供了一些可下载的示例，帮助您为常见场景设置提醒政策。如需详细了解基于指标的提醒政策，请参阅 Google Cloud Observability 文档中的创建指标阈值提醒政策。

准备工作

您必须拥有以下权限才能创建提醒政策：

monitoring.alertPolicies.create
monitoring.alertPolicies.delete
monitoring.alertPolicies.update

只需具有以下角色之一，即可拥有这些权限：

monitoring.alertPolicyEditor
monitoring.editor
Project Editor
Project Owner

如果要使用 Google Cloud CLI 创建基于日志的提醒政策，您还必须具有 serviceusage.serviceUsageConsumer 角色。如需了解如何设置基于日志的提醒政策，请参阅 Google Cloud Observability 文档中的配置基于日志的提醒。

如需查看您的角色，请转到 Google Cloud Console 中的 IAM 页面。

正在创建示例政策：API 服务器不可用

在本练习中，您将为集群的 Kubernetes API 服务器创建提醒政策。实施此政策后，您可以安排在集群的 API 服务器不可用时接收通知。

下载政策配置文件：apiserver-unavailable.json。
创建政策：
```
gcloud alpha monitoring policies create --policy-from-file=POLICY_CONFIG
```
将 POLICY_CONFIG 替换为您刚刚下载的配置文件的路径。

查看您的提醒政策：

控制台

在 Google Cloud 控制台中，转到 Monitoring 页面。

前往 Monitoring
选择左侧的提醒。
在政策下方，您可以看到提醒政策的列表。

在列表中，选择 Anthos 集群 API 服务器不可用（关键）以查看新政策的相关详细信息。在条件下，您可以查看政策的说明。例如：
```
Policy violates when ANY condition is met
Anthos cluster API server uptime is absent for 5m
```

gcloud

gcloud alpha monitoring policies list

输出会显示有关政策的详细信息。例如：

combiner: OR
conditions:
- conditionAbsent:
    aggregations:
    - alignmentPeriod: 60s
      crossSeriesReducer: REDUCE_MEAN
      groupByFields:
      - resource.label.project_id
      - resource.label.location
      - resource.label.cluster_name
      - resource.label.namespace_name
      - resource.label.container_name
      - resource.label.pod_name
      perSeriesAligner: ALIGN_MAX
    duration: 300s
    filter: resource.type = "k8s_container" AND metric.type = "kubernetes.io/anthos/container/uptime"
      AND resource.label."container_name"=monitoring.regex.full_match("kube-apiserver")
    trigger:
      count: 1
  displayName: Anthos cluster API server uptime is absent for 5m
  name: projects/…/alertPolicies/…/conditions/…
displayName: Anthos cluster API server unavailable (critical)
enabled: true
mutationRecord:
  mutateTime: …
  mutatedBy: …
name: projects/…/alertPolicies/…

创建其他提醒政策

本部分针对一组建议的提醒政策提供了说明和配置文件。

要创建政策，请按照您在之前的练习中所用的步骤操作：

如需下载配置文件，请点击右列中的链接。
（可选）调整条件以更好地满足您的特定需求，例如，您可以为一部分聚类添加其他过滤条件，或调整阈值以平衡噪声和重要性之间的平衡。
如需创建该政策，请运行 gcloud alpha monitoring policies create。

您可以使用以下脚本下载并安装本文档中描述的所有提醒政策示例：

# 1. Create a directory named alert_samples:

mkdir alert_samples && cd alert_samples
declare -a alerts=("apiserver-unavailable.json" "controller-manager-unavailable.json" "scheduler-unavailable.json" \
  "pod-crash-looping.json" "pod-not-ready-1h.json" "container-cpu-usage-high-reaching-limit.json" \
  "container-memory-usage-high-reaching-limit.json" "persistent-volume-usage-high.json" "node-cpu-usage-high.json" \
  "node-disk-usage-high.json" "node-memory-usage-high.json" "node-not-ready-1h.json" "apiserver-error-ratio-high.json" \
  "etcd-leader-changes-or-proposal-failures-frequent.json" "etcd-server-not-in-quorum.yaml" "etcd-storage-usage-high.json")

# 2. Download all alert samples into the alert_samples/ directory:

for x in "${alerts[@]}"
do
  wget https://cloud.google.com/anthos/clusters/docs/bare-metal/latest/samples/${x}
done

# 3. (optional) Uncomment and provide your project ID to set the default project
# for gcloud commands:

# gcloud config set project <PROJECT_ID>

# 4. Create alert policies for each of the downloaded samples:

for x in "${alerts[@]}"
do
  gcloud alpha monitoring policies create --policy-from-file=${x}
done

控制平面组件可用性

提醒名称	说明	Cloud Monitoring 中的提醒政策定义
API 服务器不可用（严重）	无法提供 API 服务器正常运行时间指标	apiserver-unavailable.json
调度器不可用（严重）	调度器正常运行时间指标不可用	scheduler-unavailable.json
控制器管理器不可用（严重）	没有控制器管理器正常运行时间指标	controller-manager-unavailable.json

Kubernetes 系统

提醒名称	说明	Cloud Monitoring 中的提醒政策定义
Pod 崩溃循环（警告）	Pod 不断重启，可能处于崩溃循环状态	pod-crash-looping.json
Pod 的准备就绪时间已超过 1 小时（关键）	Pod 处于尚未就绪状态超过一小时	pod-not-ready-1h.json
容器 CPU 用量超过 80%（警告）	容器 CPU 使用率超过限制 80%	container-cpu-usage-high-reaching-limit.json
容器内存用量超过 85%（警告）	容器内存用量超过上限的 85%	container-memory-usage-high-reaching-limit.json
永久性卷高使用率（关键）	已声明的永久性卷的可用空间不足 3%	persistent-volume-usage-high.json
节点 CPU 用量超过 80%（警告）	在 5 分钟内，节点 CPU 用量超过可分配总用量的 80%	node-cpu-usage-high.json
节点磁盘用量超过 85%（警告）	每个磁盘装载点可用不到 15% 的空闲时间 10 分钟	node-disk-usage-high.json
节点内存用量超过 80%（警告）	节点内存用量超过可分配总内存的 80%（持续 5 分钟）	node-memory-usage-high.json
节点已超过一小时尚未准备就绪（关键）	节点处于尚未就绪状态超过一小时	node-not-ready-1h.json

Kubernetes 性能

提醒名称	说明	Cloud Monitoring 中的提醒政策定义
API 服务器错误率超过 20%（严重）	对于每个动词超过 20% 的请求，API 服务器会在 15 分钟内针对超过 20% 的请求显示 5xx 或 429 错误	apiserver-error-ratio-high.json
ETCD 领导者更改或提案失败过于频繁（警告）	`etcd` 主要变更或提案失败过于频繁	etcd-leader-changes-or-proposal-failures-frequent.json
ETCD 服务器未达成仲裁（关键）	没有提交 5 分钟的 `etcd` 服务器提案，因此可能错过了仲裁	etcd-server-not-in-quorum.yaml
ETCD 存储超过 90% 的限制（警告）	`etcd` 存储空间用量超过上限的 90%	etcd-storage-usage-high.json

使用 PromQL 的提醒政策

提醒政策中的查询也可以用 PromQL（而不是 MQL）表示。例如，API server error ratio exceeds 20 percent (critical) 政策的 PromQL 版本可供下载：apiserver-error-ratio-high-promql.json。

如需了解详情，请参阅适用于 Google Distributed Cloud 的使用 Managed Service for Prometheus 文档以及适用于 Cloud Monitoring 的使用 PromQL 的提醒政策文档。

接收通知

创建提醒政策后，您可以为该政策定义一个或多个通知渠道。通知渠道有多种类型。例如，您可能会收到电子邮件、Slack 频道或移动应用发出的通知。您可以选择符合您需求的渠道。

如需了解如何配置通知渠道，请参阅管理通知渠道。