学习路线：可伸缩应用 - 使用 Prometheus 进行监控

本系列教程适用于想要部署、运行和管理在 Google Kubernetes Engine (GKE) Enterprise 版本上运行的现代应用环境的 IT 管理员和运维人员。在学习本系列教程的过程中，您将学习如何使用 Cymbal Bank 示例微服务应用配置监控和提醒、扩缩工作负载以及模拟故障。

创建集群并部署示例应用
使用 Google Cloud Managed Service for Prometheus 进行监控（本教程）
扩缩工作负载
模拟故障
集中进行变更管理

概览和目标

本系列教程中使用的 Cymbal Bank 示例应用由许多微服务组成，它们全部在 GKE 集群中运行。其中任何服务的问题都可能给银行客户带来糟糕的体验，例如无法访问银行应用。尽快了解服务方面的问题意味着您可以快速开始排查并解决问题。

在本教程中，您将学习如何使用 Google Cloud Managed Service for Prometheus 和 Cloud Monitoring 监控 GKE 集群中的工作负载。您将学习如何完成以下任务：

为 Alertmanager 创建 Slack webhook。
配置 Prometheus 以监控基于微服务的示例应用的状态。
模拟服务中断并查看使用 Slack webhook 发送的提醒。

费用

您将为本系列教程启用 GKE Enterprise 并部署 Cymbal Bank 示例应用，这意味着您需要按照我们价格页面中列出的价格，按集群为 GKE Enterprise on Google Cloud 付费，直到您停用 GKE Enterprise 或删除项目。

您还需要负责运行 Cymbal Bank 示例应用时产生的其他 Google Cloud 费用，例如 Compute Engine 虚拟机和 Cloud Monitoring 的费用。

准备工作

如需了解如何监控工作负载，您必须完成第一个教程，以创建一个使用 Autopilot 的 GKE 集群并部署 Cymbal Bank 基于微服务的示例应用。

我们建议您按顺序针对 Cymbal Bank 完成本系列教程。在完成本系列教程的过程中，您将学习新技能并使用其他 Google Cloud 产品和服务。

为了展示有关 GKE Autopilot 集群如何使用 Google Cloud Managed Service for Prometheus 生成发送到通信平台的消息的示例，本教程使用 Slack。在您自己的生产部署中，您可以使用组织的首选通信工具在 GKE 集群出现问题时处理和传送消息。

加入 Slack 工作区，方法是使用您的电子邮件地址注册或使用工作区管理员发送的邀请。

注意：如果您不是 Slack 工作区的管理员，可能需要先获得工作区管理员的批准，然后再将您的应用部署到工作区。

创建 Slack 应用

设置监控的一个重要部分是确保您可在发生可操作事件（例如服务中断）时收到通知。实现这一点的一种常见模式是向通信工具（例如 Slack，这是您在本教程中使用的工具）发送通知。Slack 提供 webhook 功能，使外部应用（如生产部署）可以生成消息。您可以使用组织中的其他通信工具在 GKE 集群出现问题时处理和传送消息。

使用 Autopilot 的 GKE 集群包含 Google Cloud Managed Service for Prometheus 实例。此实例可以在您的应用出现问题时生成提醒。这些提醒随后可以使用 Slack webhook 将消息发送到您的 Slack 工作区，以便您可在出现问题时收到提示通知。

如需根据 Prometheus 生成的提醒设置 Slack 通知，您必须创建 Slack 应用，为该应用激活传入的 Webhook，并将该应用安装到 Slack 工作区。

使用您的工作区名称和 Slack 账号凭据登录到 Slack。
创建一个新的 Slack 应用
1. 在创建应用对话框中，点击从头开始。
2. 指定应用名称，然后选择您的 Slack 工作区。
3. 点击创建应用。
4. 在添加特性和功能下，点击传入的网络钩子。
5. 点击激活传入的网络钩子切换开关。
6. 在 Webhook URLs for Your Workspace（工作区的 Webhook 网址）部分中，点击 Add New Webhook to Workspace（将新 Webhook 添加到工作区）。
7. 在打开的授权页面上，选择一个渠道来接收通知。
8. 点击允许。
9. Slack 应用的 webhook 会显示在 Webhook URLs for Your Workspace（工作区的 Webhook 网址）部分中。保存网址以备后用。

配置 Alertmanager

在 Prometheus 中，Alertmanager 会处理部署生成的监控事件。Alertmanager 可以跳过重复事件、对相关事件进行分组以及发送通知（例如使用 Slack webhook）。本部分介绍如何将 Alertmanager 配置为使用新的 Slack webhook。本教程的下一部分（配置 Prometheus）将介绍如何指定您希望 Alertmanager 处理要发送的事件的方式。

如需将 Alertmanager 配置为使用您的 Slack webhook，请完成以下步骤：

将目录更改为包含上一个教程中的所有 Cymbal Bank 示例清单的 Git 代码库：
```
cd ~/bank-of-anthos/
```
如果需要，将目录位置更改为您之前克隆代码库的位置。
使用 Slack 应用的 webhook 网址更新 Alertmanager 示例 YAML 清单：
```
sed -i "s@SLACK_WEBHOOK_URL@SLACK_WEBHOOK_URL@g" "extras/prometheus/gmp/alertmanager.yaml"
```
将 SLACK_WEBHOOK_URL 替换为上一部分中的 webhook 网址。
如需动态使用唯一的 Slack webhook 网址，而不更改应用代码，您可以使用 Kubernetes Secret。应用代码会读取此 Secret 的值。在更复杂的应用中，此功能使您可以出于安全或合规性原因而更改或轮替值。

使用包含 Slack webhook 网址的示例 YAML 清单为 Alertmanager 创建 Kubernetes Secret：
```
kubectl create secret generic alertmanager \
  -n gmp-public \
  --from-file=extras/prometheus/gmp/alertmanager.yaml
```
Prometheus 可以使用导出器从应用获取指标，而无需更改代码。借助 Prometheus 黑盒导出器，您可以探测 HTTP 或 HTTPS 等端点。当您不想或无法向 Prometheus 公开应用的内部工作原理时，此导出器非常有效。Prometheus 黑盒导出器无需更改应用代码即可发挥作用，从而向 Prometheus 公开指标。

将 Prometheus 黑盒导出器部署到您的集群：
```
kubectl apply -f extras/prometheus/gmp/blackbox-exporter.yaml
```

配置 Prometheus

将 Alertmanager 配置为使用 Slack webhook 后，您需要向 Prometheus 告知要在 Cymbal Bank 中监控的内容，以及您希望 Alertmanager 使用 Slack webhook 向您通知的事件类型。

在这些教程所使用的 Cymbal Bank 示例应用中，有各种微服务在 GKE 集群中运行。您可能希望尽快了解的一个问题是，其中一个 Cymbal Bank 服务是否已停止正常响应请求，这可能意味着您的客户无法访问该应用。您可以将 Prometheus 配置为根据组织的政策响应事件。

探测

您可以为要监控的资源配置 Prometheus 探测。这些探测可以根据探测收到的响应生成提醒。在 Cymbal Bank 示例应用中，您可以使用 HTTP 探测来检查来自服务的 200 级响应代码。HTTP 200 级响应表明服务正常运行并且可以响应请求。如果出现问题并且探测没有收到预期响应，您可以定义 Prometheus 规则，以生成提醒供 Alertmanager 处理并执行其他操作。

创建一些 Prometheus 探测以监控 Cymbal Bank 示例应用的各种微服务的 HTTP 状态。请查看以下示例清单：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: frontend-probe
  labels:
    app.kubernetes.io/name: frontend-probe
spec:
  selector:
    matchLabels:
      app: blackbox-exporter
  endpoints:
  - port: metrics
    path: /probe
    params:
      target: [frontend:80]
      module: [http_2xx]
    timeout: 30s
    interval: 60s
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: userservice-probe
  labels:
    app.kubernetes.io/name: userservice-probe
spec:
  selector:
    matchLabels:
      app: blackbox-exporter
  endpoints:
  - port: metrics
    path: /probe
    params:
      target: [userservice:8080/ready]
      module: [http_2xx]
    timeout: 30s
    interval: 60s
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: balancereader-probe
  labels:
    app.kubernetes.io/name: balancereader-probe
spec:
  selector:
    matchLabels:
      app: blackbox-exporter
  endpoints:
  - port: metrics
    path: /probe
    params:
      target: [balancereader:8080/ready]
      module: [http_2xx]
    timeout: 30s
    interval: 60s
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: contacts-probe
  labels:
    app.kubernetes.io/name: contacts-probe
spec:
  selector:
    matchLabels:
      app: blackbox-exporter
  endpoints:
  - port: metrics
    path: /probe
    params:
      target: [contacts:8080/ready]
      module: [http_2xx]
    timeout: 30s
    interval: 60s
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: ledgerwriter-probe
  labels:
    app.kubernetes.io/name: ledgerwriter-probe
spec:
  selector:
    matchLabels:
      app: blackbox-exporter
  endpoints:
  - port: metrics
    path: /probe
    params:
      target: [ledgerwriter:8080/ready]
      module: [http_2xx]
    timeout: 30s
    interval: 60s
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: transactionhistory-probe
  labels:
    app.kubernetes.io/name: transactionhistory-probe
spec:
  selector:
    matchLabels:
      app: blackbox-exporter
  endpoints:
  - port: metrics
    path: /probe
    params:
      target: [transactionhistory:8080/ready]
      module: [http_2xx]
    timeout: 30s
    interval: 60s

如此清单文件所示，最佳实践是让每个 PodMonitoring Prometheus 活跃性探测器单独监控每个 Deployment。

如需创建 Prometheus 活跃性探测，请将清单应用于集群：
```
kubectl apply -f extras/prometheus/gmp/probes.yaml
```

规则

Prometheus 需要根据您在先前步骤中创建的探测所收到的响应来了解您要执行的操作。您可以使用 Prometheus 规则定义此响应。

在本教程中，您将创建 Prometheus 规则，以根据对活跃性探测的响应来生成提醒。Alertmanager 随后会处理这些规则的输出，以使用 Slack webhook 生成通知。

创建根据对活跃性探测的响应来生成事件的规则。请查看以下示例清单：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
apiVersion: monitoring.googleapis.com/v1
kind: Rules
metadata:
  name: uptime-rule
spec:
  groups:
  - name: Micro services uptime
    interval: 60s
    rules:
    - alert: BalancereaderUnavailable
      expr: probe_success{job="balancereader-probe"} == 0
      for: 1m
      annotations:
        summary: Balance Reader Service is unavailable
        description: Check Balance Reader pods and its logs
      labels:
        severity: 'critical'
    - alert: ContactsUnavailable
      expr: probe_success{job="contacts-probe"} == 0
      for: 1m
      annotations:
        summary: Contacts Service is unavailable
        description: Check Contacts pods and its logs
      labels:
        severity: 'warning'
    - alert: FrontendUnavailable
      expr: probe_success{job="frontend-probe"} == 0
      for: 1m
      annotations:
        summary: Frontend Service is unavailable
        description: Check Frontend pods and its logs
      labels:
        severity: 'critical'
    - alert: LedgerwriterUnavailable
      expr: probe_success{job="ledgerwriter-probe"} == 0
      for: 1m
      annotations:
        summary: Ledger Writer Service is unavailable
        description: Check Ledger Writer pods and its logs
      labels:
        severity: 'critical'
    - alert: TransactionhistoryUnavailable
      expr: probe_success{job="transactionhistory-probe"} == 0
      for: 1m
      annotations:
        summary: Transaction History Service is unavailable
        description: Check Transaction History pods and its logs
      labels:
        severity: 'critical'
    - alert: UserserviceUnavailable
      expr: probe_success{job="userservice-probe"} == 0
      for: 1m
      annotations:
        summary: User Service is unavailable
        description: Check User Service pods and its logs
      labels:
        severity: 'critical'

此清单描述了 PrometheusRule，并包含以下字段：

spec.groups.[*].name：规则组的名称。
spec.groups.[*].interval：评估组中规则的频率。
spec.groups.[*].rules[*].alert：提醒的名称。
spec.groups.[*].rules[*].expr：要求值的 PromQL 表达式。
spec.groups.[*].rules[*].for：提醒在被视为触发之前必须返回的时长。
spec.groups.[*].rules[*].annotations：要添加到每个提醒的注解列表。这仅对提醒规则有效。
spec.groups.[*].rules[*].labels：要添加或覆盖的标签。

如需创建规则，请将清单应用于集群：

kubectl apply -f extras/prometheus/gmp/rules.yaml

模拟中断

如需确保您的 Prometheus 探测、规则和 Alertmanager 配置正确无误，您应测试是否在出现问题时发送提醒和通知。如果不测试此流程，则您可能不会在出现问题时意识到生产服务发生服务中断。

如需模拟其中一个微服务的服务中断，请将 contacts Deployment 缩容为零。由于没有服务实例，Cymbal Bank 示例应用无法读取客户的联系信息：
```
kubectl scale deployment contacts --replicas 0
```
GKE 最多可能需要 5 分钟来纵向缩容 Deployment。

检查集群中各个 Deployment 的状态，并验证 contacts Deployment 是否正确纵向缩容：

kubectl get deployments

在以下示例输出中，contacts Deployment 已成功纵向缩容为 0 个实例：

NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
balancereader        1/1     1            1           17m
blackbox-exporter    1/1     1            1           5m7s
contacts             0/0     0            0           17m
frontend             1/1     1            1           17m
ledgerwriter         1/1     1            1           17m
loadgenerator        1/1     1            1           17m
transactionhistory   1/1     1            1           17m
userservice          1/1     1            1           17m

contacts Deployment 纵向缩容为零后，Prometheus 探测会报告 HTTP 错误代码。此 HTTP 错误会生成提醒，供 Alertmanager 随后进行处理。

检查您的 Slack 工作区渠道，以查找其文本类似于以下示例的服务中断通知消息：
```
[FIRING:1] ContactsUnavailable
Severity: Warning :warning:
Summary: Contacts Service is unavailable
Namespace: default
Check Contacts pods and it's logs
```
在实际的服务中断场景中，您在 Slack 中收到通知后，会开始排查问题和恢复服务。在本教程中会模拟此过程，并通过重新纵向扩容副本数量来恢复 contacts Deployment：
```
kubectl scale deployment contacts --replicas 1
```
扩缩 Deployment 并使 Prometheus 探测收到 HTTP 200 响应最多可能需要 5 分钟。您可以使用 kubectl get deployments 命令检查 Deployment 的状态。

收到针对 Prometheus 探测的运行状况良好响应后，Alertmanager 会清除事件。您应该会在您的 Slack 工作区渠道中看到提醒解决通知消息，类似于以下示例：
```
[RESOLVED] ContactsUnavailable
Severity: Warning :warning:
Summary: Contacts Service is unavailable
Namespace: default
Check Contacts pods and it's logs
```

清理