此页面由 Cloud Translation API 翻译。

在 GKE 上使用 Prometheus 设置应用可观测性

本教程介绍如何使用开源 Prometheus 对部署到 Google Kubernetes Engine (GKE) 的应用微服务设置活跃性探测。

本教程使用开源 Prometheus。但是，每个 GKE Autopilot 集群都会自动部署 Managed Service for Prometheus，这是Google Cloud针对 Prometheus 指标的全托管式多云、跨项目解决方案。借助 Managed Service for Prometheus，您可以使用 Prometheus 全局监控工作负载并发出提醒，而无需大规模手动管理和操作 Prometheus。

您还可以使用 Grafana 等开源工具直观呈现 Prometheus 收集的指标。

准备环境

在本教程中，您将使用 Cloud Shell 来管理Google Cloud上托管的资源。

设置默认环境变量：
```
gcloud config set project PROJECT_ID
gcloud config set compute/region CONTROL_PLANE_LOCATION
```
替换以下内容：
- PROJECT_ID：您的 Google Cloud项目 ID。
- CONTROL_PLANE_LOCATION：集群控制平面的 Compute Engine 区域。在本教程中，区域为 us-central1。通常，建议使用您附近的区域。

克隆本教程中使用的示例代码库：

git clone https://github.com/GoogleCloudPlatform/bank-of-anthos.git
cd bank-of-anthos/

创建集群：

gcloud container clusters create-auto CLUSTER_NAME \
    --release-channel=CHANNEL_NAME \
    --location=CONTROL_PLANE_LOCATION

请替换以下内容：

CLUSTER_NAME：新集群的名称。
CHANNEL_NAME：发布渠道的名称。

部署 Prometheus

使用示例 Helm 图表安装 Prometheus：

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install tutorial bitnami/kube-prometheus \
    --version 8.2.2 \
    --values extras/prometheus/oss/values.yaml \
    --wait

此命令会使用以下组件安装 Prometheus：

Prometheus Operator：部署和配置开源 Prometheus 的常用方法。
Alertmanager：处理由 Prometheus 服务器发送的提醒，并将其路由到 Slack 等应用。
黑盒导出器：允许 Prometheus 探测端点使用 HTTP、HTTPS、DNS、TCP、ICMP 和 gRPC。

部署 Bank of Anthos

部署 Anthos 示例应用：

kubectl apply -f extras/jwt/jwt-secret.yaml
kubectl apply -f kubernetes-manifests

Slack 通知

如需设置 Slack 通知，您必须创建 Slack 应用，为应用激活传入的网络钩子，并将应用安装到 Slack 工作区。

创建 Slack 应用

加入 Slack 工作区，方法是使用您的电子邮件地址注册或使用工作区管理员发送的邀请。

注意：如果您不是 Slack 工作区的管理员，可能需要先获得工作区管理员的批准，然后再将您的应用部署到工作区。
使用您的工作区名称和 Slack 账号凭据登录到 Slack。
创建一个新的 Slack 应用
1. 在创建应用对话框中，点击从头开始。
2. 指定应用名称，然后选择您的 Slack 工作区。
3. 点击创建应用。
4. 在添加特性和功能下，点击传入的网络钩子。
5. 点击激活传入的网络钩子切换开关。
6. 在工作区的网络钩子网址部分中，点击将新网络钩子添加到工作区。
7. 在打开的授权页面上，选择一个渠道来接收通知。
8. 点击允许。
9. Slack 应用的网络钩子显示在工作区的网络钩子网址部分中。保存网址以备后用。

配置 Alertmanager

创建用于存储 webhook 网址的 Kubernetes Secret：

kubectl create secret generic alertmanager-slack-webhook --from-literal webhookURL=SLACK_WEBHOOK_URL
kubectl apply -f extras/prometheus/oss/alertmanagerconfig.yaml

将 SLACK_WEBHOOK_URL 替换为上一部分中的 webhook 网址。

配置 Prometheus

请查看以下清单：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
apiVersion: monitoring.coreos.com/v1
kind: Probe
metadata:
  name: frontend-probe
spec:
  jobName: frontend
  prober:
    url: tutorial-kube-prometheus-blackbox-exporter:19115
    path: /probe
  module: http_2xx
  interval: 60s
  scrapeTimeout: 30s
  targets:
    staticConfig:
      labels:
        app: bank-of-anthos
      static:
        - frontend:80
---
apiVersion: monitoring.coreos.com/v1
kind: Probe
metadata:
  name: userservice-probe
spec:
  jobName: userservice
  prober:
    url: tutorial-kube-prometheus-blackbox-exporter:19115
    path: /probe
  module: http_2xx
  interval: 60s
  scrapeTimeout: 30s
  targets:
    staticConfig:
      labels:
        app: bank-of-anthos
      static:
        - userservice:8080/ready
---
apiVersion: monitoring.coreos.com/v1
kind: Probe
metadata:
  name: balancereader-probe
spec:
  jobName: balancereader
  prober:
    url: tutorial-kube-prometheus-blackbox-exporter:19115
    path: /probe
  module: http_2xx
  interval: 60s
  scrapeTimeout: 30s
  targets:
    staticConfig:
      labels:
        app: bank-of-anthos
      static:
        - balancereader:8080/ready
---
apiVersion: monitoring.coreos.com/v1
kind: Probe
metadata:
  name: contacts-probe
spec:
  jobName: contacts
  prober:
    url: tutorial-kube-prometheus-blackbox-exporter:19115
    path: /probe
  module: http_2xx
  interval: 60s
  scrapeTimeout: 30s
  targets:
    staticConfig:
      labels:
        app: bank-of-anthos
      static:
        - contacts:8080/ready
---
apiVersion: monitoring.coreos.com/v1
kind: Probe
metadata:
  name: ledgerwriter-probe
spec:
  jobName: ledgerwriter
  prober:
    url: tutorial-kube-prometheus-blackbox-exporter:19115
    path: /probe
  module: http_2xx
  interval: 60s
  scrapeTimeout: 30s
  targets:
    staticConfig:
      labels:
        app: bank-of-anthos
      static:
        - ledgerwriter:8080/ready
---
apiVersion: monitoring.coreos.com/v1
kind: Probe
metadata:
  name: transactionhistory-probe
spec:
  jobName: transactionhistory
  prober:
    url: tutorial-kube-prometheus-blackbox-exporter:19115
    path: /probe
  module: http_2xx
  interval: 60s
  scrapeTimeout: 30s
  targets:
    staticConfig:
      labels:
        app: bank-of-anthos
      static:
        - transactionhistory:8080/ready

此清单描述了 Prometheus 活跃性探测，并包含以下字段：

spec.jobName：分配给抓取指标的作业名称。
spec.prober.url：黑盒导出器的 Service 网址。这包括 Helm 图表中定义的黑盒导出器的默认端口。
spec.prober.path：指标收集路径。
spec.targets.staticConfig.labels：分配给从目标抓取的所有指标的标签。
spec.targets.staticConfig.static：要探测的主机列表。

将清单应用到您的集群：

kubectl apply -f extras/prometheus/oss/probes.yaml

请查看以下清单：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: uptime-rule
spec:
  groups:
  - name: Micro services uptime
    interval: 60s
    rules:
    - alert: BalancereaderUnavaiable
      expr: probe_success{app="bank-of-anthos",job="balancereader"} == 0
      for: 1m
      annotations:
        summary: Balance Reader Service is unavailable
        description: Check Balance Reader pods and it's logs
      labels:
        severity: 'critical'
    - alert: ContactsUnavaiable
      expr: probe_success{app="bank-of-anthos",job="contacts"} == 0
      for: 1m
      annotations:
        summary: Contacs Service is unavailable
        description: Check Contacs pods and it's logs
      labels:
        severity: 'warning'
    - alert: FrontendUnavaiable
      expr: probe_success{app="bank-of-anthos",job="frontend"} == 0
      for: 1m
      annotations:
        summary: Frontend Service is unavailable
        description: Check Frontend pods and it's logs
      labels:
        severity: 'critical'
    - alert: LedgerwriterUnavaiable
      expr: probe_success{app="bank-of-anthos",job="ledgerwriter"} == 0
      for: 1m
      annotations:
        summary: Ledger Writer Service is unavailable
        description: Check Ledger Writer pods and it's logs
      labels:
        severity: 'critical'
    - alert: TransactionhistoryUnavaiable
      expr: probe_success{app="bank-of-anthos",job="transactionhistory"} == 0
      for: 1m
      annotations:
        summary: Transaction History Service is unavailable
        description: Check Transaction History pods and it's logs
      labels:
        severity: 'critical'
    - alert: UserserviceUnavaiable
      expr: probe_success{app="bank-of-anthos",job="userservice"} == 0
      for: 1m
      annotations:
        summary: User Service is unavailable
        description: Check User Service pods and it's logs
      labels:
        severity: 'critical'

此清单描述了 PrometheusRule，并包含以下字段：

spec.groups.[*].name：规则组的名称。
spec.groups.[*].interval：评估组中规则的频率。
spec.groups.[*].rules[*].alert：提醒的名称。
spec.groups.[*].rules[*].expr：要求值的 PromQL 表达式。
spec.groups.[*].rules[*].for：提醒在被视为触发之前必须返回的时长。
spec.groups.[*].rules[*].annotations：要添加到每个提醒的注解列表。这仅对提醒规则有效。
spec.groups.[*].rules[*].labels：要添加或覆盖的标签。

将清单应用到您的集群：

kubectl apply -f extras/prometheus/oss/rules.yaml

模拟中断

通过将 contacts Deployment 缩减至零来模拟服务中断：
```
kubectl scale deployment contacts --replicas 0
```
您应该会在您的 Slack 工作区频道中看到通知消息。GKE 最多可能需要 5 分钟来调节 Deployment。
恢复 contacts Deployment：
```
kubectl scale deployment contacts --replicas 1
```
您应该会在您的 Slack 工作区频道中看到提醒解决通知消息。GKE 最多可能需要 5 分钟来调节 Deployment。