在 GKE 上使用 Prometheus 设置应用可观测性


本教程介绍如何使用开源 Prometheus 对部署到 Google Kubernetes Engine (GKE) 的应用微服务设置活跃性探测。

本教程使用开源 Prometheus。但是,每个 GKE Autopilot 集群都会自动部署 Managed Service for Prometheus,这是Google Cloud针对 Prometheus 指标的全托管式多云、跨项目解决方案。借助 Managed Service for Prometheus,您可以使用 Prometheus 全局监控工作负载并发出提醒,而无需大规模手动管理和操作 Prometheus。

您还可以使用 Grafana 等开源工具直观呈现 Prometheus 收集的指标。

目标

  • 创建集群。
  • 部署 Prometheus
  • 部署示例应用 Bank of Anthos
  • 配置 Prometheus 活跃性探测。
  • 配置 Prometheus 提醒。
  • 配置 Alertmanager 以在 Slack 频道中接收通知。
  • 模拟服务中断以测试 Prometheus。

费用

在本文档中,您将使用 Google Cloud的以下收费组件:

您可使用价格计算器根据您的预计使用情况来估算费用。

新 Google Cloud 用户可能有资格申请免费试用

完成本文档中描述的任务后,您可以通过删除所创建的资源来避免继续计费。如需了解详情,请参阅清理

准备工作

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, click Create project to begin creating a new Google Cloud project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. Enable the GKE API.

    Enable the API

  5. In the Google Cloud console, on the project selector page, click Create project to begin creating a new Google Cloud project.

    Go to project selector

  6. Verify that billing is enabled for your Google Cloud project.

  7. Enable the GKE API.

    Enable the API

  8. 安装 Helm API
  9. 准备环境

    在本教程中,您将使用 Cloud Shell 来管理Google Cloud上托管的资源。

    1. 设置默认环境变量:

      gcloud config set project PROJECT_ID
      gcloud config set compute/region COMPUTE_REGION
      

      替换以下内容:

      • PROJECT_ID:您的 Google Cloud项目 ID
      • PROJECT_ID:集群的 Compute Engine 区域。在本教程中,区域为 us-central1。通常,建议使用您附近的区域。
    2. 克隆本教程中使用的示例代码库:

      git clone https://github.com/GoogleCloudPlatform/bank-of-anthos.git
      cd bank-of-anthos/
      
    3. 创建集群:

      gcloud container clusters create-auto CLUSTER_NAME \
          --release-channel=CHANNEL_NAME \
          --region=COMPUTE_REGION
      

      请替换以下内容:

      • CLUSTER_NAME:新集群的名称。
      • CHANNEL_NAME发布渠道的名称。

    部署 Prometheus

    使用示例 Helm 图表安装 Prometheus:

    helm repo add bitnami https://charts.bitnami.com/bitnami
    helm install tutorial bitnami/kube-prometheus \
        --version 8.2.2 \
        --values extras/prometheus/oss/values.yaml \
        --wait
    

    此命令会使用以下组件安装 Prometheus:

    • Prometheus Operator:部署和配置开源 Prometheus 的常用方法。
    • Alertmanager:处理由 Prometheus 服务器发送的提醒,并将其路由到 Slack 等应用。
    • 黑盒导出器:允许 Prometheus 探测端点使用 HTTP、HTTPS、DNS、TCP、ICMP 和 gRPC。

    部署 Bank of Anthos

    部署 Anthos 示例应用:

    kubectl apply -f extras/jwt/jwt-secret.yaml
    kubectl apply -f kubernetes-manifests
    

    Slack 通知

    如需设置 Slack 通知,您必须创建 Slack 应用,为应用激活传入的网络钩子,并将应用安装到 Slack 工作区。

    创建 Slack 应用

    1. 加入 Slack 工作区,方法是使用您的电子邮件地址注册或使用工作区管理员发送的邀请。

    2. 使用您的工作区名称和 Slack 账号凭据登录到 Slack

    3. 创建一个新的 Slack 应用

      1. 创建应用对话框中,点击从头开始
      2. 指定应用名称,然后选择您的 Slack 工作区。
      3. 点击创建应用
      4. 添加特性和功能下,点击传入的网络钩子
      5. 点击激活传入的网络钩子切换开关。
      6. 工作区的网络钩子网址部分中,点击将新网络钩子添加到工作区
      7. 在打开的授权页面上,选择一个渠道来接收通知。
      8. 点击允许
      9. Slack 应用的网络钩子显示在工作区的网络钩子网址部分中。保存网址以备后用。

    配置 Alertmanager

    创建用于存储 webhook 网址的 Kubernetes Secret:

    kubectl create secret generic alertmanager-slack-webhook --from-literal webhookURL=SLACK_WEBHOOK_URL
    kubectl apply -f extras/prometheus/oss/alertmanagerconfig.yaml
    

    SLACK_WEBHOOK_URL 替换为上一部分中的 webhook 网址。

    配置 Prometheus

    1. 请查看以下清单:

      # Copyright 2023 Google LLC
      #
      # Licensed under the Apache License, Version 2.0 (the "License");
      # you may not use this file except in compliance with the License.
      # You may obtain a copy of the License at
      #
      #      http://www.apache.org/licenses/LICENSE-2.0
      #
      # Unless required by applicable law or agreed to in writing, software
      # distributed under the License is distributed on an "AS IS" BASIS,
      # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      # See the License for the specific language governing permissions and
      # limitations under the License.
      ---
      apiVersion: monitoring.coreos.com/v1
      kind: Probe
      metadata:
        name: frontend-probe
      spec:
        jobName: frontend
        prober:
          url: tutorial-kube-prometheus-blackbox-exporter:19115
          path: /probe
        module: http_2xx
        interval: 60s
        scrapeTimeout: 30s
        targets:
          staticConfig:
            labels:
              app: bank-of-anthos
            static:
              - frontend:80
      ---
      apiVersion: monitoring.coreos.com/v1
      kind: Probe
      metadata:
        name: userservice-probe
      spec:
        jobName: userservice
        prober:
          url: tutorial-kube-prometheus-blackbox-exporter:19115
          path: /probe
        module: http_2xx
        interval: 60s
        scrapeTimeout: 30s
        targets:
          staticConfig:
            labels:
              app: bank-of-anthos
            static:
              - userservice:8080/ready
      ---
      apiVersion: monitoring.coreos.com/v1
      kind: Probe
      metadata:
        name: balancereader-probe
      spec:
        jobName: balancereader
        prober:
          url: tutorial-kube-prometheus-blackbox-exporter:19115
          path: /probe
        module: http_2xx
        interval: 60s
        scrapeTimeout: 30s
        targets:
          staticConfig:
            labels:
              app: bank-of-anthos
            static:
              - balancereader:8080/ready
      ---
      apiVersion: monitoring.coreos.com/v1
      kind: Probe
      metadata:
        name: contacts-probe
      spec:
        jobName: contacts
        prober:
          url: tutorial-kube-prometheus-blackbox-exporter:19115
          path: /probe
        module: http_2xx
        interval: 60s
        scrapeTimeout: 30s
        targets:
          staticConfig:
            labels:
              app: bank-of-anthos
            static:
              - contacts:8080/ready
      ---
      apiVersion: monitoring.coreos.com/v1
      kind: Probe
      metadata:
        name: ledgerwriter-probe
      spec:
        jobName: ledgerwriter
        prober:
          url: tutorial-kube-prometheus-blackbox-exporter:19115
          path: /probe
        module: http_2xx
        interval: 60s
        scrapeTimeout: 30s
        targets:
          staticConfig:
            labels:
              app: bank-of-anthos
            static:
              - ledgerwriter:8080/ready
      ---
      apiVersion: monitoring.coreos.com/v1
      kind: Probe
      metadata:
        name: transactionhistory-probe
      spec:
        jobName: transactionhistory
        prober:
          url: tutorial-kube-prometheus-blackbox-exporter:19115
          path: /probe
        module: http_2xx
        interval: 60s
        scrapeTimeout: 30s
        targets:
          staticConfig:
            labels:
              app: bank-of-anthos
            static:
              - transactionhistory:8080/ready
      

      此清单描述了 Prometheus 活跃性探测,并包含以下字段:

      • spec.jobName:分配给抓取指标的作业名称。
      • spec.prober.url:黑盒导出器的 Service 网址。这包括 Helm 图表中定义的黑盒导出器的默认端口。
      • spec.prober.path:指标收集路径。
      • spec.targets.staticConfig.labels:分配给从目标抓取的所有指标的标签。
      • spec.targets.staticConfig.static:要探测的主机列表。
    2. 将清单应用到您的集群:

      kubectl apply -f extras/prometheus/oss/probes.yaml
      
    3. 请查看以下清单:

      # Copyright 2023 Google LLC
      #
      # Licensed under the Apache License, Version 2.0 (the "License");
      # you may not use this file except in compliance with the License.
      # You may obtain a copy of the License at
      #
      #      http://www.apache.org/licenses/LICENSE-2.0
      #
      # Unless required by applicable law or agreed to in writing, software
      # distributed under the License is distributed on an "AS IS" BASIS,
      # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      # See the License for the specific language governing permissions and
      # limitations under the License.
      ---
      apiVersion: monitoring.coreos.com/v1
      kind: PrometheusRule
      metadata:
        name: uptime-rule
      spec:
        groups:
        - name: Micro services uptime
          interval: 60s
          rules:
          - alert: BalancereaderUnavaiable
            expr: probe_success{app="bank-of-anthos",job="balancereader"} == 0
            for: 1m
            annotations:
              summary: Balance Reader Service is unavailable
              description: Check Balance Reader pods and it's logs
            labels:
              severity: 'critical'
          - alert: ContactsUnavaiable
            expr: probe_success{app="bank-of-anthos",job="contacts"} == 0
            for: 1m
            annotations:
              summary: Contacs Service is unavailable
              description: Check Contacs pods and it's logs
            labels:
              severity: 'warning'
          - alert: FrontendUnavaiable
            expr: probe_success{app="bank-of-anthos",job="frontend"} == 0
            for: 1m
            annotations:
              summary: Frontend Service is unavailable
              description: Check Frontend pods and it's logs
            labels:
              severity: 'critical'
          - alert: LedgerwriterUnavaiable
            expr: probe_success{app="bank-of-anthos",job="ledgerwriter"} == 0
            for: 1m
            annotations:
              summary: Ledger Writer Service is unavailable
              description: Check Ledger Writer pods and it's logs
            labels:
              severity: 'critical'
          - alert: TransactionhistoryUnavaiable
            expr: probe_success{app="bank-of-anthos",job="transactionhistory"} == 0
            for: 1m
            annotations:
              summary: Transaction History Service is unavailable
              description: Check Transaction History pods and it's logs
            labels:
              severity: 'critical'
          - alert: UserserviceUnavaiable
            expr: probe_success{app="bank-of-anthos",job="userservice"} == 0
            for: 1m
            annotations:
              summary: User Service is unavailable
              description: Check User Service pods and it's logs
            labels:
              severity: 'critical'
      

      此清单描述了 PrometheusRule,并包含以下字段:

      • spec.groups.[*].name:规则组的名称。
      • spec.groups.[*].interval:评估组中规则的频率。
      • spec.groups.[*].rules[*].alert:提醒的名称。
      • spec.groups.[*].rules[*].expr:要求值的 PromQL 表达式。
      • spec.groups.[*].rules[*].for:提醒在被视为触发之前必须返回的时长。
      • spec.groups.[*].rules[*].annotations:要添加到每个提醒的注解列表。这仅对提醒规则有效。
      • spec.groups.[*].rules[*].labels:要添加或覆盖的标签。
    4. 将清单应用到您的集群:

      kubectl apply -f extras/prometheus/oss/rules.yaml
      

    模拟中断

    1. 通过将 contacts Deployment 缩减至零来模拟服务中断:

      kubectl scale deployment contacts --replicas 0
      

      您应该会在您的 Slack 工作区频道中看到通知消息。GKE 最多可能需要 5 分钟来调节 Deployment。

    2. 恢复 contacts Deployment:

      kubectl scale deployment contacts --replicas 1
      

      您应该会在您的 Slack 工作区频道中看到提醒解决通知消息。GKE 最多可能需要 5 分钟来调节 Deployment。

    清理

    为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用,请删除包含这些资源的项目,或者保留项目但删除各个资源。

    删除项目

    1. In the Google Cloud console, go to the Manage resources page.

      Go to Manage resources

    2. In the project list, select the project that you want to delete, and then click Delete.
    3. In the dialog, type the project ID, and then click Shut down to delete the project.

    删除各个资源

    1. 删除 Kubernetes 资源:

      kubectl delete -f kubernetes-manifests
      
    2. 卸载 Prometheus:

      helm uninstall tutorial
      
    3. 删除 GKE 集群:

      gcloud container clusters delete CLUSTER_NAME --quiet
      

    后续步骤