在 GKE 上部署高可用性 PostgreSQL 数据库

Autopilot Standard

PostgreSQL 是一种开源对象关系型数据库，具有可靠性和数据完整性。它符合 ACID，支持外键、联接、视图、触发器和存储过程。

本文档适用于想要在 Google Kubernetes Engine (GKE) 上部署高可用性 PostgreSQL 拓扑的数据库管理员、云架构师和运维专业人员。

目标

在本教程中，您将学习如何完成以下操作：

使用 Terraform 创建区域级 GKE 集群。
部署高可用性 PostgreSQL 数据库
为 PostgreSQL 应用设置监控。
执行 PostgreSQL 数据库和 GKE 集群升级。
模拟集群中断和 PostgreSQL 副本故障切换。
执行 PostgreSQL 数据库备份和恢复。

架构

本部分介绍了您将在本教程中构建的解决方案的架构。

您将在不同区域中预配两个 GKE 集群：主要集群和备份集群。在本教程中，主集群位于 us-central1 区域，备份集群位于 us-west1 区域。通过此架构，您可以预配高可用性 PostgreSQL 数据库并测试灾难恢复，如本教程后面部分所述。

对于源集群，您将使用 Helm 图表 (bitnami/postgresql-ha) 设置高可用性 PostgreSQL 集群。

图表展示了高可用性 PostgreSQL 集群的示例架构。 — **图 1**：高可用性 PostgreSQL 集群的示例架构。

费用

在本文档中，您将使用 Google Cloud的以下收费组件：

如需根据您的预计使用量来估算费用，请使用价格计算器。

新 Google Cloud 用户可能有资格申请免费试用。

完成本文档中描述的任务后，您可以通过删除所创建的资源来避免继续计费。如需了解详情，请参阅清理。

准备工作

设置项目

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, click Create project to begin creating a new Google Cloud project.

Roles required to create a project

To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Google Kubernetes Engine, Backup for GKE, Artifact Registry, Compute Engine, and IAM APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, click Create project to begin creating a new Google Cloud project.

Roles required to create a project

To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Google Kubernetes Engine, Backup for GKE, Artifact Registry, Compute Engine, and IAM APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

设置角色

Make sure that you have the following role or roles on the project: roles/storage.objectViewer, roles/logging.logWriter, roles/artifactregistry.Admin, roles/container.clusterAdmin, roles/container.serviceAgent, roles/serviceusage.serviceUsageAdmin, roles/iam.serviceAccountAdmin

Check for the roles

In the Google Cloud console, go to the IAM page.
Go to IAM
Select the project.
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

Grant the roles

In the Google Cloud console, go to the IAM page.
前往 IAM
选择项目。
点击 授予访问权限。
在新的主账号字段中，输入您的用户标识符。这通常是 Google 账号的电子邮件地址。
在选择角色列表中，选择一个角色。
如需授予其他角色，请点击 添加其他角色，然后添加其他各个角色。
点击 Save（保存）。

设置环境

在本教程中，您将使用 Cloud Shell 来管理Google Cloud上托管的资源。Cloud Shell 预安装有本教程所需的软件，包括 Docker、kubectl、gcloud CLI、Helm 和 Terraform。

为了使用 Cloud Shell 设置您的环境，请执行以下操作：

点击Google Cloud 控制台中的 激活 Cloud Shell，从 Google Cloud 控制台启动 Cloud Shell 会话。此操作会在 Google Cloud 控制台的底部窗格中启动会话。

设置环境变量。

export PROJECT_ID=PROJECT_ID
export SOURCE_CLUSTER=cluster-db1
export REGION=us-central1

替换以下值：

PROJECT_ID：您的 Google Cloud 项目 ID。

设置默认环境变量。
```
gcloud config set project PROJECT_ID
```

克隆代码库。

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples

切换到工作目录。

cd kubernetes-engine-samples/databases/gke-stateful-postgres

创建集群基础架构

在本部分中，您将运行 Terraform 脚本来创建自定义 Virtual Private Cloud (VPC)、Artifact Registry 存储库（用于存储 PostgreSQL 映像）以及两个区域级 GKE 集群。一个集群将部署在 us-central1 中，另一个集群用于备份，将部署在 us-west1 中。

如需创建集群，请运行以下命令：

Autopilot

在 Cloud Shell 中，运行以下命令：

terraform -chdir=terraform/gke-autopilot init
terraform -chdir=terraform/gke-autopilot apply -var project_id=$PROJECT_ID

出现提示时，请输入 yes。

了解 Terraform 配置

Terraform 配置文件会创建以下资源来部署基础架构：

创建 Artifact Registry 存储库以存储 Docker 映像。

resource "google_artifact_registry_repository" "main" {
  location      = "us"
  repository_id = "main"
  format        = "DOCKER"
  project       = var.project_id
}

为虚拟机的网络接口创建 VPC 网络和子网。

module "gcp-network" {
  source  = "terraform-google-modules/network/google"
  version = "< 8.0.0"

  project_id   = var.project_id
  network_name = "vpc-gke-postgresql"

  subnets = [
    {
      subnet_name           = "snet-gke-postgresql-us-central1"
      subnet_ip             = "10.0.0.0/17"
      subnet_region         = "us-central1"
      subnet_private_access = true
    },
    {
      subnet_name           = "snet-gke-postgresql-us-west1"
      subnet_ip             = "10.0.128.0/17"
      subnet_region         = "us-west1"
      subnet_private_access = true
    },
  ]

  secondary_ranges = {
    ("snet-gke-postgresql-us-central1") = [
      {
        range_name    = "ip-range-pods-db1"
        ip_cidr_range = "192.168.0.0/18"
      },
      {
        range_name    = "ip-range-svc-db1"
        ip_cidr_range = "192.168.64.0/18"
      },
    ],
    ("snet-gke-postgresql-us-west1") = [
      {
        range_name    = "ip-range-pods-db2"
        ip_cidr_range = "192.168.128.0/18"
      },
      {
        range_name    = "ip-range-svc-db2"
        ip_cidr_range = "192.168.192.0/18"
      },
    ]
  }
}

output "network_name" {
  value = module.gcp-network.network_name
}

output "primary_subnet_name" {
  value = module.gcp-network.subnets_names[0]
}

output "secondary_subnet_name" {
  value = module.gcp-network.subnets_names[1]
}

创建主 GKE 集群。

Terraform 会在 us-central1 区域中创建专用集群，并启用 Backup for GKE 以进行灾难恢复和 Managed Service for Prometheus 以进行集群监控。

只有运行 GKE 1.25 版或更高版本的 Autopilot 集群才支持 Managed Service for Prometheus。

module "gke-db1-autopilot" {
  source                          = "../modules/beta-autopilot-private-cluster"
  project_id                      = var.project_id
  name                            = "cluster-db1"
  kubernetes_version              = "1.25" # Will be ignored if use "REGULAR" release_channel
  region                          = "us-central1"
  regional                        = true
  zones                           = ["us-central1-a", "us-central1-b", "us-central1-c"]
  network                         = module.network.network_name
  subnetwork                      = module.network.primary_subnet_name
  ip_range_pods                   = "ip-range-pods-db1"
  ip_range_services               = "ip-range-svc-db1"
  horizontal_pod_autoscaling      = true
  release_channel                 = "RAPID" # Default version is 1.22 in REGULAR. GMP on Autopilot requires V1.25 via var.kubernetes_version
  enable_vertical_pod_autoscaling = true
  enable_private_endpoint         = false
  enable_private_nodes            = true
  master_ipv4_cidr_block          = "172.16.0.0/28"
  create_service_account          = false
}

在 us-west1 区域中创建备份集群以用于灾难恢复。

module "gke-db2-autopilot" {
  source                          = "../modules/beta-autopilot-private-cluster"
  project_id                      = var.project_id
  name                            = "cluster-db2"
  kubernetes_version              = "1.25" # Will be ignored if use "REGULAR" release_channel
  region                          = "us-west1"
  regional                        = true
  zones                           = ["us-west1-a", "us-west1-b", "us-west1-c"]
  network                         = module.network.network_name
  subnetwork                      = module.network.secondary_subnet_name
  ip_range_pods                   = "ip-range-pods-db2"
  ip_range_services               = "ip-range-svc-db2"
  horizontal_pod_autoscaling      = true
  release_channel                 = "RAPID" # Default version is 1.22 in REGULAR. GMP on Autopilot requires V1.25 via var.kubernetes_version
  enable_vertical_pod_autoscaling = true
  enable_private_endpoint         = false
  enable_private_nodes            = true
  master_ipv4_cidr_block          = "172.16.0.16/28"
  create_service_account          = false
}

Standard

在 Cloud Shell 中，运行以下命令：

terraform -chdir=terraform/gke-standard init
terraform -chdir=terraform/gke-standard apply -var project_id=$PROJECT_ID

出现提示时，请输入 yes。

了解 Terraform 配置

Terraform 配置文件会创建以下资源来部署基础架构：

创建 Artifact Registry 存储库以存储 Docker 映像。

resource "google_artifact_registry_repository" "main" {
  location      = "us"
  repository_id = "main"
  format        = "DOCKER"
  project       = var.project_id
}
resource "google_artifact_registry_repository_iam_binding" "binding" {
  provider   = google-beta
  project    = google_artifact_registry_repository.main.project
  location   = google_artifact_registry_repository.main.location
  repository = google_artifact_registry_repository.main.name
  role       = "roles/artifactregistry.reader"
  members = [
    "serviceAccount:${module.gke-db1.service_account}",
  ]
}

为虚拟机的网络接口创建 VPC 网络和子网。

module "gcp-network" {
  source  = "terraform-google-modules/network/google"
  version = "< 8.0.0"

  project_id   = var.project_id
  network_name = "vpc-gke-postgresql"

  subnets = [
    {
      subnet_name           = "snet-gke-postgresql-us-central1"
      subnet_ip             = "10.0.0.0/17"
      subnet_region         = "us-central1"
      subnet_private_access = true
    },
    {
      subnet_name           = "snet-gke-postgresql-us-west1"
      subnet_ip             = "10.0.128.0/17"
      subnet_region         = "us-west1"
      subnet_private_access = true
    },
  ]

  secondary_ranges = {
    ("snet-gke-postgresql-us-central1") = [
      {
        range_name    = "ip-range-pods-db1"
        ip_cidr_range = "192.168.0.0/18"
      },
      {
        range_name    = "ip-range-svc-db1"
        ip_cidr_range = "192.168.64.0/18"
      },
    ],
    ("snet-gke-postgresql-us-west1") = [
      {
        range_name    = "ip-range-pods-db2"
        ip_cidr_range = "192.168.128.0/18"
      },
      {
        range_name    = "ip-range-svc-db2"
        ip_cidr_range = "192.168.192.0/18"
      },
    ]
  }
}

output "network_name" {
  value = module.gcp-network.network_name
}

output "primary_subnet_name" {
  value = module.gcp-network.subnets_names[0]
}

output "secondary_subnet_name" {
  value = module.gcp-network.subnets_names[1]
}

创建主 GKE 集群。

Terraform 会在 us-central1 区域中创建专用集群，并启用 Backup for GKE 以进行灾难恢复和 Managed Service for Prometheus 以进行集群监控。

module "gke-db1" {
  source                   = "../modules/beta-private-cluster"
  project_id               = var.project_id
  name                     = "cluster-db1"
  regional                 = true
  region                   = "us-central1"
  network                  = module.network.network_name
  subnetwork               = module.network.primary_subnet_name
  ip_range_pods            = "ip-range-pods-db1"
  ip_range_services        = "ip-range-svc-db1"
  create_service_account   = true
  enable_private_endpoint  = false
  enable_private_nodes     = true
  master_ipv4_cidr_block   = "172.16.0.0/28"
  network_policy           = true
  cluster_autoscaling = {
    "autoscaling_profile": "OPTIMIZE_UTILIZATION",
    "enabled" : true,
    "gpu_resources" : [],
    "min_cpu_cores" : 36,
    "min_memory_gb" : 144,
    "max_cpu_cores" : 48,
    "max_memory_gb" : 192,
  }
  monitoring_enable_managed_prometheus = true
  gke_backup_agent_config = true

  node_pools = [
    {
      name            = "pool-sys"
      autoscaling     = true
      min_count       = 1
      max_count       = 3
      max_surge       = 1
      max_unavailable = 0
      machine_type    = "e2-standard-4"
      node_locations  = "us-central1-a,us-central1-b,us-central1-c"
      auto_repair     = true
    },
    {
      name            = "pool-db"
      autoscaling     = true
      max_surge       = 1
      max_unavailable = 0
      machine_type    = "e2-standard-8"
      node_locations  = "us-central1-a,us-central1-b,us-central1-c"
      auto_repair     = true
    },
  ]
  node_pools_labels = {
    all = {}
    pool-db = {
      "app.stateful/component" = "postgresql"
    }
    pool-sys = {
      "app.stateful/component" = "postgresql-pgpool"
    }
  }
  node_pools_taints = {
    all = []
    pool-db = [
      {
        key    = "app.stateful/component"
        value  = "postgresql"
        effect = "NO_SCHEDULE"
      },
    ],
    pool-sys = [
      {
        key    = "app.stateful/component"
        value  = "postgresql-pgpool"
        effect = "NO_SCHEDULE"
      },
    ],
  }
  gce_pd_csi_driver = true
}

在 us-west1 区域中创建备份集群以用于灾难恢复。

module "gke-db2" {
  source                   = "../modules/beta-private-cluster"
  project_id               = var.project_id
  name                     = "cluster-db2"
  regional                 = true
  region                   = "us-west1"
  network                  = module.network.network_name
  subnetwork               = module.network.secondary_subnet_name
  ip_range_pods            = "ip-range-pods-db2"
  ip_range_services        = "ip-range-svc-db2"
  create_service_account   = false
  service_account          = module.gke-db1.service_account
  enable_private_endpoint  = false
  enable_private_nodes     = true
  master_ipv4_cidr_block   = "172.16.0.16/28"
  network_policy           = true
  cluster_autoscaling = {
    "autoscaling_profile": "OPTIMIZE_UTILIZATION",
    "enabled" : true,
    "gpu_resources" : [],
    "min_cpu_cores" : 10,
    "min_memory_gb" : 144,
    "max_cpu_cores" : 48,
    "max_memory_gb" : 192,
  }
  monitoring_enable_managed_prometheus = true
  gke_backup_agent_config = true
  node_pools = [
    {
      name            = "pool-sys"
      autoscaling     = true
      min_count       = 1
      max_count       = 3
      max_surge       = 1
      max_unavailable = 0
      machine_type    = "e2-standard-4"
      node_locations  = "us-west1-a,us-west1-b,us-west1-c"
      auto_repair     = true
    },
    {
      name            = "pool-db"
      autoscaling     = true
      max_surge       = 1
      max_unavailable = 0
      machine_type    = "e2-standard-8"
      node_locations  = "us-west1-a,us-west1-b,us-west1-c"
      auto_repair     = true
    },
  ]
  node_pools_labels = {
    all = {}
    pool-db = {
      "app.stateful/component" = "postgresql"
    }
    pool-sys = {
      "app.stateful/component" = "postgresql-pgpool"
    }
  }
  node_pools_taints = {
    all = []
    pool-db = [
      {
        key    = "app.stateful/component"
        value  = "postgresql"
        effect = "NO_SCHEDULE"
      },
    ],
    pool-sys = [
      {
        key    = "app.stateful/component"
        value  = "postgresql-pgpool"
        effect = "NO_SCHEDULE"
      },
    ],
  }
  gce_pd_csi_driver = true
}

在您的集群上部署 PostgreSQL

在本部分中，您将使用 Helm 图表将 PostgreSQL 数据库实例部署为在 GKE 上运行。

安装 PostgreSQL

如需在集群上安装 PostgreSQL，请按照以下步骤操作。

配置 Docker 访问权限。

gcloud auth configure-docker us-docker.pkg.dev

使用所需的 PostgreSQL Docker 映像填充 Artifact Registry。
```
./scripts/gcr.sh bitnami/postgresql-repmgr 15.1.0-debian-11-r0
./scripts/gcr.sh bitnami/postgres-exporter 0.11.1-debian-11-r27
./scripts/gcr.sh bitnami/pgpool 4.3.3-debian-11-r28
```
该脚本会将以下 Bitnami 映像推送到 Artifact Registry 以供 Helm 安装：
- postgresql-repmgr：此 PostgreSQL 集群解决方案包含 PostgreSQL 复制管理器 (repmgr)，这是一种用于管理 PostgreSQL 集群上的复制和故障切换的开源工具。
- postgres-exporter：PostgreSQL 导出器会收集 PostgreSQL 指标供 Prometheus 使用。
- pgpool：Pgpool-II 是 PostgreSQL 代理。它可提供连接池和负载均衡。

验证存储库中是否存储了正确的映像。

gcloud artifacts docker images list us-docker.pkg.dev/$PROJECT_ID/main \
    --format="flattened(package)"

输出类似于以下内容：

---
image: us-docker.pkg.dev/[PROJECT_ID]/main/bitnami/pgpool
---
image: us-docker.pkg.dev/[PROJECT_ID]/main/bitnami/postgres-exporter
---
image: us-docker.pkg.dev/h[PROJECT_ID]/main/bitnami/postgresql-repmgr

配置对主要集群的 kubectl 命令行访问权限。

gcloud container clusters get-credentials $SOURCE_CLUSTER \
--location=$REGION --project=$PROJECT_ID

创建命名空间。

export NAMESPACE=postgresql
kubectl create namespace $NAMESPACE

如果您要部署到 Autopilot 集群，请在三个可用区中配置节点预配。如果您要部署到 Standard 集群，则可以跳过此步骤。

默认情况下，Autopilot 仅会在两个可用区中预配资源。在 prepareforha.yaml 中定义的部署通过设置以下值，可确保 Autopilot 在集群的三个可用区中预配节点：

replicas:3
具有 requiredDuringSchedulingIgnoredDuringExecution 和 topologyKey: "topology.kubernetes.io/zone" 的 podAntiAffinity

kubectl -n $NAMESPACE apply -f scripts/prepareforha.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prepare-three-zone-ha
  labels:
    app: prepare-three-zone-ha
    app.kubernetes.io/name: postgresql-ha
spec:
  replicas: 3
  selector:
    matchLabels:
      app: prepare-three-zone-ha
      app.kubernetes.io/name: postgresql-ha
  template:
    metadata:
      labels:
        app: prepare-three-zone-ha
        app.kubernetes.io/name: postgresql-ha
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - prepare-three-zone-ha
            topologyKey: "topology.kubernetes.io/zone"
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: cloud.google.com/compute-class
                operator: In
                values:
                - "Scale-Out"
            weight: 1
      nodeSelector:
        app.stateful/component: postgresql
      tolerations:
      - effect: NoSchedule
        key: app.stateful/component
        operator: Equal
        value: postgresql
      containers:
      - name: prepare-three-zone-ha
        image: busybox:latest
        command:
            - "/bin/sh"
            - "-c"
            - "while true; do sleep 3600; done"
        resources:
          limits:
            cpu: "500m"
            ephemeral-storage: "10Mi"
            memory: "0.5Gi"
          requests:
            cpu: "500m"
            ephemeral-storage: "10Mi"
            memory: "0.5Gi"

更新 Helm 依赖项。

cd helm/postgresql-bootstrap
helm dependency update

检查并验证 Helm 将安装的图表。

helm -n postgresql template postgresql . \
  --set global.imageRegistry="us-docker.pkg.dev/$PROJECT_ID/main"

安装 Helm 图表。

helm -n postgresql upgrade --install postgresql . \
    --set global.imageRegistry="us-docker.pkg.dev/$PROJECT_ID/main"

输出类似于以下内容：

NAMESPACE: postgresql
STATUS: deployed
REVISION: 1
TEST SUITE: None

验证 PostgreSQL 副本是否正在运行。

kubectl get all -n $NAMESPACE

输出类似于以下内容：

NAME                                                          READY   STATUS    RESTARTS   AGE
pod/postgresql-postgresql-bootstrap-pgpool-75664444cb-dkl24   1/1     Running   0          8m39s
pod/postgresql-postgresql-ha-pgpool-6d86bf9b58-ff2bg          1/1     Running   0          8m39s
pod/postgresql-postgresql-ha-postgresql-0                     2/2     Running   0          8m39s
pod/postgresql-postgresql-ha-postgresql-1                     2/2     Running   0          8m39s
pod/postgresql-postgresql-ha-postgresql-2                     2/2     Running   0          8m38s

NAME                                                   TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)    AGE
service/postgresql-postgresql-ha-pgpool                ClusterIP   192.168.99.236    <none>        5432/TCP   8m39s
service/postgresql-postgresql-ha-postgresql            ClusterIP   192.168.90.20     <none>        5432/TCP   8m39s
service/postgresql-postgresql-ha-postgresql-headless   ClusterIP   None              <none>        5432/TCP   8m39s
service/postgresql-postgresql-ha-postgresql-metrics    ClusterIP   192.168.127.198   <none>        9187/TCP   8m39s

NAME                                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/postgresql-postgresql-bootstrap-pgpool   1/1     1            1           8m39s
deployment.apps/postgresql-postgresql-ha-pgpool          1/1     1            1           8m39s

NAME                                                                DESIRED   CURRENT   READY   AGE
replicaset.apps/postgresql-postgresql-bootstrap-pgpool-75664444cb   1         1         1       8m39s
replicaset.apps/postgresql-postgresql-ha-pgpool-6d86bf9b58          1         1         1       8m39s

NAME                                                   READY   AGE
statefulset.apps/postgresql-postgresql-ha-postgresql   3/3     8m39s

创建测试数据集

在本部分中，您将创建一个数据库和一个包含示例值的表。该数据库用作您将在本教程后面部分测试的故障切换过程的测试数据集。

连接到您的 PostgreSQL 实例。

cd ../../
./scripts/launch-client.sh

输出类似于以下内容：

Launching Pod pg-client in the namespace postgresql ...
pod/pg-client created
waiting for the Pod to be ready
Copying script files to the target Pod pg-client ...
Pod: pg-client is healthy

启动一个 shell 会话。

kubectl exec -it pg-client -n postgresql -- /bin/bash

创建数据库和表，然后插入一些测试行。

psql -h $HOST_PGPOOL -U postgres -a -q -f /tmp/scripts/generate-db.sql

验证每个表的行数。
```
psql -h $HOST_PGPOOL -U postgres -a -q -f /tmp/scripts/count-rows.sql
```
输出类似于以下内容：
```
select COUNT(*) from tb01;
 count
--------
 300000
(1 row)

select COUNT(*) from tb02;
 count
--------
 300000
(1 row)
```
提示：您还可以使用 pgbench 创建虚拟数据，但为了更轻松地区分查询请求流量，我们建议您使用提供的脚本来创建在读取/写入测试期间查询的数据库和表。

生成测试数据。

export DB=postgres
pgbench -i -h $HOST_PGPOOL -U postgres $DB -s 50

输出类似于以下内容：

dropping old tables...
creating tables...
generating data (client-side)...
5000000 of 5000000 tuples (100%) done (elapsed 29.85 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 36.86 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 31.10 s, vacuum 1.88 s, primary keys 3.86 s).

退出 postgres 客户端 Pod。
```
exit
```

监控 PostgreSQL

在本部分中，您将查看指标并为 PostgreSQL 实例设置提醒。您将使用 Google Cloud Managed Service for Prometheus 来执行监控和提醒。

查看指标

PostgreSQL 部署包含 postgresql-exporter Sidecar 容器。此容器会公开 /metrics 端点。Google Cloud Managed Service for Prometheus 配置为监控此端点上的 PostgreSQL Pod。您可以通过 Google Cloud 控制台信息中心查看这些指标。

Google Cloud 控制台提供了几种创建和保存信息中心配置的方法：

创建和导出：您可以直接在 Google Cloud 控制台中创建信息中心，然后将其导出并存储在代码库中。为此，请在信息中心工具栏中打开 JSON 编辑器并下载信息中心 JSON 文件。
存储和导入：您可以通过点击“+创建信息中心”，然后使用 JSON 编辑器菜单上传信息中心的 JSON 内容，从 JSON 文件导入信息中心。

如需直观呈现来自 PostgreSQL 应用和 GKE 集群的数据，请按照以下步骤操作：

创建以下信息中心：

cd monitoring
gcloud monitoring dashboards create \
        --config-from-file=dashboard/postgresql-overview.json \
        --project=$PROJECT_ID
gcloud monitoring dashboards create \
        --config-from-file dashboard/gke-postgresql.json \
        --project $PROJECT_ID

在 Google Cloud 控制台中，前往 Cloud Monitoring 信息中心。前往 Cloud Monitoring 信息中心
从信息中心列表中选择自定义。系统会显示以下信息中心：
- PostgreSQL 概览：显示 PostgreSQL 应用的指标，包括数据库正常运行时间、数据库大小和事务延迟时间。
- GKE PostgreSQL 集群：显示用于运行 PostgreSQL 的 GKE 集群的指标，包括 CPU 用量、内存用量和卷利用率。
点击每个链接可检查生成的信息中心。

设置提醒

启用提醒后，您可以及时获知应用中出现的问题，从而快速解决问题。您可以创建提醒政策，以指定您希望在哪些情况下收到提醒以及通过什么方式获得通知。您还可以创建通知渠道，以便选择发送提醒的目标位置。

在本部分中，您将使用 Terraform 配置以下示例提醒：

db_max_transaction：监控事务的最长延迟时间（以秒为单位）；如果该值大于 10，则系统会触发提醒。
db_node_up：监控数据库 Pod 的状态；0 表示 Pod 已关闭并触发提醒。

如需设置提醒，请按照以下步骤操作：

使用 Terraform 配置提醒。

EMAIL=YOUR_EMAIL
cd alerting/terraform
terraform init
terraform plan -var project_id=$PROJECT_ID -var email_address=$EMAIL
terraform apply -var project_id=$PROJECT_ID -var email_address=$EMAIL

替换以下值：

YOUR_EMAIL：您的电子邮件地址。

输出类似于以下内容：

Apply complete! Resources: 3 added, 0 changed, 0 destroyed.

连接到客户端 Pod。

cd ../../../
kubectl exec -it --namespace postgresql pg-client -- /bin/bash

生成负载测试以测试 db_max_transaction 提醒。

pgbench -i -h $HOST_PGPOOL -U postgres -s 200 postgres

输出类似于以下内容：

dropping old tables...
creating tables...
generating data (client-side)...
20000000 of 20000000 tuples (100%) done (elapsed 163.22 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 191.30 s (drop tables 0.14 s, create tables 0.01 s, client-side generate 165.62 s, vacuum 4.52 s, primary keys 21.00 s).

提醒会触发电子邮件并将其发送到 YOUR_EMAIL，其主题行以“[提醒] 事务最长延迟时间”开头。

在 Google Cloud 控制台中，前往“提醒政策”页面。

进入“提醒政策”
从列出的政策中选择 db_max_transaction。在该图表中，您应该会看到来自负载测试的峰值，超过 Prometheus 指标 pg_stat_activity_max_tx_duration/gauge 的 10 秒阈值保全。
退出 postgres 客户端 Pod。
```
exit
```

管理 PostgreSQL 和 GKE 升级

PostgreSQL 和 Kubernetes 的版本更新都会定期发布。请遵循运营最佳实践来定期更新您的软件环境。默认情况下，GKE 会为您管理集群和节点池升级。

升级 PostgreSQL

本部分介绍了如何为 PostgreSQL 执行版本升级。在本教程中，您将使用滚动更新策略来升级 Pod，从而避免所有 Pod 都关闭。

如需执行版本升级，请按以下步骤操作：

将更新后的 postgresql-repmgr 映像版本推送到 Artifact Registry。定义新版本（例如 postgresql-repmgr 15.1.0-debian-11-r1）。

NEW_IMAGE=us-docker.pkg.dev/$PROJECT_ID/main/bitnami/postgresql-repmgr:15.1.0-debian-11-r1
./scripts/gcr.sh bitnami/postgresql-repmgr 15.1.0-debian-11-r1

使用 kubectl 触发滚动更新。

kubectl set image statefulset -n postgresql postgresql-postgresql-ha-postgresql postgresql=$NEW_IMAGE
kubectl rollout restart statefulsets -n postgresql postgresql-postgresql-ha-postgresql
kubectl rollout status statefulset -n postgresql postgresql-postgresql-ha-postgresql

您将看到 StatefulSet 完成滚动更新，从最高序数副本开始到最低序数副本。

输出类似于以下内容：

Waiting for 1 pods to be ready...
waiting for statefulset rolling update to complete 1 pods at revision postgresql-postgresql-ha-postgresql-5c566ccf49...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
waiting for statefulset rolling update to complete 2 pods at revision postgresql-postgresql-ha-postgresql-5c566ccf49...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
statefulset rolling update complete 3 pods at revision postgresql-postgresql-ha-postgresql-5c566ccf49...

规划 Standard 集群上的 GKE 升级

如果您运行的是 Standard 集群，则本部分适用。您可以执行主动步骤并设置配置，以在运行有状态服务时缓解风险并使集群升级更加顺畅，包括：

遵循 GKE 最佳实践升级集群。选择适当的升级策略，以确保在维护窗口内进行升级：
- 如果费用优化非常重要，并且工作负载可以容忍 60 分钟以内的正常关停，请选择超额配置升级。
- 如果工作负载对中断的容忍度较低，并且能够接受因资源用量增加而导致的临时费用增加，请选择蓝绿升级。
如需了解详情，请参阅升级运行有状态工作负载的集群。
使用 Recommender 服务来检查弃用提示和建议，以避免服务中断。
请使用维护窗口来确保升级在预期时间进行。在维护窗口之前，请确保数据库备份成功。
在允许流量进入升级后的节点之前，请使用就绪性和活跃性探测来确保节点已准备好接受流量。
创建探测，以在接受流量之前评估复制是否同步。此任务可以通过自定义脚本完成，具体取决于数据库的复杂程度和规模。

在 Standard 集群升级期间验证数据库可用性

如果您运行的是 Standard 集群，则本部分适用。如需在升级期间验证 PostgreSQL 的可用性，一般流程是在升级过程中针对 PostgreSQL 数据库生成流量。然后，使用 pgbench 检查数据库在升级期间是否可以处理基准级别的流量（与数据库完全可用时相比）。

连接到您的 PostgreSQL 实例。

./scripts/launch-client.sh

输出类似于以下内容：

Launching Pod pg-client in the namespace postgresql ...
pod/pg-client created
waiting for the Pod to be ready
Copying script files to the target Pod pg-client ...
Pod: pg-client is healthy

在 Cloud Shell 中，通过 shell 进入客户端 Pod。

kubectl exec -it -n postgresql pg-client -- /bin/bash

初始化 pgbench。

pgbench -i -h $HOST_PGPOOL -U postgres postgres

使用以下命令获取基准结果，以确认 PostgreSQL 应用在升级时间范围内保持高可用性。如需获取基准结果，请通过多作业（线程）对多连接进行持续 30 秒的测试。

pgbench -h $HOST_PGPOOL -U postgres postgres -c10 -j4 -T 30 -R 200

输出类似于以下内容：

pgbench (14.5)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 4
duration: 30 s
number of transactions actually processed: 5980
latency average = 7.613 ms
latency stddev = 2.898 ms
rate limit schedule lag: avg 0.256 (max 36.613) ms
initial connection time = 397.804 ms
tps = 201.955497 (without initial connection time)

为确保升级期间的可用性，您可以针对数据库生成一些负载，并确保 PostgreSQL 应用在升级期间提供一致的响应速率。如需执行此测试，请使用 pgbench 命令针对数据库生成一些流量。以下命令会运行 pgbench 一小时，目标为 200 TPS（每秒事务数），每 2 秒列出一次请求速率。
```
pgbench -h $HOST_PGPOOL -U postgres postgres --client=10 --jobs=4 --rate=200 --time=3600 --progress=2 --select-only
```
其中：
- --client：模拟的客户端数，即并发数据库会话数。
- --jobs：pgbench 中的工作器线程数。使用多个线程在多 CPU 机器上会很有帮助。客户端会在可用线程之间尽可能均匀地分布。默认值为 1。
- --rate：速率以每秒事务数为单位
- --progress：每秒显示进度报告。
输出类似于以下内容：
```
pgbench (14.5)
starting vacuum...end.
progress: 5.0 s, 354.8 tps, lat 25.222 ms stddev 15.038
progress: 10.0 s, 393.8 tps, lat 25.396 ms stddev 16.459
progress: 15.0 s, 412.8 tps, lat 24.216 ms stddev 14.548
progress: 20.0 s, 405.0 tps, lat 24.656 ms stddev 14.066
```
在 Google Cloud 控制台中，返回 Cloud Monitoring 中的 PostgreSQL 概览信息中心。请注意每个数据库的连接和每个 Pod 的连接图表上的峰值。
退出客户端 pod。
```
exit
```

删除客户端 Pod。

kubectl delete pod -n postgresql pg-client

模拟 PostgreSQL 服务中断

在本部分中，您将通过停止复制管理器服务来模拟其中一个 PostgreSQL 副本中的服务中断。这样会防止 Pod 向其对等副本传送流量以及其活跃性探测失败。

打开新的 Cloud Shell 会话并配置对主要集群的 kubectl 命令行访问权限。

gcloud container clusters get-credentials $SOURCE_CLUSTER \
--location=$REGION --project=$PROJECT_ID

查看 Kubernetes 中发出的 PostgreSQL 事件。

kubectl get events -n postgresql --field-selector=involvedObject.name=postgresql-postgresql-ha-postgresql-0 --watch

在前面的 Cloud Shell 会话中，通过停止 PostgreSQL repmgr 来模拟服务失败。

将会话附加到数据库容器。

kubectl exec -it -n $NAMESPACE postgresql-postgresql-ha-postgresql-0 -c postgresql -- /bin/bash

使用 repmgr 停止服务，并移除检查点和 dry-run 参数。

export ENTRY='/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh'
export RCONF='/opt/bitnami/repmgr/conf/repmgr.conf'
$ENTRY repmgr -f $RCONF node service --action=stop --checkpoint

为 PostgreSQL 容器配置的活跃性探测会在 5 秒内开始失败。此过程每十秒钟重复一次，直到达到故障阈值为止（6 次失败）。达到 failureThreshold 值后，该容器将重启。您可以配置相应参数以降低活跃性探测容忍度，从而调整部署的 SLO 要求。

从事件流中，您会看到 Pod 的活跃性和就绪性探测失败，同时显示一条消息，您需要重启容器。输出类似于以下内容：

0s          Normal    Killing                pod/postgresql-postgresql-ha-postgresql-0   Container postgresql failed liveness probe, will be restarted
0s          Warning   Unhealthy              pod/postgresql-postgresql-ha-postgresql-0   Readiness probe failed: psql: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused...
0s          Normal    Pulled                 pod/postgresql-postgresql-ha-postgresql-0   Container image "us-docker.pkg.dev/psch-gke-dev/main/bitnami/postgresql-repmgr:14.5.0-debian-11-r10" already present on machine
0s          Normal    Created                pod/postgresql-postgresql-ha-postgresql-0   Created container postgresql
0s          Normal    Started                pod/postgresql-postgresql-ha-postgresql-0   Started container postgresql

为灾难恢复做好准备

为确保生产工作负载在发生服务中断事件时仍然可用，您应该准备灾难恢复 (DR) 规划。如需详细了解灾难恢复规划，请参阅灾难恢复规划指南。

Kubernetes 的灾难恢复可分为两个阶段实现：

备份涉及在服务中断事件发生之前创建状态或数据的时间点快照。
恢复涉及在发生灾难后从备份副本恢复状态或数据。

如需在 GKE 集群上备份和恢复工作负载，您可以使用 Backup for GKE。您可以在新集群和现有集群上启用此服务。这会部署在集群中运行的 Backup for GKE 代理；该代理负责捕获配置和卷备份数据以及编排恢复。

备份和恢复的范围可以限定为整个集群、命名空间或应用（由 matchLabels 等选择器定义）。

PostgreSQL 备份和恢复场景示例

本部分中的示例介绍了如何使用 ProtectedApplication 自定义资源在应用范围内执行备份和恢复操作。

下图展示了 ProtectedApplication 中的组件资源，即代表 postgresql-ha 应用的 StatefulSet 和部署 pgpool，它们使用相同的标签 (app.kubernetes.io/name: postgresql-ha)。

图表展示了高可用性 PostgreSQL 集群的备份和恢复解决方案示例。 — **图 2**：高可用性 PostgreSQL 集群的备份和恢复解决方案示例。

如需准备备份和恢复 PostgreSQL 工作负载，请按照以下步骤操作：

设置环境变量。在此示例中，您将使用 ProtectedApplication 从源 GKE 集群 (us-central1) 恢复 PostgreSQL 工作负载及其卷，然后恢复到其他区域中的另一个 GKE 集群 (us-west1）。

export SOURCE_CLUSTER=cluster-db1
export TARGET_CLUSTER=cluster-db2
export REGION=us-central1
export DR_REGION=us-west1
export NAME_PREFIX=g-db-protected-app
export BACKUP_PLAN_NAME=$NAME_PREFIX-bkp-plan-01
export BACKUP_NAME=bkp-$BACKUP_PLAN_NAME
export RESTORE_PLAN_NAME=$NAME_PREFIX-rest-plan-01
export RESTORE_NAME=rest-$RESTORE_PLAN_NAME

验证集群是否已启用 Backup for GKE。您之前执行的 Terraform 设置过程中应该已启用了 Backup for GKE。
```
gcloud container clusters describe $SOURCE_CLUSTER \
    --project=$PROJECT_ID  \
    --location=$REGION \
    --format='value(addonsConfig.gkeBackupAgentConfig)'
```
如果启用了 Backup for GKE，则该命令的输出会显示 enabled=True。

设置备份方案并执行恢复

借助 Backup for GKE，您可以创建备份方案作为 Cron 作业。备份方案包含备份配置，此配置涉及源集群、选择要备份的工作负载以及存储此方案下生成的备份工件的区域。

如需执行备份和恢复，请按照以下步骤操作：

在 cluster-db1 上验证 ProtectedApplication 的状态。

kubectl get ProtectedApplication -A

输出类似于以下内容：

NAMESPACE    NAME            READY TO BACKUP
postgresql   postgresql-ha   true

为 ProtectedApplication 创建备份方案。

export NAMESPACE=postgresql
export PROTECTED_APP=$(kubectl get ProtectedApplication -n $NAMESPACE | grep -v 'NAME' | awk '{ print $1 }')

gcloud beta container backup-restore backup-plans create $BACKUP_PLAN_NAME \
--project=$PROJECT_ID \
--location=$DR_REGION \
--cluster=projects/$PROJECT_ID/locations/$REGION/clusters/$SOURCE_CLUSTER \
--selected-applications=$NAMESPACE/$PROTECTED_APP \
--include-secrets \
--include-volume-data \
--cron-schedule="0 3 * * *" \
--backup-retain-days=7 \
--backup-delete-lock-days=0

手动创建备份。

gcloud beta container backup-restore backups create $BACKUP_NAME \
--project=$PROJECT_ID \
--location=$DR_REGION \
--backup-plan=$BACKUP_PLAN_NAME \
--wait-for-completion

设置恢复方案。

gcloud beta container backup-restore restore-plans create $RESTORE_PLAN_NAME \
  --project=$PROJECT_ID \
  --location=$DR_REGION \
  --backup-plan=projects/$PROJECT_ID/locations/$DR_REGION/backupPlans/$BACKUP_PLAN_NAME \
  --cluster=projects/$PROJECT_ID/locations/$DR_REGION/clusters/$TARGET_CLUSTER \
  --cluster-resource-conflict-policy=use-existing-version \
  --namespaced-resource-restore-mode=delete-and-restore \
  --volume-data-restore-policy=restore-volume-data-from-backup \
  --selected-applications=$NAMESPACE/$PROTECTED_APP \
  --cluster-resource-scope-selected-group-kinds="storage.k8s.io/StorageClass","scheduling.k8s.io/PriorityClass"

从备份进行恢复。

gcloud beta container backup-restore restores create $RESTORE_NAME \
  --project=$PROJECT_ID \
  --location=$DR_REGION \
  --restore-plan=$RESTORE_PLAN_NAME \
  --backup=projects/$PROJECT_ID/locations/$DR_REGION/backupPlans/$BACKUP_PLAN_NAME/backups/$BACKUP_NAME \
  --wait-for-completion

验证集群已恢复

如需验证恢复的集群是否具有所有预期的 Pod、PersistentVolume 和 StorageClass 资源，请按照以下步骤操作：

配置对备份集群 cluster-db2 的 kubectl 命令行访问权限。

gcloud container clusters get-credentials $TARGET_CLUSTER --location $DR_REGION --project $PROJECT_ID

验证 StatefulSet 是否已通过 3/3 Pod 准备就绪。

kubectl get all -n $NAMESPACE

输出类似于以下内容：

NAME                                                   READY   STATUS    RESTARTS        AGE
pod/postgresql-postgresql-ha-pgpool-778798b5bd-k2q4b   1/1     Running   0               4m49s
pod/postgresql-postgresql-ha-postgresql-0              2/2     Running   2 (4m13s ago)   4m49s
pod/postgresql-postgresql-ha-postgresql-1              2/2     Running   0               4m49s
pod/postgresql-postgresql-ha-postgresql-2              2/2     Running   0               4m49s

NAME                                                   TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)    AGE
service/postgresql-postgresql-ha-pgpool                ClusterIP   192.168.241.46    <none>        5432/TCP   4m49s
service/postgresql-postgresql-ha-postgresql            ClusterIP   192.168.220.20    <none>        5432/TCP   4m49s
service/postgresql-postgresql-ha-postgresql-headless   ClusterIP   None              <none>        5432/TCP   4m49s
service/postgresql-postgresql-ha-postgresql-metrics    ClusterIP   192.168.226.235   <none>        9187/TCP   4m49s

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/postgresql-postgresql-ha-pgpool   1/1     1            1           4m49s

NAME                                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/postgresql-postgresql-ha-pgpool-778798b5bd   1         1         1       4m49s

NAME                                                   READY   AGE
statefulset.apps/postgresql-postgresql-ha-postgresql   3/3     4m49s

验证 postgres 命名空间中的所有 pod 是否都在运行。

kubectl get pods -n $NAMESPACE

输出类似于以下内容：

postgresql-postgresql-ha-pgpool-569d7b8dfc-2f9zx   1/1     Running   0          7m56s
postgresql-postgresql-ha-postgresql-0              2/2     Running   0          7m56s
postgresql-postgresql-ha-postgresql-1              2/2     Running   0          7m56s
postgresql-postgresql-ha-postgresql-2              2/2     Running   0          7m56s

验证 PersistentVolume 和 StorageClass。在恢复过程中，Backup for GKE 会在目标工作负载中创建代理类，以替换源工作负载中预配的 StorageClass（示例输出中的 gce-pd-gkebackup-dn）。

kubectl get pvc -n $NAMESPACE

输出类似于以下内容：

NAME                                         STATUS   VOLUME                 CAPACITY   ACCESS MODES   STORAGECLASS          AGE
data-postgresql-postgresql-ha-postgresql-0   Bound    pvc-be91c361e9303f96   8Gi        RWO            gce-pd-gkebackup-dn   10m
data-postgresql-postgresql-ha-postgresql-1   Bound    pvc-6523044f8ce927d3   8Gi        RWO            gce-pd-gkebackup-dn   10m
data-postgresql-postgresql-ha-postgresql-2   Bound    pvc-c9e71a99ccb99a4c   8Gi        RWO            gce-pd-gkebackup-dn   10m

验证预期的数据是否已恢复

如需验证预期的数据是否已恢复，请按照以下步骤操作：

连接到您的 PostgreSQL 实例。

./scripts/launch-client.sh
kubectl exec -it pg-client -n postgresql -- /bin/bash

验证每个表的行数。
```
psql -h $HOST_PGPOOL -U postgres -a -q -f /tmp/scripts/count-rows.sql
select COUNT(*) from tb01;
```
您应该会看到与之前在创建测试数据集中写入的数据类似的结果。输出类似于以下内容：
```
300000
(1 row)
```
退出客户端 pod。
```
exit
```

清理

为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用，请删除包含这些资源的项目，或者保留项目但删除各个资源。

删除项目

为避免支付费用，最简单的方法是删除您为本教程创建的项目。

In the Google Cloud console, go to the Manage resources page.

Go to Manage resources

In the project list, select the project that you want to delete, and then click Delete.

In the dialog, type the project ID, and then click Shut down to delete the project.

后续步骤

了解在 GKE 上部署数据库的最佳实践。
详细探索永久性卷。
查看相关示例，了解如何使用 Pgpool-II 对高可用性 PostgreSQL 集群进行流式复制。