使用 nettest 检查集群连接

GKE on Bare Metal nettest 可识别集群中 Kubernetes 对象（例如 Pod、节点、Service 和一些外部目标）的连接问题。nettest 不会检查从外部目标到 Pod、节点或 Service 的连接。本文档介绍如何使用 anthos-samples GitHub 代码库中的一个清单（nettest.yaml 或 nettest_rhel.yaml）部署和运行 nettest。如果您在 Red Hat Enterprise Linux (RHEL) 或 CentOS 上运行 GKE on Bare Metal，请使用 nettest_rhel.yaml。如果您在 Ubuntu 上运行 GKE on Bare Metal，请使用 nettest.yaml。

本文档还介绍了如何解读 nettest 生成的日志，以识别集群的连接问题。

关于 `nettest`

nettest 诊断工具由以下 Kubernetes 对象组成。每个对象均在 nettest YAML 清单文件中指定。

cloudprober：负责收集网络连接状态（例如错误率和延迟时间）的 DaemonSet 和 Service。
echoserver：负责响应 cloudprober，为其提供网络连接指标的 DaemonSet 和 Service。
nettest：包含 prometheus 和 nettest 容器的 Pod。
- prometheus 从 cloudprober 收集指标。
- nettest 查询 prometheus 并在日志中显示网络测试结果。
nettest-engine：用于在 nettest Pod 中配置 nettest 容器的 ConfigMap。

该清单还指定了 nettest 命名空间和专用的 ServiceAccount（以及 ClusterRole 和 ClusterRoleBinding），以将 nettest 与其他集群资源隔离。

运行 nettest

为您的操作系统运行以下命令，以部署 nettest。当 nettest Pod 启动时，测试会自动运行。测试大约需要五分钟才能完成。

对于 Ubuntu 操作系统：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest.yaml

对于 RHEL 或 CentOS 操作系统：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest_rhel.yaml

获取测试结果

测试完成后（应在部署 nettest 清单后大约五分钟）运行以下命令，以查看 nettest 结果：

kubectl -n nettest logs nettest -c nettest

当 nettest 正在运行时，它会向 stdout 发送如下消息：

I0413 03:33:04.879141       1 collectorui.go:130] Listening on ":8999"
I0413 03:33:04.879258       1 prometheus.go:172] Running prometheus controller
E0413 03:33:04.879628       1 prometheus.go:178] Prometheus controller: failed to
retries probers: Get "http://127.0.0.1:9090/api/v1/targets": dial tcp 127.0.0.1:9090:
connect: connection refused

如果 nettest 成功运行，未识别到任何连接故障，您会看到以下日志条目：

I0211 21:58:34.689290       1 validate_metrics.go:78] Metric validation passed!

如果 nettest 发现连接问题，则会写入如下日志条目：

E0211 06:40:11.948634       1 collector.go:65] Engine error: step validateMetrics failed:
"Error rate in percentage": probe from "10.200.0.3" to "172.26.115.210:80" has value 100.000000,
threshold is 1.000000
"Error rate in percentage": probe from "10.200.0.3" to "172.26.27.229:80" has value 100.000000,
threshold is 1.000000
"Error rate in percentage": probe from "192.168.3.248" to "echoserver-hostnetwork_10.200.0.2_8080"
has value 2.007046, threshold is 1.000000

虽然默认阈值为百分之一 (1.000000)，但您可以放心地忽略最高百分之五的错误率。例如，在上例中，从 IP 地址 192.168.3.248 到 echoserver-hostnetwork_10.200.0.2_8080 的连接的错误率约为 2% (2.007046)。这是您可以忽略的报告的连接问题示例。

解读测试结果

nettest 完成并发现连接问题时，您会在 nettest Pod 日志中看到以下条目：

"Error rate in percentage": probe from {src} to {dst} has value 100.000000, threshold is 1.000000

其中，{src} 和 {dst} 可以是以下之一：

echoserver Pod IP：到/来自节点上 Pod 的连接。
节点 IP：到/来自节点的连接。
Service IP（请查看下面的文本以了解详情）

此外，{dst} 还可以是：

google.com：外部连接。
dns：通过 DNS 到非 hostNetwork Service 的连接，即 echoserver-non-hostnetwork.nettest.svc.cluster.local。

Service IP 的详细信息位于日志中的 JSON 格式的探测条目中，如以下示例所示。以下探测示例显示 172.26.27.229:80 是 service-clusterip 的地址。有两个具有此 targets 值的探测，一个用于 Pod (pod-service-clusterip)，另一个用于节点 (node-service-clusterip)。
```
probe {
  name: "node-service-clusterip"
  …
  targets {
    host_names: "172.26.27.229:80"
  }
```

验证修复效果

解决所有报告的连接问题后，移除 nettest Pod 并重新应用 nettest 清单以重新运行连接测试。

例如，要为 Ubuntu 重新运行 nettest，请运行以下命令：

kubectl -n nettest delete pod nettest
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest.yaml

清理 `nettest`

完成测试后，请运行以下命令移除所有 nettest 资源：

kubectl delete namespace nettest
kubectl delete clusterroles nettest:nettest
kubectl delete clusterrolebindings nettest:nettest