Version 1.16. This version is no longer supported. For information about how to upgrade to version 1.28, see Upgrade clusters in the latest documentation. For more information about supported and unsupported versions, see the Versioning page in the latest documentation.

nettest でクラスタ接続を確認する

GKE on Bare Metal の nettest は、クラスタ内の Kubernetes オブジェクト（Pod、Node、Service、一部の外部ターゲットなど）の接続の問題を特定します。nettest では、外部ターゲットから Pod、Node、Service への接続は確認されません。このドキュメントでは、anthos-samples GitHub リポジトリにあるマニフェスト（nettest.yaml や nettest_rhel.yaml）のいずれかを使用し、nettest をデプロイして実行する方法について説明します。Red HatEnterprise Linux（RHEL）または CentOS で GKE on Bare Metal を実行している場合は、nettest_rhel.yaml を使用します。Ubuntu で GKE on Bare Metal を実行している場合は、nettest.yaml を使用します。

また、このドキュメントは、nettest によって生成されたログを解釈して、クラスタとの接続性の問題を特定する方法についても説明します。

`nettest` の概要

nettest 診断ツールは、次の Kubernetes オブジェクトで構成されています。各オブジェクトは nettest YAML マニフェストファイルで指定されます。

cloudprober: ネットワーク接続ステータス（エラー率やレイテンシなど）の収集に使用される DaemonSet と Service。
echoserver: cloudprober への応答に使用され、ネットワーク接続の指標を提供する DaemonSet と Service。
nettest: prometheus コンテナと nettest コンテナを含む Pod。
- prometheus は cloudprober から指標を収集します。
- nettest は、prometheus に対してクエリを実行し、ログでネットワークテスト結果を表示します。
nettest-engine: nettest Pod で nettest コンテナを構成するための ConfigMap。

マニフェストでは nettest 名前空間と専用の ServiceAccount も（ClusterRole と ClusterRoleBinding とともに）指定し、nettest を他のクラスタリソースから分離します。

nettest を実行する

オペレーティングシステムに応じて次のコマンドを実行し、nettest をデプロイします。nettest Pod が起動すると、テストが自動的に実行されます。テストの完了には、約 5 分かかります。

Ubuntu OS の場合:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest.yaml

RHEL または CentOS OS の場合:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest_rhel.yaml

テスト結果を取得する

テストが完了したら（nettest マニフェストをデプロイしてから約 5 分後）、次のコマンドを実行して nettest の結果を確認します。

kubectl -n nettest logs nettest -c nettest

nettest の実行中に、次のようなメッセージが stdout に送信されます。

I0413 03:33:04.879141       1 collectorui.go:130] Listening on ":8999"
I0413 03:33:04.879258       1 prometheus.go:172] Running prometheus controller
E0413 03:33:04.879628       1 prometheus.go:178] Prometheus controller: failed to
retries probers: Get "http://127.0.0.1:9090/api/v1/targets": dial tcp 127.0.0.1:9090:
connect: connection refused

接続エラーを特定せずに nettest が正常に実行されると、次のログエントリが確認できます。

I0211 21:58:34.689290       1 validate_metrics.go:78] Metric validation passed!

nettest が接続の問題を検出した場合は、次のようなログエントリが書き込まれます。

E0211 06:40:11.948634       1 collector.go:65] Engine error: step validateMetrics failed:
"Error rate in percentage": probe from "10.200.0.3" to "172.26.115.210:80" has value 100.000000,
threshold is 1.000000
"Error rate in percentage": probe from "10.200.0.3" to "172.26.27.229:80" has value 100.000000,
threshold is 1.000000
"Error rate in percentage": probe from "192.168.3.248" to "echoserver-hostnetwork_10.200.0.2_8080"
has value 2.007046, threshold is 1.000000

デフォルトのしきい値は 1%（1.000000）ですが、5% 以下のエラー率は無視しても問題ありません。たとえば、上記の例では、IP アドレス 192.168.3.248 から echoserver-hostnetwork_10.200.0.2_8080 への接続のエラー率が約 2%（2.007046）です。これは、無視できる接続の問題の報告例です。

テスト結果を解釈する

nettest が終了し、接続の問題が見つかると、nettest Pod のログで次のエントリが確認できます。

"Error rate in percentage": probe from {src} to {dst} has value 100.000000, threshold is 1.000000

ここで、{src} と {dst} は次のいずれかです。

echoserver Pod IP: ノード上の Pod との間の接続。
Node IP: ノードとの接続。
Service IP（詳細については、下の説明をご覧ください）

また、{dst} は次のいずれかになります。

google.com: 外部接続。
dns: DNS を介した hostNetwork 以外の Service への接続。つまり「echoserver-non-hostnetwork.nettest.svc.cluster.local」。

Service IP の詳細は、次の例のように、JSON 形式の probe エントリにあります。次の probe の例は、172.26.27.229:80 が service-clusterip のアドレスであることを示しています。この targets 値を持つ probe は 2 つあり、1 つは Pod（pod-service-clusterip）用、もう 1 つは Node（node-service-clusterip）用です。
```
probe {
  name: "node-service-clusterip"
  …
  targets {
    host_names: "172.26.27.229:80"
  }
```

修正を検証する

報告されたすべての接続の問題に対処したら、nettest Pod を削除し、nettest マニフェストを再適用して接続テストをもう一度実行します。

たとえば、Ubuntu で nettest を再実行するには、次のコマンドを実行します。

kubectl -n nettest delete pod nettest
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest.yaml

`nettest` をクリーンアップする

テストが完了したら、次のコマンドを実行して nettest リソースをすべて削除します。

kubectl delete namespace nettest
kubectl delete clusterroles nettest:nettest
kubectl delete clusterrolebindings nettest:nettest