クラスタの問題を診断する

gkectl ツールには、クラスタの問題を解決するために 2 つのコマンド（gkectl diagnose cluster と gkectl diagnose snapshot）が用意されています。これらのコマンドは、管理者クラスタとユーザークラスタの両方で機能します。このドキュメントでは、gkectl diagnose コマンドを使用してクラスタの問題を診断する方法について説明します。

gkectl diagnose snapshot コマンドを使用して、Cloud カスタマーケアによる問題の診断に役立つスナップショットを作成する方法については、クラスタを診断するためのスナップショットを作成するをご覧ください。

さらにサポートを必要とされる場合は、Cloud カスタマーケアにお問い合わせください。

`gkectl diagnose cluster`

このコマンドは、クラスタのヘルスチェックを実行し、エラーを報告します。このコマンドは、次のコンポーネントのヘルスチェックを実行します。

vCenter
- クルデンシャル
- DRS
- 反アフィニティグループ
- ネットワーク
- バージョン
- データセンター
- データストア
- リソースプール
- フォルダ
- ネットワーク
ロードバランサ（F5、Seesaw、または手動）
ユーザークラスタとノードプール
クラスタオブジェクト
ユーザークラスタの Konnectivity サーバーの準備状況
マシンオブジェクトと対応するクラスタノード
kube-system 名前空間と gke-system 名前空間内の Pod
コントロールプレーン
クラスタ内の vSphere 永続ボリューム
ユーザーおよび管理クラスタの vCPU（仮想 CPU）とメモリの競合シグナル
ユーザーおよび管理クラスタの ESXi のホスト CPU 使用率とメモリ使用量の事前構成アラーム。
1 日のうちの時間（TOD）
Dataplane V2 が有効になっているクラスタのノードネットワークポリシー
Dataplane V2 ノードエージェントの全体的な健全性

管理クラスタの診断

管理クラスタを診断するには、管理クラスタのパスを指定します。

gkectl diagnose cluster --kubeconfig=ADMIN_CLUSTER_KUBECONFIG

ADMIN_CLUSTER_KUBECONFIG は、管理クラスタ kubeconfig ファイルのパスに置き換えます。

次の出力例は、gkectl diagnose cluster コマンドから返されます。

Preparing for the diagnose tool...
Diagnosing the cluster......DONE

- Validation Category: Admin Cluster Connectivity
Checking VMs TOD (availability)...SUCCESS
Checking Konnectivity Server (readiness)...SUCCESS

- Validation Category: Admin Cluster F5 BIG-IP
Checking f5 (credentials, partition)...SUCCESS

- Validation Category: Admin Cluster VCenter
Checking Credentials...SUCCESS
Checking DRS enabled...SUCCESS
Checking Hosts for AntiAffinityGroups...SUCCESS
Checking Version...SUCCESS
Checking Datacenter...SUCCESS
Checking Datastore...SUCCESS
Checking Resource pool...SUCCESS
Checking Folder...SUCCESS
Checking Network...SUCCESS

- Validation Category: Admin Cluster
Checking cluster object...SUCCESS
Checking machine deployment...SUCCESS
Checking machineset...SUCCESS
Checking machine objects...SUCCESS
Checking kube-system pods...SUCCESS
Checking anthos-identity-service pods...SUCCESS
Checking storage...SUCCESS
Checking resource...SUCCESS
Checking virtual machine resource contention...SUCCESS
Checking host resource contention...SUCCESS
All validation results were SUCCESS.
Cluster is healthy!

ターゲットクラスタの仮想 IP アドレス（VIP）に問題がある場合は、--config フラグを使用して管理クラスタの構成ファイルを指定し、より多くのデバッグ情報を提供します。

gkectl diagnose cluster --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config CLUSTER_CONFIG

CLUSTER_CONFIG は、管理クラスタまたはユーザークラスタの構成ファイルのパスに置き換えます。

次の出力例は、gkectl diagnose cluster コマンドがクラスタに正しく接続され、問題を確認することを示しています。

Failed to access the api server via LB VIP "...": ...
Try to use the admin master IP instead of problematic VIP...
Reading config with version "[CONFIG_VERSION]"
Finding the admin master VM...
Fetching the VMs in the resource pool "[RESOURCE_POOL_NAME]"...
Found the "[ADMIN_MASTER_VM_NAME]" is the admin master VM.
Diagnosing admin|user cluster "[TARGET_CLUSTER_NAME]"...
...

ユーザークラスタの診断

ユーザークラスタを診断するには、ユーザークラスタ名を指定する必要があります。ユーザークラスタの名前を取得する必要がある場合は、次のコマンドを実行します。

kubectl get cluster --kubeconfig=USER_CLUSTER_KUBECONFIG

USER_CLUSTER_KUBECONFIG は、ユーザークラスタ kubeconfig ファイルのパスに置き換えます。

次のように、ユーザークラスタの名前と構成ファイルを指定します。

gkectl diagnose cluster --kubeconfig=ADMIN_CLUSTER_KUBECONFIG \
    --cluster-name=USER_CLUSTER_NAME

USER_CLUSTER_NAME を、ユーザークラスタの名前に置き換えます。

次の出力例は、gkectl diagnose cluster コマンドから返されます。

Preparing for the diagnose tool...
Diagnosing the cluster......DONE

Diagnose result is saved successfully in <DIAGNOSE_REPORT_JSON_FILE>

- Validation Category: User Cluster Connectivity
Checking Node Network Policy...SUCCESS
Checking VMs TOD (availability)...SUCCESS
Checking Dataplane-V2...Success

- Validation Category: User Cluster F5 BIG-IP
Checking f5 (credentials, partition)...SUCCESS

- Validation Category: User Cluster VCenter
Checking Credentials...SUCCESS
Checking DRS enabled...SUCCESS
Checking Hosts for AntiAffinityGroups...SUCCESS
Checking VSphere CSI Driver...SUCCESS
Checking Version...SUCCESS
Checking Datacenter...SUCCESS
Checking Datastore...SUCCESS
Checking Resource pool...SUCCESS
Checking Folder...SUCCESS
Checking Network...SUCCESS

- Validation Category: User Cluster
Checking user cluster and node pools...SUCCESS
Checking cluster object...SUCCESS
Checking machine deployment...SUCCESS
Checking machineset...SUCCESS
Checking machine objects...SUCCESS
Checking control plane pods...SUCCESS
Checking kube-system pods...SUCCESS
Checking gke-system pods...SUCCESS
Checking gke-connect pods...SUCCESS
Checeking anthos-identity-service pods...SUCCESS
Checking storage...SUCCESS
Checking resource...SUCCESS
Checking virtual machine resource contention...SUCCESS
Checking host resource contention...SUCCESS
All validation results were SUCCESS.
Cluster is healthy!

仮想マシンのステータスを診断する

仮想マシンの作成で問題が発生した場合は、gkectl diagnose cluster を実行して、仮想マシンのステータスの診断を取得します。

出力は次のようになります。


- Validation Category: Cluster Healthiness
Checking cluster object...SUCCESS
Checking machine deployment...SUCCESS
Checking machineset...SUCCESS
Checking machine objects...SUCCESS
Checking machine VMs...FAILURE
    Reason: 1 machine VMs error(s).
    Unhealthy Resources:
    Machine [NODE_NAME]: The VM's UUID "420fbe5c-4c8b-705a-8a05-ec636406f60" does not match the machine object's providerID "420fbe5c-4c8b-705a-8a05-ec636406f60e".
    Debug Information:
    null
...
Exit with error:
Cluster is unhealthy!
Run gkectl diagnose cluster automatically in gkectl diagnose snapshot
Public page https://cloud.google.com/anthos/clusters/docs/on-prem/latest/diagnose#overview_diagnose_snapshot

トラブルシューティング

次の表に、gkectl diagnose cluster コマンドを実行する際の問題の解決方法を簡単に示します。

問題	考えられる原因	解決策
管理クラスタおよびユーザークラスタのどちらからも Kubernetes API サーバーにアクセスできない。	仮想マシンの OOB（すぐに使用できる）メモリレイテンシグラフを確認します。理想的なメモリレイテンシはゼロに近いことです。また、メモリの競合により、CPU の競合が増加する可能性があります。さらに、CPU readiness グラフは、交換の増加に伴って急増する場合があります。	物理メモリを増やします。その他のオプションについては、VMware のトラブルシューティングヒントをご覧ください。
ノードプールの作成でタイムアウトが発生する。	VMDK の読み取りと書き込みのレイテンシが高くなっています。仮想ディスクの読み取りと書き込みレイテンシの VM ヘルス OOB を確認します。VMware によると、レイテンシの合計が 20 ミリ秒を上回ると、問題が発生していることを意味します。	ディスクパフォーマンスの問題が発生した場合の VMware の解決策をご覧ください。

エラー `BundleUnexpectedDiff` 件

GKE on VMware バンドルによって管理される Kubernetes Cluster API リソースは、誤って変更され、システムコンポーネントやクラスタのアップグレードや更新の失敗を引き起こす可能性があります。

GKE on VMware バージョン 1.13 以降では、onprem-user-cluster-controller がオブジェクトのステータスを定期的にチェックし、ログとイベントの望ましい状態との間の想定外の違いを報告します。これらのオブジェクトには、ユーザークラスタコントロールプレーンとアドオン（Service や DaemonSet など）が含まれます。

次の出力例は、予期しない差分イベントを示しています。

 Type     Reason                 Age    From                              Message
 ----     ------                 ----   ----                              -------
 Warning  BundleUnexpectedDiff   13m    onpremusercluster/ci-bundle-diff  Detected unexpected difference of user control plane objects: [ConfigMap/istio], please check onprem-user-cluster-controller logs for more details.

次の出力例は、onprem-user-cluster-controller によって生成されたログを示しています。

2022-08-06T02:54:42.701352295Z W0806 02:54:42.701252       1 update.go:206] Detected unexpected difference of user addon object(ConfigMap/istio), Diff:   map[string]string{
2022-08-06T02:54:42.701376406Z -    "mesh": (
2022-08-06T02:54:42.701381190Z -        """
2022-08-06T02:54:42.701385438Z -        defaultConfig:
2022-08-06T02:54:42.701389350Z -          discoveryAddress: istiod.gke-system.svc:15012
...
2022-08-06T02:54:42.701449954Z -        """
2022-08-06T02:54:42.701453099Z -    ),
2022-08-06T02:54:42.701456286Z -    "meshNetworks": "networks: {}",
2022-08-06T02:54:42.701459304Z +    "test-key":     "test-data",
2022-08-06T02:54:42.701462434Z   }

イベントとログがクラスタのオペレーションをブロックすることはありません。望ましい状態との間に予期外の違いがあるオブジェクトは、次のクラスタアップグレードで上書きされます。

次のステップ