Diagnose cluster issues

The gkectl tool has two commands for troubleshooting issues with clusters: gkectl diagnose cluster and gkectl diagnose snapshot. The commands work with both admin and user clusters. This document shows how to use the gkectl diagnose command to diagnose issues in your clusters.

For more information how to use the gkectl diagnose snapshot command to create snapshots that can help Cloud Customer Care diagnose issues, see Create snapshots to diagnose clusters.

If you need additional assistance, reach out to Cloud Customer Care.

`gkectl diagnose cluster`

This command performs health checks on your cluster and reports errors. The command runs health checks on the following components:

vCenter
- Credential
- DRS
- Anti-affinity groups
- Network
- Version
- Datacenter
- Datastore
- ResourcePool
- Folder
- Network
Load balancer (F5, Seesaw, or Manual)
User cluster and node pools
Cluster objects
Konnectivity server readiness of the user cluster
Machine objects and the corresponding cluster nodes
Pods in the kube-system and gke-system namespaces
Control plane
vSphere persistent volumes in the cluster
User and admin cluster vCPU (virtual CPU) and memory contention signals
User and admin cluster ESXi preconfigured Host CPU Usage and Memory Usage alarms.
Time of day (TOD)
Node network policy for a cluster with Dataplane V2 enabled
Overall healthiness of the Dataplane V2 node agent

Diagnose an admin cluster

To diagnose an admin cluster, specify the path to your admin cluster:

gkectl diagnose cluster --kubeconfig=ADMIN_CLUSTER_KUBECONFIG

Replace ADMIN_CLUSTER_KUBECONFIG with the path of your admin cluster kubeconfig file.

The following example output is returned from the gkectl diagnose cluster command:

Preparing for the diagnose tool...
Diagnosing the cluster......DONE

- Validation Category: Admin Cluster Connectivity
Checking VMs TOD (availability)...SUCCESS
Checking Konnectivity Server (readiness)...SUCCESS

- Validation Category: Admin Cluster F5 BIG-IP
Checking f5 (credentials, partition)...SUCCESS

- Validation Category: Admin Cluster VCenter
Checking Credentials...SUCCESS
Checking DRS enabled...SUCCESS
Checking Hosts for AntiAffinityGroups...SUCCESS
Checking Version...SUCCESS
Checking Datacenter...SUCCESS
Checking Datastore...SUCCESS
Checking Resource pool...SUCCESS
Checking Folder...SUCCESS
Checking Network...SUCCESS

- Validation Category: Admin Cluster
Checking cluster object...SUCCESS
Checking machine deployment...SUCCESS
Checking machineset...SUCCESS
Checking machine objects...SUCCESS
Checking kube-system pods...SUCCESS
Checking anthos-identity-service pods...SUCCESS
Checking storage...SUCCESS
Checking resource...SUCCESS
Checking virtual machine resource contention...SUCCESS
Checking host resource contention...SUCCESS
All validation results were SUCCESS.
Cluster is healthy!

If there's an issue with a virtual IP address (VIP) in the target cluster, use the --config flag to provide the admin cluster configuration file to provide more debugging information.

gkectl diagnose cluster --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config CLUSTER_CONFIG

Replace CLUSTER_CONFIG with the path of the admin or user cluster configuration file.

The following example output shows that the gkectl diagnose cluster command can now correctly connect to the cluster and check for issues:

Failed to access the api server via LB VIP "...": ...
Try to use the admin master IP instead of problematic VIP...
Reading config with version "[CONFIG_VERSION]"
Finding the admin master VM...
Fetching the VMs in the resource pool "[RESOURCE_POOL_NAME]"...
Found the "[ADMIN_MASTER_VM_NAME]" is the admin master VM.
Diagnosing admin|user cluster "[TARGET_CLUSTER_NAME]"...
...

Diagnose a user cluster

To diagnose a user cluster, you must specify the user cluster name. If you need to get the name of a user cluster, run the following command:

kubectl get cluster --kubeconfig=USER_CLUSTER_KUBECONFIG

Replace USER_CLUSTER_KUBECONFIG with the path of the user cluster kubeconfig file.

Specify the name of the user cluster along with the config file as follows:

gkectl diagnose cluster --kubeconfig=ADMIN_CLUSTER_KUBECONFIG \
    --cluster-name=USER_CLUSTER_NAME

Replace USER_CLUSTER_NAME with the name of the user cluster.

The following example output is returned from the gkectl diagnose cluster command:

Preparing for the diagnose tool...
Diagnosing the cluster......DONE

Diagnose result is saved successfully in <DIAGNOSE_REPORT_JSON_FILE>

- Validation Category: User Cluster Connectivity
Checking Node Network Policy...SUCCESS
Checking VMs TOD (availability)...SUCCESS
Checking Dataplane-V2...Success

- Validation Category: User Cluster F5 BIG-IP
Checking f5 (credentials, partition)...SUCCESS

- Validation Category: User Cluster VCenter
Checking Credentials...SUCCESS
Checking DRS enabled...SUCCESS
Checking Hosts for AntiAffinityGroups...SUCCESS
Checking VSphere CSI Driver...SUCCESS
Checking Version...SUCCESS
Checking Datacenter...SUCCESS
Checking Datastore...SUCCESS
Checking Resource pool...SUCCESS
Checking Folder...SUCCESS
Checking Network...SUCCESS

- Validation Category: User Cluster
Checking user cluster and node pools...SUCCESS
Checking cluster object...SUCCESS
Checking machine deployment...SUCCESS
Checking machineset...SUCCESS
Checking machine objects...SUCCESS
Checking control plane pods...SUCCESS
Checking kube-system pods...SUCCESS
Checking gke-system pods...SUCCESS
Checking gke-connect pods...SUCCESS
Checeking anthos-identity-service pods...SUCCESS
Checking storage...SUCCESS
Checking resource...SUCCESS
Checking virtual machine resource contention...SUCCESS
Checking host resource contention...SUCCESS
All validation results were SUCCESS.
Cluster is healthy!

Diagnose virtual machine status

If an issue arises with virtual machine creation, run gkectl diagnose cluster to obtain a diagnosis of the virtual machine status.

The output is similar to the following:


- Validation Category: Cluster Healthiness
Checking cluster object...SUCCESS
Checking machine deployment...SUCCESS
Checking machineset...SUCCESS
Checking machine objects...SUCCESS
Checking machine VMs...FAILURE
    Reason: 1 machine VMs error(s).
    Unhealthy Resources:
    Machine [NODE_NAME]: The VM's UUID "420fbe5c-4c8b-705a-8a05-ec636406f60" does not match the machine object's providerID "420fbe5c-4c8b-705a-8a05-ec636406f60e".
    Debug Information:
    null
...
Exit with error:
Cluster is unhealthy!
Run gkectl diagnose cluster automatically in gkectl diagnose snapshot
Public page https://cloud.google.com/anthos/clusters/docs/on-prem/latest/diagnose#overview_diagnose_snapshot

Troubleshoot

The following table outlines some possible resolutions for problems with running the gkectl diagnose cluster command:

Issue	Possible causes	Resolution
Kubernetes API server is not reachable, either for the admin cluster, or for user clusters.	Check the virtual machine health OOB (out-of-box) memory latency graphs, which ideally should have a memory latency around zero. Memory contention can also increase CPU contention, and the CPU readiness graphs might have a spike as there will be swapping involved.	Increase physical memory. For other options, see VMware troubleshooting suggestions.
Nodepool creation times out.	VMDK high read/write latency. Check VM health OOB for virtual disk read and write latency. According to VMware, a total latency greater than 20ms indicates a problem.	See VMware solutions for disk performance problems.

`BundleUnexpectedDiff` error

The Kubernetes Cluster API resource managed by a GKE on VMware bundle might be accidentally modified which can cause failure of system components, or cluster upgrade or update failure.

In GKE on VMware version 1.13 and later, the onprem-user-cluster-controller periodically checks the status of objects, and reports any unexpected differences from the desired state through logs and events. These objects include the user cluster control plane and add-ons such as Services and DaemonSets.

The following example output shows an unexpected difference event:

 Type     Reason                 Age    From                              Message
 ----     ------                 ----   ----                              -------
 Warning  BundleUnexpectedDiff   13m    onpremusercluster/ci-bundle-diff  Detected unexpected difference of user control plane objects: [ConfigMap/istio], please check onprem-user-cluster-controller logs for more details.

The following example output shows logs generated by the onprem-user-cluster-controller:

2022-08-06T02:54:42.701352295Z W0806 02:54:42.701252       1 update.go:206] Detected unexpected difference of user addon object(ConfigMap/istio), Diff:   map[string]string{
2022-08-06T02:54:42.701376406Z -    "mesh": (
2022-08-06T02:54:42.701381190Z -        """
2022-08-06T02:54:42.701385438Z -        defaultConfig:
2022-08-06T02:54:42.701389350Z -          discoveryAddress: istiod.gke-system.svc:15012
...
2022-08-06T02:54:42.701449954Z -        """
2022-08-06T02:54:42.701453099Z -    ),
2022-08-06T02:54:42.701456286Z -    "meshNetworks": "networks: {}",
2022-08-06T02:54:42.701459304Z +    "test-key":     "test-data",
2022-08-06T02:54:42.701462434Z   }

The events and logs won't block cluster operation. Objects that have unexpected differences from their desired state are overwritten in the next cluster upgrade.

What's next