The gkectl
tool has two commands for troubleshooting issues with clusters:
gkectl diagnose cluster
and gkectl diagnose snapshot
. The commands work
with both admin and user clusters. This document shows how to use the
gkectl diagnose
command to diagnose issues in your clusters.
For more information how to use the gkectl diagnose snapshot
command to
create snapshots that can help Cloud Customer Care diagnose issues, see
Create snapshots to diagnose clusters.
gkectl diagnose cluster
This command performs health checks on your cluster and reports errors. The command runs health checks on the following components:
- vCenter
- Credential
- DRS
- Anti-affinity groups
- Network
- Version
- Datacenter
- Datastore
- ResourcePool
- Folder
- Network
- Load balancer (F5, Seesaw, or Manual)
- User cluster and node pools
- Cluster objects
- Konnectivity server readiness of the user cluster
- Machine objects and the corresponding cluster nodes
- Pods in the
kube-system
andgke-system
namespaces - Control plane
- vSphere persistent volumes in the cluster
- User and admin cluster vCPU (virtual CPU) and memory contention signals
- User and admin cluster ESXi preconfigured Host CPU Usage and Memory Usage alarms.
- Time of day (TOD)
- Node network policy for a cluster with Dataplane V2 enabled
- Overall healthiness of the Dataplane V2 node agent
Diagnose an admin cluster
To diagnose an admin cluster, specify the path to your admin cluster:
gkectl diagnose cluster --kubeconfig=ADMIN_CLUSTER_KUBECONFIG
Replace ADMIN_CLUSTER_KUBECONFIG
with the path of
your admin cluster kubeconfig file.
The following example output is returned from the gkectl diagnose cluster
command:
Preparing for the diagnose tool...
Diagnosing the cluster......DONE
- Validation Category: Admin Cluster Connectivity
Checking VMs TOD (availability)...SUCCESS
Checking Konnectivity Server (readiness)...SUCCESS
- Validation Category: Admin Cluster F5 BIG-IP
Checking f5 (credentials, partition)...SUCCESS
- Validation Category: Admin Cluster VCenter
Checking Credentials...SUCCESS
Checking DRS enabled...SUCCESS
Checking Hosts for AntiAffinityGroups...SUCCESS
Checking Version...SUCCESS
Checking Datacenter...SUCCESS
Checking Datastore...SUCCESS
Checking Resource pool...SUCCESS
Checking Folder...SUCCESS
Checking Network...SUCCESS
- Validation Category: Admin Cluster
Checking cluster object...SUCCESS
Checking machine deployment...SUCCESS
Checking machineset...SUCCESS
Checking machine objects...SUCCESS
Checking kube-system pods...SUCCESS
Checking anthos-identity-service pods...SUCCESS
Checking storage...SUCCESS
Checking resource...SUCCESS
Checking virtual machine resource contention...SUCCESS
Checking host resource contention...SUCCESS
All validation results were SUCCESS.
Cluster is healthy!
If there's an issue with a virtual IP address (VIP) in the target cluster,
use the --config
flag to provide the admin cluster configuration file to
provide more debugging information.
gkectl diagnose cluster --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config CLUSTER_CONFIG
Replace CLUSTER_CONFIG
with the path of the admin
or user cluster configuration file.
The following example output shows that the gkectl diagnose cluster
command
can now correctly connect to the cluster and check for issues:
Failed to access the api server via LB VIP "...": ...
Try to use the admin master IP instead of problematic VIP...
Reading config with version "[CONFIG_VERSION]"
Finding the admin master VM...
Fetching the VMs in the resource pool "[RESOURCE_POOL_NAME]"...
Found the "[ADMIN_MASTER_VM_NAME]" is the admin master VM.
Diagnosing admin|user cluster "[TARGET_CLUSTER_NAME]"...
...
Diagnose a user cluster
To diagnose a user cluster, you must specify the user cluster name. If you need to get the name of a user cluster, run the following command:
kubectl get cluster --kubeconfig=USER_CLUSTER_KUBECONFIG
Replace USER_CLUSTER_KUBECONFIG
with the path of
the user cluster kubeconfig file.
Specify the name of the user cluster along with the config file as follows:
gkectl diagnose cluster --kubeconfig=ADMIN_CLUSTER_KUBECONFIG \ --cluster-name=USER_CLUSTER_NAME
Replace USER_CLUSTER_NAME
with the name of the user
cluster.
The following example output is returned from the gkectl diagnose cluster
command:
Preparing for the diagnose tool...
Diagnosing the cluster......DONE
Diagnose result is saved successfully in <DIAGNOSE_REPORT_JSON_FILE>
- Validation Category: User Cluster Connectivity
Checking Node Network Policy...SUCCESS
Checking VMs TOD (availability)...SUCCESS
Checking Dataplane-V2...Success
- Validation Category: User Cluster F5 BIG-IP
Checking f5 (credentials, partition)...SUCCESS
- Validation Category: User Cluster VCenter
Checking Credentials...SUCCESS
Checking DRS enabled...SUCCESS
Checking Hosts for AntiAffinityGroups...SUCCESS
Checking VSphere CSI Driver...SUCCESS
Checking Version...SUCCESS
Checking Datacenter...SUCCESS
Checking Datastore...SUCCESS
Checking Resource pool...SUCCESS
Checking Folder...SUCCESS
Checking Network...SUCCESS
- Validation Category: User Cluster
Checking user cluster and node pools...SUCCESS
Checking cluster object...SUCCESS
Checking machine deployment...SUCCESS
Checking machineset...SUCCESS
Checking machine objects...SUCCESS
Checking control plane pods...SUCCESS
Checking kube-system pods...SUCCESS
Checking gke-system pods...SUCCESS
Checking gke-connect pods...SUCCESS
Checeking anthos-identity-service pods...SUCCESS
Checking storage...SUCCESS
Checking resource...SUCCESS
Checking virtual machine resource contention...SUCCESS
Checking host resource contention...SUCCESS
All validation results were SUCCESS.
Cluster is healthy!
Diagnose virtual machine status
If an issue arises with virtual machine creation, run gkectl diagnose cluster
to obtain a diagnosis of the virtual machine status.
The output is similar to the following:
- Validation Category: Cluster Healthiness
Checking cluster object...SUCCESS
Checking machine deployment...SUCCESS
Checking machineset...SUCCESS
Checking machine objects...SUCCESS
Checking machine VMs...FAILURE
Reason: 1 machine VMs error(s).
Unhealthy Resources:
Machine [NODE_NAME]: The VM's UUID "420fbe5c-4c8b-705a-8a05-ec636406f60" does not match the machine object's providerID "420fbe5c-4c8b-705a-8a05-ec636406f60e".
Debug Information:
null
...
Exit with error:
Cluster is unhealthy!
Run gkectl diagnose cluster automatically in gkectl diagnose snapshot
Public page https://cloud.google.com/anthos/clusters/docs/on-prem/latest/diagnose#overview_diagnose_snapshot
Troubleshoot
The following table outlines some possible resolutions for problems with running
the gkectl diagnose cluster
command:
Issue | Possible causes | Resolution |
---|---|---|
Kubernetes API server is not reachable, either for the admin cluster, or for user clusters. | Check the virtual machine health OOB (out-of-box) memory latency graphs, which ideally should have a memory latency around zero. Memory contention can also increase CPU contention, and the CPU readiness graphs might have a spike as there will be swapping involved. | Increase physical memory. For other options, see VMware troubleshooting suggestions. |
Nodepool creation times out. | VMDK high read/write latency. Check VM health OOB for virtual disk read and write latency. According to VMware, a total latency greater than 20ms indicates a problem. | See VMware solutions for disk performance problems. |
BundleUnexpectedDiff
error
The Kubernetes Cluster API resource managed by a Google Distributed Cloud bundle might be accidentally modified which can cause failure of system components, or cluster upgrade or update failure.
In Google Distributed Cloud version 1.13 and later, the
onprem-user-cluster-controller
periodically checks the status of objects, and
reports any unexpected differences from the desired state through logs and
events. These objects include the user cluster control plane and add-ons such as
Services and DaemonSets.
The following example output shows an unexpected difference event:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BundleUnexpectedDiff 13m onpremusercluster/ci-bundle-diff Detected unexpected difference of user control plane objects: [ConfigMap/istio], please check onprem-user-cluster-controller logs for more details.
The following example output shows logs generated by the
onprem-user-cluster-controller
:
2022-08-06T02:54:42.701352295Z W0806 02:54:42.701252 1 update.go:206] Detected unexpected difference of user addon object(ConfigMap/istio), Diff: map[string]string{
2022-08-06T02:54:42.701376406Z - "mesh": (
2022-08-06T02:54:42.701381190Z - """
2022-08-06T02:54:42.701385438Z - defaultConfig:
2022-08-06T02:54:42.701389350Z - discoveryAddress: istiod.gke-system.svc:15012
...
2022-08-06T02:54:42.701449954Z - """
2022-08-06T02:54:42.701453099Z - ),
2022-08-06T02:54:42.701456286Z - "meshNetworks": "networks: {}",
2022-08-06T02:54:42.701459304Z + "test-key": "test-data",
2022-08-06T02:54:42.701462434Z }
The events and logs won't block cluster operation. Objects that have unexpected differences from their desired state are overwritten in the next cluster upgrade.