This page explains how to use the gkectl
command-line interface (CLI) tool to
diagnose issues in your GKE on-prem clusters.
Overview
The gkectl
tool has two commands for troubleshooting issues with clusters:
gkectl diagnose cluster
and gkectl diagnose snapshot
. The commands work
with both admin and user clusters.
gkectl diagnose cluster
Performs health checks on your GKE on-prem cluster and reports errors. Runs health checks on the following components:
- Cluster objects
- Machine objects and the corresponding cluster nodes
- Pods in the kube-system and gke-system namespaces
- User control plane if the target cluster is a user cluster
- vSphere persistent volumes in the cluster
gkectl diagnose snapshot
Compresses a cluster's status, configurations, and logs into a tarball file. Specifically, the default configuration of the command captures the following information about your cluster:
Kubernetes version
Status of Kubernetes resources in the kube-system and gke-system namespaces: cluster, machine, nodes, Services, Endpoints, ConfigMaps, ReplicaSets, CronJobs, Pods, and the owners of those Pods, including Deployments, DaemonSets, and StatefulSets
Status of the user control plane if the target cluster is a user cluster (the user cluster's control plane runs in the admin cluster)
Details about each node configuration including IP addresses, iptables rules, mount points, file system, network connections, and running processes
vSphere information including VM objects and their Events based on Resource Pool. Also Datacenter, Cluster, Network, and Datastore objects associated with VMs
F5 BIG-IP load balancer information including virtual server, virtual address, pool, node, and monitor
Logs from the
gkectl diagnose snapshot
commandOptionally, the GKE on-prem configuration file used to install and upgrade the cluster
Credentials, including vSphere and F5 credentials, are removed before the tarball is created.
Diagnosing clusters
You can run gke diagnose cluster
to look for common issues with your cluster.
Diagnosing an admin cluster
You can diagnose an admin cluster by passing in its name or by only passing in its kubeconfig.
Using admin cluster kubeconfig
Passing in the admin cluster's kubeconfig causes gkectl
to automatically
choose the admin cluster:
gkectl diagnose cluster --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG]
Using admin cluster name
To get the admin cluster's name, run the following command:
kubectl get cluster --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG]
Then, pass in the admin cluster name to gkectl diagnose cluster
:
gkectl diagnose cluster --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG] \ --cluster-name=[ADMIN_CLUSTER_NAME]
If your admin cluster is functioning properly, gkectl diagnose cluster
returns the following output:
Diagnosing admin cluster "[ADMIN_CLUSTER_NAME]"... Checking cluster object...PASS Checking machine objects...PASS Checking kube-system pods...PASS Checking storage...PASS Cluster is healthy.
Diagnosing a user cluster
To diagnose a cluster, first get the user cluster's name:
kubectl get cluster --kubeconfig=[USER_CLUSTER_KUBECONFIG]
Then, pass in the admin cluster's kubeconfig and the user cluster's name:
gkectl diagnose cluster --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG] \ --cluster-name=[USER_CLUSTER_NAME]
If your user cluster is functioning properly, gkectl diagnose cluster
returns the following output:
Diagnosing user cluster "[USER_CLUSTER_NAME]"... Checking cluster object...PASS Checking control plane pods...PASS Checking machine objects...PASS Checking other kube-system pods...PASS Checking storage...PASS Cluster is healthy.
Capturing cluster state
If gkectl diagnose cluster
finds errors, you should capture the cluster's
state and provide the information to Google. You can do so using the
gkectl diagnose snapshot
command.
gkectl diagnose snapshot
has an optional flag, --seed-config
. In addition
to collecting information about the cluster,
this flag collects the GKE on-prem configuration file that was used
to create or upgrade the cluster.
Capturing admin cluster state
To capture an admin cluster's state, run the following command, where
--seed-config
is optional:
gkectl diagnose snapshot --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG] [--seed-config]
The output includes a list of files and the name of a tarball file:
Taking snapshot of admin cluster "[ADMIN_CLUSTER_NAME]"... Using default snapshot configuration... Setting up "[ADMIN_CLUSTER_NAME]" ssh key file...DONE Taking snapshots... commands/kubectl_get_pods_-o_yaml_--kubeconfig_...env.default.kubeconfig_--namespace_kube-system commands/kubectl_get_deployments_-o_yaml_--kubeconfig_...env.default.kubeconfig_--namespace_kube-system commands/kubectl_get_daemonsets_-o_yaml_--kubeconfig_...env.default.kubeconfig_--namespace_kube-system ... nodes/[ADMIN_CLUSTER_NODE]/commands/journalctl_-u_kubelet nodes/[ADMIN_CLUSTER_NODE]/files/var/log/startup.log ... Snapshot succeeded. Output saved in [TARBALL_FILE_NAME].tar.gz.
To extract the tarball file to a directory, run the following command:
tar -zxf [TARBALL_FILE_NAME] --directory [EXTRACTION_DIRECTORY_NAME]
To look at the list of files produced by the snapshot, run the following commands:
cd [EXTRACTION_DIRECTORY_NAME]/[EXTRACTED_SNAPSHOT_DIRECTORY] ls kubectlCommands ls nodes/[NODE_NAME]/commands ls nodes/[NODE_NAME]/files
To see the details of a particular operation, open one of the files.
Specifying the SSH key for the admin cluster
When you get a snapshot of the admin cluster, gkectl
finds the private SSH key
for the admin cluster automatically. You can also specify the key explicitly by
using the --admin-ssh-key-path
parameter.
Follow the instructions for Using SSH to connect to a cluster node to download the SSH keys.
Then in your gkectl diagnose snapshot
command, set --admin-ssh-key-path
to
your decoded key file path:
gkectl diagnose snapshot --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG] \ --admin-ssh-key-path=[PATH_TO_DECODED_KEY]
Capturing user cluster state
To capture a user cluster's state, run the following command:
gkectl diagnose snapshot --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG] \ --cluster-name=[USER_CLUSTER_NAME]
The output includes a list of files and the name of a tarball file:
Taking snapshot of user cluster "[USER_CLUSTER_NAME]"... Using default snapshot configuration... Setting up "[USER_CLUSTER_NAME]" ssh key file...DONE commands/kubectl_get_pods_-o_yaml_--kubeconfig_...env.default.kubeconfig_--namespace_user commands/kubectl_get_deployments_-o_yaml_--kubeconfig_...env.default.kubeconfig_--namespace_user commands/kubectl_get_daemonsets_-o_yaml_--kubeconfig_...env.default.kubeconfig_--namespace_user ... commands/kubectl_get_pods_-o_yaml_--kubeconfig_.tmp.user-kubeconfig-851213064_--namespace_kube-system commands/kubectl_get_deployments_-o_yaml_--kubeconfig_.tmp.user-kubeconfig-851213064_--namespace_kube-system commands/kubectl_get_daemonsets_-o_yaml_--kubeconfig_.tmp.user-kubeconfig-851213064_--namespace_kube-system ... nodes/[USER_CLUSTER_NODE]/commands/journalctl_-u_kubelet nodes/[USER_CLUSTER_NODE]/files/var/log/startup.log ... Snapshot succeeded. Output saved in [FILENAME].tar.gz.
Snapshot scenarios
The gkectl diagnose snapshot
command supports four scenarios. To specify a
scenario, us the --scenario
flag. The following list shows the possible
values:
system
: (default) Collect a snapshot for the system namespaces:kube-system
andgke-system
.system-with-logs
: Collect asystem
snapshot with logs.all
: Collect a snapshot for all namespaces.all-with-logs
: Collect anall
snapshot with logs.
You can use each of the four scenarios with an admin cluster or a user cluster, so there are eight possible permutations. The following examples show some of the possibilities.
To create a snapshot of the admin cluster using the system
scenario:
gkectl diagnose snapshot \ --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG] \ --scenario=system
To create a snapshot of a user-cluster using the system-with-logs
scenario:
gkectl diagnose snapshot \ --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG] \ --cluster-name=[USER_CLUSTER_NAME] \ --scenario=system-with-logs
To create a snapshot of a user cluster using the all
scenario:
gkectl diagnose snapshot \ --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG] \ --cluster-name=[USER_CLUSTER_NAME] \ --scenario=all
To create a snapshot of the admin cluster using the all-with-logs
scenario:
gkectl diagnose snapshot \ --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG] \ --scenario=all-with-logs
Performing a dry run for a snapshot
You can use the --dry-run
flag to show the actions to be taken and the
snapshot configuration.
To perform a dry run on your admin cluster, enter the following command:
gkectl diagnose snapshot --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG] \ --cluster-name=[ADMIN_CLUSTER_NAME] \ --dry-run
To perform a dry run on a user cluster, enter the following command:
gkectl diagnose snapshot --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG] \ --cluster-name=[USER_CLUSTER_NAME] \ --dry-run
Using a snapshot configuration
If the four scenarios don't meet your needs, you can create a customized
snapshot by passing in a snapshot configuration file using the
--snapshot-config
flag:
gkectl diagnose snapshot --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG] \ --cluster-name=[USER_CLUSTER_NAME] \ --snapshot-config=[SNAPSHOT_CONFIG_FILE]
Generating a snapshot configuration
You can generate a snapshot configuration for a given scenario by passing in
the --scenario
and --dry-run
flags. For example, to see the snapshot
configuration for the default scenario
(system
) of a user cluster, enter the following command:
gkectl diagnose snapshot \ --kubeconfig=[ADMIN_CLUSTER_KUBECONFIG] \ --cluster-name=[USER_CLUSTER_NAME] \ --scenario=system --dry-run
The output is similar to the following:
numOfParallelThreads: 10 excludeWords: - password kubectlCommands: - commands: - kubectl get clusters -o wide - kubectl get machines -o wide - kubectl get clusters -o yaml - kubectl get machines -o yaml - kubectl describe clusters - kubectl describe machines namespaces: - default - commands: - kubectl version - kubectl cluster-info - kubectl get nodes -o wide - kubectl get nodes -o yaml - kubectl describe nodes namespaces: [] - commands: - kubectl get pods -o wide - kubectl get deployments -o wide - kubectl get daemonsets -o wide - kubectl get statefulsets -o wide - kubectl get replicasets -o wide - kubectl get services -o wide - kubectl get jobs -o wide - kubectl get cronjobs -o wide - kubectl get endpoints -o wide - kubectl get configmaps -o wide - kubectl get pods -o yaml - kubectl get deployments -o yaml - kubectl get daemonsets -o yaml - kubectl get statefulsets -o yaml - kubectl get replicasets -o yaml - kubectl get services -o yaml - kubectl get jobs -o yaml - kubectl get cronjobs -o yaml - kubectl get endpoints -o yaml - kubectl get configmaps -o yaml - kubectl describe pods - kubectl describe deployments - kubectl describe daemonsets - kubectl describe statefulsets - kubectl describe replicasets - kubectl describe services - kubectl describe jobs - kubectl describe cronjobs - kubectl describe endpoints - kubectl describe configmaps namespaces: - kube-system - gke-system - gke-connect.* prometheusRequests: [] nodeCommands: - nodes: [] commands: - uptime - df --all --inodes - ip addr - sudo iptables-save --counters - mount - ip route list table all - top -bn1 - sudo docker ps -a - ps -edF - ps -eo pid,tid,ppid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:14,comm,args,cgroup - sudo conntrack --count nodeFiles: - nodes: [] files: - /proc/sys/fs/file-nr - /proc/sys/net/nf_conntrack_max seesawCommands: [] seesawFiles: [] nodeCollectors: - nodes: [] f5: enabled: true vCenter: enabled: true
numOfParallelThreads
: Number of parallel threads used to take snapshots.excludeWords
: List of words to be excluded from the snapshot (case insensitive). Lines containing these words are removed from snapshot results. "password" is always excluded, whether or not you specify it.kubectlCommands
: List of kubectl commands to run. The results are saved. The commands run against the corresponding namespaces. Forkubectl logs
commands, all Pods and containers in the corresponding namespaces are added automatically. Regular expressions are supported for specifying namespaces. If you do not specify a namespace, thedefault
namespace is assumed.nodeCommands
: List of commands to run on the corresponding nodes. The results are saved. When nodes are not specified, all nodes in the target cluster are considered.nodeFiles
: List of files to be collected from the corresponding nodes. The files are saved. When nodes are not specified, all nodes in the target cluster are considered.seesawCommands
: List of commands to run to collect Seesaw load balancer information. The results are saved if the cluster is using the Seesaw load balancer.seesawFiles
: List of files to be collected for the Seesaw load balancer.nodeCollectors
: A collector running for Cilium nodes to collect eBPF information.f5
: A flag to enable the collecting of information related to the F5 BIG-IP load balancer.vCenter
: A flag to enable the collecting of information related to vCenter.prometheusRequests
: List of Prometheus requests. The results are saved.
Known issues
Version 1.1.2-gke.0: path resolves to multiple datacenters
Refer to GKE on-prem release notes.
Versions 1.1.x: Volume not attached to machine
Refer to GKE on-prem release notes.