This page shows you how to investigate issues with cluster creation and upgrade in Google Distributed Cloud (software only) for VMware.
If you need additional assistance, reach out to Cloud Customer Care.Installation issues
The following sections might help you to troubleshoot issues with installing Google Distributed Cloud.
Use the bootstrap cluster to debug issues
During installation, Google Distributed Cloud creates a temporary bootstrap cluster. After a successful installation, Google Distributed Cloud deletes the bootstrap cluster, leaving you with your admin cluster and user cluster. Generally, you should have no reason to interact with the bootstrap cluster. However, if you encounter issues during installation, you can use the bootstrap cluster logs to help you debug the problem.
If you pass --cleanup-external-cluster=false
to gkectl create cluster
, then
the bootstrap cluster doesn't get deleted, and you can use the bootstrap
cluster to debug installation issues.
Examine the bootstrap cluster's logs
Find the names of Pods running in the
kube-system
namespace:kubectl --kubeconfig /home/ubuntu/.kube/kind-config-gkectl get pods -n kube-system
View the logs for a Pod:
kubectl --kubeconfig /home/ubuntu/.kube/kind-config-gkectl -n kube-system get logs POD_NAME
Replace
POD_NAME
with the name of the Pod that you want to view.To get the logs from the bootstrap cluster directly, run the following command during cluster creation, update and upgrade:
docker exec -it gkectl-control-plane bash
This command opens a terminal inside the
gkectl-control-plane
container that runs in the bootstrap cluster.To inspect the
kubelet
andcontainerd
logs, use the following commands and look for errors or warnings in the output:systemctl status -l kubelet journalctl --utc -u kubelet systemctl status -l containerd journalctl --utc -u containerd
Examine the snapshot of the bootstrap cluster
If you attempt to create or upgrade an admin cluster, and that operation fails,
Google Distributed Cloud takes an external snapshot of the bootstrap cluster.
This snapshot of the bootstrap cluster is similar to the snapshot taken
by running the gkectl diagnose snapshot
command on the
admin cluster, but the process triggers automatically. The bootstrap cluster
snapshot contains important debugging information for the admin cluster creation
and upgrade process. You can
provide this snapshot to Cloud Customer Care if
needed.
The external snapshot includes Pod logs from the
onprem-admin-cluster-controller
that you can view to debug cluster creation or
upgrade issues. The logs are stored in a separate file, for example:
kubectl_logs_onprem-admin-cluster-controller-6767f6597-nws8g_ \
--container_onprem-admin-cluster-controller_ \
--kubeconfig_.home.ubuntu..kube.kind-config-gkectl_\
--namespace_kube-system
VM doesn't start after admin control plane starts
If a VM fails to start after the admin control plane has started, you can investigate the issue by inspecting the logs from the Cluster API controllers Pod in the admin cluster:
Find the name of the Cluster API controllers Pod:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG --namespace kube-system \ get pods | grep clusterapi-controllers
View logs from the
vsphere-controller-manager
. Start by specifying the Pod, but no container:kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG --namespace kube-system \ logs POD_NAME
The output tells you that you must specify a container, and it gives you the names of the containers in the Pod. For example:
... a container name must be specified ..., choose one of: [clusterapi-controller-manager vsphere-controller-manager rbac-proxy]
Choose a container, and view its logs:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG --namespace kube-system \ logs POD_NAME --container CONTAINER_NAME
Sufficient number of IP addresses allocated, but Machine fails to register with cluster
This issue can occur if there is an IP address conflict. For example, an IP address you specified for a machine is being used for a load balancer.
To resolve this issue, update your cluster IP block file so that the machine addresses do not conflict with addresses specified in your cluster configuration file or your Seesaw IP block file
Kind cluster does not get deleted
When you create an admin cluster, a kind
cluster (also called the
bootstrap cluster) gets created as part of the process. When the admin cluster
operation is complete, the kind
cluster gets deleted automatically.
If you see the error message Failed to delete the kind cluster
, you can
perform the following steps, on your admin workstation, to manually delete the
kind
cluster:
Get the
kind
container ID:docker inspect --format="{{.Id}}" gkectl-control-plane
Get the
containerd-shim
process ID:sudo ps awx | grep containerd-shim | grep CONTAINER_ID | awk '{print $1}'
Kill the process:
sudo kill -9 PROCESS_ID
Cluster upgrade issues
The following sections provide tips on how to resolve problems that you might encounter during a cluster upgrade.
Roll back a node pool after an upgrade
If you upgrade a user cluster and then discover an issue with the cluster nodes, you can roll back selected node pools to the previous version.
Rolling back selected node pools is supported for Ubuntu and COS node pools, but not for Windows node pools.
The version of a node pool can be the same or one minor version older than the version of the user cluster control plane. For example, if the control plane is at version 1.14, then the node pools can be at version 1.14 or 1.13.
View available node pool versions
Suppose you recently upgraded your user cluster worker nodes and control plane from version 1.13.1-gke.35 to version 1.14.0, and you discover an issue with the upgraded worker nodes. So you decide to roll back one or more node pools to the version you were previously running: 1.13.1-gke.35. Before you can begin the rollback, you need to verify that the previous version is available for rollback.
To view available versions, run the following command:
gkectl version --cluster-name USER_CLUSTER_NAME \
--kubeconfig ADMIN_CLUSTER_KUBECONFIG
The output shows the current version and the previous version for each node pool. For example:
user cluster version: 1.14.0-gke.x
node pools:
- pool-1:
- version: 1.14.0-gke.x
- previous version: 1.13.1-gke.35
- pool-2:
- version: 1.14.0-gke.x
- previous version: 1.13.1-gke.35
available node pool versions:
- 1.13.1-gke.35
- 1.14.0-gke.x
Roll back node pool version
You can roll back a node pool's version one node pool at a time, or you can roll back several node pools in a single step.
To roll back a node pool version, complete the following steps:
In your user cluster configuration file, in one or more node pools, set the value of
gkeOnPremVersion
to the previous version. The following example shows you how to roll back to version 1.13.1-gke.35:nodePools: - name: pool-1 cpus: 4 memoryMB: 8192 replicas: 3 gkeOnPremVersion: 1.13.1-gke.35 ...
Update the cluster to roll back the node pool(s):
gkectl update cluster --config USER_CLUSTER_CONFIG \ --kubeconfig ADMIN_CLUSTER_KUBECONFIG
Verify that the rollback was successful:
gkectl version --cluster-name USER_CLUSTER_NAME \ --kubeconfig ADMIN_CLUSTER_KUBECONFIG
The following output shows that
pool-1
was rolled back to version 1.13.1-gke.35.user cluster version: 1.14.0-gke.x node pools: - pool-1: - version: 1.13.1-gke.35 - previous version: 1.14.0-gke.x - pool-2: - version: 1.14.0-gke.x - previous version: 1.13.1-gke.35 available node pool versions: - 1.13.1-gke.35 - 1.14.0-gke.x
Upgrade to a new patch version
You can upgrade all node pools and the control plane to a new patch version. This might be helpful if you rolled back to a previous version and want to upgrade to a version that includes a fix.
To upgrade to a new version, complete the following steps:
Make the following changes in your user cluster configuration file:
Set the value of
gkeOnPremVersion
to a new patch version. This example uses 1.14.1-gke.x.For each node pool, remove the
gkeOnPremVersion
field, or set it to the empty string. When no version is specified for a node pool, the version for the node pool defaults to the version specified for the cluster.These changes look similar to the following example:
gkeOnPremVersion: 1.14.1-gke.x nodePools: - name: pool-1 cpus: 4 memoryMB: 8192 replicas: 3 gkeOnPremVersion: "" - name: pool-2 cpus: 8 memoryMB: 8192 replicas: 2 gkeOnPremVersion: ""
Run
gkectl prepare
andgkectl upgrade cluster
as described in Upgrading Google Distributed Cloud.Verify the new cluster version, and see the versions that are available for rollback:
gkectl version --cluster-name USER_CLUSTER_NAME \ --kubeconfig ADMIN_CLUSTER_KUBECONFIG
The output is similar to the following:
user cluster version: 1.14.1-gke.y node pools: - pool-1: - version: 1.14.1-gke.y - previous version: 1.13.1-gke.35 - pool-2: - version: 1.14.1-gke.y - previous version: 1.13.1-gke.35 available node pool versions: - 1.13.1-gke.35 - 1.14.0-gke.x - 1.14.1-gke.y ```
Health checks are run automatically when cluster upgrade fails
If you attempt to upgrade an admin or user cluster, and that operation fails,
Google Distributed Cloud automatically runs the gkectl diagnose cluster
command on the cluster.
To skip the automatic diagnosis, pass the --skip-diagnose-cluster
flag to
gkectl upgrade
.
Upgrade process becomes stuck
Behind the scenes, Google Distributed Cloud uses the Kubernetes drain
command
during an upgrade. This drain
procedure can be blocked by a Deployment with
only one replica that has a PodDisruptionBudget (PDB) created for it with
minAvailable: 1
.
From Google Distributed Cloud version 1.13, you can check failures through Kubernetes Pod events.
Find the names of the Machines:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG get machines --all-namespaces
Check for errors using the
kubectl describe machine
command:kubectl --kubeconfig USER_CLUSTER_KUBECONFIG describe machine MACHINE_NAME
The output is similar to the following:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning PodEvictionTooLong 3m49s machine-controller Waiting too long(12m10.284294321s) for pod(default/test-deployment-669b85c4cc-z8pz7) eviction.
Optional: For a more detailed analysis on the machine objects status, run
gkectl diagnose cluster
.The output is similar to the following:
... Checking machineset...SUCCESS Checking machine objects...FAILURE Reason: 1 machine objects error(s). Unhealthy Resources: Pod test-deployment-669b85c4cc-7zjpq: Pod cannot be evicted successfully. There is 1 related PDB. ... Checking all poddisruptionbudgets...FAILURE Reason: 1 pod disruption budget error(s). Unhealthy Resources: PodDisruptionBudget test-pdb: default/test-pdb might be configured incorrectly, the total replicas(3) should be larger than spec.MinAvailable(3). ... Some validation results were FAILURE or UNKNOWN. Check report above.
To resolve this issue, save the PDB, and remove it from the cluster before attempting the upgrade. You can then add the PDB back after the upgrade is complete.
Remove unsupported changes to unblock upgrade
When upgrading clusters to 1.16 or earlier versions, changes to most fields are silently ignored during the upgrade, meaning that these changes don't take effect during and after the upgrade.
When upgrading user clusters to 1.28 or later versions, we validate all changes made in the configuration file and return an error for unsupported changes, instead of just ignoring them. This feature is for user clusters only. When upgrading admin clusters, changes to most fields are silently ignored and won't take effect after upgrade.
For example, if you attempt to disable node auto-repair when upgrading a user cluster to 1.28, the upgrade fails with the following error message:
failed to generate desired create config: failed to generate desired OnPremUserCluster from seed config: failed to apply validating webhook to OnPremUserCluster: the following changes on immutable fields are forbidden during upgrade: (diff: -before, +after):
v1alpha1.OnPremUserClusterSpec{
... // 20 identical fields
UsageMetering: nil,
CloudAuditLogging: &{ProjectID: "syllogi-partner-testing", ClusterLocation: "us-central1", ServiceAccountKey: &{KubernetesSecret: &{Name: "user-cluster-creds", KeyName: "cloud-audit-logging-service-account-key"}}},
- AutoRepair: &v1alpha1.AutoRepairConfig{Enabled: true},
+ AutoRepair: &v1alpha1.AutoRepairConfig{},
CARotation: &{Generated: &{CAVersion: 1}},
KSASigningKeyRotation: &{Generated: &{KSASigningKeyVersion: 1}},
... // 8 identical fields
}
If you need to bypass this error, there are the following workarounds:
- Revert the attempted change, and then rerun the upgrade. For example, in the
previous scenario, you would revert the changes made to the
AutoRepair
config and then rerungkectl upgrade
. - Alternatively, you can generate configuration files that match the current
state of the cluster by running
gkectl get-config
, update thegkeOnPremVersion
fields for the cluster and the node pools in the configuration file, and then rerungkectl upgrade
.
Debug F5 BIG-IP issues with the internal kubeconfig file
After an installation, Google Distributed Cloud generates a kubeconfig file
named internal-cluster-kubeconfig-debug
in the home directory of your admin
workstation. This kubeconfig file is identical to your admin cluster's
kubeconfig file, except that it points directly to the admin cluster's control
plane node, where the Kubernetes API server runs. You can use the
internal-cluster-kubeconfig-debug
file to debug F5 BIG-IP issues.
Debug issues with vSphere
You can use
govc
to
investigate issues with vSphere. For example, you can confirm permissions and
access for your vCenter user accounts, and you can collect vSphere logs.
Re-create missing user cluster kubeconfig file
You might want to re-create a user cluster kubeconfig
file in the following
situations:
- If you attempt to create a user cluster, and the creation operation fails, and
you want to have its user cluster
kubeconfig
file. - If the user cluster
kubeconfig
file is missing, such as after being deleted.
To generate a new kubeconfig file for your user cluster, execute the following steps:
Define Environment Variables:
Begin by setting the following environment variables with the appropriate values:
USER_CONTROLPLANEVIP=USER_CONTROLPLANEVIP USER_CLUSTER_NAME=USER_CLUSTER_NAME ADMIN_CLUSTER_KUBECONFIG=PATH_TO_ADMIN_KUBECONFIG KUBECONFIG_SECRET_NAME=$(kubectl --kubeconfig $ADMIN_CLUSTER_KUBECONFIG get secrets -n $USER_CLUSTER_NAME | grep ^admin | cut -d' ' -f1 | head -n 1)
Replace the following:
ADMIN_CLUSTER_KUBECONFIG
: the path of thekubeconfig
file for your admin cluster.USER_CONTROLPLANEVIP
: thecontrolPlaneVIP
of the user cluster. This can be retrieved from the user cluster manifest file.
Generate the Kubeconfig File:
Run the following command to create the new kubeconfig file:
kubectl --kubeconfig $ADMIN_CLUSTER_KUBECONFIG get secrets $KUBECONFIG_SECRET_NAME \ -n $USER_CLUSTER_NAME -o jsonpath='{.data.admin\.conf}' | base64 -d | \ sed -r "s/ kube-apiserver.*local\./${USER_CONTROLPLANEVIP}/" \ > USER_CLUSTER_KUBECONFIG
Replace :
USER_CLUSTER_KUBECONFIG
: the name of the newkubeconfig
file for your user cluster.
What's next
- If you need additional assistance, reach out to Cloud Customer Care.