Troubleshooting cluster creation and upgrade

This page shows you how to investigate issues with cluster creation, upgrade, and resizing in GKE on VMware.

Default logging behavior for gkectl and gkeadm

For gkectl and gkeadm it is sufficient to use the default logging settings:

  • For gkectl, the default log file is /home/ubuntu/.config/gke-on-prem/logs/gkectl-$(date).log, and the file is symlinked with the logs/gkectl-$(date).log file in the local directory where you run gkectl.

  • For gkeadm, the default log file is logs/gkeadm-$(date).log in the local directory where you run gkeadm.

  • The default -v5 verbosity level covers all the log entries needed by the support team.

  • The log file includes the command executed and the failure message.

We recommend that you send the log file to the support team when you need help.

Specifying a non-default locations for log files

To specify a non-default location for the gkectl log file, use the --log_file flag. The log file that you specify will not be symlinked with the local directory.

To specify a non-default location for the gkeadm log file, use the --log_file flag.

Locating Cluster API logs in the admin cluster

If a VM fails to start after the admin control plane has started, you can investigate the issue by inspecting the logs from the Cluster API controllers Pod in the admin cluster.

  1. Find the name of the Cluster API controllers Pod:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG --namespace kube-system \
        get pods | grep clusterapi-controllers
    
  2. View logs from the vsphere-controller-manager. Start by specifying the Pod, but no container::

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG --namespace kube-system \
        logs POD_NAME
    

    The output tells you that you must specify a container, and it gives you the names of the containers in the Pod. For example:

    ... a container name must be specified ...,
    choose one of: [clusterapi-controller-manager vsphere-controller-manager rbac-proxy]
    

    Choose a container, and view its logs:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG --namespace kube-system \
        logs POD_NAME --container CONTAINER_NAME
    

Using govc to resolve issues with vSphere

You can use govc to investigate issues with vSphere. For example, you can confirm permissions and access for your vCenter user accounts, and you can collect vSphere logs.

Debugging using the bootstrap cluster's logs

During installation, GKE on VMware creates a temporary bootstrap cluster. After a successful installation, GKE on VMware deletes the bootstrap cluster, leaving you with your admin cluster and user cluster. Generally, you should have no reason to interact with the bootstrap cluster.

If you pass --cleanup-external-cliuster=false to gkectl create cluster, then the bootstrap cluster does not get deleted, and you can use the bootstrap cluster's logs to debug installation issues.

  1. Find the names of Pods running in the kube-system namespace:

    kubectl --kubeconfig /home/ubuntu/.kube/kind-config-gkectl get pods -n kube-system
    
  2. View the logs for a Pod:

    kubectl --kubeconfig /home/ubuntu/.kube/kind-config-gkectl -n kube-system get logs POD_NAME
    

Changing the vCenter certificate

If you are running a vCenter server in evaluation or default setup mode, and it has a generated TLS certificate, this certificate might change over time. If the certificate has changed, you need to let your running clusters know about the new certificate:

  1. Retrieve the new vCenter cert and unzip it:

    curl -k -o certs.zip https://VCENTER_IP_ADDRESS/certs/download.zip
    unzip certs.zip
    

    The -k flag allows unknown certificates. This is to avoid any certificate issues you may have accessing vCenter.

  2. Save the Linux certificate to a file named vcenter-ca.pem:

    cat certs/lin/*.0 > vcenter-ca.pem
    
  3. In your admin cluster configuration file, set vCenter.caCertPath to the path of your new vcenter-ca.pem file.

  4. Use SSH to connect to the control-plane node of your admin cluster.

    On the node, replace the content of /etc/vsphere/certificate/ca.crt with the content of vcenter.pem.

    Exit your SSH connection.

  5. Delete the vcenter-ca-certificate ConfigMaps. There is one in the kube-system namespace and one in each user cluster namespace. For example:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG delete configmap vsphere-ca-certificate \
        --namespace kube-system
    ...
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG delete configmap vsphere-ca-certificate \
        --namespace user-cluster1
    
  6. Create new ConfigMaps with the new cert. For example:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG create configmap \
        --namespace kube-system --dry-run vsphere-ca-certificate --from-file=ca.crt=vcenter.pem \
        --output yaml  | kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG apply -f -
    ...
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG create configmap \
        --namespace user-cluster1 --dry-run vsphere-ca-certificate --from-file=ca.crt=vcenter.pem \
        --output yaml  | kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG apply -f -
    
  7. Restart the container containing the static Pods in the admin control plane:

    • SSH into the admin master VM.
    • Run docker restart.
  8. Next, update the CA certificate data in the admin create-config secret.

    • Get the secret and decode the data.cfg value from the output:
      kubectl get secret create-config -n kube-system -o jsonpath={.data.cfg} | base64 -d > admin-create-config.yaml
      `
    • Compare the value in admincluster.spec.cluster.vsphere.cacertdata with the new vCenter CA cert.
    • If the two values are different, you must edit the admin create-config secret to add the new CA cert. In the admin-create-config.yaml file, copy the result from the base-64 decoding and replace the value of admincluster.spec.cluster.vsphere.cacertdata with the new vCenter CA cert.
    • Encode the value from the previous step: cat admin-create-config.yaml | base64 -w0 > admin-create-config.b64
    • Edit the create-config secret and replace the data.cfg value with the encoded value:
      kubectl --kubeconfig ADMIN_KUBECONFIG edit secrets -n kube-system create-config
  9. Now update the CA certificate data in the create-config secrets for your user clusters.

    Edit the create-config secret and replace the data.cfg value with the base64-encoded value that you created in the previous step. For example:

    kubectl --kubeconfig ADMIN_KUBECONFIG edit secrets -n user-cluster-1 create-config
    
  10. Delete the following Pods in the user clusters.

    • clusterapi-controllers
    • kube-controller-manager
    • kube-apiserver
    • vsphere-csi-controller
    • vsphere-metrics-exporter

    To get the names of the Pods, run:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get pods --all-namespaces | grep clusterapi
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get pods --all-namespaces | grep kube-controller-manager
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get pods --all-namespaces | grep kube-apiserver
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get pods --all-namespaces | grep vsphere-csi-controller
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get pods --all-namespaces | grep vsphere-metrics-exporter
    
  11. Delete the Pods that you found in the preceding step:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG --namespace NAMESPACE \
        delete pod POD_NAME
    

When the Pods restart, they will use the new certificate.

Debugging F5 BIG-IP issues using the internal kubeconfig file

After an installation, GKE on VMware generates a kubeconfig file named internal-cluster-kubeconfig-debug in the home directory of your admin workstation. This kubeconfig file is identical to your admin cluster's kubeconfig file, except that it points directly to the admin cluster's control plane node, where the Kubernetes API server runs. You can use the internal-cluster-kubeconfig-debug file to debug F5 BIG-IP issues.

Resizing a user cluster fails

If a resizing of a user cluster fails:

  1. Find the names of the MachineDeployments and the Machines:

    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG get machinedeployments --all-namespaces
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG get machines --all-namespaces
    
  2. Describe a MachineDeployment to view its logs:

    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG describe machinedeployment MACHINE_DEPLOYMENT_NAME
    
  3. Check for errors on newly-created Machines:

    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG describe machine MACHINE_NAME
    

No addresses can be allocated for cluster resize

This issue occurs if there are not enough IP addresses available to resize a user cluster.

kubectl describe machine displays the following error:

Events:
Type     Reason  Age                From                    Message
----     ------  ----               ----                    -------
Warning  Failed  9s (x13 over 56s)  machineipam-controller  ipam: no addresses can be allocated

To resolve this issue, Allocate more IP addresses for the cluster. Then, delete the affected Machine:

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG delete machine MACHINE_NAME

GKE on VMware creates a new Machine and assigns it one of the newly available IP addresses.

Sufficient number of IP addresses allocated, but Machine fails to register with cluster

This issue can occur if there is an IP address conflict. For example, an IP address you specified for a machine is being used for a load balancer.

To resolve this issue, update your cluster IP block file so that the machine addresses do not conflict with addresses specified in your cluster configuration file or your Seesaw IP block file.

Snapshot is created automatically when admin cluster creation or upgrade fails

If you attempt to create or upgrade an admin cluster, and that operation fails, GKE on VMware takes an external snapshot of the bootstrap cluster, which is a transient cluster that is used to create or upgrade the admin cluster. Although this snapshot of the bootstrap cluster is similar to the snapshot taken by running the gkectl diagnose snapshot command on the admin cluster, it is instead automatically triggered. This snapshot of the bootstrap cluster contains important debugging information for the admin cluster creation and upgrade process. You can provide this snapshot to Google Cloud Support if needed.

Upgrade process becomes stuck

GKE on VMware, behind the scenes, uses the Kubernetes drain command during an upgrade. This drain procedure can be blocked by a Deployment with only one replica that has a PodDisruptionBudget (PDB) created for it with minAvailable: 1.

In that case, save the PDB, and remove it from the cluster before attempting the upgrade. You can then add the PDB back after the upgrade is complete.