Version 1.5. This version is no longer supported. For more information see the version support policy.

Troubleshooting

The following sections describe issues you might encounter while using GKE on-prem, and how to resolve them.

Before you begin

Check the following sections before you begin troubleshooting an issue.

Diagnosing cluster issues using `gkectl`

Use gkectl diagnosecommands to identify cluster issues and share cluster information with Google. See Diagnosing cluster issues.

Default logging behavior

For gkectl and gkeadm it is sufficient to use the default logging settings:

By default, log entries are saved as follows:
- For gkectl, the default log file is /home/ubuntu/.config/gke-on-prem/logs/gkectl-$(date).log, and the file is symlinked with the logs/gkectl-$(date).log file in the local directory where you run gkectl.
- For gkeadm, the default log file is logs/gkeadm-$(date).log in the local directory where you run gkeadm.
All log entries are saved in the log file, even if they are not printed in the terminal (when --alsologtostderr is false).
The -v5 verbosity level (default) covers all the log entries needed by the support team.
The log file also contains the command executed and the failure message.

We recommend that you send the log file to the support team when you need help.

Specifying a non-default location for the log file

To specify a non-default location for the gkectl log file, use the --log_file flag. The log file that you specify will not be symlinked with the local directory.

To specify a non-default location for the gkeadm log file, use the --log_file flag.

Locating Cluster API logs in the admin cluster

If a VM fails to start after the admin control plane has started, you can try debugging this by inspecting the Cluster API controllers' logs in the admin cluster:

Find the name of the Cluster API controllers Pod in the kube-system namespace, where [ADMIN_CLUSTER_KUBECONFIG] is the path to the admin cluster's kubeconfig file:
```
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system get pods | grep clusterapi-controllers
```
Open the Pod's logs, where [POD_NAME] is the name of the Pod. Optionally, use grep or a similar tool to search for errors:
```
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system logs [POD_NAME] vsphere-controller-manager
```

Installation

Debugging F5 BIG-IP issues using the admin cluster control plane node's kubeconfig

After an installation, GKE on-prem generates a kubeconfig file in the home directory of your admin workstation named internal-cluster-kubeconfig-debug. This kubeconfig file is identical to your admin cluster's kubeconfig, except that it points directly at the admin cluster's control plane node, where the admin control plane runs. You can use the internal-cluster-kubeconfig-debug file to debug F5 BIG-IP issues.

`gkectl check-config` validation fails: can't find F5 BIG-IP partitions

Symptoms: Validation fails because F5 BIG-IP partitions can't be found, even though they exist.
Potential causes: An issue with the F5 BIG-IP API can cause validation to fail.
Resolution: Try running gkectl check-config again.

`gkectl prepare --validate-attestations` fails: could not validate build attestation

Symptoms

Running gkectl prepare with the optional --validate-attestations flag returns the following error:

could not validate build attestation for gcr.io/gke-on-prem-release/.../...: VIOLATES_POLICY

Potential causes

An attestation might not exist for the affected image(s).

Resolution

Try downloading and deploying the admin workstation OVA again, as instructed in Creating an admin workstation. If the issue persists, reach out to Google for assistance.

Debugging using the bootstrap cluster's logs

During installation, GKE on-prem creates a temporary bootstrap cluster. After a successful installation, GKE on-prem deletes the bootstrap cluster, leaving you with your admin cluster and user cluster. Generally, you should have no reason to interact with this cluster.

If something goes wrong during an installation, and you did pass --cleanup-external-cluster=false to gkectl create cluster, you might find it useful to debug using the bootstrap cluster's logs. You can find the Pod, and then get its logs:

kubectl --kubeconfig /home/ubuntu/.kube/kind-config-gkectl get pods -n kube-system

kubectl --kubeconfig /home/ubuntu/.kube/kind-config-gkectl -n kube-system get logs [POD_NAME]

Authentication plugin for GKE Enterprise

Keeping the `gcloud anthos auth` CLI up-to-date

Many common issues can be avoided by checking that the components of your gcloud anthos auth installation are up-to-date with the fixes from the latest version release.

There are two pieces that have to be verified since the gcloud anthos auth command has logic in the gcloud core component and a separately packaged anthos-auth component.

The gcloud core component.
- To update the gcloud core component, run
```
gcloud components update
```
- Verify that your gcloud installation is not out-of-date by running the following command and checking that the date printed is within the last 12 days.
```
gcloud version
```
The anthos-auth component.
- To update the anthos-auth component, run
```
gcloud components install anthos-auth
```
- Verify that your anthos-auth installation is not out-of-date by running the following command and verifying that the release is version v1.1.3 or above.
```
gcloud anthos auth version
```

`apt-get` Installations of `gcloud`

To check if your gcloud installation is managed via apt-get, run the gcloud components update command and check for the following error.

$ gcloud components update
ERROR: (gcloud.components.update)
You cannot perform this action because the gcloud CLI component manager
is disabled for this installation. You can run the following command
to achieve the same result for this installation:
...

Issue: You cannot perform this action because the gcloud CLI component manager is disabled for this installation:

For installations that are managed via apt-get, running the gcloud components commands above will not directly work and will result in an error message similar to the one reproduced above. However, running the gcloud components update and gcloud components install anthos-auth commands will print out the specific apt-get commands that you can execute to update the installation.

Failure running `gkectl create-login-config`

Issue 1:

Symptoms

When running gkectl create-login-config, you encounter the following error:

Error getting clientconfig using [user_cluster_kubeconfig]

Potential causes

This error means either the kubeconfig file passed to gkectl create-login-config is not for a user cluster or the ClientConfig CRD did not come up during cluster creation.

Resolution

Run the following command to see if the ClientConfig CRD is in the cluster:

kubectl --kubeconfig
  [user_cluster_kubeconfig] get clientconfig default -n kube-public

Issue 2:

Symptoms

When running gkectl create-login-config, you encounter the following error:

error merging with file [merge_file] because [merge_file] contains a
  cluster with the same name as the one read from [kubeconfig]. Please write to
  a new output file

Potential causes

Each login configuration file must contain unique cluster names. If you are seeing this error, the file you are writing login config data to contains a cluster name that already exists in the destination file.

Resolution

Write to a new --output file. Note the following:

If --output is not provided, the login config data will be written to a file called kubectl-anthos-config.yaml in the current directory by default.
If --output already exists, the command will try to merge the new login config to --output.

Failure running `gcloud anthos auth login`

Issue 1:

Symptoms: Running login using the auth plugin and the generated login config YAML file fails.
Potential causes: There might be an error in the OIDC configuration details.
Resolution: Verify OIDC client registration with your administrator.

Issue 2:

Symptoms: When a proxy is configured for HTTPS traffic, running the gcloud anthos auth login command fails with proxyconnect tcp in the error message. An example of the type of message you might see is proxyconnect tcp: tls: first record does not look like a TLS handshake.
Potential causes: There might be an error in the https_proxy or HTTPS_PROXY environment variable configurations. If there's an https:// specified in the environment variables, then the GoLang HTTP client libraries might fail if the proxy is configured to handle HTTPS connections using other protocols such as SOCK5.
Resolution: Modify the https_proxy and HTTPS_PROXY environment variables to omit the https:// prefix. On Windows, modify the system environment variables. For example, change the value of the https_proxy environment variable from https://webproxy.example.com:8000 to webproxy.example.com:8000.

Failure using kubeconfig generated by `gcloud anthos auth login` to access cluster

Symptoms

"Unauthorized" Error

If there is an `Unauthorized` error when using the kubeconfig generated by gcloud anthos auth login to access the cluster, this means that the apiserver is unable to authorize the user.

Potential causes

Either the appropriate RBACs are missing or incorrect or there is an error in the OIDC configuration for the cluster.

Resolution

Try the following steps to resolve the issue:

Parse the id-token from kubeconfig.

In the kubeconfig file that was generated by the login command, copy the id-token:
```
kind: Config
…
users:
- name: …
  user:
    auth-provider:
      config:
        id-token: [id-token]
        …
```
Follow the steps to install jwt-cli and run:
```
jwt [id-token]
```
Verify OIDC configuration.

The oidc section filled out in config.yaml, which was used to create the cluster, contains the fields group and username, which are used to set the flags --oidc-group-claim and --oidc-username-claim in the apiserver. When the apiserver is presented with the token, it will look for that group- claim and username-claim and verify that the corresponding group or user has the correct permissions.

Verify that the claims set for group and user in the oidc section of config.yaml are present in the id-token.
Check RBACs that were applied.

Verify that there is an RBAC with the correct permissions for either the user specified by the username-claim or one of the groups listed under the group-claim from the previous step. The name of the user or group in the RBAC should be prefixed with the usernameprefix or groupprefix that was provided in the oidc section of config.yaml.

Note that if usernameprefix was left blank, and username is a value other than email, the prefix will default to issuerurl#. To disable username prefixes, usernameprefix should be set to -.

For more information about user and group prefixes, see Populating the oidc spec.

Note that the Kubernetes API server currently treats a backslash as an escape character. Therefore, if the name of the user or group contains \\, the API server will read it as a single \ when parsing the id_token. Therefore, the RBAC applied for this user or group should only contain a single backslash, or you might see an Unauthorized error.

Example:

config.yaml:
```
oidc:
    issuerurl:
    …
    username: "unique_name"
    usernameprefix: "-"
    group: "group"
    groupprefix: "oidc:"
    ...
```
id_token:
```
{
  ...
  "email": "cluster-developer@example.com",
  "unique_name": "EXAMPLE\\cluster-developer",
  "group": [
    "Domain Users",
    "EXAMPLE\\developers"
],
  ...
}
```
The following RBACs would grant this user cluster-admin permissions (note the single slash in the name field instead of a double slash):

Group RBAC:
```
apiVersion:
kind:
metadata:
   name: example-binding
subjects:
-  kind: Group
   name: "oidc:EXAMPLE\developers"
   apiGroup: rbac.authorization.k8s.io
   roleRef:
     kind: ClusterRole
     name: pod-reader
     apiGroup: rbac.authorization.k8s.io
```
User RBAC:
```
apiVersion:
kind:
metadata:
   name: example-binding
subjects:
-  kind: User
               name: "EXAMPLE\cluster-developer"
               apiGroup: rbac.authorization.k8s.io
           roleRef:
           kind: ClusterRole
               name: pod-reader
               apiGroup: rbac.authorization.k8s.io
```
Check API Server logs.

If the OIDC plugin configured in the kube apiserver does not start up correctly, the API server will return an "Unauthorized" error when presented with the id-token. To see if there were any issues with the OIDC plugin in the API server, run:
```
kubectl
      --kubeconfig=[admin_cluster_kubeconfig] logs statefulset/kube-apiserver -n
      [user_cluster_name]
```

Symptoms

Unable to connect to the server: Get {DISCOVERY_ENDPOINT}: x509: certificate signed by unknown authority

Potential causes

The refresh token in the kubeconfig expired.

Resolution

Run the login command again.

The following are common errors that might occur while using Google Cloud console to try to log in:

Login redirects to page with "URL not found" error

Symptoms

Google Cloud console is not able to reach the GKE on-prem identity provider.

Potential causes

Google Cloud console is not able to reach the GKE on-prem identity provider.

Resolution

Try the following steps to resolve the issue:

Set useHTTPProxy to true

If the IDP is not reachable over the public internet, then you will need to enable the OIDC HTTP Proxy to login via Google Cloud console. In the oidc section of config.yaml, usehttpproxy should be set to true. If you have already created a cluster and want to turn on the proxy, you can edit the ClientConfig CRD directly. Run $ kubectl edit clientconfig default -n kube-public and change useHTTPProxy to true.
useHTTPProxy is already set to true

If the HTTP proxy is enabled and you are still seeing this error,there might have been an issue with the proxy starting up. To get the logs of the proxy, run $ kubectl logs deployment/clientconfig-operator -n kube-system. Note that even if your IDP has a well known CA, for the http proxy to start, the field capath in the oidc section of config.yaml must be provided.
IDP prompts for consent

If the authorization server prompts for consent, and you have not included the extraparam prompt=consent, then you might see this error. Run $ kubectl edit clientconfig default -n kube-public and add prompt=consent to extraparams and try logging in again.
RBACs are misconfigured

If you have not done so already, try authenticating using the Authentication Plugin for Anthos. If you are seeing an authorization error logging in with the plugin as well, then follow the troubleshooting steps to resolve the issue with the plugin, and then try logging in via Google Cloud console again.
Try logging out and logging back in

In some cases, if some settings are changed on storage service, you might need to log out explicitly. Go to the cluster details page, click Log out, and try logging back in.

Admin workstation

`openssl` can't validate admin workstation OVA

Symptoms: Running openssl dgst against the admin workstation OVA file doesn't return Verified OK
Potential causes: An issue is present in the OVA file that prevents successful validation.
Resolution: Try downloading and deploying the admin workstation OVA again, as instructed in Download the admin workstation OVA . If the issue persists, reach out to Google for assistance.

Connect

Unable to register a user cluster

If you encounter issues with registering user clusters, reach out to Google for assistance.

Cluster created during alpha was deregistered

Refer to Registering a user cluster in the Connect documentation.

You might also choose to delete and recreate the cluster.

Storage

Volume fails to attach

Symptoms

The output of gkectl diagnose cluster looks like the following:

Checking cluster object...PASS
Checking machine objects...PASS
Checking control plane pods...PASS
Checking gke-connect pods...PASS
Checking kube-system pods...PASS
Checking gke-system pods...PASS
Checking storage...FAIL
    PersistentVolume pvc-776459c3-d350-11e9-9db8-e297f465bc84: virtual disk "[datastore_nfs] kubevols/kubernetes-dynamic-pvc-776459c3-d350-11e9-9db8-e297f465bc84.vmdk" IS attached to machine "gsl-test-user-9b46dbf9b-9wdj7" but IS NOT listed in the Node.Status
1 storage errors

One or more Pods is stuck in ContainerCreating state with a warning like the following:

Events:
  Type     Reason              Age               From                     Message
  ----     ------              ----              ----                     -------
  Warning  FailedAttachVolume  6s (x6 over 31s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-776459c3-d350-11e9-9db8-e297f465bc84" : Failed to add disk 'scsi0:6'.

Potential causes

If a virtual disk is attached to the wrong virtual machine, it may be due to issue #32727 in Kubernetes 1.12.

Resolution

If a virtual disk is attached to the wrong virtual machine, you might need to manually detach it:

Drain the node. See Safely draining a node. You might want to include the --ignore-daemonsets and --delete-local-data flags in your kubectl drain command.
Power off the VM.
Edit the VM's hardware config in vCenter to remove the volume.
Power on the VM
Uncordon the node.

Volume is lost

Symptoms

The output of gkectl diagnose cluster looks like the following:

Checking cluster object...PASS
Checking machine objects...PASS
Checking control plane pods...PASS
Checking gke-connect pods...PASS
Checking kube-system pods...PASS
Checking gke-system pods...PASS
Checking storage...FAIL
    PersistentVolume pvc-52161704-d350-11e9-9db8-e297f465bc84: virtual disk "[datastore_nfs] kubevols/kubernetes-dynamic-pvc-52161704-d350-11e9-9db8-e297f465bc84.vmdk" IS NOT found
1 storage errors

One or more Pods is stuck in ContainerCreating state with a warning like the following:

Events:
  Type     Reason              Age                   From                                    Message
  ----     ------              ----                  ----                                    -------
  Warning  FailedAttachVolume  71s (x28 over 42m)    attachdetach-controller                 AttachVolume.Attach failed for volume "pvc-52161704-d350-11e9-9db8-e297f465bc84" : File []/vmfs/volumes/43416d29-03095e58/kubevols/
  kubernetes-dynamic-pvc-52161704-d350-11e9-9db8-e297f465bc84.vmdk was not found

Potential causes

If you see a "not found" error related to your VMDK file, it is likely that the virtual disk was permanently deleted. This can happen if an operator manually deletes a virtual disk or the virtual machine it is attached to. To prevent this, manage your virtual machines as described in Resizing a user cluster and Upgrading clusters

Resolution

If a virtual disk was permanently deleted, you might need to manually clean up related Kubernetes resources:

Delete the PVC that referenced the PV by running kubectl delete pvc [PVC_NAME].
Delete the Pod that referenced the PVC by running kubectl delete pod [POD_NAME].
Repeat step 2. (Yes, really. See Kubernetes issue 74374.)

vSphere CSI Volume fails to detach

Symptoms

If you find pods stuck in the ContainerCreating phase with FailedAttachVolume warnings, it could be due to a failed detach on a different node.

Run the following command to find CSI detach errors:

kubectl get volumeattachments -o=custom-columns=NAME:metadata.name,DETACH_ERROR:status.detachError.message

The output should look like the following:

NAME                                                                   DETACH_ERROR
csi-0e80d9be14dc09a49e1997cc17fc69dd8ce58254bd48d0d8e26a554d930a91e5   rpc error: code = Internal desc = QueryVolume failed for volumeID: "57549b5d-0ad3-48a9-aeca-42e64a773469". ServerFaultCode: NoPermission
csi-164d56e3286e954befdf0f5a82d59031dbfd50709c927a0e6ccf21d1fa60192d   
csi-8d9c3d0439f413fa9e176c63f5cc92bd67a33a1b76919d42c20347d52c57435c   
csi-e40d65005bc64c45735e91d7f7e54b2481a2bd41f5df7cc219a2c03608e8e7a8

Potential causes

The CNS > Searchable privilege has not been granted to the vSphere user.

Resolution

Add the CNS > Searchable privilege to your vCenter user account. The detach operation automatically retries until it succeeds.

Upgrades

About downtime during upgrades

Resource	Description
Admin cluster	When an admin cluster is down, user cluster control planes and workloads on user clusters continue to run, unless they were affected by a failure that caused the downtime
User cluster control plane	Typically, you should expect no noticeable downtime to user cluster control planes. However, long-running connections to the Kubernetes API server might break and would need to be re-established. In those cases, the API caller should retry until it establishes a connection. In the worst case, there can be up to one minute of downtime during an upgrade. Note: If user cluster nodes are unable reach the user control plane during the upgrade, new workloads are not scheduled to the cluster. Existing workloads are unaffected.
User cluster nodes	If an upgrade requires a change to user cluster nodes, GKE on-prem recreates the nodes in a rolling fashion, and reschedules Pods running on these nodes. You can prevent impact to your workloads by configuring appropriate PodDisruptionBudgets and anti-affinity rules.

Renewal of certificates might be required before an admin cluster upgrade

Before you begin the admin cluster upgrade process, you should make sure that your admin cluster certificates are currently valid, and renew these certificates if they are not.

Admin cluster certificate renewal process

Make sure that OpenSSL is installed on the admin workstation before you begin.
Set the KUBECONFIG variable:
```
KUBECONFIG=ABSOLUTE_PATH_ADMIN_CLUSTER_KUBECONFIG
```
Replace ABSOLUTE_PATH_ADMIN_CLUSTER_KUBECONFIG with the absolute path to the admin cluster kubeconfig file.

Get the IP address and SSH keys for the admin master node:

kubectl --kubeconfig "${KUBECONFIG}" get secrets -n kube-system sshkeys \
-o jsonpath='{.data.vsphere_tmp}' | base64 -d > \
~/.ssh/admin-cluster.key && chmod 600 ~/.ssh/admin-cluster.key

export MASTER_NODE_IP=$(kubectl --kubeconfig "${KUBECONFIG}" get nodes -o \
jsonpath='{.items[*].status.addresses[?(@.type=="ExternalIP")].address}' \
--selector='node-role.kubernetes.io/master')

Check if the certificates are expired:
```
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" \
"sudo kubeadm alpha certs check-expiration"
```
If the certificates are expired, you must renew them before upgrading the admin cluster.

Back up old certificates:

This is an optional, but recommended, step.

# ssh into admin master
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"

# on admin master
sudo tar -czvf backup.tar.gz /etc/kubernetes
logout

# on worker node
sudo scp -i ~/.ssh/admin-cluster.key \
ubuntu@"${MASTER_NODE_IP}":/home/ubuntu/backup.tar.gz .

Renew the certificates with kubeadm:

 # ssh into admin master
 ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
 # on admin master
 sudo kubeadm alpha certs renew all

Restart the admin master node:

  # on admin master
  cd /etc/kubernetes
  sudo mkdir tempdir
  sudo mv manifests/*.yaml tempdir/
  sleep 5
  echo "remove pods"
  # ensure kubelet detect those change remove those pods
  # wait until the result of this command is empty
  sudo docker ps | grep kube-apiserver

  # ensure kubelet start those pods again
  echo "start pods again"
  sudo mv tempdir/*.yaml manifests/
  sleep 30
  # ensure kubelet start those pods again
  # should show some results
  sudo docker ps | grep -e kube-apiserver -e kube-controller-manager -e kube-scheduler -e etcd

  # clean up
  sudo rm -rf tempdir

  logout

Because the admin cluster kubeconfig file also expires if the admin certificates expire, you should back up this file before expiration.
- Back up the admin cluster kubeconfig file:
```
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" 

"sudo cat /etc/kubernetes/admin.conf" > new_admin.conf
vi "${KUBECONFIG}"
```
- Replace client-certificate-data and client-key-data in kubeconfig with client-certificate-data and client-key-data in the new_admin.conf file that you created.

You must validate the renewed certificates, and validate the certificate of kube-apiserver.

Check certificates expiration:

ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" 

"sudo kubeadm alpha certs check-expiration"

Check certificate of kube-apiserver:

# Get the IP address of kube-apiserver
cat $KUBECONFIG | grep server
# Get the current kube-apiserver certificate
openssl s_client -showcerts -connect : 

| sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p'  

> current-kube-apiserver.crt
# check expiration date of this cert
openssl x509 -in current-kube-apiserver.crt -noout -enddate

Resizing user clusters

Resizing a user cluster fails

Symptoms

A resize operation on a user cluster fails.

Potential causes

Several factors could cause resize operations to fail.

Resolution

If a resize fails, follow these steps:

Check the cluster's MachineDeployment status to see if there are any events or error messages:
```
kubectl describe machinedeployments [MACHINE_DEPLOYMENT_NAME]
```
Check if there are errors on the newly-created Machines:
```
kubectl describe machine [MACHINE_NAME]
```

Error: "no addresses can be allocated"

Symptoms

After resizing a user cluster, kubectl describe machine [MACHINE_NAME] displays the following error:

Events:
   Type     Reason  Age                From                    Message
   ----     ------  ----               ----                    -------
   Warning  Failed  9s (x13 over 56s)  machineipam-controller  ipam: no addresses can be allocated

Potential causes

There aren't enough IP addresses available for the user cluster.

Resolution

Allocate more IP addresses for the cluster. Then, delete the affected Machine:

kubectl delete machine [MACHINE_NAME]

If the cluster is configured correctly, a replacement Machine is created with an IP address.

Sufficient number of IP addresses allocated, but Machine fails to register with cluster

Symptoms: Network has enough addresses allocated but the Machine still fails to register with the user cluster.
Possible causes: There might be an IP conflict. The IP might be taken by another Machine or by your load balancer.
Resolution: Check that the affected Machine's IP address is not taken. If there is a conflict, you need to resolve the conflict in your environment.

vSphere

Debugging with `govc`

If you encounter issues specific to vSphere, you can use govc to troubleshoot. For example, you can easily confirm permissions and access for your vCenter user accounts and collect vSphere logs.

Changing vCenter Certificate

If you are running a vCenter server in evaluation or default setup mode, and it has a generated TLS certificate, this certificate might change over time. If the certificate has changed, you need to let your running cluster(s) know about the new certificate:

Retrieve the new vCenter cert and save to a file:

true | openssl s_client -connect [VCENTER_IP_ADDRESS]:443 -showcerts 2>/dev/null | sed -ne '/-BEGIN/,/-END/p' > vcenter.pem

Now, for each cluster, delete the ConfigMap containing the vSphere and vCenter certificate for each cluster, and create a new ConfigMap with the new cert. For example:

kubectl --kubeconfig kubeconfig delete configmap vsphere-ca-certificate -n kube-system

kubectl --kubeconfig kubeconfig delete configmap vsphere-ca-certificate -n user-cluster1

kubectl --kubeconfig kubeconfig create configmap -n user-cluster1 --dry-run vsphere-ca-certificate --from-file=ca.crt=vcenter.pem  -o yaml  | kubectl --kubeconfig kubeconfig apply -f -

kubectl --kubeconfig kubeconfig create configmap -n kube-system --dry-run vsphere-ca-certificate --from-file=ca.crt=vcenter.pem  -o yaml  | kubectl --kubeconfig kubeconfig apply -f -

Delete the clusterapi-controllers Pod for each cluster. When the Pod restarts, it begins using the new certificate. For example:

kubectl --kubeconfig kubeconfig -n kube-system get pods

kubectl --kubeconfig kubeconfig -n kube-system delete pod clusterapi-controllers-...

Miscellaneous

Terraform vSphere provider session limit

GKE on-prem uses Terraform's vSphere provider to bring up VMs in your vSphere environment. The provider's session limit is 1000 sessions. The current implementation doesn't close active sessions after use. You might encounter 503 errors if you have too many sessions running.

Sessions are automatically closed after 300 seconds.

Symptoms

If you have too many sessions running, you might encounter the following error:

Error connecting to CIS REST endpoint: Login failed: body:
  {"type":"com.vmware.vapi.std.errors.service_unavailable","value":
  {"messages":[{"args":["1000","1000"],"default_message":"Sessions count is
  limited to 1000. Existing sessions are 1000.",
  "id":"com.vmware.vapi.endpoint.failedToLoginMaxSessionCountReached"}]}},
  status: 503 Service Unavailable

Potential causes

There are too many Terraform provider sessions running in your environment.

Resolution

Currently, this is working as intended. Sessions are automatically closed after 300 seconds. For more information, refer to to GitHub issue #618.

Using a proxy for Docker: `oauth2: cannot fetch token`

Symptoms

While using a proxy, you encounter the following error:

oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: proxyconnect tcp: tls: oversized record received with length 20527

Potential causes

You might have provided a HTTPS proxy instead of HTTP.

Resolution

In your Docker configuration, change the proxy address to http:// instead of https://.

Verifying that licenses are valid

Remember to verify that your licenses are valid, especially if you are using trial licenses. You might encounter unexpected failures if your F5, ESXi host, or vCenter licenses have expired.

Troubleshooting

Before you begin

Diagnosing cluster issues using gkectl

Default logging behavior

Specifying a non-default location for the log file

Locating Cluster API logs in the admin cluster

Installation

Debugging F5 BIG-IP issues using the admin cluster control plane node's kubeconfig

gkectl check-config validation fails: can't find F5 BIG-IP partitions

gkectl prepare --validate-attestations fails: could not validate build attestation

Debugging using the bootstrap cluster's logs

Authentication plugin for GKE Enterprise

Keeping the gcloud anthos auth CLI up-to-date

apt-get Installations of gcloud

Failure running gkectl create-login-config

Failure running gcloud anthos auth login

Failure using kubeconfig generated by gcloud anthos auth login to access cluster

Google Cloud console login

Login redirects to page with "URL not found" error

Admin workstation

openssl can't validate admin workstation OVA

Connect

Unable to register a user cluster

Cluster created during alpha was deregistered

Storage

Volume fails to attach

Symptoms

Potential causes

Resolution

Volume is lost

Symptoms

Potential causes

Resolution

vSphere CSI Volume fails to detach

Symptoms

Potential causes

Resolution

Upgrades

About downtime during upgrades

Renewal of certificates might be required before an admin cluster upgrade

Admin cluster certificate renewal process

Resizing user clusters

Resizing a user cluster fails

Error: "no addresses can be allocated"

Sufficient number of IP addresses allocated, but Machine fails to register with cluster

vSphere

Debugging with govc

Changing vCenter Certificate

Miscellaneous

Terraform vSphere provider session limit

Using a proxy for Docker: oauth2: cannot fetch token

Verifying that licenses are valid

Diagnosing cluster issues using `gkectl`

`gkectl check-config` validation fails: can't find F5 BIG-IP partitions

`gkectl prepare --validate-attestations` fails: could not validate build attestation

Keeping the `gcloud anthos auth` CLI up-to-date

`apt-get` Installations of `gcloud`

Failure running `gkectl create-login-config`

Failure running `gcloud anthos auth login`

Failure using kubeconfig generated by `gcloud anthos auth login` to access cluster

`openssl` can't validate admin workstation OVA

Debugging with `govc`

Using a proxy for Docker: `oauth2: cannot fetch token`