The following sections describe issues you might encounter while using GKE On-Prem, and how to resolve them.
Before you begin
Check the following sections before you begin troubleshooting an issue.
Diagnosing cluster issues using gkectl
Use gkectl diagnose
commands to identify cluster issues
and share cluster information with Google. See
Diagnosing cluster issues.
Default logging behavior
For gkectl
and gkeadm
it is sufficient to use the
default logging settings:
-
By default, log entries are saved as follows:
-
For
gkectl
, the default log file is/home/ubuntu/.config/gke-on-prem/logs/gkectl-$(date).log
, and the file is symlinked with thelogs/gkectl-$(date).log
file in the local directory where you rungkectl
. -
For
gkeadm
, the default log file islogs/gkeadm-$(date).log
in the local directory where you rungkeadm
.
-
For
- All log entries are saved in the log file, even if they are not printed in
the terminal (when
--alsologtostderr
isfalse
). - The
-v5
verbosity level (default) covers all the log entries needed by the support team. - The log file also contains the command executed and the failure message.
We recommend that you send the log file to the support team when you need help.
Specifying a non-default location for the log file
To specify a non-default location for the gkectl
log file, use
the --log_file
flag. The log file that you specify will not be
symlinked with the local directory.
To specify a non-default location for the gkeadm
log file, use
the --log_file
flag.
Locating Cluster API logs in the admin cluster
If a VM fails to start after the admin control plane has started, you can try debugging this by inspecting the Cluster API controllers' logs in the admin cluster:
Find the name of the Cluster API controllers Pod in the
kube-system
namespace, where [ADMIN_CLUSTER_KUBECONFIG] is the path to the admin cluster's kubeconfig file:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system get pods | grep clusterapi-controllers
Open the Pod's logs, where [POD_NAME] is the name of the Pod. Optionally, use
grep
or a similar tool to search for errors:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system logs [POD_NAME] vsphere-controller-manager
Installation
Debugging F5 BIG-IP issues using the admin cluster control plane node's kubeconfig
After an installation, GKE On-Prem generates a kubeconfig file in
the home directory of your admin workstation named
internal-cluster-kubeconfig-debug
. This kubeconfig file is
identical to your admin cluster's kubeconfig, except that it points directly at
the admin cluster's control plane node, where the admin control plane runs. You can use
the internal-cluster-kubeconfig-debug
file to debug F5 BIG-IP
issues.
gkectl check-config
validation fails: can't find F5 BIG-IP partitions
- Symptoms
Validation fails because F5 BIG-IP partitions can't be found, even though they exist.
- Potential causes
An issue with the F5 BIG-IP API can cause validation to fail.
- Resolution
Try running
gkectl check-config
again.
gkectl prepare --validate-attestations
fails: could not validate build attestation
- Symptoms
Running
gkectl prepare
with the optional--validate-attestations
flag returns the following error:could not validate build attestation for gcr.io/gke-on-prem-release/.../...: VIOLATES_POLICY
- Potential causes
An attestation might not exist for the affected image(s).
- Resolution
Try downloading and deploying the admin workstation OVA again, as instructed in Creating an admin workstation. If the issue persists, reach out to Google for assistance.
Debugging using the bootstrap cluster's logs
During installation, GKE On-Prem creates a temporary bootstrap cluster. After a successful installation, GKE On-Prem deletes the bootstrap cluster, leaving you with your admin cluster and user cluster. Generally, you should have no reason to interact with this cluster.
If something goes wrong during an installation, and you did pass
--cleanup-external-cluster=false
to gkectl create cluster
,
you might find it useful to debug using the bootstrap cluster's logs. You can
find the Pod, and then get its logs:
kubectl --kubeconfig /home/ubuntu/.kube/kind-config-gkectl get pods -n kube-system
kubectl --kubeconfig /home/ubuntu/.kube/kind-config-gkectl -n kube-system get logs [POD_NAME]
Admin workstation
openssl
can't validate admin workstation OVA
- Symptoms
Running
openssl dgst
against the admin workstation OVA file doesn't returnVerified OK
- Potential causes
An issue is present in the OVA file that prevents successful validation.
- Resolution
Try downloading and deploying the admin workstation OVA again, as instructed in Download the admin workstation OVA . If the issue persists, reach out to Google for assistance.
Connect
Unable to register a user cluster
If you encounter issues with registering user clusters, reach out to Google for assistance.
Cluster created during alpha was deregistered
Refer to Registering a user cluster in the Connect documentation.
You might also choose to delete and recreate the cluster.
Upgrades
About downtime during upgrades
Resource | Description |
---|---|
Admin cluster | When an admin cluster is down, user cluster control planes and workloads on user clusters continue to run, unless they were affected by a failure that caused the downtime |
User cluster control plane | Typically, you should expect no noticeable downtime to user cluster control planes. However, long-running connections to the Kubernetes API server might break and would need to be re-established. In those cases, the API caller should retry until it establishes a connection. In the worst case, there can be up to one minute of downtime during an upgrade. |
User cluster nodes | If an upgrade requires a change to user cluster nodes, GKE On-Prem recreates the nodes in a rolling fashion, and reschedules Pods running on these nodes. You can prevent impact to your workloads by configuring appropriate PodDisruptionBudgets and anti-affinity rules. |
Resizing user clusters
Resizing a user cluster fails
- Symptoms
A resize operation on a user cluster fails.
- Potential causes
Several factors could cause resize operations to fail.
- Resolution
If a resize fails, follow these steps:
Check the cluster's MachineDeployment status to see if there are any events or error messages:
kubectl describe machinedeployments [MACHINE_DEPLOYMENT_NAME]
Check if there are errors on the newly-created Machines:
kubectl describe machine [MACHINE_NAME]
Error: "no addresses can be allocated"
- Symptoms
After resizing a user cluster,
kubectl describe machine [MACHINE_NAME]
displays the following error:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Failed 9s (x13 over 56s) machineipam-controller ipam: no addresses can be allocated
- Potential causes
There aren't enough IP addresses available for the user cluster.
- Resolution
Allocate more IP addresses for the cluster. Then, delete the affected Machine:
kubectl delete machine [MACHINE_NAME]
If the cluster is configured correctly, a replacement Machine is created with an IP address.
Sufficient number of IP addresses allocated, but Machine fails to register with cluster
- Symptoms
Network has enough addresses allocated but the Machine still fails to register with the user cluster.
- Possible causes
There might be an IP conflict. The IP might be taken by another Machine or by your load balancer.
- Resolution
Check that the affected Machine's IP address is not taken. If there is a conflict, you need to resolve the conflict in your environment.
Miscellaneous
Terraform vSphere provider session limit
GKE On-Prem uses Terraform's vSphere provider to bring up VMs in your vSphere environment. The provider's session limit is 1000 sessions. The current implementation doesn't close active sessions after use. You might encounter 503 errors if you have too many sessions running.
Sessions are automatically closed after 300 seconds.
- Symptoms
If you have too many sessions running, you might encounter the following error:
Error connecting to CIS REST endpoint: Login failed: body: {"type":"com.vmware.vapi.std.errors.service_unavailable","value": {"messages":[{"args":["1000","1000"],"default_message":"Sessions count is limited to 1000. Existing sessions are 1000.", "id":"com.vmware.vapi.endpoint.failedToLoginMaxSessionCountReached"}]}}, status: 503 Service Unavailable
- Potential causes
There are too many Terraform provider sessions running in your environment.
- Resolution
Currently, this is working as intended. Sessions are automatically closed after 300 seconds. For more information, refer to to GitHub issue #618.
Using a proxy for Docker: oauth2: cannot fetch token
- Symptoms
While using a proxy, you encounter the following error:
oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: proxyconnect tcp: tls: oversized record received with length 20527
- Potential causes
You might have provided a HTTPS proxy instead of HTTP.
- Resolution
In your Docker configuration, change the proxy address to
http://
instead ofhttps://
.
Verifying that licenses are valid
Remember to verify that your licenses is valid, especially if you are using trial licenses. You might encounter unexpected failures if your F5, ESXi host, or vCenter licenses have expired.