Troubleshooting Anthos clusters on Azure
Use these troubleshooting steps if you run into problems creating or using Anthos clusters on Azure.
Cluster creation failures
When you make a request to create a cluster, Anthos clusters on Azure first runs a set of pre-flight tests to verify the request. If the cluster creation fails, it can be either because one of these pre-flight tests failed or because a step in the cluster creation process itself didn't complete.
If a pre-flight test fails, your cluster doesn't create any resources, and
returns information on the error to you directly. For example, if you try to
create a cluster with the name
invalid%%%name, the pre-flight test for a valid
cluster name fails and the request returns the following error:
ERROR: (gcloud.container.azure.clusters.create) INVALID_ARGUMENT: must be between 1-63 characters, valid characters are /[a-z][0-9]-/, should start with a letter, and end with a letter or a number: "invalid%%%name", field: azure_cluster_id
Cluster creation can also fail after the pre-flight tests have passed. This can
happen several minutes after cluster creation has begun, after Anthos clusters on Azure
has created resources in Google Cloud and Azure. In this case, an
Azure resource will exist in your Google Cloud project with its state set
To get details about the failure, run:
gcloud container azure clusters describe CLUSTER_NAME \ --location GOOGLE_CLOUD_LOCATION \ --format "value(state, errors)"
- CLUSTER_NAME with the name of the cluster whose state you're querying
- GOOGLE_CLOUD_LOCATION with the name of the Google Cloud region that manages this Azure cluster
Alternatively, you can get details about the creation failure by describing the
Operation resource associated to the create cluster API call.
gcloud container azure operations describe OPERATION_ID
Replace OPERATION_ID with the ID of the operation that created the cluster. If you don't have the operation ID of your cluster creation request, you can fetch it with the following command:
gcloud container azure operations list \ --location GOOGLE_CLOUD_LOCATION
Use the timestamp or related information to identify the cluster creation operation of interest.
Cluster update failures
When you update a cluster, just as when you create a new cluster, Anthos clusters on Azure first runs a set of pre-flight tests to verify the request. If the cluster update fails, it can be either because one of these pre-flight tests failed or because a step in the cluster update process itself didn't complete.
If a pre-flight test fails, your cluster doesn't update any resources, and
returns information on the error to you directly. For example, if you try to
update a cluster to use an SSH key pair with name
pre-flight test tries to fetch the EC2 key pair and fails and the request
returns the following error:
ERROR: (gcloud.container.azure.clusters.update) INVALID_ARGUMENT: key pair "test_ec2_keypair" not found, field: azure_cluster.control_plane.ssh_config.ec2_key_pair
Cluster updates can also fail after the pre-flight tests have passed. This can
happen several minutes after cluster update has begun, and your Azure
resource in your Google Cloud project will have its state set to
To get details about the failure and the related operation, follow the steps described in cluster creation failures.
Cannot connect to cluster with kubectl
This section gives some hints for diagnosing issues with connecting to your
cluster with the
kubectl command-line tool.
Server doesn't have a resource
Errors such as
error: the server doesn't have a resource type "services" can
happen when a cluster has no running node pools, or Connect gateway cannot
connect to a node pool. To check the status of your node pools, run the
gcloud container azure node-pools list \ --cluster-name CLUSTER_NAME \ --location LOCATION
Replace the following:
CLUSTER_NAME: your cluster's name
LOCATION: the Google Cloud location that manages your cluster
The output includes the status of your cluster's node pools. If you don't have a node pool listed, Create a node pool.
Troubleshooting Connect gateway
The following error occurs when your username does not have administrator access to your cluster:
Error from server (Forbidden): users "firstname.lastname@example.org" is forbidden: User "system:serviceaccount:gke-connect:connect-agent-sa" cannot impersonate resource "users" in API group "" at the cluster scope
You can configure additional users by passing the
flag when you create a cluster.
If you use Connect gateway and can't connect to your cluster, try the following steps:
Get the authorized users for your cluster.
gcloud container azure clusters describe CLUSTER_NAME \ --format 'value(authorization.admin_users)'
CLUSTER_NAMEwith your cluster's name.
The output includes the usernames with administrator access to the cluster. For example:
Get the username currently authenticated with the Google Cloud CLI.
gcloud config get-value account
The output includes the account authenticated with the Google Cloud CLI. If the output of the
gcloud containers azure clusters describeand
gcloud config get-value accountdon't match, run
gcloud auth loginand authenticate as the username with administrative access to the cluster.
kubectl command stops responding
kubectl command is unresponsive or times out, the most common
reason is that you have not yet created a node pool.
By default, Anthos clusters on Azure generates
kubeconfig files that use
Connect gateway as an internet-reachable endpoint. For this to work, the
gke-connect-agent Deployment needs to be running in a node pool on the cluster.
For more diagnostic information in this case, run the following command:
kubectl cluster-info -v=9
If there are no running node pools, you will see requests to
connectgateway.googleapis.com fail with a 404
cannot find active connections for cluster error.
kubectl exec, attach, and port-forward commands fail
kubectl attach, and
kubectl port-forward commands might
fail with the message
error: unable to upgrade connection when using
Connect gateway. This is a limitation when using Connect gateway as your
Kubernetes API Server endpoint.
To work around this, use a
kubeconfig that specifies the cluster's
private endpoint. For instructions on accessing the cluster through its
private endpoint, see
Configure cluster access for kubectl.
Generic kubectl troubleshooting
If you use Connect gateway:
Ensure you have enabled Connect gateway in your Google Cloud project:
gcloud services enable connectgateway.googleapis.com
Ensure you have at least one Linux node pool running.
gke-connect-agentis running. See troubleshooting connect for details.
Kubernetes 1.22 deprecates and replaces several APIs. If you've upgraded your cluster to version 1.22 or later, any calls your application makes to one of the deprecated APIs will fail.
Upgrade your application to replace the deprecated API calls with their newer counterparts.
Unreachable clusters detected error in UI
Some UI surfaces in Google Cloud console have a problem connecting to version
1.25.4-gke.1300 clusters. In particular the cluster list
for Anthos Service Mesh.
This problem results in a warning that the cluster is unreachable despite being able to login and interact with it from other pages.
This was a regression due to the removal of the
ClusterRoleBinding in these two cluster versions.
A workaround is to add the needed permissions manually by applying this YAML:
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: connect-agent-impersonate-admin-users rules: - apiGroups: - "" resourceNames: - ADMIN_USER1 - ADMIN_USER2 resources: - users verbs: - impersonate --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: connect-agent-impersonate-admin-users roleRef: kind: ClusterRole name: connect-agent-impersonate-admin-users apiGroup: rbac.authorization.k8s.io subjects: - kind: ServiceAccount name: connect-agent-sa namespace: gke-connect
with your specific clusters admin user accounts (email addresses). In this
example there are only two admin users assumes two admin users.
To view the list of admin users configured for the cluster:
gcloud container azure clusters describe CLUSTER_NAME \ --location GOOGLE_CLOUD_LOCATION \ --format "value(authorization.adminUsers)"
ClusterRole will be automatically overwritten when upgrading to a newer cluster version.