Anthos clusters on AWS common error messages

Use these troubleshooting steps if you run into problems creating or using Anthos clusters on AWS (GKE on AWS).

Cluster creation failures

When you make a request to create a cluster, Anthos clusters on AWS first runs a set of pre-flight tests to verify the request. If the cluster creation fails, it can be either because one of these pre-flight tests failed or because a step in the cluster creation process itself didn't complete.

If a pre-flight test fails, your cluster doesn't create any resources, and returns information on the error to you directly. For example, if you try to create a cluster with the name invalid%%%name, the pre-flight test for a valid cluster name fails and the request returns the following error:

ERROR: (gcloud.container.aws.clusters.create) INVALID_ARGUMENT: must be
between 1-63 characters, valid characters are /[a-z][0-9]-/, should start with a
letter, and end with a letter or a number: "invalid%%%name",
field: aws_cluster_id

Cluster creation can also fail after the pre-flight tests have passed. This can happen several minutes after cluster creation has begun, after Anthos clusters on AWS has created resources in Google Cloud and AWS. In this case, an AWS resource will exist in your Google Cloud project with its state set to ERROR.

To get details about the failure, run:

gcloud container aws clusters describe CLUSTER_NAME \
  --location GOOGLE_CLOUD_LOCATION \
  --format "value(state, errors)"

Replace:

  • CLUSTER_NAME with the name of the cluster whose state you're querying
  • GOOGLE_CLOUD_LOCATION with the name of the Google Cloud region that manages this AWS cluster

Alternatively, you can get details about the creation failure by describing the Operation resource associated to the create cluster API call.

gcloud container aws operations describe OPERATION_ID

Replace OPERATION_ID with the ID of the operation that created the cluster. If you don't have the operation ID of your cluster creation request, you can fetch it with the following command:

gcloud container aws operations list \
  --location GOOGLE_CLOUD_LOCATION

Use the timestamp or related information to identify the cluster creation operation of interest.

For example, if your cluster creation has failed because of an insufficiently permissioned AWS IAM role, this command and its results resemble the following example:

gcloud container aws operations describe b6a3d042-8c30-4524-9a99-6ffcdc24b370 \
  --location GOOGLE_CLOUD_LOCATION

The response resembles the following:

done: true
error:
  code: 9
  message: 'could not set deregistration_delay timeout for the target group: AccessDenied
    User: arn:aws:sts::0123456789:assumed-role/foo-1p-dev-oneplatform/multicloud-service-agent
    is not authorized to perform: elasticloadbalancing:ModifyTargetGroupAttributes
    on resource: arn:aws:elasticloadbalancing:us-west-2:0123456789:targetgroup/gke-4nrk57tlyjva-cp-tcp443/74b57728e7a3d5b9
    because no identity-based policy allows the elasticloadbalancing:ModifyTargetGroupAttributes
    action'
metadata:
  '@type': type.googleapis.com/google.cloud.gkemulticloud.v1.OperationMetadata
  createTime: '2021-12-02T17:47:31.516995Z'
  endTime: '2021-12-02T18:03:12.590148Z'
  statusDetail: Cluster is being deployed
  target: projects/123456789/locations/us-west1/awsClusters/aws-prod1
name: projects/123456789/locations/us-west1/operations/b6a3d042-8c30-4524-9a99-6ffcdc24b370

Cluster creation or operation fails with an authorization error

An error showing an authorization failure usually indicates that one of the two AWS IAM roles you specified during the cluster creation command was created incorrectly. For example, if the API role didn't include the elasticloadbalancing:ModifyTargetGroupAttributes permission, cluster creation would fail with an error message resembling the following:

ERROR: (gcloud.container.aws.clusters.create) could not set
deregistration_delay timeout for the target group: AccessDenied User:
arn:aws:sts::0123456789:assumed-role/cloudshell-user-dev-api-role/multicloud-
service-agent is not authorized to perform:
elasticloadbalancing:ModifyTargetGroupAttributes on resource:
arn:aws:elasticloadbalancing:us-east-1:0123456789:targetgroup/gke-u6au6c65e4iq-
cp-tcp443/be4c0f8d872bb60e because no identity-based policy allows the
elasticloadbalancing:ModifyTargetGroupAttributes action

Even if a cluster appears to have been created successfully, an incorrectly specified IAM role might cause failures later during cluster operation, such as when using commands like kubectl logs.

To resolve such authorization errors, confirm the policies associated with the two IAM roles you specified during cluster creation are correct. Specifically, ensure that they match the descriptions in Create AWS IAM roles, then delete and re-create the cluster. The individual role descriptions are available in API Role and Control plane Role.

Cluster creation or operation fails at the health checking stage

Sometimes the cluster creation will fail during health checking with an Operation status that resembles the following:

done: true
error:
  code: 4
  message: Operation failed
metadata:
  '@type': type.googleapis.com/google.cloud.gkemulticloud.v1.OperationMetadata
  createTime: '2022-06-29T18:26:39.739574Z'
  endTime: '2022-06-29T18:54:45.632136Z'
  errorDetail: Operation failed
  statusDetail: Health-checking cluster
  target: projects/123456789/locations/us-west1/awsClusters/aws-prod1
name: projects/123456789/locations/us-west1/operations/8a7a3b7f-242d-4fff-b518-f361d41c6597

This can be because of missing IAM roles or incorrectly specified IAM roles. You can use AWS CloudTrail to surface IAM issues.

For example:

  • If the API role didn't include the kms:GenerateDataKeyWithoutPlaintext permission for control plane main volume KMS key, you'll see following events:

    "eventName": "AttachVolume",
    "errorCode": "Client.InvalidVolume.NotFound",
    "errorMessage": "The volume 'vol-0ff75940ce333aebb' does not exist.",
    

    and

    "errorCode": "AccessDenied",
    "errorMessage": "User: arn:aws:sts::0123456789:assumed-role/foo-1p-dev-oneplatform/multicloud-service-agent is not authorized to perform: kms:GenerateDataKeyWithoutPlaintext on resource: arn:aws:kms:us-west1:0123456789:key/57a61a45-d9c1-4038-9021-8eb08ba339ba because no identity-based policy allows the kms:GenerateDataKeyWithoutPlaintext action",
    
  • If the Control Plane role didn't include the kms:CreateGrant permission for control plane main volume KMS key, you'll see following events:

    "eventName": "AttachVolume",
    "errorCode": "Client.CustomerKeyHasBeenRevoked",
    "errorMessage": "Volume vol-0d022beb769c8e33b cannot be attached. The encrypted volume was unable to access the KMS key.",
    

    and

    "errorCode": "AccessDenied",
    "errorMessage": "User: arn:aws:sts::0123456789:assumed-role/foo-controlplane/i-0a11fae03eb0b08c1 is not authorized to perform: kms:CreateGrant on resource: arn:aws:kms:us-west1:0123456789:key/57a61a45-d9c1-4038-9021-8eb08ba339ba because no identity-based policy allows the kms:CreateGrant action",
    
  • If you didn't give the service-linked role named AWSServiceRoleForAutoScaling with kms:CreateGrant permissions to use the control plane root volume KMS key, you'll see following events:

    "errorCode": "AccessDenied",
    "errorMessage": "User: arn:aws:sts::0123456789:assumed-role/AWSServiceRoleForAutoScaling/AutoScaling is not authorized to perform: kms:CreateGrant on resource: arn:aws:kms:us-west1:0123456789:key/c77a3a26-bc91-4434-bac0-0aa963cb0c31 because no identity-based policy allows the kms:CreateGrant action",
    
  • If you didn't give the service-linked role named AWSServiceRoleForAutoScaling with kms:GenerateDataKeyWithoutPlaintext permissions to use the control plane root volume KMS key, you'll see following events:

    "errorCode": "AccessDenied",
    "errorMessage": "User: arn:aws:sts::0123456789:assumed-role/AWSServiceRoleForAutoScaling/AutoScaling is not authorized to perform: kms:GenerateDataKeyWithoutPlaintext on resource: arn:aws:kms:us-west1:0123456789:key/c77a3a26-bc91-4434-bac0-0aa963cb0c31 because no identity-based policy allows the kms:CreateGrant action",
    

Waiting for nodes to join the cluster

If you receive the following error when creating a node pool, check that your VPC does not include an Associated secondary IPv4 CIDR block.

errorDetail: Operation failed
statusDetail: Waiting for nodes to join the cluster (0 out of 1 are ready)

To fix this issue, create a security group that includes all the CIDR blocks and add that group to your cluster. For more information, see Node pools in VPC Secondary CIDR blocks.

Get an instance's system log

If a control plane or node pool instance doesn't start, you can inspect its system log. To inspect the system log, do the following:

  1. Open the AWS EC2 Instance console.
  2. Click Instances.
  3. Find the instance by name. Anthos clusters on AWS typically creates instances named CLUSTER-NAME-cp for control plane nodes or CLUSTER-NAME-np for node pool nodes.
  4. Choose Actions -> Monitor and Troubleshoot -> Get System Log. The instance's system log appears.

Cluster update failures

When you update a cluster, just as when you create a new cluster, Anthos clusters on AWS first runs a set of pre-flight tests to verify the request. If the cluster update fails, it can be either because one of these pre-flight tests failed or because a step in the cluster update process itself didn't complete.

If a pre-flight test fails, your cluster doesn't update any resources, and returns information on the error to you directly. For example, if you try to update a cluster to use an SSH key pair with name test_ec2_keypair, the pre-flight test tries to fetch the EC2 key pair and fails and the request returns the following error:

ERROR: (gcloud.container.aws.clusters.update) INVALID_ARGUMENT: key pair
"test_ec2_keypair" not found,
field: aws_cluster.control_plane.ssh_config.ec2_key_pair

Cluster updates can also fail after the pre-flight tests have passed. This can happen several minutes after cluster update has begun, and your AWS resource in your Google Cloud project will have its state set to DEGRADED.

To get details about the failure and the related operation, follow the steps described in cluster creation failures.

Cluster update fails when updating control plane tags

The AWS update API now supports updating control plane tags. To update tags, you need a cluster with Kubernetes version 1.24 or higher. You must also make sure your AWS IAM role has the appropriate permissions as listed on the update cluster page for updating control plane tags.

An error showing an auth failure as below usually indicates you missed adding some IAM permission. For example, if the API role didn't include the ec2:DeleteTags permission, cluster update for tags may fail with an error message resembling the following (the <encoded_auth_failure_message> is redacted for brevity):

ERROR: (gcloud.container.aws.clusters.update) could not delete tags:
UnauthorizedOperation You are not authorized to perform this operation.
Encoded authorization failure message: <encoded_auth_failure_message>

To debug the above encoded failure message, you could send a request to AWS STS decode-authorization-message API as below:

aws sts decode-authorization-message --encoded-message
<encoded_auth_failure_message> --query DecodedMessage --output
text | jq '.' | less

The resulting output will resemble the following:

...
"principal": {
  "id": "AROAXMEL2SCNPG6RCJ72B:iam-session",
  "arn": "arn:aws:sts::1234567890:assumed-role/iam_role/iam-session"
},
"action": "ec2:DeleteTags",
"resource": "arn:aws:ec2:us-west-2:1234567890:security-group-rule/sgr-00bdbaef24a92df62",
...

The above response indicates that you could not perform ec2:DeleteTags action on the EC2 security group rule resource of the AWS cluster. Update your API Role accordingly and resend the update API request to update the control plane tags.

Cannot connect to cluster with kubectl

This section gives some hints for diagnosing issues with connecting to your cluster with the kubectl command-line tool.

Server doesn't have a resource

Errors such as error: the server doesn't have a resource type "services" can happen when a cluster has no running node pools, or Connect gateway cannot connect to a node pool. To check the status of your node pools, run the following command:

gcloud container aws node-pools list \
    --cluster-name CLUSTER_NAME \
    --location LOCATION

Replace the following:

  • CLUSTER_NAME: your cluster's name
  • LOCATION: the Google Cloud location that manages your cluster

The output includes the status of your cluster's node pools. If you don't have a node pool listed, Create a node pool.

Troubleshooting Connect gateway

The following error occurs when your username does not have administrator access to your cluster:

Error from server (Forbidden): users "administrator@example.com" is forbidden:
User "system:serviceaccount:gke-connect:connect-agent-sa" cannot impersonate
resource "users" in API group "" at the cluster scope

You can configure additional users by passing the --admin-users flag when you create a cluster.

If you use Connect gateway and can't connect to your cluster, try the following steps:

  1. Get the authorized users for your cluster.

    gcloud container aws clusters describe CLUSTER_NAME \
      --format 'value(authorization.admin_users)'
    

    Replace CLUSTER_NAME with your cluster's name.

    The output includes the usernames with administrator access to the cluster. For example:

    {'username': 'administrator@example.com'}
    
  2. Get the username currently authenticated with the Google Cloud CLI.

    gcloud config get-value account
    

    The output includes the account authenticated with the Google Cloud CLI. If the output of the gcloud containers aws clusters describe and gcloud config get-value account don't match, run gcloud auth login and authenticate as the username with administrative access to the cluster.

kubectl command stops responding

If your cluster runs a Kubernetes version earlier than 1.25 and the kubectl command is unresponsive or times out, the most common reason is that you have not yet created a node pool. By default, Anthos clusters on AWS generates kubeconfig files that use Connect gateway as an internet-reachable endpoint. For this to work, the gke-connect-agent Deployment needs to be running in a node pool on the cluster.

For more diagnostic information in this case, run the following command:

kubectl cluster-info -v=9

If there are no running node pools, you will see requests to connectgateway.googleapis.com fail with a 404 cannot find active connections for cluster error.

For clusters with a Kubernetes version of 1.25 or later, the gke-connect-agent runs on the control plane, and node pool is not required. If kubectl command is unresponsive, check control plane component logs with Cloud Logging.

kubectl exec, attach, and port-forward commands fail

The kubectl exec, kubectl attach, and kubectl port-forward commands might fail with the message error: unable to upgrade connection when using Connect gateway. This is a limitation when using Connect gateway as your Kubernetes API Server endpoint.

To work around this, use a kubeconfig that specifies the cluster's private endpoint. For instructions on accessing the cluster through its private endpoint, see Configure cluster access for kubectl.

kubectl logs fails with remote error: tls: internal error

This might happen when the Control Plane API Role is missing a permission. For instance, this can happen if your AWS role is missing theec2:DescribeDhcpOptions permission. In this case, certificate signing requests from nodes can't be approved, and the worker node will lack a valid certificate.

To determine if this is the problem, you can check if there are pending Certificate Signing Requests that have not been approved with this command:

kubectl get csr

To resolve this, verify that your AWS role matches the requirements in the documentation.

Generic kubectl troubleshooting

If you use Connect gateway:

  • Ensure you have enabled Connect gateway in your Google Cloud project:

    gcloud services enable connectgateway.googleapis.com
    
  • For clusters with a Kubernetes version earlier than 1.25, ensure that you have at least one Linux node pool running and that the gke-connect-agent is running. See troubleshooting connect for details.

  • For clusters with a Kubernetes version of 1.25 or later, check the gke-connect-agent logs with Cloud Logging.

Kubernetes Service (LoadBalancer) or Kubernetes Ingress don't work

If your AWS Elastic Load Balancers (ELB/NLB/ALB) were created but aren't operating as you expected, this might be due to problems with subnet tagging. For more information, see Load balancer subnets.

API errors

Kubernetes 1.22 deprecates and replaces several APIs. If you've upgraded your cluster to version 1.22 or later, any calls your application makes to one of the deprecated APIs will fail.

Solution

Upgrade your application to replace the deprecated API calls with their newer counterparts.

Cannot delete cluster

FAILED_PRECONDITION: Could not assume role

If you receive an error similar to the following, your Anthos Multi-Cloud API role might not exist:

ERROR: (gcloud.container.aws.clusters.delete) FAILED_PRECONDITION: Could not
assume role
"arn:aws:iam::ACCOUNT_NUMBER:role/gke123456-anthos-api-role"
through service account
"service-123456789@gcp-sa-gkemulticloud.iam.gserviceaccount.com".
Please make sure the role has a trust policy allowing the GCP service agent to
assume it: WebIdentityErr failed to retrieve credentials

To fix the problem, follow the steps at Create Anthos Multi-Cloud API role. When you re-create the role with the same name and permissions, you can re-try the command.

Troubleshoot Arm workloads

Pods on Arm nodes crashing

The following issue occurs when you deploy a Pod on an Arm node, but the container image isn't built for Arm architecture.

To identify the issue, complete the following tasks:

  1. Get the status of your Pods:

    kubectl get pods
    
  2. Get the logs for the crashing Pod:

    kubectl logs POD_NAME
    

    Replace POD_NAME with the name of the crashing Pod.

    The error message in your Pod logs is similar to the following:

    exec ./hello-app: exec format error
    

To resolve this issue, ensure that your container image supports Arm architecture. As a best practice, build multiple architecture images.

Unreachable clusters detected error in UI

Some UI surfaces in Google Cloud console have a problem connecting to version 1.25.5-gke.1500 and 1.25.4-gke.1300 clusters. In particular the cluster list for Anthos Service Mesh.

This problem results in a warning that the cluster is unreachable despite being able to login and interact with it from other pages.

This was a regression due to the removal of the gateway-impersonate ClusterRoleBinding in these two cluster versions.

A workaround is to add the needed permissions manually by applying this YAML:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: connect-agent-impersonate-admin-users
rules:
- apiGroups:
  - ""
  resourceNames:
  - ADMIN_USER1
  - ADMIN_USER2
  resources:
  - users
  verbs:
  - impersonate
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: connect-agent-impersonate-admin-users
roleRef:
  kind: ClusterRole
  name: connect-agent-impersonate-admin-users
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: connect-agent-sa
  namespace: gke-connect

Replace ADMIN_USER1 and ADMIN_USER2 with your specific clusters admin user accounts (email addresses). In this example there are only two admin users assumes two admin users.

To view the list of admin users configured for the cluster:

gcloud container aws clusters describe CLUSTER_NAME \
  --location GOOGLE_CLOUD_LOCATION \
  --format "value(authorization.adminUsers)"

This ClusterRole will be automatically overwritten when upgrading to a newer cluster version.