Resolving managed Anthos Service Mesh issues

This document explains common Anthos Service Mesh problems and how to resolve them, such as when a pod is injected with istio.istio-system, the installation tool generates errors such as HTTP 400 status codes and cluster membership errors.

If you need additional assistance troubleshooting Anthos Service Mesh, see Getting support.

Revision(s) reporting as unhealthy error

You may see a generic Revision(s) reporting unhealthy error if managed Anthos Service Mesh does not have the required Google-managed Service Account with Anthos Service Mesh Service Agent Identity and Access Management (IAM) role bindings. Generally, this occurs when the permission for Anthos Service Mesh Service Agent role is revoked via Terraform, Puppet, or CI/CD reconfiguration.

The steps required to troubleshoot this error depend on whether you are using the Google Cloud console or the Google Cloud CLI.

Google Cloud console

In the Google Cloud console, navigate to the IAM & Admin > IAM.
Select Include Google-provided role grants.
Review the Principal list.

If you see the managed Service Account with the required IAM role in the list, then it is configured correctly.

Note: The format of managed Service Account name is service-${PROJECT_NUMBER}@gcp-sa-servicemesh.iam.gserviceaccount.com and the name of the required IAM role for Anthos Service Mesh is Anthos Service Mesh Service Agent(roles/anthosservicemesh.serviceAgent).

If you don't see the required managed Service Account with the required IAM role in the list, then the required Anthos Service Mesh Service Agent IAM role binding does not exist in the managed Service Account.
Grant Anthos Service Mesh Service Agent (roles/anthosservicemesh.serviceAgent) IAM role bindings to the Anthos Service Mesh managed Service Account in the Google Cloud console.

Note: Ensure that you do not have any automated tooling that will revert this change. If the error recurs, then update any relevant configurations or allow-lists.

Google Cloud CLI

In the Google Cloud CLI, run the following command to check if the required IAM role is configured:

gcloud projects get-iam-policy PROJECT_ID  \
--flatten="bindings[].members" \
--filter="bindings.members:serviceAccount:service-PROJECT_NUMBER@gcp-sa-servicemesh.iam.gserviceaccount.com AND bindings.role:roles/anthosservicemesh.serviceAgent" \
--format='table(bindings.role)'

Review the ROLE list.

If you see any roles in the list, then it is configured correctly.

If you don't see any roles in the list, then all of the managed Service Account roles were revoked.
Run the following command to assign appropriate IAM role bindings to the Anthos Service Mesh managed Service Account:
```
 gcloud projects add-iam-policy-binding PROJECT_ID \
 --member="serviceAccount:service-PROJECT_NUMBER@gcp-sa-servicemesh.iam.gserviceaccount.com" \
 --role="roles/anthosservicemesh.serviceAgent"
```
Note: Ensure that you do not have any automated tooling that will revert this change. If the error recurs, then update any relevant configurations or allow-lists.

Pod is injected with istiod.istio-system

This can occur if you did not replace the istio-injection: enabled label.

In addition, verify the mutating webhooks configuration by using the following command:

kubectl get mutatingwebhookconfiguration

...
istiod-asm-managed
…
# may include istio-sidecar-injector

kubectl get mutatingwebhookconfiguration   istio-sidecar-injector -o yaml

# Run debug commands
export T=$(echo '{"kind":"TokenRequest","apiVersion":"authentication.k8s.io/v1","spec":{"audiences":["istio-ca"], "expirationSeconds":2592000}}' | kubectl create --raw /api/v1/namespaces/default/serviceaccounts/default/token -f - | jq -j '.status.token')

export INJECT_URL=$(kubectl get mutatingwebhookconfiguration istiod-asmca -o json | jq -r .webhooks[0].clientConfig.url)
export ISTIOD_ADDR=$(echo $INJECT_URL | 'sed s/\/inject.*//')

curl -v -H"Authorization: Bearer $T" $ISTIOD_ADDR/debug/configz

The install tool generates HTTP 400 errors

The installation tool might generate HTTP 400 errors like the following:

HealthCheckContainerError, message: Cloud Run error: Container failed to start.
Failed to start and then listen on the port defined by the PORT environment
variable. Logs for this revision might contain more information.

The error can occur if you did not enable Workload Identity on your Kubernetes cluster, which you can do by using the following command:

export CLUSTER_NAME=...
export PROJECT_ID=...
export LOCATION=...
gcloud container clusters update $CLUSTER_NAME --zone $LOCATION \
    --workload-pool=$PROJECT_ID.svc.id.goog

Managed data plane state

The following command displays the state of the managed data plane:

gcloud container fleet mesh describe --project PROJECT_ID

The following table lists all possible managed data plane states:

State	Code	Description
`ACTIVE`	`OK`	The managed data plane is running normally.
`DISABLED`	`DISABLED`	The managed data plane will be in this state if no namespace or revision is configured to use it. Follow the instructions to enable managed Anthos Service Mesh via the fleet API, or enable the managed data plane after provisioning managed Anthos Service Mesh with `asmcli`. Note that the managed data plane status reporting is only available if you enabled the managed data plane by annotating a namespace or revision. Annotating individual pods causes those pods to be managed but with a feature state of `DISABLED` if no namespaces or revisions are annotated.
`FAILED_PRECONDITION`	`MANAGED_CONTROL_PLANE_REQUIRED`	The managed data plane requires an active managed Anthos Service Mesh control plane.
`PROVISIONING`	`PROVISIONING`	The managed data plane is being provisioned. If this state persists for more than 10 minutes, an error has likely occurred and you should contact Support.
`STALLED`	`INTERNAL_ERROR`	The managed data plane is blocked from operating due to an internal error condition. If the issue persists, contact Support.
`NEEDS_ATTENTION`	`UPGRADE_FAILURES`	The managed data plane requires manual intervention in order to bring the service back to the normal state. For more information and how to resolve this issue, see `NEEDS_ATTENTION` state.

`NEEDS_ATTENTION` state

If the gcloud container fleet mesh describe command shows that the managed data plane state is in NEEDS_ATTENTION state and the code is UPGRADE_FAILURES, then the managed data plane has failed to upgrade certain workloads. These workloads will be labeled with dataplane-upgrade: failed by the managed data plane service for further analysis. The proxies must be restarted manually to be upgraded. To get the list of pods that require attention, run the following command:

kubectl get pods --all-namespaces -l dataplane-upgrade=failed

Cluster membership error (No identity provider specified)

The installation tool might fail with Cluster membership errors like the following:

asmcli: [ERROR]: Cluster has memberships.hub.gke.io CRD but no identity
provider specified. Please ensure that an identity provider is available for the
registered cluster.

The error can occur if you don't have GKE workload identity enabled before registering the cluster. You can re-register the cluster on the command line by using the gcloud container fleet memberships register --enable-workload-identity command.

Check the managed control plane status

To check the managed control plane status, run gcloud container fleet mesh describe --project FLEET_PROJECT_ID.

In the response, the membershipStates[].servicemesh.controlPlaneManagement.details field might explain the specific error.

If you need more details, then check the ControlPlaneRevision custom resource in the cluster, which is updated when the managed control plane is provisioned or fails provisioning.

To inspect the status of the resource, replace NAME with the value corresponding to each channel: asm-managed, asm-managed-stable, or asm-managed-rapid.

kubectl describe controlplanerevision NAME -n istio-system

The output is similar to:

    Name:         asm-managed

    …

    Status:
      Conditions:
        Last Transition Time:  2021-08-05T18:56:32Z
        Message:               The provisioning process has completed successfully
        Reason:                Provisioned
        Status:                True
        Type:                  Reconciled
        Last Transition Time:  2021-08-05T18:56:32Z
        Message:               Provisioning has finished
        Reason:                ProvisioningFinished
        Status:                True
        Type:                  ProvisioningFinished
        Last Transition Time:  2021-08-05T18:56:32Z
        Message:               Provisioning has not stalled
        Reason:                NotStalled
        Status:                False
        Type:                  Stalled

The Reconciled condition determines whether the managed control plane is running correctly. If true, the control plane is running successfully. Stalled determines whether the managed control plane provisioning process has encountered an error. If Stalled, the Message field contains more information about the specific error. See Stalled codes for more information about possible errors.

ControlPlaneRevision Stalled Codes

There are multiple reasons the Stalled condition could become true in the ControlPlaneRevisions status.

Reason	Message	Description
PreconditionFailed	Only GKE memberships are supported but ${CLUSTER_NAME} is not a GKE cluster.	The current cluster does not appear to be a GKE cluster. Managed control plane only works on GKE clusters.
	Unsupported ControlPlaneRevision name: ${NAME}	The name of the ControlPlaneRevision must be one of the following: asm-managed asm-managed-rapid asm-managed-stable
	Unsupported ControlPlaneRevision namespace: ${NAMESPACE}	The namespace of the ControlPlaneRevision must be `istio-system`.
	Unsupported channel ${CHANNEL} for ControlPlaneRevision with name${NAME}. Expected ${OTHER_CHANNEL}	The name of the ControlPlaneRevision must match the channel of the ControlPlaneRevision with the following: asm-managed -> regular asm-managed-rapid -> rapid asm-managed-stable -> stable
	Channel must not be omitted or blank	`Channel` is a required field on the ControlPlaneRevision. It is missing or blank on the custom resource.
	Unsupported control plane revision type: ${TYPE}	`managed_service` is the only allow field for the ControlPlaneRevisionType field.
	Unsupported Kubernetes version: ${VERSION}	Kubernetes versions 1.15+ are supported.
	Workload identity is not enabled	Please enable workload identity on your cluster.
	Unsupported workload pool: ${POOL}	The workload pool must be of the form `${PROJECT_ID}.svc.id.goog`.
	Cluster project and environ project do not match	Clusters must be part of the same project in which they are registered to the fleet.
ProvisioningFailed	An error occurred updating cluster resources	Google was unable to update your in-cluster resources such as CRDs and webhooks.
	MutatingWebhookConfiguration "istiod-asm-managed" contains a webhook with URL of ${EXISTING_URL} but expected ${EXPECTED_URL}	Google will not overwrite existing webhooks to avoid breaking your installation. Update this manually if it is desired behavior.
	ValidatingWebhookConfiguration ${NAME} contains a webhook with URL of ${EXISTING_URL} but expected ${EXPECTED_URL}	Google will not overwrite existing webhooks to avoid breaking your installation. Update this manually if it is desired behavior.

Managed Anthos Service Mesh is unable to connect to the GKE cluster

Between June 2022 and September 2022, Google completed security work related to Authorized Networks, Cloud Run and Cloud Functions on Google Kubernetes Engine (GKE). Projects that previously used managed Anthos Service Mesh but stopped using it before the migration do not have the API required for the communication between Cloud Run and GKE.

In this scenario, managed Anthos Service Mesh provisioning will fail and Cloud Logging will display the following error message:

Connect Gateway API has not been used in project [*PROJECT_NUMBER*] before or it is disabled.
Enable it by visiting https://console.developers.google.com/apis/api/connectgateway.googleapis.com/overview?project=[*PROJECT_NUMBER*] then retry.
If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.

Filter this message using the following query:

resource.type="istio_control_plane"
resource.labels.project_id=[*PROJECT_ID*]
resource.labels.location=[*REGION*]
severity=ERROR
jsonPayload.message=~"Connect Gateway API has not been used in project"

In the meantime, sidecar injection and deploying any Anthos Service Mesh related Kubernetes custom resources will also fail and Cloud Logging will display the following warning message:

Error creating: Internal error occurred: failed calling webhook
"rev.namespace.sidecar-injector.istio.io": failed to call webhook: an error on
the server ("unknown") has prevented the request from succeeding.