Troubleshooting observability issues

The page describes how to troubleshoot issues on the Anthos Service Mesh pages in the Google Cloud console.

Services list is missing a particular service

If a service is missing from the list, validate that a Kubernetes service configuration exists in your cluster.

To get a list of all Kubernetes services:

 kubectl get services --all-namespaces

To get a list of Kubernetes services in a specific namespace:

kubectl get services -n YOUR_NAMESPACE

Missing or incorrect telemetry data for services

By default, Cloud Monitoring and Cloud Logging are enabled in your Google Cloud project when you install Anthos Service Mesh. To report telemetry data, each sidecar proxy that is injected into your service pods calls the Cloud Monitoring API and the Cloud Logging API. After deploying workloads, it takes about one or two minutes for telemetry data to be displayed in the Google Cloud console. Anthos Service Mesh automatically keeps the service dashboards up to date:

  • For metrics, the sidecar proxies call the Cloud Monitoring API approximately every minute.

  • To update the Topology graph, the sidecar proxies send incremental reports approximately every minute and full reports about every ten minutes.

  • For logging, the sidecar proxies call the Cloud Logging API approximately every ten seconds.

  • For tracing, you have to enable Cloud Trace. Traces are reported according to the sampling frequency that you have configured (typically, one out of every 100 requests).

Metrics are displayed only for HTTP services on the Anthos Service Mesh Metrics page. If you don't see any metrics, check the following:

Verify sidecar proxies have been injected

Check that all the pods in the namespace for your application's services have sidecar proxies injected:

kubectl get pod -n YOUR_NAMESPACE --all

In the following example output from the previous command, notice that the READY column indicates there are two containers for each of your workloads: the primary container and the container for the sidecar proxy.

NAME                    READY   STATUS    RESTARTS   AGE
YOUR_WORKLOAD           2/2     Running   0          20s
...

If you don't see two containers, check if your namespace has the istio-injection=enabled label, which indicates that automatic sidecar injection is enabled:

  kubectl get ns --show-labels

Example output:

NAME              STATUS   AGE   LABELS
default           Active   35m   istio-injection=enabled
istio-system      Active   34m   istio-injection=disabled,istio-operator-managed=Reconcile
  • If you don't see the istio-injection=enabled label, run the following command to enable automatic sidecar injection:

    kubectl label namespace YOUR_NAMESPACE istio-injection=enabled --overwrite
  • If you see the istio-injection=enabled label, and you installed Anthos Service Mesh on an existing Google Kubernetes Engine cluster that had workloads on it, you need to restart any running pods to have the sidecar proxy injected or updated with the current Anthos Service Mesh version. See Updating sidecars for existing pods for more information.

Verify the Kubernetes service port names

Check the Kubernetes service port names to verify that Anthos Service Mesh considers the service an HTTP service. To be included in Anthos Service Mesh, service ports must be named, and the name must include the port's protocol, for example:

apiVersion: v1
kind: Service
metadata:
  name: ratings
  labels:
    app: ratings
    service: ratings
spec:
  ports:
  - port: 9080
    name: http

The service port name can include a suffix in the following syntax: name: protocol[-suffix] where the square brackets indicate an optional suffix that must start with a dash, for example:

kind: Service
metadata:
  name: myservice
spec:
  ports:
  - number: 3306
    name: mysql
  - number: 80
    name: http-web

For metrics to be displayed in the Google Cloud console, the service ports must be named with one of the following protocols: http, http2, or grpc. Service ports named with the https protocol are treated astcp, and metrics aren't displayed for those services.

Verify the required APIs are enabled

Anthos Service Mesh requires several APIs for reporting and displaying telemetry. You can see which APIs are enabled for your project using the gcloud services list command, or you can just enable all the required APIs to make sure you get them all.

  1. Set the default project for the Google Cloud CLI:

    gcloud config set project YOUR_PROJECT_ID
  2. Enable all the required APIs:

    gcloud services enable \
       container.googleapis.com \
       compute.googleapis.com \
       monitoring.googleapis.com \
       logging.googleapis.com \
       meshca.googleapis.com \
       meshtelemetry.googleapis.com \
       meshconfig.googleapis.com \
       iamcredentials.googleapis.com \
       anthos.googleapis.com

Verify that the ASM Mesh Data Plane Service Account exists

  1. In the Google Cloud console, open the IAM page:

    Open the IAM page

  2. Select your project.

  3. In the members list, look for a service account with the name ASM Mesh Data Plane Service Account.

  4. If the service account is missing, create it:

    curl --request POST \
      --header "Authorization: Bearer $(gcloud auth print-access-token)" \
      --data '' \
      https://meshconfig.googleapis.com/v1alpha1/projects/YOUR_PROJECT_ID:initialize

Verify your workloads are running

  1. In the Google Cloud console, open the GKE Workloads page:

    Open GKE Workloads

  2. Select your project.

  3. Add a filter for your cluster. Verify all workloads for your application and Anthos Service Mesh are up and running with status OK. A workload might fail because of limited resources (CPU, memory, etc.), in which case you need to upgrade your cluster with more resources. See Resizing a cluster for more information.

Verify your application is serving requests

Verify that you application is actutally serving requests. The QPS can be low, but it should be receiving traffic.

The topology graph is empty

If the topology graph doesn't show your services and displays the error message No data available to graph. Check your filters and try again, check the following:

Verify the mesh ID

Verify that the cluster has the correct mesh_id label:

  1. Get the project number, which is a unique number that is automatically generated when you create your project.

  2. Make sure that the cluster has a mesh_id label in the following format: mesh_id: proj-PROJECT_NUMBER

    Fix the mesh_id label if it is missing or incorrect. For more information, see Adding or updating labels for existing clusters.

  3. Set the following environment variables:

    • Set the project ID:

      export PROJECT_ID=YOUR_PROJECT_ID
    • Set the project number:

      export PROJECT_NUMBER=YOUR_PROJECT_NUMBER
    • Set the cluster name:

      export CLUSTER_NAME=YOUR_CLUSTER_NAME
    • Set the CLUSTER_LOCATION to either your cluster zone or cluster region:

      export CLUSTER_LOCATION=YOUR_ZONE_OR_REGION
    • Set the workload pool:

      export WORKLOAD_POOL=${PROJECT_ID}.svc.id.goog
    • Set the mesh ID:

      export MESH_ID="proj-${PROJECT_NUMBER}"
  4. Redeploy Anthos Service Mesh with the same options that you used previously.

  5. Any workloads that were running on your cluster before you installed Anthos Service Mesh need to have the sidecar proxy updated so they have the current Anthos Service Mesh.

    kubectl rollout restart YOUR_DEPLOYMENT -n YOUR_NAMESPACE