Version 1.7

Resolving observability and telemetry issues in Anthos Service Mesh

This section explains common Anthos Service Mesh problems and how to resolve them. If you need additional assistance, see Getting support.

In Anthos Service Mesh telemetry, the Envoy proxies call the Google Cloud's operations suite APIs periodically to report telemetry data. The type of the API call determines its frequency:

  • Logging: every ~10 seconds
  • Metrics: every ~1 minute
  • Edges (Context API/Topology view): incremental reports every ~1 minute, with full reports every ~10 minutes.
  • Traces: determined by the sampling frequency you configure (typically, one out of every 100 requests).

The telemetry dashboards gather data from both Confluence and Google Cloud's operations suite to display the various service-focused dashboards.

Services dashboard is missing a service

The dashboard only displays HTTP(S)/gRPC services. If your service should be in the list, verify that Anthos Service Mesh telemetry identifies it as an HTTP service.

If your service remains missing, verify that a Kubernetes service configuration {: class="external"} exists in your cluster.

Review the list of all Kubernetes services:

kubectl get services --all-namespaces

Review the list of Kubernetes services in a specific namespace:

kubectl get services -n YOUR_NAMESPACE

Missing/Incorrect metrics for services

If there are missing or incorrect metrics for services in the dashboard, see the following sections for potential resolutions.

Verify Sidecar proxies exist and have been injected properly

The namespace might not have a label for automatic injection, or manual injection has failed. Confirm that the pods in the namespace have at least two containers and that one of those containers is the istio-proxy container:

kubectl -n YOUR_NAMESPACE get pods

Verify that telemetry configuration exists

Use EnvoyFilters in the istio-system namespace to configure telemetry. Without that configuration, Anthos Service Mesh will not report data to Google Cloud's operations suite.

Verify that the Google Cloud's operations suite configuration (and metadata exchange configuration) exists:

kubectl -n istio-system get envoyfilter

The expected output looks similar to the following:

NAME                        AGE
metadata-exchange-1.4       13d
metadata-exchange-1.5       13d
stackdriver-filter-1.4      13d
stackdriver-filter-1.5      13d
...

To further confirm that the Google Cloud's operations suite filter is properly configured, gather a configuration dump from each proxy and look for the presence of the Google Cloud's operations suite filter:

kubectl exec YOUR_POD_NAME -n YOUR_NAMESPACE -c istio-proxy curl localhost:15000/config_dump

In output of the previous command, look for the Google Cloud's operations suite filter, which looks like the following:

"config": {
    "root_id": "stackdriver_inbound",
    "vm_config": {
        "vm_id": "stackdriver_inbound",
        "runtime": "envoy.wasm.runtime.null",
        "code": {
            "local": {
                "inline_string": "envoy.wasm.null.stackdriver"
             }
         }
     },
     "configuration": "{....}"
}

Verify that Anthos Service Mesh identifies an HTTP service

Metrics will not show up in the user interface if the service port for the Kubernetes service is not named with an http prefix. Confirm that the service has the proper names for its ports.

Verify the Cloud Monitoring API is enabled for the project

Confirm that the Cloud Monitoring API is enabled in the APIs & Services dashboard in Google Cloud Console, which is the default.

Verify no errors reporting to the Cloud Monitoring API

In the Google Cloud Console API & Services dashboard, open the Traffic By Response Code graph.

If you see error messages, it might be an issue that warrants further investigation. In particular, look for a large number of 429 error messages, which indicates a potential quota issue. See the next section for troubleshooting steps.

Verify correct quota for the Cloud Monitoring API

In the Google Cloud Console, open the IAM & Admin menu and verify there is a Quotas option. You can access this page directly using the URL:

https://console.cloud.google.com/iam-admin/quotas?project=YOUR_PROJECT_ID

This page shows the full set of quotas for the project, where you can search for Cloud Monitoring API.

Verify no error logs in Envoy proxies

Review the logs for the proxy in question, searching for error message instances:

kubectl -n YOUR_NAMESPACE logs YOUR_POD_NAME -c istio-proxy

However, ignore the warning messages like the following, which are normal:

[warning][filter] [src/envoy/http/authn/http_filter_factory.cc:83]
mTLS PERMISSIVE mode is used, connection can be either plaintext or TLS,
and client cert can be omitted. Please consider to upgrade to mTLS STRICT mode
for more secure configuration that only allows TLS connection with client cert.
See https://istio.io/docs/tasks/security/mtls-migration/ [warning][config]
[bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91]
gRPC config stream closed: 13

Missing or incorrect telemetry data for services

By default, Cloud Monitoring and Cloud Logging are enabled in your Cloud project when you install Anthos Service Mesh. To report telemetry data, each sidecar proxy that is injected into your service pods calls the Cloud Monitoring API and the Cloud Logging API. After deploying workloads, it takes about one or two minutes for telemetry data to be displayed in the Cloud Console. Anthos Service Mesh automatically keeps the service dashboards up to date:

  • For metrics, the sidecar proxies call the Cloud Monitoring API approximately every minute.
  • To update the Topology graph, the sidecar proxies send incremental reports approximately every minute and full reports about every ten minutes.
  • For logging, the sidecar proxies call the Cloud Logging API approximately every ten seconds.
  • For tracing, you have to enable Cloud Trace. Traces are reported according to the sampling frequency that you have configured (typically, one out of every 100 requests).

Metrics are displayed only for HTTP services on the Anthos Service Mesh Metrics page. If you don't see any metrics, verify that all the pods in the namespace for your application's services have sidecar proxies injected:

kubectl get pod -n YOUR_NAMESPACE --all

In the output, notice that the READY column shows two containers for each of your workloads: the primary container and the container for the sidecar proxy.