Resolving workload startup issues in Cloud Service Mesh

This document explains common Cloud Service Mesh problems and how to resolve them. If you need additional assistance, see Getting support.

Connection Refused when reaching a Cloud Service Mesh endpoint

You might intermittently experience connection refused (ECONNREFUSED) errors with communication from your clusters to your endpoints, for example Memorystore Redis, Cloud SQL, or any external service your application workload needs to reach.

This can occur when your application workload initiates faster than the istio-proxy (Envoy) container and tries to reach an external endpoint. Because at this stage istio-init (initContainer) has already executed, there are iptables rules in place redirecting all outgoing traffic to Envoy. Since istio-proxy is not ready yet, the iptables rules will redirect traffic to a sidecar proxy that is not yet started and therefore, the application gets the ECONNREFUSED error.

The following steps detail how to check if this is the error you are experiencing:

  1. Check the stackdriver logs with the following Filter to identify which pods had the problem.

    The following example shows a typical error message:

    Error: failed to create connection to feature-store redis, err=dial tcp   192.168.9.16:19209: connect: connection refused
    [ioredis] Unhandled error event: Error: connect ECONNREFUSED
    
  2. Search for an occurrence of the problem. If you are using legacy Stackdriver, then use resource.type="container".

    resource.type="k8s_container"
    textPayload:"$ERROR_MESSAGE$"
    
  3. Expand the latest occurrence to obtain the name of the pod and then make note of the pod_name under resource.labels.

  4. Obtain the first occurrence of the issue for that pod:

    resource.type="k8s_container"
    resource.labels.pod_name="$POD_NAME$"
    

    Example output:

    E 2020-03-31T10:41:15.552128897Z
    post-feature-service post-feature-service-v1-67d56cdd-g7fvb failed to create
    connection to feature-store redis, err=dial tcp 192.168.9.16:19209: connect:
    connection refused post-feature-service post-feature-service-v1-67d56cdd-g7fvb
    
  5. Make note of the timestamp of the first error for this pod.

  6. Use the following filter to see the pod startup events.

    resource.type="k8s_container"
    resource.labels.pod_name="$POD_NAME$"
    

    Example output:

    I 2020-03-31T10:41:15Z spec.containers{istio-proxy} Container image "docker.io/istio/proxyv2:1.3.3" already present on machine  spec.containers{istio-proxy}
    I 2020-03-31T10:41:15Z spec.containers{istio-proxy} Created container  spec.containers{istio-proxy}
    I 2020-03-31T10:41:15Z spec.containers{istio-proxy} Started container  spec.containers{istio-proxy}
    I 2020-03-31T10:41:15Z spec.containers{APP-CONTAINER-NAME} Created container  spec.containers{APP-CONTAINER-NAME}
    W 2020-03-31T10:41:17Z spec.containers{istio-proxy} Readiness probe failed: HTTP probe failed with statuscode: 503  spec.containers{istio-proxy}
    W 2020-03-31T10:41:26Z spec.containers{istio-proxy} Readiness probe failed: HTTP probe failed with statuscode: 503  spec.containers{istio-proxy}
    W 2020-03-31T10:41:28Z spec.containers{istio-proxy} Readiness probe failed: HTTP probe failed with statuscode: 503  spec.containers{istio-proxy}
    W 2020-03-31T10:41:31Z spec.containers{istio-proxy} Readiness probe failed: HTTP probe failed with statuscode: 503  spec.containers{istio-proxy}
    W 2020-03-31T10:41:58Z spec.containers{istio-proxy} Readiness probe failed: HTTP probe failed with statuscode: 503  spec.containers{istio-proxy}
    
  7. Use the timestamps of errors and istio-proxy startup events to confirm the errors are happening when Envoy is not ready.

    If the errors occur while the istio-proxy container is not ready yet, it is normal to obtain connection refused errors. In the preceding example, the pod was trying to connect to Redis as soon as 2020-03-31T10:41:15.552128897Z but by 2020-03-31T10:41:58Z istio-proxy was still failing readiness probes.

    Even though the istio-proxy container started first, it is possible that it did not become ready fast enough before the app was already trying to connect to the external endpoint.

    If this is the problem you are experiencing, then continue through the following troubleshooting steps.

  8. Annotate the config at the pod level. This is only available at the pod level and not at a global level.

    annotations:
    proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }'
    
  9. Modify the application code so that it checks if Envoy is ready before it tries to make any other requests to external services. For example, on application start, initiate a loop that makes requests to the istio-proxy health endpoint and only continues once a 200 is obtained. The istio-proxy health endpoint is as follows:

    http://localhost:15020/healthz/ready
    

Race condition during sidecar injection between Vault and Cloud Service Mesh

When using vault for secrets management, sometimes vault injects sidecar before istio, causing that Pods get stuck in Init status. When this happens, the Pods created get stuck in Init status after restarting any deployment or deploying a new one. For example:

E 2020-03-31T10:41:15.552128897Z
post-feature-service post-feature-service-v1-67d56cdd-g7fvb failed to create
connection to feature-store redis, err=dial tcp 192.168.9.16:19209: connect:
connection refused post-feature-service post-feature-service-v1-67d56cdd-g7fvb

This issue is caused by a race condition, both Istio and vault inject the sidecar and Istio must be the last doing this, the istio proxy is not running during init containers. The istio init container sets up iptables rules to redirect all traffic to the proxy. Since it is not running yet, those rules redirect to nothing, blocking all traffic. This is why the init container must be last, so the proxy is up and running immediately after the iptables rules are set up. Unfortunately, the order is not deterministic, so if Istio is injected first it breaks.

To troubleshoot this condition, allow the IP address of vault so the traffic going to the Vault IP is not redirected to the Envoy Proxy which is not ready yet and therefore blocking the communication. To achieve this, a new annotation named excludeOutboundIPRanges should be added.

For managed Cloud Service Mesh, this is only possible at Deployment or Pod level under spec.template.metadata.annotations, for example:

apiVersion: apps/v1
kind: Deployment
...
...
...
spec:
  template:
    metadata:
      annotations:
        traffic.sidecar.istio.io/excludeOutboundIPRanges:

For in-cluster Cloud Service Mesh, there is an option to set it as a global one with an IstioOperator under spec.values.global.proxy.excludeIPRanges, for example:

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  values:
    global:
      proxy:
        excludeIPRanges: ""

After adding the annotation, restart your workloads.