Troubleshooting deployments that use Envoy

This guide provides information to help you resolve Traffic Director configuration issues. For information on how to use the Client Status Discovery Service (CSDS) API to help you investigate issues with Traffic Director, see Understanding Traffic Director client status.

Envoy log locations

To troubleshoot some issues, you need to examine the Envoy proxy logs.

In Google Kubernetes Engine, the Envoy proxies run with the application pods, so you see any errors in the application pod logs, filtering by the istio-proxy container. If the cluster has Workload logging enabled, you can see those in Cloud Logging.

Here is a possible filter:

resource.type="k8s_container"
resource.labels.project_id="PROJECT-NAME"
resource.labels.location="CLUSTER-ZONE"
resource.labels.cluster_name="CLUSTER-NAME"
resource.labels.namespace_name="WORKLOAD-NAMESPACE"
labels.k8s-pod/app="WORKLOAD-NAME"
resource.labels.container_name="istio-proxy"

If Workload Logging is not enabled on the cluster, you can see the errors using a command such as this:

kubectl logs $(kubectl get po -l app=WORKLOAD-NAME -o=jsonpath='{.items[0].metadata.name}') -c istio-proxy --tail 50 #NOTE: This assumes the default namespace.

You can also see the logs for all Envoys running in all clusters and any workload with the following filter:

resource.type="k8s_container"
resource.labels.container_name="istio-proxy"

With Compute Engine and manual deployment, define the LOG_DIR before running the run.sh script in the setup guide.

For example: LOG_DIR='/var/log/envoy/'

By default, the errors are displayed in /var/log/envoy/envoy.err.log.

If no additional configuration was performed by the user to export this to Logging, the errors would only be visible if you SSH to the instance and obtaining this file.

If you use automatic Envoy deployment, you can SSH to the instance to obtain the log file. The file path is likely be the same as the above.

Proxies do not connect to Traffic Director

If your proxies do not connect to Traffic Director, do the following:

  • Check the Envoy proxy logs for any errors connecting to trafficdirector.googleapis.com.
  • If you have set up netfilter (via iptables) to redirect all traffic to the Envoy proxy, make sure that the user (UID) as whom you run the proxy is excluded from redirection. Otherwise, this causes traffic to continuously loop back to the proxy.
  • Make sure that you enabled the Traffic Director API for the project. Under APIs & services for your project, look for errors for the Traffic Director API.
    Go to the API Library page
  • Confirm that API access scope of the VM is set to allow full access to the GCP APIs. This is done by specifying --scopes=https://www.googleapis.com/auth/cloud-platform at VM creation time.
  • Confirm that the service account has the correct permissions. For more information, read Enabling the service account to access the Traffic Director API.
  • Confirm that you can access trafficdirector.googleapis.com:443 from the VM. If there are issues with this access, possible reasons could be firewall preventing access to trafficdirector.googleapis.com over TCP port 443 or DNS resolution issues for trafficdirector.googleapis.com hostname.
  • If you're using Envoy for the sidecar proxy, confirm that the Envoy version is release 1.9.1 or later.

Service configured with Traffic Director is not reachable

If a service configured with Traffic Director is not reachable, confirm that the sidecar proxy is running and able to connect to Traffic Director.

If you are using Envoy as sidecar proxy, you can confirm this by running the following commands.

  1. From the command line, confirm that the Envoy process is running.

    ps aux | grep envoy
    
  2. Inspect Envoy's runtime configuration to confirm that dynamic resources were configured by Traffic Director. You can see the config by running this command:

    curl http://localhost:15000/config_dump
    
  3. Ensure that traffic interception for the sidecar proxy is set up correctly.

For the redirect setup with iptables, run the iptables command and then grep the output to ensure that your rules are there:

sudo iptables -t nat -S | grep ISTIO

The following is an example of the output for iptables intercepting the VIP 10.0.0.1/32, and forwarding it to an Envoy proxy running on port 15001 as UID 1006:

-N ISTIO_IN_REDIRECT
-N ISTIO_OUTPUT
-N ISTIO_REDIRECT
-A OUTPUT -p tcp -j ISTIO_OUTPUT
-A ISTIO_IN_REDIRECT -p tcp -j REDIRECT --to-ports 15001
-A ISTIO_OUTPUT -m owner --uid-owner 1006 -j RETURN
-A ISTIO_OUTPUT -d 127.0.0.1/32 -j RETURN
-A ISTIO_OUTPUT -d 10.0.0.1/32 -j ISTIO_REDIRECT
-A ISTIO_OUTPUT -j RETURN

If the VM instance is created through the GCP Console, some ipv6-related modules are not installed and available before a restart. This causes iptables to fail because of missing dependencies. In this case, restart the VM and rerun the setup process, which should solve the problem. A Compute Engine VM that you created using gcloud commands is not expected to have this problem.

Service stops being reachable when Envoy access logging is configured

If you used TRAFFICDIRECTOR_ACCESS_LOG_PATH to configure Envoy access log as described in the Configuring additional attributes for sidecar proxies, make sure that the system user running Envoy proxy has permissions to write to the specified access log location.

Failure to provide necessary permissions will result in listeners not being programmed on the proxy and can be detected by checking for the following error message in the Envoy proxy log:

gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected:
Error adding/updating listener(s) TRAFFICDIRECTOR_INTERCEPTION_LISTENER:
unable to open file '/var/log/envoy.log': Permission denied

To solve the problem change permissions of the chosen file for the access log to be writable by the Envoy user.

Applications are unable to connect to services not configured in Traffic Director

Make sure that you have set up traffic interception only for the IP-addresses of services that are configured in Traffic Director. If all traffic is intercepted, then connections to the services not configured in Traffic Director are silently discarded by the sidecar proxy.

Traffic is looping within a node or a node crashes

If netfilter (iptables) is set up to intercept all traffic, ensure that the user (UID) that is used to run the sidecar proxy is excluded from traffic interception. Otherwise, traffic sent by the sidecar proxy is looped back to the proxy indefinitely. The sidecar proxy process might crash as a result. In the reference configuration, the netfilter rules do not intercept traffic from the proxy user.

Traffic Director behavior when most endpoints are unhealthy

When 99% of endpoints are unhealthy, for better reliability, Traffic Director configures the data plane to disregard the health status of the endpoints and balance traffic among all of the endpoints, because it is possible that the serving port is still functional.

Error messages in the Envoy logs indicating a configuration problem

If you are having difficulty with your Traffic Director configuration, you might see any of these error messages in the Envoy logs:

  • warning envoy config StreamAggregatedResources gRPC config stream closed: 5, Traffic Director configuration was not found for network "VPC_NAME" in project "PROJECT_NUMBER".
  • warning envoy upstream StreamLoadStats gRPC config stream closed: 5, Traffic Director configuration was not found for network "VPC_NAME" in project "PROJECT_NUMBER".
  • warning envoy config StreamAggregatedResources gRPC config stream closed: 5, Requested entity was not found.
  • warning envoy upstream StreamLoadStats gRPC config stream closed: 5, Requested entity was not found.
  • Traffic Director configuration was not found.

This error message generally indicates that Envoy is requesting configuration from Traffic Director but no matching configuration can be found. When Envoy connects to Traffic Director, it presents a VPC network name (for example, my-network). Traffic Director then looks for forwarding rules that (1) have the INTERNAL_SELF_MANAGED load balancing scheme, and (2) reference the same network name (for example, my-network).

  1. Make sure that there is a forwarding rule in your network that has the load balancing scheme INTERNAL_SELF_MANAGED. Note that forwarding rule's VPC network.

  2. If you're using Traffic Director with automated Envoy deployments on Compute Engine, ensure that the value provided to the --service-proxy:network flag matches the forwarding rule's network name.

  3. If you're using Traffic Director with manual Envoy deployments on Compute Engine, check the Envoy bootstrap file:

    1. Check that the value for the TRAFFICDIRECTOR_NETWORK_NAME variable and ensure that its value matches the forwarding rule's network name.
    2. Make sure that the project number is set in the TRAFFICDIRECTOR_GCP_PROJECT_NUMBER variable in the Envoy bootstrap file.
  4. If you're deploying on GKE and you are using the auto-injector, make sure that the project number and network name are configured correctly, according to the directions in Traffic Director setup for Google Kubernetes Engine Pods with automatic Envoy injection.

Troubleshooting automated Envoy deployments for Compute Engine

This section provides instructions for troubleshooting automated Envoy deployments.

Communication channels for troubleshooting

The Envoy and VM bootstrapping processes and further lifecycle management operations can fail for many reasons, including temporary connectivity issues, broken repositories, bugs in bootstrapping scripts and on-VM agents, and unexpected user actions.

Google Cloud provides communications channels that you can use to help you understand the bootstrapping process and the current state of the components that reside on your VMs.

Virtual serial port output logging

A VM's operating system, BIOS, and other system-level entities typically write output to the serial ports, and the output is useful for troubleshooting system crashes, failed boot-ups, start-up issues, and shutdown issues.

Compute Engine bootstrapping agents log all performed actions to serial port 1, together with system events, starting with basic package installation, through getting data from an instance's metadata server, iptables configuration and Envoy installation status.

On-VM agent logs Envoy process health status, newly discovered Traffic Director services and any other information that might be useful when you investigate issues with VMs.

Cloud Monitoring logging

Data exposed in serial port output is also logged to Monitoring, which uses the Golang library and exports the logs to a separate log to reduce noise. Note that this is an instance-level log, so you might find service proxy logs on the same page as the other instance logs.

VM guest attributes

Guest attributes are a specific type of custom metadata that your applications can write to while running on your instance. Any application or user on your instances can read and write data to these guest attribute metadata values.

Compute Engine Envoy bootstrap scripts and on-VM agents expose attributes with information about the bootstrapping process and current status of Envoy. All guest attributes are exposed in the gce-service-proxy namespace:

gcloud compute instances get-guest-attributes INSTANCE_NAME \
    --query-path=gce-service-proxy/ --zone ZONE

If you find any issues, we recommend that you check the value of the guest attributes bootstrap-status and bootstrap-last-failure. Any bootstrap-status value other than FINISHED indicates that the Envoy environment is not configured yet. The value of bookstrap-last-failure might indicate what the problem is.

Unable to reach Traffic Director service from a VM created using a service-proxy-enabled instance template

Follow these steps to correct this problem.

  1. The installation of service proxy components on the VM might not have completed or might have failed.

    Use the following command to determine whether all components are properly installed.

    gcloud compute instances get-guest-attributes INSTANCE_NAME \
        --query-path=gce-service-proxy/ --zone=ZONE
    

    The bootstrap-status guest attribute is set to one of the following:

    • [none] indicates that installation has not started yet. The VM might still be booting up. Check the status again in a few minutes.
    • IN PROGRESS indicates that the installation and configuration of the service proxy components are not yet complete. Repeat the status check for updates on the process.
    • FAILED indicates that the installation or configuration of a component failed. Check the error message by querying the gce-service-proxy/bootstrap-last-failure1 attribute.
    • FINISHED indications that the installation and configuration process finished without any errors. Use the instructions below to verify that traffic interception and the Envoy proxy are configured correctly.
  2. Traffic interception on the VM is not configured correctly for Traffic Director-based services.

    Log in to the VM and check the iptables configuration:

    gcloud compute ssh INSTANCE_NAME --zone=ZONE
    sudo iptables -L -t nat
    

    Examine the chain SERVICE_PROXY_SERVICE_CIDRS for SERVICE_PROXY_REDIRECT entries such as these:

    Chain SERVICE_PROXY_SERVICE_CIDRS (1 references)
    target           prot opt source              destination  ...
    SERVICE_PROXY_REDIRECT  all  --  anywhere             10.7.240.0/20
    

    For each service, there should be a matching IP address or CIDR in the destination column. If there is no entry for the VIP, there is a problem with populating the Envoy proxy configuration from Traffic Director, or the on-VM agent failed.

  3. The Envoy proxies haven't received their configuration from Traffic Director yet.

    Log in to the VM check the Envoy proxy configuration:

    gcloud compute ssh INSTANCE_NAME --zone=ZONE
    sudo curl localhost:15000/config_dump
    

    Examine the listener configuration received from Traffic Director. For example:

    "dynamic_active_listeners": [
      ...
      "filter_chains": [{
        "filter_chain_match": {
          "prefix_ranges": [{
            "address_prefix": "10.7.240.20",
            "prefix_len": 32
          }],
          "destination_port": 80
        },
      ...
        "route_config_name": "URL_MAP/[PROJECT_NUMBER].td-routing-rule-1"
      ...
    ]
    

    address_prefix is the VIP of a Traffic Director service. It points to the URL map called td-routing-rule-1. Check whether the service you would like to connect to is already included in the listener configuration.

  4. The on-VM agent is not running.

    The on-VM agent automatically configures traffic interception when new Traffic Director services are created. If the agent is not running, all traffic to new services goes directly to VIPs, bypassing the Envoy Proxy and times out.

    To verify the status of the on-VM agent, run the following command:

    gcloud compute instances get-guest-attributes INSTANCE_NAME \
        --query-path=gce-service-proxy/ --zone=ZONE
    
  5. You can examine the attributes of the on-VM agent. The value of the agent-heartbeat attribute has the time that the agent last performed an action or check. If the value is more than five minutes old, the agent is stuck and you should recreate the VM using the command gcloud compute instance-groups managed recreate-instance.

  6. The agent-last-failure attribute exposes the last error that occurred in the agent. This may be a transient issue that resolves by the next time the agent checks, for example, if the error is Cannot reach the Traffic Director API server or it may be a permanent error. Wait a few minutes and recheck the error.

Inbound traffic interception is configured to the workload port, but you cannot connect to the port from outside the VM

Follow these steps to correct this problem.

  1. The installation of service proxy components on the VM might not have completed or might have failed.

    Use the following command to determine whether all components are properly installed.

    gcloud compute instances get-guest-attributes INSTANCE_NAME \
        --query-path=gce-service-proxy/ --zone=ZONE
    

    The bootstrap-status guest attribute is set to one of the following:

    • [none] indicates that installation has not started yet. The VM might still be booting up. Check the status again in a few minutes.
    • IN PROGRESS indicates that the installation and configuration of the service proxy components are not yet complete. Repeat the status check for updates on the process.
    • FAILED indicates that the installation or configuration of a component failed. Check the error message by querying the gce-service-proxy/bootstrap-last-failure1 attribute.
    • FINISHED indications that the installation and configuration process finished without any errors. Use the instructions below to verify that traffic interception and the Envoy proxy are configured correctly.
  2. Traffic interception on the VM is not configured correctly for inbound traffic.

    Log in to the VM and check the iptables configuration:

    gcloud compute ssh INSTANCE_NAME --zone=ZONE
    sudo iptables -L -t nat
    

    Examine the chain SERVICE_PROXY_INBOUND for SERVICE_PROXY_IN_REDIRECT entries such as:

    Chain SERVICE_PROXY_INBOUND (1 references)
    target     prot opt source               destination
    ...
    SERVICE_PROXY_IN_REDIRECT  tcp  --  anywhere  anywhere  tcp dpt:mysql
    

    For each port that is defined in service-proxy:serving-ports there should be a matching port in the destination column. If there is no entry for the port, all inbound traffic goes to this port directly, bypassing the Envoy proxy.

    Verify that there are no other rules that drop traffic to this port or all ports except one specific port.

  3. The Envoy proxies haven't received their configuration for the inbound port from Traffic Director yet.

    Log in to the VM check the Envoy proxy configuration:

    gcloud compute ssh INSTANCE_NAME --zone=ZONE
    sudo curl localhost:15000/config_dump
    

    Look for the inbound listener configuration received from Traffic Director:

    "dynamic_active_listeners": [
      ...
      "filter_chains": [{
        "filter_chain_match": {
          "prefix_ranges": [{
            "address_prefix": "10.0.0.1",
            "prefix_len": 32
          }],
          "destination_port": 80
        },
      ...
        "route_config_name": "inbound|default_inbound_config-80"
      ...
    ]
    

    The route_config_name, starting with inbound indicates a special service created for inbound traffic interception purposes. Check whether the port you want to connect to is already included in the listener configuration under destination_port.

Troubleshooting automated deployment with GKE Pods

Use these instructions to help you solve problems when you use automated Envoy deployment with GKE Pods.

Pods not starting up after you enable automatic Envoy injection

In some circumstances, application pods may not spin up correctly. This may occur when you use a private GKE cluster with restrictive firewall rules.

If you want to use Traffic Director with a private GKE cluster, an additional firewall rule needs to be created for the sidecar injector webhook. Follow the instructions in this guide to create a firewall rule that allows the GKE control plane to reach the pods on port TCP 9443.

You may observe this issue when creating a standalone pod or when a deployment tries to create pods.

When creating a standalone pod (for example, using kubectl apply or kubectl run), the kubectl CLI may return an error message like the following:

Error from server (InternalError): Internal error occurred: failed calling webhook "sidecar-injector.istio.io": Post https://istio-sidecar-injector.istio-control.svc:443/inject?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

When creating pods from a deployment, you may encounter the following symptoms:

  • kubectl get pods shows no pods associated with the deployment.
  • kubectl get events --all-namespaces shows an error message like the following:
Warning  FailedCreate  15s   replicaset-controller  Error creating: Internal error occurred: failed calling webhook "sidecar-injector.istio.io": Post https://istio-sidecar-injector.istio-control.svc:443/inject?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

When following the setup guide, you may first encounter this issue during the Deploying a sample client and verifying injection step. After running kubectl create -f demo/client_sample.yaml, run kubectl get deploy busybox and you will see 0/1 READY pods. You can also find the error by describing the replicaset associated with the deployment by issuing the kubectl describe rs -l run=client command.

Connection refused after you verify the configuration

When you set up Traffic Director with automatic Envoy injection, you might receive a connection refused error when you try to verify the configuration. The cause could be one of the following:

  • The value of discoveryAddress in the specs/01-configmap.yaml file is not correct. The value should be trafficdirector.googleapis.com:443.
  • The value for the VPC network in the specs/01-configmap.yaml file is not correct.
  • The value for the Traffic Director project in the specs/01-configmap.yaml file is not correct.
  • The value of discoveryAddress is wrong in the Pod.
  • The Istio sidecar injector is running rather than the Traffic Director sidecar injector

You can see a sample of the specs/01-configmap.yaml file in Configuring the sidecar injector. If the specs/01-configmap.yaml file does not contain correct values, then Envoy cannot obtain the correction configuration from Traffic Director. To fix this, examine the specs/01-configmap.yaml file and make sure that the values are correct, then recreate the auto-injector.

Make sure to check the value of discoveryAddress in the specs/01- configmap.yaml file and in the Pod. In the Pod, the value is set by the sidecar injector. To check the value of discoveryAddress in the Pod, run this command:

kubectl get po $BUSYBOX_POD -o yaml|grep -Po '\"discoveryAddress\":\"[^,]*\"'

You should see output similar to this:

"discoveryAddress":"trafficdirector.googleapis.com:443"

What's next