Traffic Director setup for external backends

This document provides instructions for configuring external backends for Traffic Director by using internet fully qualified domain name (FQDN) network endpoint groups (NEGs). It is assumed that you have an intermediate to advanced level familiarity with the following:

This setup guide provides you with basic instructions for the following:

  • Configuring Traffic Director to use an internet NEG and unauthenticated TLS for outbound traffic
  • Routing traffic to a Cloud Run service from your service mesh

Before you begin

Review the Traffic Director with internet network endpoint groups overview.

For the purposes of this guide, the example configurations assume the following:

  • All relevant Compute Engine resources, such as middle proxies, Traffic Director resources, Cloud DNS zones, and hybrid connectivity, are attached to the default Virtual Private Cloud (VPC) network.
  • The service example.com:443 is running in your on-premises infrastructure. The domain example.com is served by three endpoints, 10.0.0.100, 10.0.0.101, and 10.0.0.102. Routes exist that ensure connectivity from the Envoy proxies to these remote endpoints.

The resulting deployment is similar to the following.

Example setup with an internet NEG.
Example setup with an internet NEG (click to enlarge)

Traffic routing with an internet NEG and TLS with SNI

After you configure Traffic Director with an internet NEG by using the FQDN and TLS for outbound traffic, the example deployment behaves as illustrated in the following diagram and description of the traffic.

How traffic is routed in the example.
How traffic is routed in the example (click to enlarge)

The steps in the following legend correspond to the numbering in the previous diagram.

Step Description
0 Envoy receives the FQDN backend configuration from Traffic Director through xDS.
0 Envoy, running in the VM, continuously queries DNS for the configured FQDN.
1 User application initiates a request.
2 Parameters of the request.
3 The Envoy proxy intercepts the request. The example assumes that you are using 0.0.0.0 as the forwarding rule virtual IP address (VIP). When 0.0.0.0 is the VIP, Envoy intercepts all requests. Request routing is based only on Layer 7 parameters regardless of the destination IP address in the original request generated by the application.
4 Envoy selects a healthy remote endpoint and performs a TLS handshake with the SNI obtained from the client TLS policy.
5 Envoy proxies the request to the remote endpoint.

It's not shown in the diagram, but if health checks are configured, Envoy continuously health checks the remote endpoints and routes requests only to healthy endpoints.

Set up hybrid connectivity

This document also assumes that hybrid connectivity is already established:

  • Hybrid connectivity between the VPC network and on-premises services or a third-party public cloud is established with Cloud VPN or Cloud Interconnect.
  • VPC firewall rules and routes are correctly configured to establish bi-directional reachability from Envoy to private service endpoints and, optionally, to an on-premises DNS server.
  • For a successful regional HA failover scenario, global dynamic routing is enabled. For more details, see dynamic routing mode.

Set up Cloud DNS configuration

Use the following commands to set up a Cloud DNS private zone for the domain (FQDN) example.com that has A records pointing to endpoints 10.0.0.100, 10.0.0.101, 10.0.0.102, and 10.0.0.103.

gcloud

  1. Create a DNS managed private zone and attach it to the default network:

    gcloud dns managed-zones create example-zone \
        --description=test \
        --dns-name=example.com \
        --networks=default \
        --visibility=private
    
  2. Add DNS records to the private zone:

    gcloud dns record-sets transaction start \
        --zone=example-zone
    
    gcloud dns record-sets transaction add 10.0.0.100 10.0.0.101 10.0.0.102 10.0.0.103 \
        --name=example.com \
        --ttl=300 \
        --type=A \
        --zone=example-zone
    
    gcloud dns record-sets transaction execute \
        --zone=example-zone
    

Configure Traffic Director with an internet FQDN NEG

In this section, you configure Traffic Director with an internet FQDN NEG.

Create the NEG, health check, and backend service

gcloud

  1. Create the internet NEG:

    gcloud compute network-endpoint-groups create on-prem-service-a-neg \
        --global \
        --network-endpoint-type INTERNET_FQDN_PORT
    
  2. Add the FQDN:Port endpoint to the internet NEG:

    gcloud compute network-endpoint-groups update on-prem-service-a-neg \
        --global \
        --add-endpoint=fqdn=example.com,port=443
    
  3. Create a global health check:

    gcloud compute health-checks create http service-a-http-health-check \
        --global
    
  4. Create a global backend service called on-prem-service-a and associate the health check with it:

    gcloud compute backend-services create on-prem-service-a \
        --global \
        --load-balancing-scheme=INTERNAL_SELF_MANAGED \
        --health-checks service-a-http-health-check
    
  5. Add the NEG called on-prem-service-a-neg as the backend of the backend service:

    gcloud compute backend-services add-backend on-prem-service-a \
        --global \
        --global-network-endpoint-group \
        --network-endpoint-group on-prem-service-a-neg
    

Create a routing rule map

The URL map, target HTTP proxy, and forwarding rule constitute a routing rule map, which provides routing information for traffic in your mesh.

This URL map contains a rule that routes all HTTP traffic to on-prem-service-a.

gcloud

  1. Create the URL map:

    gcloud compute url-maps create td-url-map \
        --default-service on-prem-service-a
    
  2. Create the target HTTP proxy and associate the URL map with the target proxy:

    gcloud compute target-http-proxies create td-proxy \
        --url-map td-url-map
    
  3. Create the global forwarding rule by using the IP address 0.0.0.0. This is a special IP address that causes your data plane to ignore the destination IP address and route requests based only on the request's HTTP parameters.

    gcloud compute forwarding-rules create td-forwarding-rule \
        --global \
        --load-balancing-scheme=INTERNAL_SELF_MANAGED \
        --address=0.0.0.0 \
        --target-http-proxy=td-proxy \
        --ports=443 \
        --network=default
    

Configure unauthenticated TLS and HTTPS

Optionally, if you want to configure unauthenticated HTTPS between your Envoy proxies and your on-premises services, use these instructions. These instructions also demonstrate how to configure SNI in the TLS handshake.

A client TLS policy specifies the client identity and authentication mechanism when a client sends outbound requests to a particular service. A client TLS policy is attached to a backend service resource by using the securitySettings field.

gcloud

  1. Create and import the client TLS policy; set the SNI to the FQDN that you configured in the NEG:

    cat << EOF > client_unauthenticated_tls_policy.yaml
    name: "client_unauthenticated_tls_policy"
    sni: "example.com"
    EOF
    
    gcloud beta network-security client-tls-policies import client_unauthenticated_tls_policy \
        --source=client_unauthenticated_tls_policy.yaml \
        --location=global
    
  2. If you configured an HTTP health check with the backend service in the previous section, detach the health check from the backend service:

    gcloud compute backend-services update on-prem-service-a \
        --global \
        --no-health-checks
    
  3. Create an HTTPS health check:

    gcloud compute health-checks create https service-a-https-health-check \
        --global
    
  4. Attach the client TLS policy to the backend service that you created previously; this enforces unauthenticated HTTPS on all outbound requests from the client to this backend service:

    gcloud compute backend-services export on-prem-service-a \
        --global \
        --destination=on-prem-service-a.yaml
    
    cat << EOF >> on-prem-service-a.yaml
    securitySettings:
      clientTlsPolicy: projects/${PROJECT_ID}/locations/global/clientTlsPolicies/client_unauthenticated_tls_policy
    healthChecks:
      - projects/${PROJECT_ID}/global/healthChecks/service-a-https-health-check
    EOF
    
    gcloud compute backend-services import on-prem-service-a \
        --global \
        --source=on-prem-service-a.yaml
    

You can use internet FQDN NEGs to route traffic to any service that is reachable through FQDN—for example, DNS resolvable external and internal services or Cloud Run services.

Migrate from an IP:Port NEG to an FQDN:Port NEG

NON_GCP_PRIVATE_IP_PORT NEG requires you to program service endpoints into the NEG as static IP:PORT pairs, whereas INTERNET_FQDN_NEG lets the endpoints be resolved dynamically by using DNS. You can migrate to the internet NEG by setting up DNS records for your on-premises service endpoints and configuring Traffic Director as described in the following steps:

  1. Set up DNS records for your FQDN.
  2. Create a new internet NEG with the FQDN.
  3. Create a new backend service with the internet NEG that you created in step 2 as its backend. Associate the same health check that you used with the hybrid connectivity NEG backend service with the new backend service. Verify that the new remote endpoints are healthy.
  4. Update your routing rule map to reference the new backend service by replacing the old backend that includes the hybrid connectivity NEG.
  5. If you want zero downtime during live migration in a production deployment, you can use weight-based traffic. Initially, configure your new backend service to receive only a small percentage of traffic, for example, 5%. Use the instructions for setting up traffic splitting.
  6. Verify that the new remote endpoints are serving traffic correctly.
  7. If you are using weight-based traffic splitting, configure the new backend service to receive 100% of traffic. This step drains the old backend service.
  8. After you verify that the new backends are serving traffic without any issues, delete the old backend service.

Troubleshooting

To resolve deployment issues, use the instructions in this section. If your issues are not resolved with this information, see Troubleshooting deployments that use Envoy.

An on-premises endpoint is not receiving traffic

If an endpoint is not receiving traffic, make sure that it is passing health checks, and that DNS queries from the Envoy client return its IP address consistently.

Envoy uses strict_dns mode to manage connections. It load balances traffic across all resolved endpoints that are healthy. The order in which endpoints are resolved does not matter in strict_dns mode, but Envoy drains traffic to any endpoint that is no longer present in the list of returned IP addresses.

HTTP host header does not match with FQDN when the request reaches my on-premises server

Consider an example in which the domain abc.com resolves to 10.0.0.1, which is the forwarding rule's IP address, and the domain xyz.com resolves to 10.0.0.100, which is your on-premises service endpoint. You want to send traffic to domain xyz.com, which is configured in your NEG.

It's possible that the application in Compute Engine or GKE sets the HTTP Host header to abc.com (Host: abc.com), which gets carried forward to the on-premises endpoint. If you are using HTTPS, Envoy sets the SNI to xyz.com during the TLS handshake. Envoy obtains the SNI from the client TLS policy resource.

If this conflict is causing issues in processing or routing the request when it reaches the on-premises endpoint, as a workaround, you can rewrite the Host header to xyz.com (Host: xyz.com). This can be done either in Traffic Director by using URL rewrite or on the remote endpoint if it has header rewrite capability.

Another less complex workaround is to set the Host header to xyz.com (Host: xyz.com) and use special address 0.0.0.0 as the forwarding rule's IP address.

Envoy returns many 5xx errors

If Envy returns an excessive number of 5xx errors, do the following:

  • Check the Envoy logs to distinguish whether the response is coming from the backend (on-premises backend) and what the reason is for the 5xx error.
  • Make sure that DNS queries are successful, and there are no SERVFAIL or NXDOMAIN errors.
  • Make sure that all the remote endpoints are passing health checks.
  • If health checks are not configured, make sure that all endpoints are reachable from Envoy. Check your firewall rules and routes on the Google Cloud side as well as on the on-premises side.

Cannot reach external services over the public internet from the service mesh

You can send traffic to services located on the public internet by using FQDN backends in Traffic Director. You must first establish internet connectivity between Envoy clients and the external service. If you are getting a 502 error during connections to the external service, do the following:

  • Make sure that you have the correct routes, specifically the default route 0.0.0.0/0, and firewall rules configured.
  • Make sure that DNS queries are successful, and there are no SERVFAIL or NXDOMAIN errors.
  • If the Envoy proxy is running on a Compute Engine VM that doesn't have an external IP address or in a private GKE cluster, you need to configure Cloud NAT or another means to establish outbound internet connectivity.

If the errors persist, or if you are getting other 5xx errors, check the Envoy logs to narrow down the source of the errors.

Cannot reach Serverless services from the service mesh

You can send traffic to Serverless (Cloud Run, Cloud Functions, and App Engine) services by using FQDN backends in Traffic Director. If the Envoy proxy is running on a Compute Engine VM that doesn't have an external IP address or in a private GKE cluster, you need to configure Private Google Access on the subnet to be able to access Google APIs and services.

What's next