Troubleshooting Internal TCP/UDP Load Balancing

This guide describes how to troubleshoot configuration issues for a Google Cloud internal TCP/UDP load balancer.

Overview

The types of issues discussed in this guide include the following:

  • Load balancer setup issues
  • Connectivity issues
  • Performance issues
  • Backend failover issues
  • Load balancer as next-hop issues

Before you begin

Before investigating issues, familiarize yourself with the following pages.

For general connectivity:

For failover:

For next hop:

Backends have incompatible balancing modes

When creating a load balancer, you might see the error:

Validation failed for instance group INSTANCE_GROUP:

backend services 1 and 2 point to the same instance group
but the backends have incompatible balancing_mode. Values should be the same.

This happens when you try to use the same backend in two different load balancers, and the backends don't have compatible balancing modes.

For more information, see the following:

Troubleshooting general connectivity issues

If you can't connect to your internal TCP/UDP load balancer, check for the following common issues.

Verify firewall rules

  • Ensure that ingress allow firewall rules are defined to permit health checks to backend VMs.
  • Ensure that ingress allow firewall rules allow traffic to the backend VMs from clients.
  • Ensure that relevant firewall rules exist to allow traffic to reach the backend VMs on the ports being used by the load balancer.
  • If you're using target tags for the firewall rules, make sure that the load balancer's backend VMs are tagged appropriately.

To learn how to configure firewall rules required by your internal TCP/UDP load balancer, see Configuring firewall rules.

Verify that the Guest environment is running on the backend VM

If you can connect to a healthy backend VM, but cannot connect to the load balancer, it might be that the Guest environment (formerly, the Windows Guest Environment or Linux Guest Environment) on the VM is either not running or is unable to communicate with the metadata server (metadata.google.internal, 169.254.169.254).

Check for the following:

  • Ensure that the Guest environment is installed and running on the backend VM.
  • Ensure that the firewall rules within the guest operating system of the backend VM (iptables or Windows Firewall) don't block access to the metadata server.

Verify that backend VMs accept packets sent to the load balancer

Each backend VM must be configured to accept packets sent to the load balancer. That is, the destination of packets delivered to the backend VMs is the IP address of the load balancer. Under most circumstances, this is done with a local route.

For VMs created from Google Cloud images, the Guest agent installs the local route for the load balancer's IP address. Google Kubernetes Engine instances based on Container-Optimized OS implement this by using iptables instead.

On a Linux backend VM, you can verify the presence of the local route by running the following command. Replace LOAD_BALANCER_IP with the load balancer's IP address:

sudo ip route list table local | grep LOAD_BALANCER_IP

Verify service IP address and port binding on the backend VMs

Packets sent to an internal TCP/UDP load balancer arrive at backend VMs with the destination IP address of the load balancer itself. This type of load balancer is not a proxy, and this is expected behavior.

The software running on the backend VM must be doing the following:

  • Listening on (bound to) the load balancer's IP address or any IP address (0.0.0.0 or ::)
  • Listening on (bound to) a port that's included in the load balancer's forwarding rule

To test this, connect to a backend VM using either SSH or RDP. Then perform the following tests using either curl, telnet, or a similar tool:

  • Attempt to reach the service by contacting it using the internal IP address of the backend VM itself, 127.0.0.1, or localhost.
  • Attempt to reach the service by contacting it using the IP address of the load balancer's forwarding rule.

Check if the client VM is in the same region as the load balancer

If the client connecting to the load balancer is in another region, make sure that global access is enabled.

Verify that health check traffic can reach backend VMs

To verify that health check traffic reaches your backend VMs, enable health check logging and search for successful log entries.

You can also verify that load balancer functionality is healthy by viewing the "Healthy" state for the backend.

If there are no healthy instances in the backend, make sure that the appropriate health check is configured and each VM in the backend is listening on the configured health check ports.

From a client in the same VPC network, run the following command to verify that the backend VM is listening on a specific TCP port:

telnet SERVER_IP_ADDRESS PORT

Replace the following:

  • SERVER_IP_ADDRESS: The IP address of the backend VM.
  • PORT: The port that you configured for your health check. By default, the health check port is 80.

Alternatively, you can use SSH to connect the backend VM and run the following command:

curl localhost:PORT

Again, replace PORT with the port that you configured for your health check.

Another way to perform this test is to run the following command:

netstat -npal | grep LISTEN

In the output, check for the following:

  • <var>LB_IP_ADDRESS</var>:<var>PORT</var>
  • 0.0.0.0:<var>PORT</var>
  • :::<var>PORT</var>

This does not determine whether routing is set up correctly to respond to the load balancer's IP address. That's a separate problem with a similar symptom. For routing, run the ip route list table local command and verify that the load balancer's IP address is listed, as described in Verify that backend VMs accept packets sent to the load balancer.

Troubleshooting performance issues

If you are noticing performance issues and increased latency, check for the following common issues.

Verify server functionality

If all of the backend servers are responding to health checks, verify that requests from the client are working properly when issued on the server directly. For example, if the client is sending HTTP requests to the server through the load balancer and there is no response or the response is substantially slower than normal, issue the same HTTP request on each of the backend servers and observe the behavior.

If any of the individual backend servers are not behaving correctly when the request is issued from within the server itself, you can conclude that the server application stack is not working properly. You can focus further troubleshooting on the application itself. If all of the servers are behaving correctly, the next step is to look at the client side and the network.

Verify network connectivity and latency

If all of the backend servers are responding to requests properly, verify network latency. From a client VM, issue a constant ping command to each of the servers, as follows:

ping SERVER_IP_ADDRESS

This test shows the built-in network latency and whether the network is dropping packets. In some cases, firewall rules might be blocking ICMP traffic. If so, this test fails to produce any result. Verify with your firewall rules administrator whether this is the case.

If the ping command shows significantly higher latency than normal or significant packet loss, open a Google Cloud support case to investigate further.

Identify problematic client-server combinations

If the network ping test suggests low latency and no packet loss, the next step is to identify which specific client-server combination, if any, produces problematic results. You can do this by reducing the number of backend servers by half until the number of servers reaches 1, while simultaneously reproducing the problematic behavior (for example, high latency or no responses).

If you identify one or more problematic client-server combinations, perform traffic capture and analysis.

If no problematic client-server combination is identified, skip to performance testing.

Perform traffic capture and analysis

If you identify a specific problematic combination of client-server, you can use packet capture to pinpoint the part of the communication that is causing delay or breakage. Packet capture can be done with tcpdump as follows:

  1. Install tcpdump on the server.
  2. Start tcpdump capture on the server.
  3. From a client, issue a sample request, such as the following:

    curl URL
    
  4. Analyze the tcpdump output to identify the problem.

Do performance testing

If you don't identify any problematic client-server combinations and aggregate performance of all clients and servers together is lower than expected, consider the following tests:

  1. One client and one server, without load balancing.
  2. One client and one server, with load balancing.

    Result: The combination of results from tests [1] and [2] identifies whether the load balancer is causing the issue.

  3. One client and multiple servers, with load balancing.

    Result: Identify the performance limit of one client.

  4. Multiple clients and one server, with load balancing.

    Result: Identify the performance limit of one server.

  5. Multiple clients and multiple servers, without load balancing.

    Result: Identify the performance limit of the network.

When running a stress test with multiple clients and servers, client or server resources (CPU, memory, I/O) might become bottlenecks and reduce the aggregate results. Degraded aggregate results can happen even if each client and server is behaving correctly.

Troubleshooting Shared VPC issues

If you are using Shared VPC and you cannot create a new internal TCP/UDP load balancer in a particular subnet, an organization policy might be the cause. In the organization policy, add the subnet to the list of allowed subnets or contact your organization administrator. For more information, refer to the constraints/compute.restrictSharedVpcSubnetworks constraint.

Troubleshooting failover issues

If you've configured failover for an internal TCP/UDP load balancer, the following sections describe the issues that can occur.

Connectivity

  • Make sure that you've designated at least one failover backend.
  • Verify your failover policy settings:
    • Failover ratio
    • Dropping traffic when all backend VMs are unhealthy
    • Disabling connection draining on failover

Issues with managed instance groups and failover

  • Symptom: The active pool is changing back and forth (flapping) between the primary and failover backends.
  • Possible reason: Using managed instance groups with autoscaling and failover might cause the active pool to repeatedly failover and failback between the primary and failover backends. Google Cloud doesn't prevent you from configuring failover with managed instance groups, because your deployment might benefit from this setup.

Disable connection draining restriction for failover groups

Disabling connection draining only works if the backend service is set up with protocol TCP.

The following error message appears if you create a backend service with UDP while connection draining is disabled:

gcloud compute backend-services create my-failover-bs \
    --global-health-checks \
    --load-balancing-scheme=internal \
    --health-checks=my-tcp-health-check \
    --region=us-central1 \
    --no-connection-drain-on-failover \
    --drop-traffic-if-unhealthy \
    --failover-ratio=0.5 \
    --protocol=UDP
ERROR: (gcloud.compute.backend-services.create) Invalid value for
[--protocol]: can only specify --connection-drain-on-failover if the protocol is
TCP.

Traffic is sent to unexpected backend VMs

First check the following: If the client VM is also a backend VM of the load balancer, it's expected behavior that connections sent to the IP address of the load balancer's forwarding rule are always answered by the backend VM itself. For more information, refer to testing connections from a single client and sending requests from load balanced VMs.

If the client VM is not a backend VM of the load balancer:

  • For requests from a single client, refer to testing connections from a single client so that you understand the limitations of this method.

  • Ensure that you have configured ingress allow firewall rules to allow health checks.

  • For a failover configuration, make sure that you understand how membership in the active pool works, and when Google Cloud performs failover and failback. Inspect your load balancer's configuration:

    • Use the Cloud console to check for the number of healthy backend VMs in each backend instance group. The Cloud console also shows you which VMs are in the active pool.

    • Make sure that your load balancer's failover ratio is set appropriately. For example, if you have ten primary VMs and a failover ratio set to 0.2, this means Google Cloud performs a failover when fewer than two (10 × 0.2 = 2) primary VMs are healthy. A failover ratio of 0.0 has a special meaning: Google Cloud performs a failover when no primary VMs are healthy.

Existing connections are terminated during failover or failback

Edit your backend service's failover policy. Ensure that connection draining on failover is enabled.

Troubleshooting load balancer as next hop

When you set an internal TCP/UDP load balancer to be a next hop of a custom static route, the following issues might occur:

Connectivity

  • If you cannot ping an IP address in the destination range of a route whose next hop is a forwarding rule for an internal TCP/UDP load balancer, note that a route using this type of next hop might not process ICMP traffic depending on when the route was created. If the route was created before May 15, 2021, it only processes TCP and UDP traffic until August 16, 2021. Starting August 16, 2021, all routes will automatically forward all protocol traffic (TCP, UDP, and ICMP) regardless of when a route was created. If you don't want to wait until then, you can enable ping functionality now by creating new routes and deleting the old ones.

  • When using an internal TCP/UDP load balancer as a next hop for a custom static route, all traffic is delivered to the load balancer's healthy backend VMs, regardless of the protocol configured for the load balancer's internal backend service, and regardless of the port or ports configured on the load balancer's internal forwarding rule.

  • Ensure that you have created ingress allow firewall rules that correctly identify sources of traffic that should be delivered to backend VMs via the custom static route's next hop. Packets that arrive on backend VMs preserve their source IP addresses, even when delivered by way of a custom static route.

Invalid value for destination range

The destination range of a custom static route can't be more specific than any subnet route in your VPC network. If you receive the following error message when creating a custom static route:

Invalid value for field 'resource.destRange': [ROUTE_DESTINATION].
[ROUTE_DESTINATION] hides the address space of the network .... Cannot change
the routing of packets destined for the network.
  • You cannot create a custom static route with a destination that exactly matches or is more specific (with a longer mask) than a subnet route. Refer to applicability and order for further information.

  • If packets go to an unexpected destination, remove other routes in your VPC network with more specific destinations. Review the routing order to understand Google Cloud route selection.

What's next