Optimizing application latency with load balancing

This document discusses load balancing options and shows how your choice of a specific load balancer on Google Cloud affects end-to-end latency.

Options for load balancing

Depending on the type of traffic sent to your application, you have several options for external load balancing. The following table summarizes your options:

Option Description Traffic flow Scope
HTTP(S) Load Balancing Supports HTTP(S) traffic and advanced features, such as URL mapping and SSL offloading
TCP Proxy Load Balancing or SSL Proxy Load Balancing for non-HTTP traffic on specific ports.
The TCP or SSL (TLS) session is terminated at Google Front Ends (GFEs), at Google's network edge, and traffic is proxied to the backends. Global
Network Load Balancing Allows TCP/UDP traffic using any port to pass through the load balancer. Delivered using Google's Maglev technology to distribute the traffic to the backends. Regional

Because the internal load balancers and Traffic Director don't support user-facing traffic, they are out of scope for this article.

This article's measurements use the Premium Tier in Network Service Tiers because global load balancing requires this service tier.

Measuring latency

When accessing a website that is hosted in us-central1, a user in Germany uses the following methods to test latency:

  • Ping: While ICMP ping is a common way to measure server reachability, ICMP ping doesn't measure end-user latency. For more information, see Additional latency effects of HTTP(S) Load Balancing.
  • Curl: Curl measures Time To First Byte (TTFB). Issue a curl command repeatedly to the server.

When comparing results, be aware that latency on fiber links is constrained by the distance and the speed of light in fiber, which is roughly 200,000 km per second (or 124,724 miles per second).

The distance between Frankfurt, Germany and Council Bluffs, Iowa (the us-central1 region), is roughly 7,500 km. With straight fiber between the locations, round-trip latency is the following:

7,500 km * 2 / 200,000 km/s * 1000 ms/s = 75 milliseconds (ms)

Fiber optic cable doesn't follow a straight path between the user and the data center. Light on the fiber cable passes through active and passive equipment along its path. An observed latency of approximately 1.5 times the ideal, or 112.5 ms, indicates a near-ideal configuration.

Comparing latency

This section compares load balancing in the following configurations:

  • No load balancing
  • Network Load Balancing
  • HTTP(S) Load Balancing or TCP Proxy Load Balancing

In this scenario, the application consists of a regional managed instance group of HTTP web servers. Because the application relies on low-latency calls to a central database, the web servers must be hosted in one location. The application is deployed in the us-central1 region, and users are distributed across the globe. The latency that the user in Germany observes in this scenario illustrates what users worldwide might experience.

Latency scenario diagram (click to enlarge)
Latency scenario diagram (click to enlarge)

No load balancing

When a user makes an HTTP request, unless load balancing is configured, the traffic flows directly from the user’s network to the virtual machine (VM) hosted on Compute Engine. For Premium Tier, traffic then enters Google's network at an edge point of presence (PoP) close to the user's location. For Standard Tier, the user traffic enters Google's network at a PoP close to the destination region. For more information, see the Network Service Tiers documentation.

Architecture with no load balancing (click to enlarge)
Architecture with no load balancing (click to enlarge)

The following table shows the results when the user in Germany tested latency of a system with no load balancing:

Method Result Minimum latency
Ping the VM IP address (Response is directly from web server)
  ping -c 5 compute-engine-vm
  
  PING compute-engine-vm (xxx.xxx.xxx.xxx) 56(84) bytes of data.
  64 bytes from compute-engine-vm (xxx.xxx.xxx.xxx): icmp_seq=1 ttl=56 time=111 ms
  64 bytes from compute-engine-vm (xxx.xxx.xxx.xxx): icmp_seq=2 ttl=56 time=110 ms
  [...]
  --- compute-engine-vm ping statistics ---
  5 packets transmitted, 5 received, 0% packet loss, time 4004ms
  rtt min/avg/max/mdev = 110.818/110.944/111.265/0.451 ms
  
110 ms
TTFB
  for ((i=0; i < 500; i++)); do curl -w  /
  "%{time_starttransfer}\n" -o /dev/null -s compute-engine-vm; done
  
  0.230
  0.230
  0.231
  0.231
  0.230
  [...]
  0.232
  0.231
  0.231
  
230 ms

The TTFB latency is stable, as shown in the following graph of the first 500 requests:

Latency to VM in ms graph (click to enlarge)
Latency to VM in ms graph (click to enlarge)

When pinging the VM IP address, the response comes directly from the web server. The response time from the web server is minimal compared to the network latency (TTFB). This is because a new TCP connection is opened for every HTTP request. An initial three-way handshake is needed before the HTTP response is sent, as shown in the following diagram. Therefore, the observed latency is close to double the ping latency.

Client-server HTTP request diagram (click to enlarge)
Client-server HTTP request diagram (click to enlarge)

Network Load Balancing

With a network load balancer, user requests still enter the Google network at the closest edge PoP (in Premium Tier). In the region where the project's VMs are located, traffic flows first through a network load balancer. It is then forwarded without changes to the target backend VM. The network load balancer distributes traffic based on a stable hashing algorithm. The algorithm uses a combination of source and destination port, IP address, and protocol. The VMs listen to the load balancer IP and accept the traffic unaltered.

Architecture with Network Load Balancing (click to enlarge)
Architecture with Network Load Balancing (click to enlarge)

The following table shows the results when the user in Germany tested latency for the network-load-balancing option.

Method Result Minimum latency
Ping the network load balancer
  ping -c 5 net-lb
  
  PING net-lb (xxx.xxx.xxx.xxx) 56(84) bytes of data.
  64 bytes from net-lb (xxx.xxx.xxx.xxx): icmp_seq=1 ttl=44 time=110 ms
  64 bytes from net-lb (xxx.xxx.xxx.xxx): icmp_seq=2 ttl=44 time=110 ms
  [...]
  64 bytes from net-lb (xxx.xxx.xxx.xxx): icmp_seq=5 ttl=44 time=110 ms
  --- net-lb ping statistics ---
  5 packets transmitted, 5 received, 0% packet loss, time 4007ms
  rtt min/avg/max/mdev = 110.658/110.705/110.756/0.299 ms
  
110 ms
TTFB
 for ((i=0; i < 500; i++)); do curl -w /
    "%{time_starttransfer}\n" -o /dev/null -s net-lb
 
 0.231
 0.232
 0.230
 0.230
 0.232
 [...]
 0.232
 0.231
 
230 ms

Because load balancing takes place within a region and traffic is only forwarded, there is no significant latency impact compared with having no load balancer.

External load balancing

With HTTP(S) Load Balancing, GFEs proxy traffic. These GFEs are at the edge of Google's global network. The GFE terminates the TCP session and connects to a backend in the closest region that can serve the traffic.

HTTP(S) Load Balancing scenario diagram (click to enlarge)
HTTP(S) Load Balancing scenario diagram (click to enlarge)

The following table shows the results when the user in Germany tested latency for the HTTP-load-balancing option.

Method Result Minimum latency
Ping the HTTP(S) Load Balancing
 ping -c 5 http-lb
 
 PING http-lb (xxx.xxx.xxx.xxx) 56(84) bytes of data.
 64 bytes from http-lb (xxx.xxx.xxx.xxx): icmp_seq=1 ttl=56 time=1.22 ms
 64 bytes from http-lb (xxx.xxx.xxx.xxx): icmp_seq=2 ttl=56 time=1.20 ms
 64 bytes from http-lb (xxx.xxx.xxx.xxx): icmp_seq=3 ttl=56 time=1.16 ms
 64 bytes from http-lb (xxx.xxx.xxx.xxx): icmp_seq=4 ttl=56 time=1.17 ms
 64 bytes from http-lb (xxx.xxx.xxx.xxx): icmp_seq=5 ttl=56 time=1.20 ms
 --- http-lb ping statistics ---
 5 packets transmitted, 5 received, 0% packet loss, time 4005ms
 rtt min/avg/max/mdev = 1.163/1.195/1.229/0.039 ms
 
1 ms
TTFB
 for ((i=0; i < 500; i++)); do curl -w /
    "%{time_starttransfer}\n" -o /dev/null -s http-lb; done
 
 0.309
 0.230
 0.229
 0.233
 0.230
 [...]
 0.123
 0.124
 0.126
 
123 ms

The results for HTTP(S) Load Balancing are significantly different. When pinging the HTTP(S) Load Balancing, the round-trip latency is slightly over 1 ms. This result represents latency to the closest GFE, which is located in the same city as the user. This result doesn't reflect the actual latency that the user experiences when trying to access the application that is hosted in the us-central1 region. Experiments using protocols (ICMP) that differ from your application communication protocol (HTTP) can be misleading.

When measuring TTFB, the initial requests show similar response latency. Some requests achieve the lower minimum latency of 123 ms, as shown in the following graph:

Latency to HTTP(S) Load Balancing in ms graph (click to enlarge)
Latency to HTTP(S) Load Balancing in ms graph (click to enlarge)

Two round trips between the client and the VM take more than 123 ms even with straight fiber. The latency is lower because GFEs proxy the traffic. GFEs maintain persistent connections to the backend VMs. Therefore, only the first request from a specific GFE to a specific backend requires a three-way handshake.

Initial HTTP request through GFE diagram (click to enlarge)
Initial HTTP request through GFE diagram (click to enlarge)

Each location has multiple GFEs. The latency graph shows multiple, fluctuating spikes the first time that traffic reaches each GFE-backend pair. The GFE must then establish a new connection to that backend. These spikes reflect differing request hashes. Subsequent requests show lower latency.

Subsequent HTTP request through GFE diagram (click to enlarge)
Subsequent HTTP request through GFE diagram (click to enlarge)

These scenarios demonstrate the reduced latency that users can experience in a production environment. The following table summarizes the results:

Option Ping TTFB
No load balancing 110 ms to the web server 230 ms
Network Load Balancing 110 ms to the in-region network load balancer 230 ms
HTTP(S) Load Balancing 1 ms to the closest GFE 123 ms

When a healthy application is serving users in a specific region, GFEs in that region maintain a persistent connection open to all serving backends. Because of this, users in that region notice reduced latency on their first HTTP request if users are far from the application backend. If users are near the application backend, the users don't notice latency improvement.

For subsequent requests, such as clicking a page link, there is no latency improvement because modern browsers maintain a persistent connection to the service. This differs from a curl command issued from the command line.

Additional latency effects of HTTP(S) Load Balancing

Additional observable effects with HTTP(S) Load Balancing depend on traffic patterns.

  • HTTP(S) Load Balancing has less latency for complex assets than Network Load Balancing because fewer round trips are needed before a response completes. For example, when the user in Germany measures latency over the same connection by repeatedly downloading a 10 MB file, the average latency for Network Load Balancing is 1911 ms compared to 1341 ms with HTTP(S) Load Balancing. This saves approximately 5 round trips per request. Persistent connections between GFEs and serving backends reduce the effects of TCP Slow Start.

  • HTTP(S) Load Balancing significantly reduces the additional latency for a TLS handshake (typically 1-2 extra roundtrips) because HTTP(S) Load Balancing uses SSL offloading, and only the latency to the edge PoP is relevant. For the user in Germany, the minimum observed latency is 201 ms using HTTP(S) Load Balancing, versus 525 ms using HTTP(S) through the Network Load Balancing.

  • The HTTP(S) Load Balancing allows an automatic upgrade of the user-facing session to HTTP/2, which can reduce the number of packets needed, by using improvements in binary protocol, header compression, and connection multiplexing. This can reduce the observed latency even more than that observed by switching to HTTP(S) Load Balancing. HTTP/2 is supported with current browsers that use SSL/TLS. For the user in Germany, minimum latency decreased further from 201 ms to 145 ms when using HTTP/2 instead of HTTPS.

Optimizing HTTP(S) Load Balancing

You can optimize latency for your application by using the external HTTP(S) load balancer as follows:

  • If some of the traffic you serve is cacheable, you can integrate with Cloud CDN. Cloud CDN reduces latency by serving assets directly at Google's network edge. Cloud CDN also makes use of the TCP and HTTP optimizations from HTTP/2 mentioned in the Additional latency effects of HTTP(S) Load Balancing section.

  • You can use any CDN partner with Google Cloud. By using one of Google's CDN Interconnect partners, you benefit from discounted egress costs.

  • If content is static, you can reduce the load on the web servers by serving content directly from Cloud Storage through the HTTP(S) Load Balancing. This option combines with the Cloud CDN.

  • Deploying your web servers in multiple regions close to your users can reduce latency because HTTP(S) Load Balancing, SSL Proxy Load Balancing, and TCP Proxy Load Balancing automatically direct users to the closest region. However, if your application is partly centralized, design it to decrease the number of inter-regional round trips.

  • To reduce latency inside your applications, examine any remote procedure calls (RPCs) that communicate between VMs. This latency typically occurs when applications communicate between tiers or services. Tools such as Cloud Trace can help you decrease latency caused by application-serving requests.

  • Because TCP Proxy Load Balancing and SSL Proxy Load Balancing are based on GFEs, the effect on latency is the same as observed with HTTP(S) Load Balancing. Because HTTP(S) Load Balancing has more features than TCP Proxy Load Balancing and SSL Proxy Load Balancing, we recommend using HTTP(S) Load Balancing for HTTP(S) traffic.

Next steps

We recommend that you deploy your application close to the majority of your users. For more information about the different load balancing options in Google Cloud, see the following documents: