Optimizing Application Latency with Load Balancing

This document discusses load balancing options and shows how your choice of a specific load balancer on Google Cloud Platform (GCP) affects end-to-end latency.

Options for load balancing

Depending on the type of traffic sent to your application, you have several options for load balancing. The following table summarizes your options:

Option Description Traffic flow Scope
HTTP, HTTPS, TCP, and SSL load balancing Offers HTTP(S) traffic and advanced features, such as URL mapping and SSL offloading.
Supports TCP proxy or SSL proxy for non-HTTP traffic on specific ports.
TCP or SSL (TLS) session is terminated on Google Front Ends (GFEs) at Google's network edge and traffic is proxied to the backends. Global
Network load balancing Allows any TCP/UDP traffic through any port to pass through the load balancer. Delivered using Google's Maglev technology to distribute the traffic to the backends. Regional

Because the Internal load balancer does not support user-facing traffic, it is out of scope for this article.

Measuring latency

When accessing a website hosted in us-central1, a user in Germany used the following methods to test latency:

  • Ping: While this is a common way to measure server reachability, ICMP ping does not give a good indication of end-user latency. You can read an explanation in the Additional latency effects of HTTP(S) load balancing
    section.
  • Time To First Byte (TTFB): A good way to measure time to first HTTP response is to issue a curl command repeatedly to the server to get a response from the web server.

When comparing results, be aware that latency on fiber links is constrained mainly by the distance and the speed of light in fiber, which is roughly 200,000 km/s (or 124,724 miles/s).

The distance between Frankfurt, Germany, and Council Bluffs, Iowa, which is the location of the us-central1 region, is roughly 7,500 km. With perfectly straight fiber between the locations, round-trip latency would be:

7,500 km * 2 / 200,000 km/s * 1000 ms/s = 75 milliseconds (ms)

In reality, fiber optic cable doesn't follow an ideal path between the user and the data center, and light on the fiber cable passes through active and passive equipment along its path. An observed latency of approximately 1.5 times the ideal, or 112.5 ms, would indicate a near-ideal configuration.

Comparing latency

This section compares load balancing in the following configurations:

  • No load balancing
  • Network load balancing
  • HTTP load balancing or TCP Proxy

In this scenario, the application consists of a regional managed instance group of HTTP web servers. Because the application relies on low-latency calls to a central database, the web servers must be hosted in one location. The application is deployed in the us-central1 region, and users are distributed across the globe. The latency that the user in Germany observes in this scenario illustrates what users worldwide might experience.

Latency scenario diagram

No load balancing

When a user makes an HTTP request, without load balancing the traffic flows directly from the user’s network to the virtual machine (VM) hosted on Google Compute Engine. Traffic then enters Google's network at an edge point of presence (POP) close to the user's location.

Architecture with no load balancing

The following table shows the results when the user in Germany tested latency of a system with no load balancing:

Method Result Minimum latency
Ping the VM IP address (Response is directly from web server)
[user@germany ~]$ ping -c 5 gce-vm
PING gce-vm (xxx.xxx.xxx.xxx) 56(84) bytes of data.
64 bytes from gce-vm (xxx.xxx.xxx.xxx): icmp_seq=1 ttl=56 time=111 ms
64 bytes from gce-vm (xxx.xxx.xxx.xxx): icmp_seq=2 ttl=56 time=110 ms
[...]
--- gce-vm ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4004ms
rtt min/avg/max/mdev = 110.818/110.944/111.265/0.451 ms
110 ms
TTFB
[user@germany ~]$ for ((i=0;i<500;i++)); do curl -w  /
    "%{time_total}\n" -o /dev/null -s gce-vm; done
0.230
0.230
0.231
0.231
0.230
[...]
0.232
0.231
0.231
230 ms

The TTFB latency is very stable, as shown in the following graph of the first 500 requests:

Latency to VM in ms graph

When pinging the VM IP address, the response is directly from web server. The time the web server consumes is minimal compared to the network latency (TTFB). This difference is because a new TCP connection is opened for every HTTP request, and an initial, three-way handshake is needed before the HTTP response is sent, as shown in the following diagram. Therefore, the latency the user in Germany observed is roughly double the ping latency.

Client-server HTTP request diagram

Network load balancing

With a network load balancer, user requests still enter the Google network at the closest edge POP. In the region where the project's VMs are located, traffic flows first through a Maglev load balancer and then is forwarded without changes to the target backend VM. The Maglev load balancer distributes traffic based on a stable hashing algorithm, which uses a combination of source and destination port, IP address, and protocol. The VMs listen to the load balancer IP and accept the traffic unaltered.

Architecture with network load balancing

The following table shows the results when the user in Germany tested latency for the network-load-balancing option:

Method Result Minimum latency
Ping the network load balancer
[user@germany ~]$ ping -c 5 net-lb
PING net-lb (xxx.xxx.xxx.xxx) 56(84) bytes of data.
64 bytes from net-lb (xxx.xxx.xxx.xxx): icmp_seq=1 ttl=44 time=110 ms
64 bytes from net-lb (xxx.xxx.xxx.xxx): icmp_seq=2 ttl=44 time=110 ms
[...]
64 bytes from net-lb (xxx.xxx.xxx.xxx): icmp_seq=5 ttl=44 time=110 ms
--- net-lb ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4007ms
rtt min/avg/max/mdev = 110.658/110.705/110.756/0.299 ms
110 ms
TTFB
[user@germany ~]$ for ((i=0;i<500;i++)); do curl -w /
    "%{time_total}\n" -o /dev/null -s net-lb
0.231
0.232
0.230
0.230
0.232
[...]
0.232
0.231
230 ms

Because load balancing happens in-region and traffic is merely forwarded, there is no significant latency impact compared with the no-load-balancer option.

HTTP(S)/TCP/SSL proxy load balancing

With HTTP load balancing, traffic is proxied through GFEs, which are typically located at the edge of Google's global network. The GFE terminates the TCP session and connects to a backend in the closest region that has capacity to serve the traffic.

HTTP load balancing scenario diagram

The following table shows the results when the user in Germany tested latency for the HTTP-load-balancing option:

Method Result Minimum latency
Ping the HTTP load balancer
[user@germany ~]$ $ ping -c 5 http-lb
PING http-lb (xxx.xxx.xxx.xxx) 56(84) bytes of data.
64 bytes from http-lb (xxx.xxx.xxx.xxx): icmp_seq=1 ttl=56 time=1.22 ms
64 bytes from http-lb (xxx.xxx.xxx.xxx): icmp_seq=2 ttl=56 time=1.20 ms
64 bytes from http-lb (xxx.xxx.xxx.xxx): icmp_seq=3 ttl=56 time=1.16 ms
64 bytes from http-lb (xxx.xxx.xxx.xxx): icmp_seq=4 ttl=56 time=1.17 ms
64 bytes from http-lb (xxx.xxx.xxx.xxx): icmp_seq=5 ttl=56 time=1.20 ms
--- http-lb ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 1.163/1.195/1.229/0.039 ms
1 ms
TTFB
[user@germany ~]$ for ((i=0;i<500;i++)); do curl -w /
    "%{time_total}\n" -o /dev/null -s http-lb; done
0.309
0.230
0.229
0.233
0.230
[...]
0.123
0.124
0.126
123 ms

The results for HTTP load balancing are significantly different. When pinging the HTTP load balancer, the round-trip latency is just over 1 ms. However, this result represents latency to the closest GFE, which is located in the same city as the user in this case. This result has nothing to do with actual latency the user experiences when trying to access the application hosted in the us- central1 region. This shows that experiments using protocols (ICMP) that differ from your application communication protocol (HTTP) can be misleading.

When measuring TTFB, the initial requests show roughly the same response latency. Over the course of the requests, additional requests achieve the lower minimum latency of 123 ms, as shown in the following graph:

Latency to HTTP load balancer in ms graph

However, two round trips between the client and VM would take more than 123 ms even with perfectly straight fiber. The reason for the lower latency is that traffic is proxied through GFEs, which keep persistent connections to the backend VMs open. Therefore, only the first request from a specific GFE to a specific backend needs a three-way handshake.

Initial HTTP request via GFE diagram

There are multiple GFEs in each location. You can see in the latency graph multiple, fluctuating spikes early on as traffic reaches each GFE-backend pair the first time, reflecting differing request hashes. After all GFEs have been reached, subsequent requests show the lower latency.

Subsequent HTTP request via GFE diagram

These scenarios demonstrate the reduced latency that users can experience in a production environment. The following table summarizes the results:

Option Ping TTFB
No load balancing 110 ms to the web server 230 ms
Network load balancing 110 ms to the in-region network load balancer 230 ms
HTTP load balancing 1 ms to the closest GFE 123 ms

When a healthy application is serving users in a specific region regularly, all GFEs in that region should generally have a persistent connection open to all serving backends. As such, users in that region will notice significantly reduced latency on their first HTTP request if they are far from the application backend. If users are near the application backend, no latency improvement is observed because of their proximity.

For subsequent requests, such as clicking a page link, no latency improvement is observed because modern browsers already keep a persistent connection to the service to be reused, as opposed to a curl command issued from the command line.

Additional latency effects of HTTP(S) load balancing

There are some additional observable effects with HTTP(S) load balancing that depend on traffic patterns.

  • HTTP(S) load balancing has less latency for complex assets than HTTP network load balancing because fewer round trips are needed before a response completes. For example, when the user in Germany measured latency over the same connection by repeatedly downloading a 10 MB file, the average latency for network load balancing was 1911 ms compared to 1341 ms with HTTP load balancing, saving approximately 5 round trips per request. This reduction is because persistent connections between GFEs and serving backends reduce the effects of TCP Slow Start.

  • HTTP(S) load balancing significantly reduces the additional latency for a TLS handshake (typically 1-2 extra roundtrips). This reduction is because HTTP(S) uses SSL offloading, and only the latency to the edge POP is relevant. For the user in Germany, the minimum observed latency is 201 ms using HTTP(S) load balancing versus 525 ms using HTTP(S) through the network load balancer.

  • The HTTP(S) load balancer also allows an automatic upgrade of the user-facing session to HTTP/2, which can reduce the number of packets needed by using improvements in binary protocol, header compression, and connection multiplexing. This can reduce observed latency even more than that observed by switching to HTTP load balancing alone. HTTP/2 is used only in conjunction with current browsers using SSL/TLS. For our user in Germany, minimum latency decreased further from 201 ms to 145 ms when using HTTP/2 instead of plain HTTPS.

Optimizing HTTP(S) load balancing

You can optimize latency for your application by using the HTTP(S) load balancer as follows:

  • If some of the traffic you serve is cacheable, you can integrate with Google Cloud CDN. Cloud CDN reduces latency by serving assets directly at Google's network edge. Cloud CDN also makes use of the TCP and HTTP optimizations (HTTP/2) mentioned in the Additional latency effects of HTTP(S) load balancing section.

  • You can use any CDN partner with GCP. By using one of Google's CDN interconnect partners, you benefit from discounted egress costs.

  • If content is static, you can reduce the load on the web servers by serving content directly from Google Cloud Storage through the HTTP/S load balancer. This option combines seamlessly with the CDN options mentioned previously.

  • Deploying your web servers in multiple regions close to your users can reduce latency because HTTP(S), SSL Proxy, and TCP Proxy load balancing automatically direct users to the closest region. However, if your application is partly centralized, design it to minimize inter-regional round trips.

  • To reduce latency inside your applications, examine any remote procedure calls (RPCs) that communicate between VMs. This latency typically occurs when applications communicate between tiers or services. Tools such as Stackdriver Trace can help to minimize latency caused by application-serving requests.

  • Because TCP and SSL proxy are also based on GFE, the effect on latency is the same as observed with HTTP load balancing. Because HTTP(S) load balancing has more features than TCP/SSL proxy, we recommend always using HTTP(S) load balancing for HTTP(S) traffic.

Next steps

When architecting your application on GCP, we recommend that you deploy your application so that it is close to the majority of your users and choose the best configuration for your use. For more information about the different capabilities for load balancing on GCP, see the following pages:

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Compute Engine Documentation