TCP optimization for network performance and resiliency

This page outlines methods to calculate the correct settings for decreasing the latency of your TCP connections in Google Cloud and hybrid scenarios. This page also helps you understand ways to improve connection latency between processes within Google Cloud.

Modern microservices architecture advocates that developers should build small services with single responsibility. The services should communicate using TCP or UDP, based on the reliability expectations of the system. It's therefore critical for microservices-based systems to communicate with reliability and low latency.

Google Cloud provides both reliability and low latency by providing a global network, which means that your application users can also go global. Having a global network means that you create a Virtual Private Cloud (VPC) network that spans regions and zones. Applications can connect to each other across regions and zones without ever leaving the Google Cloud network.

Applications that have been written for a traditional data center environment can exhibit slow performance when they're moved to a hybrid cloud environment—that is, when some of the application components run in a corporate data center and others run in the cloud. Slow performance can be the result of a number of factors. This article focuses on round-trip latencies and how latency affects TCP performance in applications that move a considerable amount of data over any part of the network.

Optimize TCP performance

The problem: latency and TCP behavior

TCP uses a windowing mechanism to prevent a fast sender from overrunning a slow receiver. The receiver advertises how much data the sender should send before the sender must wait for a window update from the receiver. As a result, if a receiving application can't receive data on the connection, there's a limit to how much data can be queued waiting for the application.

The TCP window allows efficient use of memory on the sending and receiving systems. As the receiving application consumes data, window updates are sent to the sender. The fastest that the window update can happen is in one round trip, which leads to the following formula for one of the limits to the bulk transfer performance of a TCP connection:

Throughput <= window size / round-trip time (RTT) latency

In the original design for TCP, this window has a maximum size of 65535 bytes (64 KiB - 1). This was the maximum amount of data that the sender could send before the sender received a window update in order to allow more data to be sent.

Changes in TCP since its introduction

Since TCP was introduced, some key features have changed:

Typical network speeds have increased by four orders of magnitude.
Typical memory in a system has increased by four orders of magnitude.

The result of the first change is that the original TCP window sizes led to an inefficient use of network resources. A sender would send a window's worth of data at the best speed possible under network conditions, and then sit idle for a considerable length of time while waiting for the TCP window update. The result of the second change is that senders and receivers can use more memory for networking to address the limitation exposed by first change.

The following diagram illustrates this interchange.

Sender sends only 64K of data and spends a very long time waiting after sending before getting back a window update

The sender can't fully utilize the network, because it's waiting for the TCP window update before sending additional data.

Sending more data at a time

The solution is to send more data at a time. As the bandwidth of the network increases, more data can fit into the pipe (network), and as the pipe gets longer, it takes longer to acknowledge the receipt of the data. This relationship is known as the bandwidth-delay product (BDP). This is calculated as the bandwidth multiplied by the round-trip time (RTT), resulting in a value that specifies the optimal number of bits to send in order to fill the pipe. The formula is this:

BDP (bits) = bandwidth (bits/second) * RTT (seconds)

Computed BDP is used as TCP window size for optimization.

For example, imagine that you have a 10 Gbps network with an RTT of 30 milliseconds. For the window size, use the value of the original TCP window size (65535 bytes). This value doesn't come close to taking advantage of the bandwidth capability. The maximum TCP performance possible on this link is as follows:

(65535 bytes * 8 bits/byte) = bandwidth * 0.030 second
bandwidth = (65535 bytes * 8 bits/byte) / 0.030 second
bandwidth = 524280 bits / 0.030 second
bandwidth = 17476000 bits / second

To state it another way, these values result in throughput that's a bit more than 17 Mbits per second, which is a small fraction of network's 10 Gbps capability.

The solution: TCP window-size scaling

To resolve the performance limitations imposed by the original design of TCP window size, extensions to the TCP protocol were introduced that allow the window size to be scaled to much larger values. Window scaling supports windows up to 1,073,725,440 bytes, or almost 1 GiB. This feature is outlined in RFC 7323 as TCP window scale option.

The window scale extensions expand the definition of the TCP window to use 30 bits, and then use an implicit scale factor to carry this 30-bit value in the 16-bit window field of the TCP header. To see if the feature is enabled on Linux-based systems, use the following command:

sudo sysctl net.ipv4.tcp_window_scaling

All Google Cloud Linux virtual machines have this feature enabled by default. A return value of 1 indicates that the option is enabled. If the feature is disabled, you can enable it by using the following command:

sudo sysctl -w net.ipv4.tcp_window_scaling=1

Throughput with a larger window size

You can use the previous example to show the benefit of having window scaling. As before, assume a 10 Gbps network with 30-millisecond latency, and then compute a new window size using this formula:

(Link speed * latency) / 8 bits = window size

If you plug in the example numbers, you get this:

(10 Gbps * 30ms/1000sec) / 8bits/byte = window size
(10000 Mbps * 0.030 second) / 8 bits/byte = 37.5 MB

Increasing the TCP window size to 37 MB can increase the theoretical limit of TCP bulk transfer performance to a value approaching the network capability. Of course, many other factors can limit performance, including system overhead, average packet size, and number of other flows sharing the link, but as you can see, the window size substantially mitigates the limits imposed by the previous limited window size.

Setting Linux tunables to change TCP window size

In Linux, the TCP window size is affected by the following sysctl(8) tunables:

net.core.rmem_max
net.core.wmem_max
net.ipv4.tcp_rmem
net.ipv4.tcp_wmem

The first two tunables affect the maximum TCP window size for applications that attempt to control the TCP window size directly, by limiting the applications' request to no more than those values. The second two tunables affect the TCP window size for applications that let Linux auto-tuning do the work.

The optimal window-size value depends on your specific circumstances, but one starting point is the largest BDP (bandwidth-delay product) for the path or paths over which you expect the system to send data. In that case, you want to set the tunables by using following steps:

Make sure that you have root privileges.
Get the current buffer settings. Save these settings in case you want to roll back these changes.
```
sudo sysctl -a | grep mem
```
Set an environment variable to the new TCP window size that you want to use:
```
MaxExpectedPathBDP=8388608
```
Set the maximum OS receive buffer size for all types of connections:
```
sudo sysctl -w net.core.rmem_max=$MaxExpectedPathBDP
```
Set the maximum OS send buffer size for all types of connections:
```
sudo sysctl -w net.core.wmem_max=$MaxExpectedPathBDP
```
Set the TCP receive memory buffer (tcp_rmem) settings:
```
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 $MaxExpectedPathBDP"
```
The tcp_rmem setting takes three values:
- The minimum receive buffer size that can be allocated for a TCP socket. In this example, the value is 4096 bytes.
- The default receive buffer size, which also overrides the /proc/sys/net/core/rmem_default value used by other protocols. In the example, the value is 87380 bytes.
- The maximum receive buffer size that can be allocated for a TCP socket. In the example, this is set to the value that you set earlier (8388608 bytes).
Set the TCP send memory buffer (tcp_wmem) settings:
```
sudo sysctl -w net.ipv4.tcp_wmem="4096 16384 $MaxExpectedPathBDP"
```
The tcp_wmem setting takes three values:
- The minimum TCP send buffer space available for a single TCP socket.
- The default buffer space allowed for a single TCP socket.
- The maximum TCP send buffer space.
Set the tunables so that subsequent connections use the values you specified:
```
sudo sysctl -w net.ipv4.route.flush=1
```

To persist these settings across reboots, append the commands you set previously to the /etc/sysctl.conf file:

sudo bash -c 'cat << EOF >> /etc/sysctl.conf
net.core.rmem_max=8388608
net.core.wmem_max=8388608
net.ipv4.tcp_rmem=4096 87380 8388608
net.ipv4.tcp_wmem=4096 16384 8388608
net.ipv4.route.flush=1
EOF'

Testing RTT with an updated window size

When TCP has a large enough window size to use the BDP, the picture changes, as shown in the following diagram:

Sender sends a large quantity of data at a time and spends very little time waiting for a window update

The TCP window size can always be adapted based on the resources available to the process involved and the TCP algorithm in use. As the diagram shows, window scaling lets a connection go well beyond the 65 KiB window size defined in original TCP specification.

You can test this yourself. First, make sure that you've made TCP window size changes to your local computer and to a remote computer by setting the tunables on both machines. Then run the following commands:.

dd if=/dev/urandom of=sample.txt bs=1M count=1024 iflag=fullblock
scp sample.txt your_username@remotehost.com:/some/remote/directory

The first command creates a 1 GB sample.txt file that has random data. The second command copies that file from your local machine to a remote machine.

Note the scp command output on the console, which displays bandwidth in Kbps. You should see sizable difference in the results from before and after the TCP window size changes.

Optimize TCP network resiliency by using Protective ReRoute

Protective ReRoute (PRR) is a host-based technique to route packets around faults in a multipath network. Google deploys PRR fleetwide, and it supplements standard network resiliency mechanisms to improve the overall availability of the network.

PRR is available to Google Cloud customers in two modes: hypervisor mode and guest mode. All Google Cloud customers automatically get hypervisor mode PRR with no actions required.

Hypervisor mode PRR protects the vast majority of traffic, with certain exceptions:

If an instance has a huge traffic fan-out (for example, if an instance is actively sending packets to thousands of cross-region instances at the same time), not all packets are protected.
Hypervisor mode PRR protects key segments of the network path but isn't fully end-to-end.
Hypervisor mode PRR reacts within single-digit seconds.

Guest mode PRR can be used for critical applications that are particularly sensitive to short-duration network events, have large fanout patterns, are highly sensitive to packet loss, or require the fastest possible network recovery time (at RTT timescale). You can adopt guest mode PRR by meeting the following conditions:

The compute instance must run Linux with kernel 4.20 or later. To check the Linux kernel version, run the uname -r command within the compute instance.
The following sysctl must be enabled: /proc/sys/net/ipv6/auto_flowlabels. This sysctl is typically enabled by default.
If used with VirtIO network interface, then only IPv6 network traffic is protected.
If used with the gVNIC network interface, then both IPv4 and IPv6 network traffic are protected.

What's next

Read the blog post on 5 steps to better Google Cloud networking performance.
Learn about Global Networking Products.
Read more about Networking Tiers on Google Cloud.
Learn how to benchmark network performance.