Optimizing GPU performance

You can use the following options to improve the performance of GPUs on virtual machine (VM) instances :

Disabling autoboost

When you use the autoboost feature with NVIDIA® K80 GPUs, the system automatically adjusts clock speeds to find the optimal rate for a given application. However, constantly adjusting clock speeds can also lead to some reduction in the performance of your GPUs. For more information about autoboost, see Increase Performance with GPU Boost and K80 Autoboost.

We recommend that you disable autoboost when running NVIDIA® K80 GPUs on Compute Engine.

To disable autoboost on instances with NVIDIA® K80 GPUs attached, run the following command:

sudo nvidia-smi --auto-boost-default=DISABLED

The output is similar to the following:

All done.

Setting GPU clock speed to the maximum frequency

To set GPU clock speed to the maximum frequency on instances with NVIDIA® K80 GPUs attached, run the following command:

sudo nvidia-smi --applications-clocks=2505,875

Using network bandwidths of up to 100 Gbps

Creating VMs that use higher bandwidths

You can use higher network bandwidths to improve the performance of distributed workloads on VMs running on Compute Engine that use NVIDIA® A100, T4, or V100 GPUs.

For more information about the network bandwidths that are supported for your GPU instances, see Network bandwidths and GPUs.

To create a VM with attached GPUs and a network bandwidth of up to 100 Gbps:

  1. Review the minimum CPU, GPU, and memory configuration required to get the maximum bandwidth available.
  2. Create your VM with attached A100, T4, or V100 GPUs, see Creating VMs with attached GPUs. This GPU VM must also have the following setup:

    Alternatively, you can create a VM using any GPU supported image from the Deep learning VM (DLVM) image project. All GPU supported DLVM images have the GPU driver, ML software, and gVNIC preinstalled. For a list of DLVM images, see Choosing an image.

    Example

    For example, to create a VM that has a maximum bandwidth of 100 Gbps, has eight V100 GPUs attached, and uses the tf-latest-gpu DLVM image, run the following command:

     gcloud compute instances create VM_NAME \
       --project PROJECT_ID \
       --custom-cpu 96 \
       --custom-memory 624 \
       --image-project=deeplearning-platform-release \
       --image-family=tf-latest-gpu \
       --accelerator type=nvidia-tesla-v100,count=8 \
       --maintenance-policy TERMINATE \
       --metadata="install-nvidia-driver=True"  \
       --boot-disk-size 200GB \
       --network-interface=nic-type=GVNIC \
       --zone=ZONE
    

    Example

    For example, to create a VM that has a maximum bandwidth of 100 Gbps, has eight A100 GPUs attached, and uses the tf-latest-gpu DLVM image, run the following command:

    gcloud compute instances create VM_NAME \
       --project=PROJECT_ID \
       --zone=ZONE \
       --machine-type=a2-highgpu-8g \
       --maintenance-policy=TERMINATE --restart-on-failure \
       --image-family=tf-latest-gpu \
       --image-project=deeplearning-platform-release \
       --boot-disk-size=200GB \
       --network-interface=nic-type=GVNIC \
       --metadata="install-nvidia-driver=True,proxy-mode=project_editors" \
       --scopes=https://www.googleapis.com/auth/cloud-platform
    

    Replace the following:

    • VM_NAME: the name of your VM
    • PROJECT_ID : your project ID
    • ZONE: the zone for the VM. This zone must support the specified GPU type. For more information about zones, see GPU regions and zones availability.
  3. After you create the VM, you can verify the network bandwidth.

Checking network bandwidth

When working with high bandwidth GPUs, you can use a network traffic tool, such as iperf2, to measure the networking bandwidth.

To check bandwidth speeds, you need at least two VMs that have attached GPUs and can both support the bandwidth speed that you are testing. For recommended minimum VM configurations for specific bandwidths, see VM configurations.

Use iPerf to perform the benchmark on Debian-based systems.

  1. Create two VMs that can support the required bandwidth speeds.

  2. Once both VMs are running, use SSH to connect to one of the VMs.

    gcloud compute ssh VM_NAME \
        --project=PROJECT_ID
    

    Replace the following:

    • VM_NAME: the name of the first VM
    • PROJECT_ID: your project ID
  3. On the first VM, complete the following steps:

    1. Install iperf.

      sudo apt-get update && sudo apt-get install iperf
      
    2. Get the internal IP address for this VM. Keep track of it by writing it down.

      ip a
      
    3. Start up the iPerf server.

      iperf -s
      

      This starts up a server listening for connections in order to perform the benchmark. Leave this running for the duration of the test.

  4. From a new client terminal, connect to the second VM using SSH.

    gcloud compute ssh VM_NAME \
       --project=PROJECT_ID
    

    Replace the following:

    • VM_NAME: the name of the second VM
    • PROJECT_ID: your project ID
  5. On the second VM, complete the following steps:

    1. Install iPerf.

      sudo apt-get update && sudo apt-get install iperf
      
    2. Run the iperf test and specify the first VM's IP address as the target.

      iperf -t 30 -c internal_ip_of_instance_1 -P 16
      

      This executes a 30-second test and produces a result that resembles the following output. If iPerf is not able to reach the other VM you, might need to adjust the network or firewall settings on the VMs or perhaps in the Cloud Console.

When you use the maximum available bandwidth of 100 Gbps, keep the following considerations in mind:

  • Due to header overheads for protocols such as Ethernet, IP, and TCP on the virtualization stack, the throughput, as measured by netperf, saturates at around 90 Gbps.

    TCP is able to achieve the 100-Gbps network speed. Other protocols, such as UDP are currently slower.

  • Due to factors such as protocol overhead and network congestion, end-to-end performance of data streams might be slightly lower than 100 Gbps.

  • You need to use multiple TCP streams to achieve maximum bandwidth between VM instances. Google recommends 4–16 streams. At 16 flows you'll frequently max out the throughput. Depending on your application and software stack, you might need to adjust settings or your code to set up multiple streams.

  • The 100-Gbps network bandwidth can only be achieved unidirectionally. You can expect the sum of TX + RX to be roughly 100 Gbps.

Using higher network bandwidth speeds with Fast Socket

NVIDIA Collective Communications Library (NCCL) is used by deep learning frameworks such as TensorFlow, PyTorch, Horovod for multi-GPU and multi-node training.

Fast Socket is a Google proprietary network transport for NCCL. On Compute Engine, Fast Socket improves NCCL performance on 100 Gbps networks by reducing the contention between multiple TCP connections. For more information about working with NCCL, see the NCCL user guide.

Current evaluation shows that Fast Socket improves all-reduce throughput by 30%–60% depending on the message size.

To setup a Fast Socket environment, you can use either a Deep Learning VM image, or a Compute Engine public image.

Using Deep Learning VM images

To set up Fast Socket, you can use a Deep Learning VM. Deep Learning VM images have the GPU driver, ML software, Fast Socket, and gVNIC preinstalled.

These images include the following:

  • tf-latest-gpu-debian-10
  • tf-latest-gpu-ubuntu-1804

V100 Example

For example, to create a Debian 10 VM that has a maximum bandwidth of 100 Gbps, has eight V100 GPUs attached, and uses a Deep Learning VM image with Fast Socket, run the following command:

gcloud compute instances create VM_NAME \
    --project=PROJECT_ID \
    --zone=ZONE \
    --custom-cpu=96 \
    --custom-memory=624 \
    --image-project=deeplearning-platform-release \
    --image-family=tf-latest-gpu-debian-10 \
    --accelerator=type=nvidia-tesla-v100,count=8 \
    --maintenance-policy=TERMINATE \
    --metadata="install-nvidia-driver=True"  \
    --network-interface=nic-type=GVNIC \
    --boot-disk-size=200GB

A100 Example

For example, to create an Ubuntu 18.04 VM that has a maximum bandwidth of 100 Gbps, has eight A100 GPUs attached, and uses a Deep Learning VM image with Fast Socket, run the following command:

gcloud compute instances create VM_NAME \
    --project=PROJECT_ID \
    --zone=ZONE \
    --machine-type=a2-highgpu-8g \
    --maintenance-policy=TERMINATE --restart-on-failure \
    --image-family=tf-latest-gpu-ubuntu-1804 \
    --image-project=deeplearning-platform-release \
    --boot-disk-size=200GB \
    --network-interface=nic-type=GVNIC \
    --metadata="install-nvidia-driver=True,proxy-mode=project_editors" \
    --scopes=https://www.googleapis.com/auth/cloud-platform

Replace the following:

  • VM_NAME: the name of your VM.
  • PROJECT_ID: your project ID.
  • ZONE: the zone for the VM. This zone must support the specified GPU type. For more information about zones, see GPU regions and zones availability.

After you setup the environment, you can verify that Fast Socket is enabled.

Using Compute Engine public images

To set up Fast Socket, you can use a Compute Engine public image. To use a Compute Engine public image, complete the following steps:

  1. Create your VM with attached A100, T4, or V100 GPUs, see Creating VMs with attached GPUs.

  2. Use this image to create your VM with attached A100, T4, or V100 GPUs. For more information, see Creating VMs with attached GPUs.

  3. Install GPU drivers. For more information, see Installing GPU drivers.

  4. Install Fast Socket. For instructions, see Manually installing Fast Socket.

  5. Verify that Fast Socket is enabled. For instructions, see Verifying that Fast Socket is enabled.

Manually installing Fast Socket

Before you install Fast Socket on a Linux VM, you need to install NCCL. For detailed instructions, see NVIDIA NCCL documentation.

CentOS/RHEL

To download and install Fast Socket on a CentOS or RHEL VM, complete the following steps:

  1. Add the package repository and import public keys.

    sudo tee /etc/yum.repos.d/google-fast-socket.repo << EOM
    [google-fast-socket]
    name=Fast Socket Transport for NCCL
    baseurl=https://packages.cloud.google.com/yum/repos/google-fast-socket
    enabled=1
    gpgcheck=0
    repo_gpgcheck=0
    gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg
          https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
    EOM
    
  2. Install Fast Socket.

    sudo yum install google-fast-socket
    
  3. Verify that Fast Socket is enabled.

SLES

To download and install Fast Socket on an SLES VM, complete the following steps:

  1. Add the package repository.

    sudo zypper addrepo https://packages.cloud.google.com/yum/repos/google-fast-socket google-fast-socket
    
  2. Add repository keys.

    sudo rpm --import https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
    
  3. Install Fast Socket.

    sudo zypper install google-fast-socket
    
  4. Verify that Fast Socket is enabled.

Debian/Ubuntu

To download and install Fast Socket on a Debian or Ubuntu VM, complete the following steps:

  1. Add the package repository.

    echo "deb https://packages.cloud.google.com/apt google-fast-socket main" | sudo tee /etc/apt/sources.list.d/google-fast-socket.list
    
  2. Add repository keys.

    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
    
  3. Install Fast Socket.

    sudo apt update && sudo apt install google-fast-socket
    
  4. Verify that Fast Socket is enabled.

Verifying that Fast Socket is enabled

On your VM, complete the following steps:

  1. Locate the NCCL home directory.

    sudo ldconfig -p | grep nccl

    For example, on a DLVM image, you get the following output:

    libnccl.so.2 (libc6,x86-64) => /usr/local/nccl2/lib/libnccl.so.2
    libnccl.so (libc6,x86-64) => /usr/local/nccl2/lib/libnccl.so
    libnccl-net.so (libc6,x86-64) => /usr/local/nccl2/lib/libnccl-net.so

    This shows that the NCCL home directory is /usr/local/nccl2.

  2. Check that NCCL loads the Fast Socket plugin. To check, you need to download the NCCL test package. To download the test package, run the following command:

    git clone https://github.com/NVIDIA/nccl-tests.git && \
    cd nccl-tests && make NCCL_HOME=NCCL_HOME_DIRECTORY

    Replace NCCL_HOME_DIRECTORY with the NCCL home directory.

  3. From the nccl-tests directory, run the all_reduce_perf process:

    NCCL_DEBUG=INFO build/all_reduce_perf

    If Fast Socket is enabled, the FastSocket plugin initialized message displays in the output log.

    # nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
    #
    # Using devices
    #   Rank  0 Pid  63324 on fast-socket-gpu device  0 [0x00] Tesla V100-SXM2-16GB
    .....
    fast-socket-gpu:63324:63324 [0] NCCL INFO NET/FastSocket : Flow placement enabled.
    fast-socket-gpu:63324:63324 [0] NCCL INFO NET/FastSocket : queue skip: 0
    fast-socket-gpu:63324:63324 [0] NCCL INFO NET/FastSocket : Using [0]ens12:10.240.0.24
    fast-socket-gpu:63324:63324 [0] NCCL INFO NET/FastSocket plugin initialized
    ......
    

What's next?