Create an A3 VM with GPUDirect-TCPX enabled

Linux

The accelerator-optimized machine family is designed by Google Cloud to deliver the needed performance and efficiency for GPU accelerated workloads such as artificial intelligence (AI), machine learning (ML), and high performance computing (HPC).

The A3 accelerator-optimized machine series has 208 vCPUs, and up to 1872 GB of memory. Each A3 machine type has eight NVIDIA H100 GPUs attached, which offers 80 GB GPU memory per GPU. These VMs can get up to 1,000 Gbps of network bandwidth, which makes them ideal for large transformer-based language models, databases, and high performance computing (HPC).

When working with a3-highgpu-8g or a3-edgegpu-8g VMs, you can use GPUDirect-TCPX to achieve the lowest possible latency between applications and the network. GPUDirect-TCPX is a custom, remote direct memory access (RDMA) networking stack that increases the network performance of your A3 VMs by allowing data packet payloads to transfer directly from GPU memory to the network interface without having to go through the CPU and system memory. A3 VMs can use GPUDirect-TCPX combined with Google Virtual NIC (gVNIC) to deliver the highest throughput between VMs in a cluster when compared to the A2 or G2 accelerator-optimized machine types.

This document shows you how to create an a3-highgpu-8g or a3-edgegpu-8g VM that runs on a Container-Optimized OS operating system. It also shows how to enable GPUDirect-TCPX on the VM and set up and test the improved GPU network performance.

Overview

To test network performance with GPUDirect-TCPX, complete the following steps:

Set up one or more Virtual Private Cloud (VPC) networks and set the MTU setting (also known as jumbo frames) to 8244.
Create your GPU VMs by using the cos-105-lts or later Container-Optimized OS image.
On each VM, install the GPU drivers.
On each VM, give the network interface cards (NICs) access to the GPU.
Run an NCCL test.

Set up jumbo frame MTU networks

a3-highgpu-8g and a3-edgegpu-8g VMs have five physical NICs, to get the best performance for the physical NICs, you need to create five Virtual Private Cloud networks and set the MTU to 8244.

Create management network, subnet, and firewall rule

Complete the following steps to set up the management network:

Create the management network by using the networks create command:

gcloud compute networks create NETWORK_NAME_PREFIX-mgmt-net \
  --project=PROJECT_ID \
  --subnet-mode=custom \
  --mtu=8244

Create the management subnet by using the networks subnets create command:

gcloud compute networks subnets create NETWORK_NAME_PREFIX-mgmt-sub \
  --project=PROJECT_ID \
  --network=NETWORK_NAME_PREFIX-mgmt-net \
  --region=REGION \
  --range=192.168.0.0/24

Create firewall rules by using the firewall-rules create command.

Create a firewall rule for the management network.

gcloud compute firewall-rules create NETWORK_NAME_PREFIX-mgmt-internal \
 --project=PROJECT_ID \
 --network=NETWORK_NAME_PREFIX-mgmt-net \
 --action=ALLOW \
 --rules=tcp:0-65535,udp:0-65535,icmp \
 --source-ranges=192.168.0.0/16

Create the tcp:22 firewall rule to limit which source IP addresses can connect to your VM by using SSH.

gcloud compute firewall-rules create NETWORK_NAME_PREFIX-mgmt-external-ssh \
 --project=PROJECT_ID \
 --network=NETWORK_NAME_PREFIX-mgmt-net \
 --action=ALLOW \
 --rules=tcp:22 \
 --source-ranges=SSH_SOURCE_IP_RANGE

Create the icmp firewall rule that can be used to check for data transmission issues in the network.

gcloud compute firewall-rules create NETWORK_NAME_PREFIX-mgmt-external-ping \
 --project=PROJECT_ID \
 --network=NETWORK_NAME_PREFIX-mgmt-net \
 --action=ALLOW \
 --rules=icmp \
 --source-ranges=0.0.0.0/0

Replace the following:

NETWORK_NAME_PREFIX: the name prefix to use for the Virtual Private Cloud networks and subnets.
PROJECT_ID : your project ID.
REGION: the region where you want to create the networks.
SSH_SOURCE_IP_RANGE: IP range in CIDR format. This specifies which source IP addresses can connect to your VM by using SSH.

Create data networks, subnets, and firewall rule

Use the following command to create four data networks, each with subnets and firewall rules.

for N in $(seq 1 4); do
  gcloud compute networks create NETWORK_NAME_PREFIX-data-net-$N \
      --project=PROJECT_ID \
      --subnet-mode=custom \
      --mtu=8244

  gcloud compute networks subnets create NETWORK_NAME_PREFIX-data-sub-$N \
      --project=PROJECT_ID \
      --network=NETWORK_NAME_PREFIX-data-net-$N \
      --region=REGION \
      --range=192.168.$N.0/24

  gcloud compute firewall-rules create NETWORK_NAME_PREFIX-data-internal-$N \
      --project=PROJECT_ID \
      --network=NETWORK_NAME_PREFIX-data-net-$N \
      --action=ALLOW \
      --rules=tcp:0-65535,udp:0-65535,icmp \
      --source-ranges=192.168.0.0/16
done

For more information about how to create Virtual Private Cloud networks, see Create and verify a jumbo frame MTU network.

Create your GPU VMs

To test network performance with GPUDirect-TCPX, you need to create at least two A3 VMs.

Create each VM by using the cos-105-lts or later Container-Optimized OS image and specifying the virtual MTU networks that were created in the previous step.

The VMs must also use the Google Virtual NIC (gVNIC) network interface. For A3 VMs, gVNIC version 1.4.0rc3 or later is required. This driver version is available on the Container-Optimized OS.

The first virtual NIC is used as the primary NIC for general networking and storage, the other four virtual NICs are NUMA aligned with two of the eight GPUs on the same PCIe switch.

gcloud compute instances create VM_NAME \
  --project=PROJECT_ID \
  --zone=ZONE \
  --machine-type=MACHINE_TYPE \
  --maintenance-policy=TERMINATE --restart-on-failure \
  --image-family=cos-105-lts \
  --image-project=cos-cloud \
  --boot-disk-size=${BOOT_DISK_SZ:-50} \
  --metadata=cos-update-strategy=update_disabled \
  --scopes=https://www.googleapis.com/auth/cloud-platform \
  --network-interface=nic-type=GVNIC,network=NETWORK_NAME_PREFIX-mgmt-net,subnet=NETWORK_NAME_PREFIX-mgmt-sub \
  --network-interface=nic-type=GVNIC,network=NETWORK_NAME_PREFIX-data-net-1,subnet=NETWORK_NAME_PREFIX-data-sub-1,no-address \
  --network-interface=nic-type=GVNIC,network=NETWORK_NAME_PREFIX-data-net-2,subnet=NETWORK_NAME_PREFIX-data-sub-2,no-address \
  --network-interface=nic-type=GVNIC,network=NETWORK_NAME_PREFIX-data-net-3,subnet=NETWORK_NAME_PREFIX-data-sub-3,no-address \
  --network-interface=nic-type=GVNIC,network=NETWORK_NAME_PREFIX-data-net-4,subnet=NETWORK_NAME_PREFIX-data-sub-4,no-address

Replace the following:

VM_NAME: the name of your VM.
PROJECT_ID : your project ID.
ZONE: a zone that supports your machine type.
MACHINE_TYPE: the machine type for the VM. Specify either a3-highgpu-8g or a3-edgegpu-8g.
NETWORK_NAME_PREFIX: the name prefix to use for the Virtual Private Cloud networks and subnets.

Install GPU drivers

On each A3 VM, complete the following steps.

Install the NVIDIA GPU drivers by running the following command:
```
sudo cos-extensions install gpu -- --version=latest
```

Re-mount the path by running the following command:

sudo mount --bind /var/lib/nvidia /var/lib/nvidia
sudo mount -o remount,exec /var/lib/nvidia

Give the NICs access to the GPUs

On each A3 VM, give the NICs access to the GPUs by completing the following steps:

Configure the registry.
- If you are using Container Registry, run the following command:
```
docker-credential-gcr configure-docker
```
- If you are using Artifact Registry, run the following command:
```
docker-credential-gcr configure-docker --registries us-docker.pkg.dev
```

Configure the receive data path manager. A management service, GPUDirect-TCPX Receive Data Path Manager, needs to run alongside the applications that use GPUDirect-TCPX. To start the service on each Container-Optimized OS VM, run the following command:

docker run --pull=always --rm \
  --name receive-datapath-manager \
  --detach \
  --privileged \
  --cap-add=NET_ADMIN --network=host \
  --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 \
  --device /dev/nvidia0:/dev/nvidia0 \
  --device /dev/nvidia1:/dev/nvidia1 \
  --device /dev/nvidia2:/dev/nvidia2 \
  --device /dev/nvidia3:/dev/nvidia3 \
  --device /dev/nvidia4:/dev/nvidia4 \
  --device /dev/nvidia5:/dev/nvidia5 \
  --device /dev/nvidia6:/dev/nvidia6 \
  --device /dev/nvidia7:/dev/nvidia7 \
  --device /dev/nvidia-uvm:/dev/nvidia-uvm \
  --device /dev/nvidiactl:/dev/nvidiactl \
  --env LD_LIBRARY_PATH=/usr/local/nvidia/lib64 \
  --volume /run/tcpx:/run/tcpx \
  --entrypoint /tcpgpudmarxd/build/app/tcpgpudmarxd \
us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd \
  --gpu_nic_preset a3vm --gpu_shmem_type fd --uds_path "/run/tcpx" --setup_param "--verbose 128 2 0"

Verify the receive-datapath-manager container started.

docker container logs --follow receive-datapath-manager

The output should resemble the following:

I0000 00:00:1687813309.406064       1 rx_rule_manager.cc:174] Rx Rule Manager server(s) started...

To stop viewing the logs, press ctrl-c.

Install IP table rules.

sudo iptables -I INPUT -p tcp -m tcp -j ACCEPT

Configure the NVIDIA Collective Communications Library (NCCL) and GPUDirect-TCPX plugin.

A specific NCCL library version and GPUDirect-TCPX plugin binary combination are required to use NCCL with GPUDirect-TCPX support. Google Cloud has provided packages that meet this requirement.

To install the Google Cloud package, run the following command:
```
docker run --rm -v /var/lib:/var/lib us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx install --install-nccl
sudo mount --bind /var/lib/tcpx /var/lib/tcpx
sudo mount -o remount,exec /var/lib/tcpx
```
If this command is successful, the libnccl-net.so and libnccl.so files are placed in the /var/lib/tcpx/lib64 directory.

Run tests

On each A3 VM, run an NCCL test by completing the following steps:

Start the container.

#!/bin/bash

function run_tcpx_container() {
docker run \
  -u 0 --network=host \
  --cap-add=IPC_LOCK \
  --userns=host \
  --volume /run/tcpx:/tmp \
  --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 \
  --volume /var/lib/tcpx/lib64:/usr/local/tcpx/lib64 \
  --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
  --device /dev/nvidia0:/dev/nvidia0 \
  --device /dev/nvidia1:/dev/nvidia1 \
  --device /dev/nvidia2:/dev/nvidia2 \
  --device /dev/nvidia3:/dev/nvidia3 \
  --device /dev/nvidia4:/dev/nvidia4 \
  --device /dev/nvidia5:/dev/nvidia5 \
  --device /dev/nvidia6:/dev/nvidia6 \
  --device /dev/nvidia7:/dev/nvidia7 \
  --device /dev/nvidia-uvm:/dev/nvidia-uvm \
  --device /dev/nvidiactl:/dev/nvidiactl \
  --env LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/tcpx/lib64 \
  "$@"
}

The preceding command completes the following:

Mounts NVIDIA devices from /dev into the container
Sets network namespace of the container to the host
Sets user namespace of the container to host
Adds CAP_IPC_LOCK to the capabilities of the container
Mounts /tmp of the host to /tmp of the container
Mounts the installation path of NCCL and GPUDirect-TCPX NCCL plugin into the container and add the mounted path to LD_LIBRARY_PATH

After you start the container, applications that use NCCL can run from inside the container. For example, to run the run-allgather test, complete the following steps:
1. On each A3 VM, run the following:
```
$ run_tcpx_container -it --rm us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx shell
```
2. On one VM, run the following commands:
  1. Set up connection between the VMs. Replace VM-0 and VM-1 with the names of each VM.
```
/scripts/init_ssh.sh VM-0 VM-1
pushd /scripts && /scripts/gen_hostfiles.sh VM-0 VM-1; popd
```
    This creates a /scripts/hostfiles2 directory on each VM.
  2. Run the script.
```
/scripts/run-allgather.sh 8 eth1,eth2,eth3,eth4 1M 512M 2
```
    The run-allgather script takes about two minutes to run. At the end of the logs, you'll see the all-gather results.
    
    If you see the following line in your NCCL logs, this verifies that GPUDirect-TCPX is initialized successfully.
```
NCCL INFO NET/GPUDirectTCPX ver. 3.1.1.
```