Improve workload efficiency using NCCL Fast Socket


This page shows you how to use the NVIDIA Collective Communication Library (NCCL) Fast Socket plugin to run more efficient workloads on your Google Kubernetes Engine (GKE) clusters.

Autopilot clusters don't support NCCL Fast Socket.

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.

Limitations

Requirements

  • Your node pools must have gVNIC enabled to use NCCL Fast Socket.
  • GKE nodes must use a Container-Optimized OS node image.
  • Your clusters must be running GKE version 1.25.2-gke.1700 or later.

Create a cluster

Create a new cluster:

gcloud container clusters create CLUSTER_NAME \
    --cluster-version=VERSION \
    --region=COMPUTE_REGION

Replace the following:

  • CLUSTER_NAME: the name of the new cluster.
  • VERSION: the GKE version, which must be 1.25.2-gke.1700 or later. You can also use the --release-channel flag to select a release channel. The release channel must have a default version of 1.25.2-gke.1700 or later.
  • COMPUTE_REGION: the Compute Engine region for the new cluster. For zonal clusters, use --zone=COMPUTE_ZONE.

Enable NCCL Fast Socket

Create a node pool that uses the NCCL Fast Socket plugin. You can also update an existing node pool using gcloud container node-pools update.

gcloud container node-pools create NODEPOOL_NAME \
    --accelerator type=ACCELERATOR_TYPE,count=ACCELERATOR_COUNT \
    --machine-type=MACHINE_TYPE \
    --cluster=CLUSTER_NAME \
    --enable-fast-socket \
    --enable-gvnic

Replace the following:

  • NODEPOOL_NAME: the name of the new node pool.
  • CLUSTER_NAME: the name of the cluster.
  • ACCELERATOR_TYPE: the GPU type. Can be one of the following:
    • nvidia-tesla-k80
    • nvidia-tesla-p100
    • nvidia-tesla-p4
    • nvidia-tesla-v100
    • nvidia-tesla-t4
    • nvidia-tesla-a100
    • nvidia-a100-80gb
    • nvidia-l4
  • ACCELERATOR_COUNT: the number of GPUs per node.
  • MACHINE_TYPE: the type of machine you want to use. NCCL Fast Socket is not supported on memory-optimized machine types.

Install NVIDIA GPU device drivers

Follow the instructions in Installing NVIDIA GPU device drivers to install the required NVIDIA device drivers on your nodes.

Verify that NCCL Fast Socket is enabled

To verify that NCCL Fast Socket is enabled, view the kube-system pods:

kubectl get pods -n kube-system

The output is similar to the following:

NAME                             READY   STATUS    RESTARTS   AGE
nccl-fastsocket-installer-qvfdw  2/2     Running   0          10m
nccl-fastsocket-installer-rtjs4  2/2     Running   0          10m
nccl-fastsocket-installer-tm294  2/2     Running   0          10m

In this output, the number of Pods should be equal to the number of nodes in the node pool.

Disable NCCL Fast Socket

To disable NCCL Fast Socket, run the following command:

gcloud container node-pools update NODEPOOL_NAME \
    --cluster=CLUSTER_NAME \
    --no-enable-fast-socket

Existing nodes still have the plugin installed. You must manually resize the node pool to migrate workloads to new nodes.

Troubleshooting

To troubleshoot gVNIC, see Troubleshooting Google Virtual NIC.

What's next