Improve workload efficiency using NCCL Fast Socket


This page shows you how to use the NVIDIA Collective Communication Library (NCCL) Fast Socket plugin to run more efficient workloads on your Google Kubernetes Engine (GKE) clusters.

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.

Limitations

Requirements

GKE Autopilot:

  • GKE Autopilot clusters must be running 1.30.2-gke.1023000 or later.

For details, see Creating an Autopilot cluster.

GKE Standard:

  • Your node pools must have gVNIC enabled to use NCCL Fast Socket.
  • GKE nodes must use a Container-Optimized OS node image.
  • Your clusters must be running GKE version 1.25.2-gke.1700 or later.

For details, see Creating a regional cluster.

Enable NCCL Fast Socket in Standard clusters

This section shows you how to enable the NCCL Fast Socket plugin in GKE Standard node pools. If you use GKE Autopilot clusters, GKE automatically enables the plugin when you request NCCL Fast Socket in your workloads. For instructions, see the NCCL Fast Socket in Autopilot section.

For Standard clusters, create a node pool that uses the NCCL Fast Socket plugin. You can also update an existing node pool using gcloud container node-pools update.

gcloud container node-pools create NODEPOOL_NAME \
    --accelerator type=ACCELERATOR_TYPE,count=ACCELERATOR_COUNT \
    --machine-type=MACHINE_TYPE \
    --cluster=CLUSTER_NAME \
    --enable-fast-socket \
    --enable-gvnic

Replace the following:

  • NODEPOOL_NAME: the name of the new node pool.
  • CLUSTER_NAME: the name of the cluster.
  • ACCELERATOR_TYPE: the type of GPU accelerator that you use. For example, nvidia-tesla-t4.
  • ACCELERATOR_COUNT: the number of GPUs per node.
  • MACHINE_TYPE: the type of machine you want to use. NCCL Fast Socket is not supported on memory-optimized machine types.

Install NVIDIA GPU device drivers

In Autopilot, GPU device drivers are automatically installed.

For Standard clusters, follow the instructions in Installing NVIDIA GPU device drivers to install the required NVIDIA device drivers on your nodes.

NCCL Fast Socket in Autopilot

In Autopilot clusters, you request NCCL Fast Socket in your workloads by using the cloud.google.com/gke-nccl-fastsocket node selector. When you request NCCL Fast Socket in a workload, GKE enables gVNIC and NCCL Fast Socket on nodes that GKE provisions for the workload. You can use NCCL Fast Socket with any GPU type that Autopilot supports.

The following pod requests NCCL Fast Socket:

apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  nodeSelector:
    cloud.google.com/gke-accelerator: GPU_TYPE
    cloud.google.com/gke-nccl-fastsocket: "true"
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 600; done;"]
    resources:
      limits:
        nvidia.com/gpu: GPU_QUANTITY

Replace the following:

  • GPU_TYPE: the type of GPU hardware. Allowed values are the following:
    • nvidia-h100-mega-80gb: NVIDIA H100 Mega (80GB)
    • nvidia-h100-80gb: NVIDIA H100 (80GB)
    • nvidia-a100-80gb: NVIDIA A100 (80GB)
    • nvidia-tesla-a100: NVIDIA A100 (40GB)
    • nvidia-l4: NVIDIA L4
    • nvidia-tesla-t4: NVIDIA T4
  • GPU_QUANTITY: the number of GPUs to allocate to the container.

Verify that NCCL Fast Socket is enabled

To verify that NCCL Fast Socket is enabled, view the kube-system pods:

kubectl get pods -n kube-system

The output is similar to the following:

NAME                             READY   STATUS    RESTARTS   AGE
nccl-fastsocket-installer-qvfdw  2/2     Running   0          10m
nccl-fastsocket-installer-rtjs4  2/2     Running   0          10m
nccl-fastsocket-installer-tm294  2/2     Running   0          10m

In this output, the number of Pods should be equal to the number of nodes in the node pool.

Disable NCCL Fast Socket

In GKE Autopilot clusters, the NCCL Fast Socket plugin is disabled by default. To disable the plugin on an existing workload, redeploy the workload without the NCCL Fast Socket node selector.

To disable NCCL Fast Socket for a node pool in Standard clusters, run the following command:

gcloud container node-pools update NODEPOOL_NAME \
    --cluster=CLUSTER_NAME \
    --no-enable-fast-socket

Existing nodes still have the plugin installed. You must manually resize the node pool to migrate workloads to new nodes.

Troubleshooting

To troubleshoot gVNIC, see Troubleshooting Google Virtual NIC.

What's next