This page shows you how to use the NVIDIA Collective Communication Library (NCCL) Fast Socket plugin to run more efficient workloads on your Google Kubernetes Engine (GKE) clusters.
Autopilot clusters don't support NCCL Fast Socket.
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
Limitations
- Compute Engine limitations apply.
- gVNIC limitations apply.
- NCCL Fast Socket is only supported on node pools that have hardware accelerators enabled.
Requirements
- Your node pools must have gVNIC enabled to use NCCL Fast Socket.
- GKE nodes must use a Container-Optimized OS node image.
- Your clusters must be running GKE version 1.25.2-gke.1700 or later.
Create a cluster
Create a new cluster:
gcloud container clusters create CLUSTER_NAME \
--cluster-version=VERSION \
--region=COMPUTE_REGION
Replace the following:
CLUSTER_NAME
: the name of the new cluster.VERSION
: the GKE version, which must be 1.25.2-gke.1700 or later. You can also use the--release-channel
flag to select a release channel. The release channel must have a default version of 1.25.2-gke.1700 or later.COMPUTE_REGION
: the Compute Engine region for the new cluster. For zonal clusters, use--zone=COMPUTE_ZONE
.
Enable NCCL Fast Socket
Create a node pool that uses the NCCL Fast Socket plugin. You can also
update an existing node pool using
gcloud container node-pools update
.
gcloud container node-pools create NODEPOOL_NAME \
--accelerator type=ACCELERATOR_TYPE,count=ACCELERATOR_COUNT \
--machine-type=MACHINE_TYPE \
--cluster=CLUSTER_NAME \
--enable-fast-socket \
--enable-gvnic
Replace the following:
NODEPOOL_NAME
: the name of the new node pool.CLUSTER_NAME
: the name of the cluster.ACCELERATOR_TYPE
: the type of GPU accelerator that you use. For example,nvidia-tesla-t4
.ACCELERATOR_COUNT
: the number of GPUs per node.MACHINE_TYPE
: the type of machine you want to use. NCCL Fast Socket is not supported on memory-optimized machine types.
Install NVIDIA GPU device drivers
Follow the instructions in Installing NVIDIA GPU device drivers to install the required NVIDIA device drivers on your nodes.
Verify that NCCL Fast Socket is enabled
To verify that NCCL Fast Socket is enabled, view the kube-system pods:
kubectl get pods -n kube-system
The output is similar to the following:
NAME READY STATUS RESTARTS AGE
nccl-fastsocket-installer-qvfdw 2/2 Running 0 10m
nccl-fastsocket-installer-rtjs4 2/2 Running 0 10m
nccl-fastsocket-installer-tm294 2/2 Running 0 10m
In this output, the number of Pods should be equal to the number of nodes in the node pool.
Disable NCCL Fast Socket
To disable NCCL Fast Socket, run the following command:
gcloud container node-pools update NODEPOOL_NAME \
--cluster=CLUSTER_NAME \
--no-enable-fast-socket
Existing nodes still have the plugin installed. You must manually resize the node pool to migrate workloads to new nodes.
Troubleshooting
To troubleshoot gVNIC, see Troubleshooting Google Virtual NIC.
What's next
- Use network policy logging to record when connections to Pods are allowed or denied by your cluster's network policies.