This page shows you how to use the NVIDIA Collective Communication Library (NCCL) Fast Socket plugin to run more efficient workloads on your Google Kubernetes Engine (GKE) clusters.
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
Limitations
- Compute Engine limitations apply.
- gVNIC limitations apply.
- NCCL Fast Socket is only supported on node pools that have hardware accelerators enabled.
Requirements
GKE Autopilot:
- GKE Autopilot clusters must be running 1.30.2-gke.1023000 or later.
For details, see Creating an Autopilot cluster.
GKE Standard:
- Your node pools must have gVNIC enabled to use NCCL Fast Socket.
- GKE nodes must use a Container-Optimized OS node image.
- Your clusters must be running GKE version 1.25.2-gke.1700 or later.
For details, see Creating a regional cluster.
Enable NCCL Fast Socket in Standard clusters
This section shows you how to enable the NCCL Fast Socket plugin in GKE Standard node pools. If you use GKE Autopilot clusters, GKE automatically enables the plugin when you request NCCL Fast Socket in your workloads. For instructions, see the NCCL Fast Socket in Autopilot section.
For Standard clusters, create a node pool that uses the NCCL Fast Socket plugin. You can also
update an existing node pool using
gcloud container node-pools update
.
gcloud container node-pools create NODEPOOL_NAME \
--accelerator type=ACCELERATOR_TYPE,count=ACCELERATOR_COUNT \
--machine-type=MACHINE_TYPE \
--cluster=CLUSTER_NAME \
--enable-fast-socket \
--enable-gvnic
Replace the following:
NODEPOOL_NAME
: the name of the new node pool.CLUSTER_NAME
: the name of the cluster.ACCELERATOR_TYPE
: the type of GPU accelerator that you use. For example,nvidia-tesla-t4
.ACCELERATOR_COUNT
: the number of GPUs per node.MACHINE_TYPE
: the type of machine you want to use. NCCL Fast Socket is not supported on memory-optimized machine types.
Install NVIDIA GPU device drivers
In Autopilot, GPU device drivers are automatically installed.
For Standard clusters, follow the instructions in Installing NVIDIA GPU device drivers to install the required NVIDIA device drivers on your nodes.
NCCL Fast Socket in Autopilot
In Autopilot clusters, you request NCCL Fast Socket in your workloads by using the cloud.google.com/gke-nccl-fastsocket
node selector.
When you request NCCL Fast Socket in a workload, GKE
enables gVNIC and NCCL Fast Socket on nodes that GKE
provisions for the workload.
You can use NCCL Fast Socket with any GPU type that Autopilot supports.
The following pod requests NCCL Fast Socket:
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod
spec:
nodeSelector:
cloud.google.com/gke-accelerator: GPU_TYPE
cloud.google.com/gke-nccl-fastsocket: "true"
containers:
- name: my-gpu-container
image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
command: ["/bin/bash", "-c", "--"]
args: ["while true; do sleep 600; done;"]
resources:
limits:
nvidia.com/gpu: GPU_QUANTITY
Replace the following:
GPU_TYPE
: the type of GPU hardware. Allowed values are the following:nvidia-h100-mega-80gb
: NVIDIA H100 Mega (80GB)nvidia-h100-80gb
: NVIDIA H100 (80GB)nvidia-a100-80gb
: NVIDIA A100 (80GB)nvidia-tesla-a100
: NVIDIA A100 (40GB)nvidia-l4
: NVIDIA L4nvidia-tesla-t4
: NVIDIA T4
GPU_QUANTITY
: the number of GPUs to allocate to the container.
Verify that NCCL Fast Socket is enabled
To verify that NCCL Fast Socket is enabled, view the kube-system pods:
kubectl get pods -n kube-system
The output is similar to the following:
NAME READY STATUS RESTARTS AGE
nccl-fastsocket-installer-qvfdw 2/2 Running 0 10m
nccl-fastsocket-installer-rtjs4 2/2 Running 0 10m
nccl-fastsocket-installer-tm294 2/2 Running 0 10m
In this output, the number of Pods should be equal to the number of nodes in the node pool.
Disable NCCL Fast Socket
In GKE Autopilot clusters, the NCCL Fast Socket plugin is disabled by default. To disable the plugin on an existing workload, redeploy the workload without the NCCL Fast Socket node selector.
To disable NCCL Fast Socket for a node pool in Standard clusters, run the following command:
gcloud container node-pools update NODEPOOL_NAME \
--cluster=CLUSTER_NAME \
--no-enable-fast-socket
Existing nodes still have the plugin installed. You must manually resize the node pool to migrate workloads to new nodes.
Troubleshooting
To troubleshoot gVNIC, see Troubleshooting Google Virtual NIC.
What's next
- Use network policy logging to record when connections to Pods are allowed or denied by your cluster's network policies.