This page helps you decide when to use the NVIDIA GPU operator and shows you how to enable the NVIDIA GPU Operator on GKE.
Overview
Operators are Kubernetes software extensions that allow users to create custom resources that manage applications and their components. You can use operators to automate complex tasks beyond what Kubernetes itself provides, such as deploying and upgrading applications.
The NVIDIA GPU Operator is a Kubernetes operator that provides a common infrastructure and API for deploying, configuring, and managing software components needed to provision NVIDIA GPUs in a Kubernetes cluster. The NVIDIA GPU Operator provides you with a consistent experience, simplifies GPU resource management, and streamlines the integration of GPU-accelerated workloads into Kubernetes.
Why use the NVIDIA GPU Operator?
We recommend using GKE GPU management for your GPU nodes, because GKE fully manages the GPU node lifecycle. To get started with using GKE to manage your GPU nodes, choose one of these options:
Alternatively, the NVIDIA GPU Operator might be a suitable option for you if you're looking for a consistent experience across multiple cloud service providers, you are already using the NVIDIA GPU Operator, or if you are using software that depends on the NVIDIA GPU operator.
For more considerations when deciding between these options, refer to Manage the GPU stack through GKE or the NVIDIA GPU Operator on GKE.
Limitations
The NVIDIA GPU Operator is supported on both Container-Optimized OS (COS) and Ubuntu node images with the following limitations:
- The NVIDIA GPU Operator is supported on GKE starting with GPU Operator version 24.6.0 or later.
- The NVIDIA GPU Operator is not supported on Autopilot clusters.
- The NVIDIA GPU Operator is not supported on Windows node images.
- The NVIDIA GPU Operator is not managed by GKE. To upgrade the NVIDIA GPU Operator, refer to the NVIDIA documentation.
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
- Make sure you meet the requirements in Run GPUs in Standard node pools.
Verify that you have Helm installed in your development environment. Helm comes pre-installed on Cloud Shell.
While there is no specific Helm version requirement, you can use the following command to verify that you have Helm installed.
helm version
If the output is similar to
Command helm not found
, then you can install the Helm CLI by running this command:curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \ && chmod 700 get_helm.sh \ && ./get_helm.sh
Create and set up the GPU node pool
To create and set up the GPU node pool, follow these steps:
Create a GPU node pool by following the instructions on how to Create a GPU node pool with the following modifications:
- Set
gpu-driver-version=disabled
to skip automatic GPU driver installation since it's not supported when using the NVIDIA GPU operator. - Set
--node-labels="gke-no-default-nvidia-gpu-device-plugin=true"
to disable the GKE managed GPU device plugin Daemonset.
Run the following command and append other flags for GPU node pool creation as needed:
gcloud container node-pools create POOL_NAME \ --accelerator type=GPU_TYPE,count=AMOUNT,gpu-driver-version=disabled \ --node-labels="gke-no-default-nvidia-gpu-device-plugin=true"
Replace the following:
- POOL_NAME the name you chose for the node pool.
- GPU_TYPE: the type of GPU accelerator that
you want to use. For example,
nvidia-h100-80gb
. - AMOUNT: the number of GPUs to attach to nodes in the node pool.
For example, the following command creates a GKE node pool,
a3nodepool
, with H100 GPUs in the zonal clustera3-cluster
. In this example, the GKE GPU device plugin Daemonset and automatic driver installation are disabled.gcloud container node-pools create a3nodepool \ --region=us-central1 --cluster=a3-cluster \ --node-locations=us-central1-a \ --accelerator=type=nvidia-h100-80gb,count=8,gpu-driver-version=disabled \ --machine-type=a3-highgpu-8g \ --node-labels="gke-no-default-nvidia-gpu-device-plugin=true" \ --num-nodes=1
- Set
Get the authentication credentials for the cluster by running the following command:
USE_GKE_GCLOUD_AUTH_PLUGIN=True \ gcloud container clusters get-credentials CLUSTER_NAME [--zone COMPUTE_ZONE] [--region COMPUTE_REGION]
Replace the following:
- CLUSTER_NAME: the name of the cluster containing your node pool.
- COMPUTE_REGION or COMPUTE_ZONE: specify the cluster's region or zone based on whether your cluster is a regional or zonal cluster, respectively.
The output is similar to the following:
Fetching cluster endpoint and auth data. kubeconfig entry generated for CLUSTER_NAME.
(Optional) Verify that you can connect to the cluster.
kubectl get nodes -o wide
You should see a list of all your nodes running in this cluster.
Create the namespace
gpu-operator
for the NVIDIA GPU Operator by running this command:kubectl create ns gpu-operator
The output is similar to the following:
namespace/gpu-operator created
Create resource quota in the
gpu-operator
namespace by running this command:kubectl apply -n gpu-operator -f - << EOF apiVersion: v1 kind: ResourceQuota metadata: name: gpu-operator-quota spec: hard: pods: 100 scopeSelector: matchExpressions: - operator: In scopeName: PriorityClass values: - system-node-critical - system-cluster-critical EOF
The output is similar to the following:
resourcequota/gpu-operator-quota created
View the resource quota for the
gpu-operator
namespace:kubectl get -n gpu-operator resourcequota gpu-operator-quota
The output is similar to the following:
NAME AGE REQUEST LIMIT gpu-operator-quota 2m27s pods: 0/100
Manually install the drivers on your Container-Optimized OS or Ubuntu nodes. For detailed instructions, refer to Manually install NVIDIA GPU drivers.
If using COS, run the following commands to deploy the installation DaemonSet and install the default GPU driver version:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
If using Ubuntu, the installation DaemonSet that you deploy depends on the GPU type and on the GKE node version as described in the Ubuntu section of the instructions.
Verify the GPU driver version by running this command:
kubectl logs -l k8s-app=nvidia-driver-installer \ -c "nvidia-driver-installer" --tail=-1 -n kube-system
If GPU driver installation is successful, the output is similar to the following:
I0716 03:17:38.863927 6293 cache.go:66] DRIVER_VERSION=535.183.01 … I0716 03:17:38.863955 6293 installer.go:58] Verifying GPU driver installation I0716 03:17:41.534387 6293 install.go:543] Finished installing the drivers.
Install the NVIDIA GPU Operator
This section shows how to install the NVIDIA GPU Operator using Helm. To learn more, refer to NVIDIA's documentation on installing the NVIDIA GPU Operator.
Add the NVIDIA Helm repository:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update
Install the NVIDIA GPU Operator using Helm with the following configuration options:
- Make sure the GPU Operator version is 24.6.0 or later.
- Configure the driver install path in the GPU Operator with
hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia
. - Set the toolkit install path
toolkit.installDir=/home/kubernetes/bin/nvidia
for both COS and Ubuntu. In COS, the/home
directory is writable and serves as a stateful location for storing the NVIDIA runtime binaries. To learn more, refer to the COS Disks and file system overview. - Enable the Container Device Interface (CDI) in the GPU Operator with
cdi.enabled=true
andcdi.default=true
as legacy mode is unsupported. CDI is required for both COS and Ubuntu on GKE.
helm install --wait --generate-name \ -n gpu-operator \ nvidia/gpu-operator \ --set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \ --set toolkit.installDir=/home/kubernetes/bin/nvidia \ --set cdi.enabled=true \ --set cdi.default=true \ --set driver.enabled=false
To learn more about these settings, refer to the Common Chart Customization Options and Common Deployment Scenarios in the NVIDIA documentation.
Verify that the NVIDIA GPU operator is successfully installed.
To check that the GPU Operator operands are running correctly, run the following command.
kubectl get pods -n gpu-operator
The output looks similar to the following:
NAME READY STATUS RESTARTS AGE gpu-operator-5c7cf8b4f6-bx4rg 1/1 Running 0 11m gpu-operator-node-feature-discovery-gc-79d6d968bb-g7gv9 1/1 Running 0 11m gpu-operator-node-feature-discovery-master-6d9f8d497c-thhlz 1/1 Running 0 11m gpu-operator-node-feature-discovery-worker-wn79l 1/1 Running 0 11m gpu-feature-discovery-fs9gw 1/1 Running 0 8m14s gpu-operator-node-feature-discovery-worker-bdqnv 1/1 Running 0 9m5s nvidia-container-toolkit-daemonset-vr8fv 1/1 Running 0 8m15s nvidia-cuda-validator-4nljj 0/1 Completed 0 2m24s nvidia-dcgm-exporter-4mjvh 1/1 Running 0 8m15s nvidia-device-plugin-daemonset-jfbcj 1/1 Running 0 8m15s nvidia-mig-manager-kzncr 1/1 Running 0 2m5s nvidia-operator-validator-fcrr6 1/1 Running 0 8m15s
To check that the GPU count is configured correctly in the node's 'Allocatable' field, run the following command:
kubectl describe node GPU_NODE_NAME | grep Allocatable -A7
Replace GPU_NODE_NAME with the name of the node that has GPUs.
The output is similar to the following:
Allocatable: cpu: 11900m ephemeral-storage: 47060071478 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 80403000Ki nvidia.com/gpu: 1 # showing correct count of GPU associated with the nods pods: 110
To check that GPU workload runs correctly, you can use the
cuda-vectoradd
tool:cat << EOF | kubectl create -f - apiVersion: v1 kind: Pod metadata: name: cuda-vectoradd spec: restartPolicy: OnFailure containers: - name: vectoradd image: nvidia/samples:vectoradd-cuda11.2.1 resources: limits: nvidia.com/gpu: 1 EOF
Then, run the following command:
kubectl logs cuda-vectoradd
The output is similar to the following:
[Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done
What's next
- Learn how to run GPUs in Standard node pools.
- Learn about about GPU sharing strategies for GKE.
- Learn best practices for autoscaling LLM inference workloads with GPUs on GKE.
- Explore the NVIDIA GPU Operator documentation.