This document explains how to set up your Google Kubernetes Engine (GKE) infrastructure to support dynamic resource allocation (DRA). The setup steps include creating node pools that use GPUs or TPUs, and installing DRA drivers in your cluster. This document is intended for platform administrators who want to reduce the complexity and overhead of setting up infrastructure with specialized hardware devices.
Limitations
- Node auto-provisioning isn't supported.
- Autopilot clusters don't support DRA.
- Automatic GPU driver installation isn't supported with DRA.
- You can't use the following GPU sharing features:
- Time-sharing GPUs
- Multi-instance GPUs
- Multi-process Service (MPS)
- For TPUs, you must enable the
v1beta1andv1beta2versions of the DRA API kinds. This limitation doesn't apply to GPUs, which can usev1API versions.
Requirements
To use DRA, your GKE cluster must run 1.34 or later.
You should also be familiar with the following requirements and limitations, depending on the type of hardware that you want to use:
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running the
gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
Have a GKE Standard cluster that runs version 1.34 or later. You can also create a regional cluster.
If you're not using the Cloud Shell, install the Helm CLI:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 chmod 700 get_helm.sh ./get_helm.shTo use DRA for TPUs, enable the
v1beta1andv1beta2versions of the DRA API kinds:gcloud container clusters update CLUSTER_NAME \ --location=CONTROL_PLANE_LOCATION \ --enable-kubernetes-unstable-apis="resource.k8s.io/v1beta1/deviceclasses,resource.k8s.io/v1beta1/resourceclaims,resource.k8s.io/v1beta1/resourceclaimtemplates,resource.k8s.io/v1beta1/resourceslices,resource.k8s.io/v1beta2/deviceclasses,resource.k8s.io/v1beta2/resourceclaims,resource.k8s.io/v1beta2/resourceclaimtemplates,resource.k8s.io/v1beta2/resourceslices"
Create a GKE node pool with GPUs or TPUs
On GKE, you can use DRA with both GPUs and TPUs. The node pool configuration settings—such as machine type, accelerator type, count, node operating system, and node locations—depend on your requirements. To create a node pool that supports DRA, select one of the following options:
GPU
To use DRA for GPUs, you must do the following when you create the node pool:
- Disable automatic GPU driver installation by specifying the
gpu-driver-version=disabledoption in the--acceleratorflag when you configure GPUs for a node pool. - Disable the GPU device plugin by adding the
gke-no-default-nvidia-gpu-device-plugin=truenode label. - Let the DRA driver DaemonSet run on the nodes by adding the
nvidia.com/gpu.present=truenode label.
To create a GPU node pool for DRA, follow these steps:
Create a node pool with the required hardware. The following example creates a node pool that has a
g2-standard-24instance on Container-Optimized OS with two L4 GPUs.gcloud container node-pools create NODEPOOL_NAME \ --cluster=CLUSTER_NAME \ --location=CONTROL_PLANE_LOCATION \ --machine-type "g2-standard-24" \ --accelerator "type=nvidia-l4,count=2,gpu-driver-version=disabled" \ --num-nodes "1" \ --node-labels=gke-no-default-nvidia-gpu-device-plugin=true,nvidia.com/gpu.present=trueReplace the following:
NODEPOOL_NAME: the name for your node pool.CLUSTER_NAME: the name of your cluster.CONTROL_PLANE_LOCATION: the region or zone of the cluster control plane, such asus-central1orus-central1-a.
Manually install the drivers on your Container-Optimized OS or Ubuntu nodes. For detailed instructions, refer to Manually install NVIDIA GPU drivers.
TPU
To use DRA for TPUs, you must disable the TPU device plugin by adding the
gke-no-default-tpu-device-plugin=true node label. The following example
creates a TPU Trillium node pool with DRA support:
gcloud container node-pools create NODEPOOL_NAME \
--cluster CLUSTER_NAME --num-nodes 1 \
--location=CONTROL_PLANE_LOCATION \
--node-labels "gke-no-default-tpu-device-plugin=true,gke-no-default-tpu-dra-plugin=true" \
--machine-type=ct6e-standard-8t
Replace the following:
NODEPOOL_NAME: the name for your node pool.CLUSTER_NAME: the name of your cluster.CONTROL_PLANE_LOCATION: the region or zone of the cluster control plane, such asus-central1orus-central1-a.
Install DRA drivers
GPU
Pull and update the Helm chart that contains the NVIDIA DRA driver:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo updateInstall the NVIDIA DRA driver with version
25.3.2:helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ --version="25.3.2" --create-namespace --namespace=nvidia-dra-driver-gpu \ --set nvidiaDriverRoot="/home/kubernetes/bin/nvidia/" \ --set gpuResourcesEnabledOverride=true \ --set resources.computeDomains.enabled=false \ --set kubeletPlugin.priorityClassName="" \ --set kubeletPlugin.tolerations[0].key=nvidia.com/gpu \ --set kubeletPlugin.tolerations[0].operator=Exists \ --set kubeletPlugin.tolerations[0].effect=NoScheduleFor Ubuntu nodes, use the
nvidiaDriverRoot="/opt/nvidia"directory path.
TPU
Clone the
ai-on-gkerepository to access the Helm charts that contain the DRA drivers for GPUs and TPUs:git clone https://github.com/ai-on-gke/common-infra.gitNavigate to the directory that contains the charts:
cd common-infra/common/chartsInstall the TPU DRA driver:
./tpu-dra-driver/install-tpu-dra-driver.sh
Verify that your infrastructure is ready for DRA
To verify that your DRA driver Pods are running, select one of the following options:
GPU
kubectl get pods -n nvidia-dra-driver-gpuThe output is similar to the following:
NAME READY STATUS RESTARTS AGE nvidia-dra-driver-gpu-kubelet-plugin-52cdm 1/1 Running 0 46sTPU
kubectl get pods -n tpu-dra-driverThe output is similar to the following:
NAME READY STATUS RESTARTS AGE tpu-dra-driver-kubeletplugin-h6m57 1/1 Running 0 30sConfirm that the
ResourceSlicelists the hardware devices that you added:kubectl get resourceslices -o yamlIf you used the example in the previous section, the output is similar to the following, depending on whether you configured GPUs or TPUs:
GPU
apiVersion: v1 items: - apiVersion: resource.k8s.io/v1 kind: ResourceSlice metadata: # Multiple lines are omitted here. spec: devices: - attributes: architecture: string: Ada Lovelace brand: string: Nvidia cudaComputeCapability: version: 8.9.0 cudaDriverVersion: version: 13.0.0 driverVersion: version: 580.65.6 index: int: 0 minor: int: 0 pcieBusID: string: "0000:00:03.0" productName: string: NVIDIA L4 resource.kubernetes.io/pcieRoot: string: pci0000:00 type: string: gpu uuid: string: GPU-ccc19e5e-e3cd-f911-65c8-89bcef084e3f capacity: memory: value: 23034Mi name: gpu-0 - attributes: architecture: string: Ada Lovelace brand: string: Nvidia cudaComputeCapability: version: 8.9.0 cudaDriverVersion: version: 13.0.0 driverVersion: version: 580.65.6 index: int: 1 minor: int: 1 pcieBusID: string: "0000:00:04.0" productName: string: NVIDIA L4 resource.kubernetes.io/pcieRoot: string: pci0000:00 type: string: gpu uuid: string: GPU-f783198d-42f9-7cef-9ea1-bb10578df978 capacity: memory: value: 23034Mi name: gpu-1 driver: gpu.nvidia.com nodeName: gke-cluster-1-dra-gpu-pool-b56c4961-7vnm pool: generation: 1 name: gke-cluster-1-dra-gpu-pool-b56c4961-7vnm resourceSliceCount: 1 kind: List metadata: resourceVersion: ""TPU
apiVersion: v1 items: - apiVersion: resource.k8s.io/v1beta1 kind: ResourceSlice metadata: # lines omitted for clarity spec: devices: - basic: attributes: index: int: 0 tpuGen: string: v6e uuid: string: tpu-54de4859-dd8d-f67e-6f91-cf904d965454 name: "0" - basic: attributes: index: int: 1 tpuGen: string: v6e uuid: string: tpu-54de4859-dd8d-f67e-6f91-cf904d965454 name: "1" - basic: attributes: index: int: 2 tpuGen: string: v6e uuid: string: tpu-54de4859-dd8d-f67e-6f91-cf904d965454 name: "2" - basic: attributes: index: int: 3 tpuGen: string: v6e uuid: string: tpu-54de4859-dd8d-f67e-6f91-cf904d965454 name: "3" driver: tpu.google.com nodeName: gke-tpu-b4d4b61b-fwbg pool: generation: 1 name: gke-tpu-b4d4b61b-fwbg resourceSliceCount: 1 kind: List metadata: resourceVersion: ""