Prepare GKE infrastructure for DRA workloads

Standard

This page explains how to set up your Google Kubernetes Engine (GKE) infrastructure to support dynamic resource allocation (DRA). The setup steps include creating node pools that use GPUs or TPUs, and installing DRA drivers in your cluster.

This page is intended for platform administrators who want to reduce the complexity and overhead of setting up infrastructure with specialized hardware devices.

About DRA

DRA is a built-in Kubernetes feature that lets you flexibly request, allocate, and share hardware in your cluster among Pods and containers. For more information, see About dynamic resource allocation.

Limitations

Node auto-provisioning isn't supported.
Autopilot clusters don't support DRA.
Automatic GPU driver installation isn't supported with DRA.
You can't use the following GPU sharing features:
- Time-sharing GPUs
- Multi-instance GPUs
- Multi-process Service (MPS)

Requirements

To use DRA, your GKE cluster must run 1.32.1-gke.1489001 or later.

You should also be familiar with the following requirements and limitations, depending on the type of hardware that you want to use:

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property. If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Have a GKE Standard cluster that runs version 1.32.1-gke.1489001 or later. You can also create a regional cluster.

If you're not using the Cloud Shell, install the Helm CLI:

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

Enable the DRA beta APIs in your cluster

gcloud container clusters update CLUSTER_NAME \
    --location=CONTROL_PLANE_LOCATION \
    --enable-kubernetes-unstable-apis="resource.k8s.io/v1beta1/deviceclasses,resource.k8s.io/v1beta1/resourceclaims,resource.k8s.io/v1beta1/resourceclaimtemplates,resource.k8s.io/v1beta1/resourceslices"

Replace the following:

CLUSTER_NAME: the name of your cluster.
CONTROL_PLANE_LOCATION: the region or zone of the cluster control plane, such as us-central1 or us-central1-a.

Create a GKE node pool with GPUs or TPUs

On GKE, you can use DRA with both GPUs and TPUs. The node pool configuration settings—such as machine type, accelerator type, count, node operating system, and node locations—depend on your requirements.

GPU

To use DRA for GPUs, you must do the following when you create the node pool:

Disable automatic GPU driver installation with gpu-driver-version=disabled.
Disable GPU device plugin by adding the gke-no-default-nvidia-gpu-device-plugin=true node label.
Let the DRA Driver DaemonSet run on the nodes by adding the nvidia.com/gpu.present=true node label.

To create a GPU node pool for DRA, follow these steps:

Create a node pool with the required hardware. The following example creates a node pool that has g2-standard-24 instances on Container-Optimized OS with two L4 GPUs.

gcloud container node-pools create NODEPOOL_NAME \
    --cluster=CLUSTER_NAME \
    --location=CONTROL_PLANE_LOCATION \
    --machine-type "g2-standard-24" \
    --accelerator "type=nvidia-l4,count=2,gpu-driver-version=disabled" \
    --num-nodes "1" \
    --node-labels=gke-no-default-nvidia-gpu-device-plugin=true,nvidia.com/gpu.present=true

Replace the following:

NODEPOOL_NAME: the name for your node pool.
CLUSTER_NAME: the name of your cluster.
CONTROL_PLANE_LOCATION: the region or zone of the cluster control plane, such as us-central1 or us-central1-a.

Manually install the drivers on your Container-Optimized OS or Ubuntu nodes. For detailed instructions, refer to Manually install NVIDIA GPU drivers.

TPU

To use DRA for TPUs, you must disable TPU device plugin by adding the gke-no-default-tpu-device-plugin=true node label.

Create a node pool that uses TPUs. The following example creates a TPU Trillium node pool:

gcloud container node-pools create NODEPOOL_NAME \
    --cluster CLUSTER_NAME --num-nodes 1 \
    --location=CONTROL_PLANE_LOCATION \
    --node-labels "gke-no-default-tpu-device-plugin=true,gke-no-default-tpu-dra-plugin=true" \
    --machine-type=ct6e-standard-8t

Replace the following:

NODEPOOL_NAME: the name for your node pool.
CLUSTER_NAME: the name of your cluster.
CONTROL_PLANE_LOCATION: the region or zone of the cluster control plane, such as us-central1 or us-central1-a.

Install DRA drivers

GPU

Pull and update the Helm chart that contains the NVIDIA DRA driver:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    && helm repo update

Install the NVIDIA DRA driver with version 25.3.0-rc.4:

helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --version="25.3.0-rc.4" --create-namespace --namespace nvidia-dra-driver-gpu \
    --set nvidiaDriverRoot="/home/kubernetes/bin/nvidia/" \
    --set gpuResourcesEnabledOverride=true \
    --set resources.computeDomains.enabled=false \
    --set kubeletPlugin.priorityClassName="" \
    --set kubeletPlugin.tolerations[0].key=nvidia.com/gpu \
    --set kubeletPlugin.tolerations[0].operator=Exists \
    --set kubeletPlugin.tolerations[0].effect=NoSchedule

For Ubuntu nodes, use the nvidiaDriverRoot="/opt/nvidia" directory path.

TPU

You can install DRA drivers for TPUs with the provided Helm chart. To get access to the Helm charts, complete the following steps:

Clone the ai-on-gke repository to access the Helm charts that contain the DRA drivers for GPUs and TPUs:
```
git clone https://github.com/ai-on-gke/common-infra.git
```
Navigate to the directory that contains the charts:
```
cd common-infra/common/charts
```

Install the TPU DRA driver:

./tpu-dra-driver/install-tpu-dra-driver.sh

Verify that your infrastructure is ready for DRA

Verify that the DRA driver Pod is running.

GPU

kubectl get pods -n nvidia-dra-driver-gpu
NAME                                         READY   STATUS    RESTARTS   AGE
nvidia-dra-driver-gpu-kubelet-plugin-52cdm   1/1     Running   0          46s

TPU

kubectl get pods -n tpu-dra-driver
NAME                                         READY   STATUS    RESTARTS   AGE
tpu-dra-driver-kubeletplugin-h6m57           1/1     Running   0          30s

Confirm that the ResourceSlice lists the hardware devices that you added:

kubectl get resourceslices -o yaml

If you used the example in the previous section, the ResourceSlice resembles the following, depending on the type of hardware you used:

GPU

The following example creates a g2-standard-24 machine with two L4 GPUs.

apiVersion: v1
items:
- apiVersion: resource.k8s.io/v1beta1
  kind: ResourceSlice
  metadata:
    # lines omitted for clarity
  spec:
    devices:
    - basic:
        attributes:
          architecture:
            string: Ada Lovelace
          brand:
            string: Nvidia
          cudaComputeCapability:
            version: 8.9.0
          cudaDriverVersion:
            version: 12.9.0
          driverVersion:
            version: 575.57.8
          index:
            int: 0
          minor:
            int: 0
          productName:
            string: NVIDIA L4
          type:
            string: gpu
          uuid:
            string: GPU-4d403095-4294-6ddd-66fd-cfe5778ef56e
        capacity:
          memory:
            value: 23034Mi
      name: gpu-0
    - basic:
        attributes:
          architecture:
            string: Ada Lovelace
          brand:
            string: Nvidia
          cudaComputeCapability:
            version: 8.9.0
          cudaDriverVersion:
            version: 12.9.0
          driverVersion:
            version: 575.57.8
          index:
            int: 1
          minor:
            int: 1
          productName:
            string: NVIDIA L4
          type:
            string: gpu
          uuid:
            string: GPU-cc326645-f91d-d013-1c2f-486827c58e50
        capacity:
          memory:
            value: 23034Mi
      name: gpu-1
    driver: gpu.nvidia.com
    nodeName: gke-cluster-gpu-pool-9b10ff37-mf70
    pool:
      generation: 1
      name: gke-cluster-gpu-pool-9b10ff37-mf70
      resourceSliceCount: 1
kind: List
metadata:
  resourceVersion: ""

TPU

apiVersion: v1
items:
- apiVersion: resource.k8s.io/v1beta1
  kind: ResourceSlice
  metadata:
    # lines omitted for clarity
  spec:
    devices:
    - basic:
        attributes:
          index:
            int: 0
          tpuGen:
            string: v6e
          uuid:
            string: tpu-54de4859-dd8d-f67e-6f91-cf904d965454
      name: "0"
    - basic:
        attributes:
          index:
            int: 1
          tpuGen:
            string: v6e
          uuid:
            string: tpu-54de4859-dd8d-f67e-6f91-cf904d965454
      name: "1"
    - basic:
        attributes:
          index:
            int: 2
          tpuGen:
            string: v6e
          uuid:
            string: tpu-54de4859-dd8d-f67e-6f91-cf904d965454
      name: "2"
    - basic:
        attributes:
          index:
            int: 3
          tpuGen:
            string: v6e
          uuid:
            string: tpu-54de4859-dd8d-f67e-6f91-cf904d965454
      name: "3"
    driver: tpu.google.com
    nodeName: gke-tpu-b4d4b61b-fwbg
    pool:
      generation: 1
      name: gke-tpu-b4d4b61b-fwbg
      resourceSliceCount: 1
kind: List
metadata:
  resourceVersion: ""

What's next

Deploy your workloads with DRA