Deploy GPU container workloads

This page describes how to deploy GPU container workloads on the Google Distributed Cloud (GDC) Sandbox AI Optimized SKU.

Deploy GPU container workloads

The GDC Sandbox AI Optimized SKU includes four NVIDIA H100 80GB HBM3 GPUs within the org-infra cluster. These GPUs are accessible using the resource name nvidia.com/gpu-pod-NVIDIA_H100_80GB_HBM3. This section describes how to update a container configuration to use these GPUS.

The GPUs in GDC Sandbox AI Optimized SKU are associated with a pre-configured project, "sandbox-gpu-project". You must deploy your container using this project inside the org-1-infra cluster to make use of the GPUs.

Before you begin

To run commands against the org infrastructure cluster, make sure that you have the kubeconfig of the org-1-infra cluster, as described in Work with clusters:
- Configure and authenticate with the gdcloud command line, and
- generate the kubeconfig file for the org infrastructure cluster, and assign its path to the environment variable KUBECONFIG.
To run the workloads, you must have the sandbox-gpu-admin role assigned. By default, the role is assigned to the platform-admin user. You can assign the role to other users by signing in as the platform-admin and running the following command:
```
kubectl --kubeconfig ${KUBECONFIG} create rolebinding ${NAME} --role=sandbox-gpu-admin \
--user=${USER} --namespace=sandbox-gpu-project
```

Configure a container to use GPU resources

Add the .containers.resources.requests and .containers.resources.limits fields to your container specification to request GPUs for the workload. All containers within the sandbox-gpu-project can request up to a total of 4 GPUs across the entire project. The following example requests one GPU as part of the container specification.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: sandbox-gpu-project
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        resources:
          requests:
            nvidia.com/gpu-pod-NVIDIA_H100_80GB_HBM3: 1
          limits:
            nvidia.com/gpu-pod-NVIDIA_H100_80GB_HBM3: 1

Containers also require additional permissions to access GPUs. For each container that requests GPUs, add a .containers.securityContext granting securityContext.seLinuxOptions of type unconfined_t to the container.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: sandbox-gpu-project
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        resources:
          requests:
            nvidia.com/gpu-pod-NVIDIA_H100_80GB_HBM3: 1
          limits:
            nvidia.com/gpu-pod-NVIDIA_H100_80GB_HBM3: 1
        securityContext:
          seLinuxOptions:
            type: unconfined_t

Apply your container manifest file:

kubectl apply -f ${CONTAINER_MANIFEST_FILE_PATH} \
    -n sandbox-gpu-project \
    --kubeconfig ${KUBECONFIG}