Manage GPU container workloads

You can enable and manage graphics processing unit (GPU) resources on your containers. For example, you might prefer running artificial intelligence (AI) and machine learning (ML) notebooks in a GPU environment. GPU support is enabled by default in Google Distributed Cloud (GDC) air-gapped appliance.

Before you begin

To deploy GPUs to your containers, you must have the following:

  • The Namespace Admin role (namespace-admin) to deploy GPU workloads in your project namespace.

  • The kubeconfig path for the bare metal Kubernetes cluster. Sign in and generate the kubeconfig file if you don't have one.

Configure a container to use GPU resources

To use GPUs in a container, complete the following steps:

  1. Confirm your Kubernetes cluster nodes support your GPU resource allocation:

    kubectl describe nodes NODE_NAME
    

    Replace NODE_NAME with the node managing the GPUs you want to inspect.

    The relevant output is similar to the following snippet:

    Capacity:
      nvidia.com/gpu-pod-NVIDIA_A100_80GB_PCIE: 1
    Allocatable:
      nvidia.com/gpu-pod-NVIDIA_A100_80GB_PCIE: 1
    
  2. Add the .containers.resources.requests and .containers.resources.limits fields to your container spec. Since your Kubernetes cluster is preconfigured with GPU machines, the configuration is the same for all workloads:

     ...
     containers:
     - name: CONTAINER_NAME
       image: CONTAINER_IMAGE
       resources:
         requests:
           nvidia.com/gpu-pod-NVIDIA_A100_80GB_PCIE: 1
         limits:
           nvidia.com/gpu-pod-NVIDIA_A100_80GB_PCIE: 1
     ...
    

    Replace the following:

    • CONTAINER_NAME: the name of the container.
    • CONTAINER_IMAGE: the container image to access the GPU machines. You must include the container registry path and version of the image, such as REGISTRY_PATH/hello-app:1.0.
  3. Containers also require additional permissions to access GPUs. For each container that requests GPUs, add the following permissions to your container spec:

    ...
    securityContext:
     seLinuxOptions:
       type: unconfined_t
    ...
    
  4. Apply your container manifest file:

    kubectl apply -f CONTAINER_MANIFEST_FILE \
        -n NAMESPACE \
        --kubeconfig CLUSTER_KUBECONFIG
    

    Replace the following:

    • CONTAINER_MANIFEST_FILE: the YAML file for your container workload custom resource.
    • NAMESPACE: the project namespace in which to deploy the container workloads.
    • CLUSTER_KUBECONFIG: the kubeconfig file for the bare metal Kubernetes cluster to which you're deploying container workloads.
  5. Verify that your pods are running and are using the GPUs:

    kubectl get pods -A | grep CONTAINER_NAME \
        -n NAMESPACE \
        --kubeconfig CLUSTER_KUBECONFIG
    

    The relevant output is similar to the following snippet:

    Port:           80/TCP
    Host Port:      0/TCP
    State:          Running
    Ready:          True
    Restart Count:  0
    Limits:
      nvidia.com/gpu-pod-NVIDIA_A100_80GB_PCIE:  1
    Requests:
      nvidia.com/gpu-pod-NVIDIA_A100_80GB_PCIE:  1