Manage GPU container workloads

You can enable and manage graphics processing unit (GPU) resources on your containers. For example, you might prefer running artificial intelligence (AI) and machine learning (ML) notebooks in a GPU environment. To run GPU container workloads, you must have a user cluster that supports GPU devices. GPU support is enabled by default for user clusters that have GPU machines provisioned for them.

Before you begin

To deploy GPUs to your containers, you must have the following:

  • A user cluster with a GPU machine class. Check the supported GPU cards section for options on what you can configure for your cluster machines.

  • The User Cluster Node Viewer role (user-cluster-node-viewer) to check GPUs, and the Namespace Admin role (namespace-admin) to deploy GPU workloads.

  • The org admin cluster kubeconfig path. Sign in and generate the kubeconfig file if you don't have one.

  • The user cluster name. Ask your Platform Administrator for this information if you don't have it.

  • The user cluster kubeconfig path. Sign in and generate the kubeconfig file if you don't have one.

Configure a container to use GPU resources

GPUs are automatically partitioned using the NVIDIA Multi-Instance GPU (MIG), which allows each node to provide seven GPU partitions that can be used by containers in the cluster.

To use these GPUs in a container, complete the following steps:

  1. Verify your user cluster has node pools that support GPUs:

    kubectl describe nodepoolclaims -n USER_CLUSTER_NAME \
        --kubeconfig ORG_ADMIN_CLUSTER_KUBECONFIG
    

    The relevant output is similar to the following snippet:

    Spec:
      Machine Class Name:  a2-ultragpu-1g-gdc
      Node Count:          2
    
  2. Add the .containers.resources.requests and .containers.resources.limits fields to your container spec. Each resource name is different depending on your machine class:

    Machine class Resource name
    a2-highgpu-1g-gdc nvidia.com/mig-1g.5gb-NVIDIA_A100_PCIE_40GB
    a2-ultragpu-1g-gdc nvidia.com/mig-1g.10gb-NVIDIA_A100_80GB_PCIE

    For example, the following container spec requests three partitions of a GPU from an a2-ultragpu-1g-gdc node:

     ...
     containers:
     - name: my-container
       image: "my-image"
       resources:
         requests:
           nvidia.com/mig-1g.10gb-NVIDIA_A100_80GB_PCIE: 3
         limits:
           nvidia.com/mig-1g.10gb-NVIDIA_A100_80GB_PCIE: 3
     ...
    
  3. Containers also require additional permissions to access GPUs. For each container that requests GPUs, add the following permissions to your container spec:

    ...
    securityContext:
     seLinuxOptions:
       type: unconfined_t
    ...
    
  4. Apply your container manifest file:

    kubectl apply -f CONTAINER_MANIFEST_FILE \
        --kubeconfig USER_CLUSTER_KUBECONFIG
    

Check GPU resource allocation

  • To check your GPU resource allocation, use the following command:

    kubectl describe nodes NODE_NAME --namespace vm-system
    

    Replace NODE_NAME with the node managing the GPUs you want to inspect.

    The relevant output is similar to the following snippet:

    Capacity:
      nvidia.com/mig-1g.10gb-NVIDIA_A100_80GB_PCIE: 7
    Allocatable:
      nvidia.com/mig-1g.10gb-NVIDIA_A100_80GB_PCIE: 7