Deploy GPU workloads in Autopilot


This page shows you how to request hardware accelerators (GPUs) in your Google Kubernetes Engine (GKE) Autopilot workloads.

Autopilot provides the specialized Accelerator compute class to run GPU Pods. With this compute class, GKE places a single Pod on each GPU node, providing Pods with access to advanced capabilities on the virtual machine (VM). You can also optionally run GPU Pods without selecting the Accelerator compute class. To learn more about the benefits of the Accelerator compute class, see When to use specific compute classes.

Pricing

Autopilot bills you differently depending whether you requested the Accelerator compute class to run your GPU workloads.

Use the Accelerator compute class? Pricing Compatibility with GKE capabilities
You're billed for the Compute Engine hardware that runs your GPU workloads, plus an Autopilot premium for automatic node management and scalability. For details, see Autopilot mode pricing.

Compatible with the following:

You're billed based on the GPU Pod resource requests. For details, see the "GPU Pods" section in Kubernetes Engine pricing.

Compatible with the following:

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.

Limitations

  • You can't use time-sharing GPUs and multi-instance GPUs with Autopilot.
  • GPU availability depends on the Google Cloud region of your Autopilot cluster, and your GPU quota. To find a GPU model by region or zone, see GPU regions and zones availability.
  • If you explicitly request a specific existing GPU node for your Pod, the Pod must consume all the GPU resources on the node. For example, if the existing node has 8 GPUs and your Pod's containers request a total of 4 GPUs, Autopilot rejects the Pod.
  • For NVIDIA A100 (80GB) GPUs, you're charged a fixed price for the Local SSDs attached to the nodes, regardless of whether your Pods use that capacity.

Request GPUs in your containers

To request GPU resources for your containers, add the following fields to your Pod specification. Depending on your workload requirements, you can optionally omit the cloud.google.com/compute-class: "Accelerator" field.

apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  nodeSelector:
    cloud.google.com/compute-class: "Accelerator"
    cloud.google.com/gke-accelerator: GPU_TYPE
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 600; done;"]
    resources:
      limits:
        nvidia.com/gpu: GPU_QUANTITY

Replace the following:

  • GPU_TYPE: the type of GPU hardware. Allowed values are the following:
    • nvidia-a100-80gb: NVIDIA A100 (80GB)
    • nvidia-tesla-a100: NVIDIA A100 (40GB)
    • nvidia-l4: NVIDIA L4
    • nvidia-tesla-t4: NVIDIA T4
  • GPU_QUANTITY: the number of GPUs to allocate to the container. Must be a supported GPU quantity for the GPU type you selected.

You must specify both the GPU type and the GPU quantity in your Pod specification. If you omit either of these values, Autopilot rejects your Pod.

CPU and memory requests for Autopilot GPU Pods

When defining your GPU Pods, you should also request CPU and memory resources so that your containers perform as expected. Autopilot enforces specific CPU and memory minimums, maximums, and defaults based on the GPU type and quantity. For details, refer to Resource requests in Autopilot.

You can also request ephemeral storage in Pods that need short-lived storage. The maximum available ephemeral storage and the type of storage hardware used depends on the type and quantity of GPUs the Pod requests.

Your Pod specification should look similar to the following example, which requests four T4 GPUs:

apiVersion: v1
kind: Pod
metadata:
  name: t4-pod
spec:
  nodeSelector:
    cloud.google.com/compute-class: "Accelerator"
    cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
  containers:
  - name: t4-container-1
    image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 600; done;"]
    resources:
      limits:
        nvidia.com/gpu: 3
        cpu: "54"
        memory: "54Gi"
      requests:
        cpu: "54"
        memory: "54Gi"
  - name: t4-container-2
    image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 600; done;"]
    resources:
      limits:
        nvidia.com/gpu: 1
        cpu: "18"
        memory: "18Gi"
      requests:
        cpu: "18"
        memory: "18Gi"

Verify GPU allocation

To check that a deployed GPU workload has the requested GPUs, run the following command:

kubectl describe node NODE_NAME

Replace NODE_NAME with the name of the node on which the Pod was scheduled.

The output is similar to the following:


apiVersion: v1
kind: Node
metadata:
...
  labels:
    ...
    cloud.google.com/gke-accelerator: nvidia-tesla-t4
    cloud.google.com/gke-accelerator-count: "1"
    cloud.google.com/machine-family: custom-48
    ...
...

How GPU allocation works in Autopilot

After you request a GPU type and a quantity for the containers in a Pod and deploy the Pod, the following happens:

  1. If no allocatable GPU node exists, Autopilot provisions a new GPU node to schedule the Pod. Autopilot automatically installs NVIDIA's drivers to facilitate the hardware.
  2. Autopilot adds node taints to the GPU node and adds the corresponding tolerations to the Pod. This prevents GKE from scheduling other Pods on the GPU node.

Autopilot places exactly one GPU Pod on each GPU node, as well as any GKE-managed workloads that run on all nodes, and any DaemonSets that you configure to tolerate all node taints.

Run DaemonSets on every node

You might want to run DaemonSets on every node, even nodes with applied taints. For example, some logging and monitoring agents must run on every node in the cluster. You can configure those DaemonSets to ignore node taints so that GKE places those workloads on every node.

To run DaemonSets on every node in your cluster, including your GPU nodes, add the following toleration to your specification:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: logging-agent
spec:
  tolerations:
  - key: ""
    operator: "Exists"
    effect: ""
  containers:
  - name: logging-agent-v1
    image: IMAGE_PATH

To run DaemonSets on specific GPU nodes in your cluster, add the following to your specification:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: logging-agent
spec:
  nodeSelector:
    cloud.google.com/gke-accelerator: "GPU_TYPE"
  tolerations:
  - key: ""
    operator: "Exists"
    effect: ""
  containers:
  - name: logging-agent-v1
    image: IMAGE_PATH

Replace GPU_TYPE with the type of GPU in your target nodes. Can be one of the following:

  • nvidia-a100-80gb: NVIDIA A100 (80GB)
  • nvidia-tesla-a100: NVIDIA A100 (40GB)
  • nvidia-l4: NVIDIA L4
  • nvidia-tesla-t4: NVIDIA T4

GPU use cases in Autopilot

You can allocate GPUs to containers in Autopilot Pods to facilitate workloads such as the following:

  • Machine learning (ML) inference
  • ML training
  • Rendering

Supported GPU quantities

When you request GPUs in your Pod specification, you must use the following quantities based on the GPU type:

GPU quantities
NVIDIA L4
nvidia-l4
1, 2, 4, 8
NVIDIA T4
nvidia-tesla-t4
1, 2, 4
NVIDIA A100 (40GB)
nvidia-tesla-a100
1, 2, 4, 8, 16
NVIDIA A100 (80GB)
nvidia-a100-80gb
1, 2, 4, 8

If you request a GPU quantity that isn't supported for that type, Autopilot rejects your Pod.

Monitor GPU nodes

If your GKE cluster has system metrics enabled, then the following metrics are available in Cloud Monitoring to monitor your GPU workload performance:

  • Duty Cycle (container/accelerator/duty_cycle): Percentage of time over the past sample period (10 seconds) during which the accelerator was actively processing. Between 1 and 100.
  • Memory Usage (container/accelerator/memory_used): Amount of accelerator memory allocated in bytes.
  • Memory Capacity (container/accelerator/memory_total): Total accelerator memory in bytes.

You can use predefined dashboards to monitor your clusters with GPU nodes. For more information, see View observability metrics. For general information about monitoring your clusters and their resources, refer to Observability for GKE.

View usage metrics for workloads

You view your workload GPU usage metrics from the Workloads dashboard in the Google Cloud console.

To view your workload GPU usage, perform the following steps:

  1. Go to the Workloads page in the Google Cloud console.

    Go to Workloads
  2. Select a workload.

The Workloads dashboard displays charts for GPU memory usage and capacity, and GPU duty cycle.

View NVIDIA Data Center GPU Manager (DCGM) metrics

You can collect and visualize NVIDIA DCGM metrics by using Google Cloud Managed Service for Prometheus. For Standard clusters, you must install the NVIDIA drivers. For Autopilot clusters, GKE installs the drivers.

For instructions on how to deploy DCGM and the Prometheus DCGM exporter, see NVIDIA Data Center GPU Manager (DCGM) in the Google Cloud's operations suite documentation.

What's next