About GPU sharing strategies in GKE


This page explains and compares the GPU sharing strategies available in Google Kubernetes Engine (GKE). This page assumes that you're familiar with Kubernetes concepts such as Pods, nodes, deployments, and namespaces; and GKE concepts such as node pools, autoscaling, and node auto-provisioning.

How GPU requests work in Kubernetes

Kubernetes enables workloads to precisely request the resource amounts they need to function. Although you can request fractional CPU units for workloads, you can't request fractional GPU units. Pod manifests must request GPU resources in integers, which means that an entire physical GPU is allocated to one container even if the container only needs a fraction of the resources to function correctly. This is inefficient and can be costly, especially when you're running multiple workloads with similar low GPU requirements. We recommend that you use GPU sharing strategies to improve GPU utilization when your workloads don't use all the GPU resources.

What are GPU sharing strategies?

GPU sharing strategies allow multiple containers to efficiently use your attached GPUs and save running costs. GKE provides the following GPU sharing strategies:

  • Multi-instance GPU: GKE divides a single supported GPU in up to seven slices. Each slice can be allocated to one container on the node independently, for a maximum of seven containers per GPU. Multi-instance GPU provides hardware isolation between the workloads, and consistent and predictable Quality of Service (QoS) for all containers running on the GPU.
  • GPU time-sharing: GKE uses the built-in timesharing ability provided by the NVIDIA GPU and the software stack. Starting with the Pascal architecture, NVIDIA GPUs support instruction level preemption. When doing context switching between processes running on a GPU, instruction-level preemption ensures every process gets a fair timeslice. GPU time-sharing provides software-level isolation between the workloads in terms of address space isolation, performance isolation, and error isolation.
  • NVIDIA MPS: GKE uses NVIDIA's Multi-Process Service (MPS). NVIDIA MPS is an alternative, binary-compatible implementation of the CUDA API designed to transparently enable co-operative multi-process CUDA workloads to run concurrently on a single GPU device. GPU with NVIDIA MPS provides software-level isolation in terms of resource limits (active thread percentage and pinned device memory.

Which GPU sharing strategy to use

The following table summarizes and compares the characteristics of the available GPU sharing strategies:

Multi-instance GPU GPU time-sharing NVIDIA MPS
General Parallel GPU sharing among containers Rapid context switching. Parallel GPU sharing among containers
Isolation A single GPU is divided in up to seven slices and each container on the same physical GPU has dedicated compute, memory, and bandwidth. Therefore, a container in a partition has a predictable throughput and latency even when other containers saturate other partitions.

Each container accesses the full capacity of the underlying physical GPU by doing context switching between processes running on a GPU.

However, time-sharing provides no memory limit enforcement between shared Jobs and the rapid context switching for shared access may introduce overheads.

NVIDIA MPS has limited resource isolation, but gains more flexibility in other dimensions, for example GPU types and max shared units, which simplify resource allocation.
Suitable for these workloads Recommended for workloads running in parallel and that need certain resiliency and QoS. For example, when running AI inference workloads, multi-instance GPU multi-instance GPU allows multiple inference queries to run simultaneously for quick responses, without slowing each other down.

Recommended for bursty and interactive workloads that have idle periods. These workloads are not cost-effective with a fully dedicated GPU. By using time-sharing, workloads get quick access to the GPU when they are in active phases.

GPU time-sharing is optimal for scenarios where full isolation and continuous GPU access might not be necessary, for example, when multiple users test or prototype workloads without idling costly GPUs.

Workloads that use time-sharing need to tolerate certain performance and latency compromises.

Recommended for batch processing for small jobs because MPS maximizes the throughput and concurrent use of a GPU. MPS allows batch jobs to efficiently process in parallel for small to medium sized workloads.

NVIDIA MPS is optimal for cooperative processes acting as a single application. For example, MPI jobs with inter-MPI rank parallelism. With these jobs, each small CUDA process (typically MPI ranks) can run concurrently on the GPU to fully saturate the whole GPU.

Workloads that use CUDA MPS need to tolerate the memory protection and error containment limitations.

Monitoring GPU utilization metrics are not available for multi-instance GPUs. Use Cloud Monitoring to monitor the performance of your GPU time-sharing. To learn more about the available metrics, see Monitor GPU time-sharing or NVIDIA MPS nodes. Use Cloud Monitoring to monitor the performance of your NVIDIA MPS. To learn more about the available metrics, see Monitor GPU time-sharing or NVIDIA MPS nodes.
Request shared GPUs in workloads Run multi-instance GPUs Run GPUs with time-sharing Run GPUs with NVIDIA MPS

If you want to maximize your GPU utilization, you can combine the GPU sharing strategies to use either time-sharing or NVIDIA MPS for each multi-instance GPU partition. You can then run multiple containers on each partition, with those containers sharing access to the resources on that partition. We recommend that you use any of the following combinations:

  • Multi-instance GPU and GPU time-sharing.
  • Multi-instance GPU and NVIDIA MPS.

How the GPU sharing strategies work

You can specify the maximum number of containers allowed to share a physical GPU. On Autopilot clusters, this is configured in your workload specification. On Standard clusters, this is configured when you create a new node pool with GPUs attached. Every GPU in the node pool is shared based on the setting you specify at the node pool level.

The following sections explain the scheduling behavior and operation of each GPU sharing strategy.

Multi-instance GPU

You can request multi-instance GPU in workloads by specifying the cloud.google.com/gke-gpu-partition-size label in the Pod spec nodeSelector field, under spec: nodeSelector.

GKE schedules workloads in available nodes that match these labels. If there are no available nodes, GKE uses autoscaling and node auto-provisioning to create new nodes or node pools that match this label.

GPU time-sharing or NVIDIA MPS

You can request GPU time-sharing or NVIDIA MPS in workloads by specifying the following labels in the Pod spec nodeSelector field, under spec:nodeSelector.

  • cloud.google.com/gke-max-shared-clients-per-gpu: Select nodes that allow a specific number of clients to share the underlying GPU.
  • cloud.google.com/gke-gpu-sharing-strategy: Select nodes that use the time-sharing or NVIDIA MPS strategy for GPUs.

The following table describes how scheduling behavior changes based on the node labels combination that you specify in your manifests.

Node labels
cloud.google.com/gke-max-shared-clients-per-gpu

and

cloud.google.com/gke-gpu-sharing-strategy

GKE schedules workloads in available nodes that match both the labels.

If there are no available nodes, GKE uses autoscaling and node auto-provisioning to create new nodes or node pools that match both the labels.

Only cloud.google.com/gke-max-shared-clients-per-gpu

Autopilot: GKE rejects the workload.

Standard: GKE schedules workloads in available nodes that match the label. If there are no available nodes, GKE uses autoscaling and node auto-provisioning to create new nodes or node pools that match the label. By default, auto-provisioned nodes are given the following label and value for each strategy:

  • GPU time-sharing: cloud.google.com/gke-gpu-sharing-strategy=TIME_SHARING
  • NVIDIA MPS: cloud.google.com/gke-gpu-sharing-strategy=MPS
Only cloud.google.com/gke-gpu-sharing-strategy

Autopilot: GKE rejects the workload.

Standard: GKE schedules workloads in available nodes that use specific sharing strategies.

  • If there are multiple shared node pools with different values for cloud.google.com/gke-max-shared-clients-per-gpu, the workload can be scheduled on any available node.
  • If there are no available nodes in any node pools, the cluster autoscaler scales up the node pool with the lowest value for cloud.google.com/gke-max-shared-clients-per-gpu.
  • If all node pools are at capacity, node auto-provisioning creates a new node pool with a default value of cloud.google.com/gke-max-shared-clients-per-gpu=2

The GPU request process that you complete is the same for GPU time-sharing and NVIDIA MPS strategy. If you're developing GPU applications that run on GPU time-sharing or NVIDIA MPS, you can only request one GPU for each container. GKE rejects a request for more than one GPU in a container to avoid unexpected behavior, and because the number of GPU time-sharing and NVIDIA MPS requested is not a measure of the compute power available to the container.

The following table shows you what to expect when you request specific quantities of GPUs.

GPU requests that apply to GPU time-sharing and NVIDIA MPS
One GPU time-sharing or NVIDIA MPS per container GKE allows the request, even if the node has one physical GPU or multiple physical GPUs.
More than one GPU time-sharing per container

GKE rejects the request.

This behavior includes requesting more than one multi-instance GPU instance in a container, because each GPU instance is considered to be a discrete physical GPU.

More than one NVIDIA MPS per container

Based on the number of physical GPUs in the node, GKE does the following:

  • GKE allows the request when the node only has one physical GPU.
  • GKE rejects the request when the node has multiple physical GPUs, such as requesting more than one multi-instance GPU instance in a container. This rejection happens because each GPU instance is considered to be a discrete physical GPU.

If GKE rejects the workload, you see an error message similar to the following:

status:
  message: 'Pod Allocate failed due to rpc error: code = Unknown desc = [invalid request
    for sharing GPU (time-sharing), at most 1 nvidia.com/gpu can be requested on GPU nodes], which is unexpected'
  phase: Failed
  reason: UnexpectedAdmissionError

Monitor GPU time-sharing or NVIDIA MPS nodes

Use Cloud Monitoring to monitor the performance of your GPU time-sharing or NVIDIA MPS nodes. GKE sends metrics for each GPU node to Cloud Monitoring. These metrics are different from the metrics for regular GPUs that are not time-shared or NVIDA MPS nodes.

You can check the following metrics for each GPU time-sharing or NVIDIA MPS node in Cloud Monitoring:

  • Duty cycle (node/accelerator/duty_cycle): Percentage of time over the last sample period (10 seconds) during which the GPU node was actively processing. Ranges from 1% to 100%.
  • Memory usage (node/accelerator/memory_used): Amount of accelerator memory allocated in bytes for each GPU node.
  • Memory capacity (node/accelerator/memory_total): Total accelerator memory in bytes for each GPU node.

These GPU time-sharing or NVIDIA MPS node metrics apply at the node level (node/accelerator/), and the metrics for regular physical GPUs apply at the container level (container/accelerator). The metrics for regular physical GPUs metrics are not collected for containers scheduled on a GPU that uses GPU time-sharing or NVIDIA MPS.

What's next