status:
message: 'Pod Allocate failed due to rpc error: code = Unknown desc = [invalid request
for sharing GPU (time-sharing), at most 1 nvidia.com/gpu can be requested on GPU nodes], which is unexpected'
phase: Failed
reason: UnexpectedAdmissionError
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-07-16。"],[],[],null,["# About GPU sharing strategies in GKE\n\n[Autopilot](/kubernetes-engine/docs/concepts/autopilot-overview) [Standard](/kubernetes-engine/docs/concepts/choose-cluster-mode)\n\n*** ** * ** ***\n\nThis page explains the characteristics and best types of workloads for each GPU\nsharing strategy available in Google Kubernetes Engine (GKE), such as\nmulti-instance GPUs, GPU time-sharing, and NVIDIA MPS. GPU sharing helps you to\nminimize underutilized capacity in your cluster and to provide workloads with\njust enough capacity to complete tasks.\n\nThis page is for Platform admins and operators and for\nData and AI specialists who want to run GPU-based workloads that consume GPU\ncapacity as efficiently as possible. To learn more about common roles that we\nreference in Google Cloud content, see\n[Common GKE user roles and tasks](/kubernetes-engine/enterprise/docs/concepts/roles-tasks).\n\nBefore reading this page, ensure that you're familiar with the following\nconcepts:\n\n- Kubernetes concepts, such as Pods, nodes, deployments, and namespaces.\n- GKE concepts, such as [node pools](/kubernetes-engine/docs/concepts/node-pools), [autoscaling](/kubernetes-engine/docs/concepts/cluster-autoscaler), and [node auto-provisioning](/kubernetes-engine/docs/how-to/node-auto-provisioning).\n\nHow GPU requests work in Kubernetes\n-----------------------------------\n\nKubernetes enables workloads to request precisely the resource amounts they need\nto function. Although you can request fractional *CPU* units for workloads, you\ncan't request fractional *GPU* units. Pod manifests must request GPU resources in\nintegers, which means that an entire physical GPU is allocated to one container\neven if the container only needs a fraction of the resources to function\ncorrectly. This is inefficient and can be costly, especially when you're running\nmultiple workloads with similar low GPU requirements.\n**Best practice** :\n\nUse GPU sharing strategies to improve GPU utilization when your workloads don't need all of the GPU resources.\n\nWhat are GPU sharing strategies?\n--------------------------------\n\nGPU sharing strategies allow multiple containers to efficiently use your\nattached GPUs and save running costs. GKE provides the following\nGPU sharing strategies:\n\n- **Multi-instance GPU**: GKE divides a single supported GPU in up to seven slices. Each slice can be allocated to one container on the node independently, for a maximum of seven containers per GPU. Multi-instance GPU provides hardware isolation between the workloads, plus consistent and predictable Quality of Service (QoS) for all containers running on the GPU.\n- **GPU time-sharing** : GKE uses the built-in timesharing ability provided by the NVIDIA GPU and the software stack. Starting with the [Pascal architecture](https://www.nvidia.com/en-us/data-center/pascal-gpu-architecture/), NVIDIA GPUs support instruction level preemption. When doing context switching between processes running on a GPU, instruction-level preemption ensures every process gets a fair timeslice. GPU time-sharing provides software-level isolation between the workloads in terms of address space isolation, performance isolation, and error isolation.\n- **NVIDIA MPS** : GKE uses [NVIDIA's Multi-Process Service (MPS)](https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf). NVIDIA MPS is an alternative, binary-compatible implementation of the CUDA API designed to transparently enable co-operative multi-process CUDA workloads to run concurrently on a single GPU device. GPU with NVIDIA MPS provides software-level isolation in terms of resource limits ([active thread percentage](https://docs.nvidia.com/deploy/mps/index.html#topic_5_2_5) and [pinned device memory](https://docs.nvidia.com/deploy/mps/index.html#topic_5_2_7)).\n\nWhich GPU sharing strategy to use\n---------------------------------\n\nThe following table summarizes and compares the characteristics of the available\nGPU sharing strategies:\n\n**Best practice** :\n\nTo maximize your GPU utilization, combine GPU sharing\nstrategies. For each, multi-instance GPU partition,\nuse either time-sharing or NVIDIA MPS.\nYou can then run multiple containers on each partition, with those\ncontainers sharing access to the resources on that partition. We recommend that\nyou use any of the following combinations:\n\n- Multi-instance GPU and GPU time-sharing.\n- Multi-instance GPU and NVIDIA MPS.\n\n### How the GPU sharing strategies work\n\nYou can specify the maximum number of containers allowed to share a physical GPU:\n\n- On Autopilot clusters, this is configured in your workload specification.\n- On Standard clusters, this is configured when you create a new node pool with GPUs attached. Every GPU in the node pool is shared based on the setting you specify at the node pool level.\n\nThe following sections explain the scheduling behavior and operation of each GPU\nsharing strategy.\n\n#### Multi-instance GPU\n\nYou can request multi-instance GPU in workloads by specifying the\n`cloud.google.com/gke-gpu-partition-size` label in the Pod spec\n`nodeSelector` field, under `spec: nodeSelector`.\n\nGKE schedules workloads to appropriate available nodes by matching these labels. If there are no appropriate available nodes, GKE uses autoscaling\nand node auto-provisioning to create new nodes or node pools that match this\nlabel.\n\n#### GPU time-sharing or NVIDIA MPS\n\nYou can request GPU time-sharing or NVIDIA MPS in workloads by specifying the\nfollowing labels in the Pod spec `nodeSelector` field, under `spec:nodeSelector`.\n\n- `cloud.google.com/gke-max-shared-clients-per-gpu`: Select nodes that allow a specific number of clients to share the underlying GPU.\n- `cloud.google.com/gke-gpu-sharing-strategy`: Select nodes that use the time-sharing or NVIDIA MPS strategy for GPUs.\n\nThe following table describes how scheduling behavior changes based on the combination of node\nlabels that you specify in your manifests.\n\nThe GPU request process that you complete is the same for GPU time-sharing and\nNVIDIA MPS strategy.\n\nIf you're developing GPU applications that run on GPU\ntime-sharing or NVIDIA MPS, you can only request one GPU for each container.\nGKE rejects a request for more than one GPU in a\ncontainer to avoid unexpected behavior. In addition, the number of GPUs requested with\ntime-sharing and NVIDIA MPS is not a measure of the compute power\navailable to the container.\n\nThe following table shows you what to expect when you request specific quantities\nof GPUs.\n\nIf GKE rejects the workload, you see an error message similar\nto the following: \n\n status:\n message: 'Pod Allocate failed due to rpc error: code = Unknown desc = [invalid request\n for sharing GPU (time-sharing), at most 1 nvidia.com/gpu can be requested on GPU nodes], which is unexpected'\n phase: Failed\n reason: UnexpectedAdmissionError\n\nMonitor GPU time-sharing or NVIDIA MPS nodes\n--------------------------------------------\n\nUse [Cloud Monitoring](/monitoring/docs) to monitor the performance of your\nGPU time-sharing or NVIDIA MPS nodes. GKE sends metrics for each\nGPU node to Cloud Monitoring. These GPU time-sharing or NVIDIA MPS node metrics apply at the node level\n(`node/accelerator/`).\n\nYou can check the following metrics for each GPU time-sharing or NVIDIA MPS node\nin Cloud Monitoring:\n\n- **Duty cycle (`node/accelerator/duty_cycle`)**: Percentage of time over the last sample period (10 seconds) during which the GPU node was actively processing. Ranges from 1% to 100%.\n- **Memory usage (`node/accelerator/memory_used`)**: Amount of accelerator memory allocated in bytes for each GPU node.\n- **Memory capacity (`node/accelerator/memory_total`)**: Total accelerator memory in bytes for each GPU node.\n\nThese metrics are *different* from the\n[metrics for regular GPUs](/kubernetes-engine/docs/how-to/gpus#monitoring) that are not\ntime-shared or NVIDA MPS nodes.\nThe metrics for\n[regular physical GPUs](/kubernetes-engine/docs/concepts/gpus#monitoring)\napply at the container\nlevel (`container/accelerator`) and\nare *not* collected for containers scheduled on a GPU that uses GPU time-sharing\nor NVIDIA MPS.\n\nWhat's next\n-----------\n\n- Learn how to [share GPUs with multiple workloads using GPU time-sharing](/kubernetes-engine/docs/how-to/timesharing-gpus).\n- Learn how to [share GPUs with multiple workloads using NVIDIA MPS](/kubernetes-engine/docs/how-to/nvidia-mps-gpus).\n- Learn how to [run multi-instance GPUs](/kubernetes-engine/docs/how-to/gpus-multi).\n- Learn more about [GPUs](/kubernetes-engine/docs/how-to/gpus).\n- For more information about compute preemption for the NVIDIA GPU, refer to the [NVIDIA Pascal Tuning Guide](https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html#preemption)."]]