定義 Kubernetes Pod 失敗政策和 Pod 退避失敗限制,以處理 Job 中可重試和不可重試的失敗。這項定義可避免因 Pod 中斷而導致不必要的 Pod 重試和工作失敗,進而提升叢集資源消耗量。舉例來說,您可以設定搶占、API 啟動的逐出或汙點的逐出,其中沒有 NoExecute 汙點效果容許條件的 Pod 會遭到逐出。瞭解如何使用 Pod 失敗政策處理可重試和不可重試的 Pod 失敗情形。
將多個工作視為一個單位來管理
使用 JobSet API 以單元形式管理多個工作,解決工作負載模式 (例如一個驅動程式 (或協調器) 和多個工作者 (例如 MPIJob)),同時根據您的用途設定與常見模式一致的工作預設值。舉例來說,您可以預設建立已建立索引的 Job、為 Pod 的可預測完整網域名稱 (FQDN) 建立無標題服務,以及設定相關聯的 Pod 失敗政策。
延長無法容忍重新啟動的 Pod 執行時間
在 Pod 規格中,將 Kubernetes cluster-autoscaler.kubernetes.io/safe-to-evict 註解設為 false。叢集自動配置器會遵循 Pod 上設定的移除規則。如果節點包含具有 cluster-autoscaler.kubernetes.io/safe-to-evict 註解的 Pod,自動調度器就無法刪除節點。
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-01 (世界標準時間)。"],[],[],null,["# Best practices for running batch workloads on GKE\n\nAutopilot Standard\n\n*** ** * ** ***\n\nThis page introduces the best practices for building and optimizing batch processing platforms with Google Kubernetes Engine (GKE), including best practices for:\n\n- Architecture\n- Job management\n- Multi-tenancy\n- Security\n- Queueing\n- Storage\n- Performance\n- Cost efficiency\n- Monitoring\n\nGKE provides a powerful framework for orchestrating batch\nworkloads such as data processing,\n[training machine learning models](/blog/products/ai-machine-learning/build-a-ml-platform-with-kubeflow-and-ray-on-gke),\n[running scientific simulations](/blog/products/containers-kubernetes/gke-gpu-sharing-helps-scientists-quest-for-neutrinos),\nand other [high performance computing workloads](https://www.pgs.com/company/newsroom/news/industry-insights--hpc-in-the-cloud/).\n\nThese best practices are intended for platform administrators, cloud architects, and\noperations professionals interested in deploying batch workloads in\nGKE. The [Reference Architecture: Batch Processing Platform on GKE](https://github.com/ai-on-gke/batch-reference-architecture) showcases many of the best practices discussed in this guide, and can be deployed in your own Google Cloud project.\n\nHow batch workloads work\n------------------------\n\nA batch workload is a group of tasks that run to completion without user\nintervention. To define tasks, you use the Kubernetes\n[Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job) resource.\nA batch platform receives the Jobs and queues them in the\norder they are received. The queue in the batch platform applies processing\nlogic such as priority, quota, and allocable resources. By queueing and\ncustomizing the batch processing parameters, Kubernetes lets you optimize the\nuse of available resources, minimize the idle time for scheduled Jobs, and\nmaximize cost savings. The following diagram shows the\nGKE components that can be part of a batch platform.\n[](/static/kubernetes-engine/images/batch-process.svg)\n\nBatch platform management\n-------------------------\n\nTraditionally, batch platforms have two main user personas, developers and\nplatform administrators:\n\n- A developer submits a Job specifying the program, the data to be processed, and requirements for the Job. Then, the developer receives confirmation of the Job submission and a unique identifier. Once the Job is complete, the developer would get a notification along with any output or results of the Job.\n- A platform administrator manages and delivers an efficient and reliable batch processing platform to the developers.\n\nA batch processing platform must meet the following requirements:\n\n- The platform resources are properly provisioned to ensure that Jobs run with little to no user intervention required.\n- The platform resources are configured according to the organization's security and observability best practices.\n- The platform resources are used as efficiently as possible. In case of resource contention, the most important work gets done first.\n\n### Prepare the batch platform architecture in GKE\n\nA GKE environment consists of nodes, which are Compute Engine\nvirtual machines (VMs), that are grouped together to form a cluster.\n\nThe following table lists the key recommendations when planning and designing\nyour batch platform architecture:\n\n### Manage the Job lifecycle\n\nIn Kubernetes, you run your workloads in a set of\n[*Pods*](https://kubernetes.io/docs/concepts/workloads/pods/). Pods\nare groups of single or multiple containers, with shared storage and network\nresources. Pods are defined by a Kubernetes specification.\n\nA Job creates one or more Pods and continually tries to run them until a\nspecified number of Pods successfully terminate. As Pods complete, the Job\ntracks the successful completions. When a specified number of successful\ncompletions is reached, the Job is complete.\n\nThe following table lists the key recommendations when designing and managing\nJobs:\n\nManage Multi-tenancy\n--------------------\n\nGKE cluster\n[multi-tenancy](/kubernetes-engine/docs/concepts/multitenancy-overview) is an\nalternative to managing\nGKE resources by different users or workloads, named as\n*tenants* , in a single\norganization. The management of GKE resources might follow\ncriteria such as tenant isolation, [quotas and limit ranges](/kubernetes-engine/quotas), or cost allocation.\n[](/static/kubernetes-engine/images/enterprise-multitenancy.svg)\n\nThe following table lists the key recommendations when managing multi-tenancy:\n\n### Control access to the batch platform\n\nGKE allows you to finely tune the access permissions of the\nworkloads running on the cluster.\n\nThe following table lists the key recommendations when managing access and security\n\n#### Queueing and fair sharing\n\nTo control resource consumption, you can assign resource quota limits\nfor each tenant, queue incoming Jobs, and process Jobs in the order they were\nreceived.\n\nThe following table lists the key recommendations when managing queueing and\nfair sharing among batch workloads:\n\n### Optimize storage, performance, and cost efficiency\n\nThe efficient use of our GKE compute and [storage](/kubernetes-engine/docs/concepts/storage-overview) resources can reduce costs.\nOne strategy is to right-size and configure your compute instances to align with your batch\nprocessing needs while not sacrificing performance.\n\nThe following table lists the key recommendations when designing and managing\nstorage and optimizing performance:\n\n### Monitor clusters\n\nGKE is integrated with observability and logging tools that help\nyou monitor the reliability and efficiency of your cluster. The following table\nlists the key recommendations when enabling and using GKE\nobservability tools:\n\nWhat's next\n-----------\n\n- Learn how to [Deploy a batch system using Kueue](/kubernetes-engine/docs/tutorials/kueue-intro)\n- See the [Best practices for running cost-optimized Kubernetes applications on GKE](/architecture/best-practices-for-running-cost-effective-kubernetes-applications-on-gke#make_sure_your_application_can_grow_vertically_and_horizontally)"]]