[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[],[],null,["# TPU collection scheduling for inference workloads\n=================================================\n\nTrillium (v6e) includes a feature called \"collection scheduling\" that lets\nyou group a set of TPU slices, single or multi-host, intended to serve replicas\nof the same model. This feature is available for both Cloud TPU and GKE configurations.\n\nThis document is about using collection scheduling with the\nCloud TPU API. See the\n[GKE documentation](/kubernetes-engine/docs/concepts/tpus#collection-scheduling)\nfor more information about using collection scheduling with GKE.\n\nBy creating a collection for\nyour inference workload,\nGoogle Cloud limits and streamlines\ninterruptions to the operations of inference workloads.\nThis is useful for inference workloads where high availability\nis a concern. Google Cloud ensures high availability\nfor the collection to manage incoming traffic. A portion of\nslices within a collection is always available to handle incoming traffic.\n\nEach TPU slice in a collection will have the same accelerator type and topology.\n| **Note:** Collection scheduling only applies to v6e.\n\n### Create a collection from the Cloud TPU API\n\nWhen you request a queued resource using the Cloud TPU API,\nyou use the `--workload-type=AVAILABILITY-OPTIMIZED` flag to create a\ncollection. This flag indicates to the Cloud TPU infrastructure that it is\nmeant to be used for availability-focused workloads.\n\nThe following command provisions a collection using the\nCloud TPU API: \n\n```bash\ngcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \\\n --project=PROJECT_ID \\\n --zone=ZONE \\\n --accelerator-type=ACCELERATOR_TYPE \\\n --runtime-version=RUNTIME_VERSION \\\n --node-count=NODE_COUNT \\\n --node-prefix=NODE_PREFIX \\\n --workload-type=AVAILABILITY-OPTIMIZED\n```\n\nThe `--node-count` flag specifies the number of slices you want in your\nqueued resource. This creates a collection of TPU slices.\n\nOptional: The `--node-prefix` flag specifies a prefix for the slice names.\n| **Note:** The supported accelerator types are described in [v6e supported configurations](/tpu/docs/v6e#configurations)."]]