Stay organized with collections
Save and categorize content based on your preferences.
TPU collection scheduling for inference workloads
Trillium (v6e) includes a feature called "collection scheduling" that lets
you group a set of TPU slices, single or multi-host, intended to serve replicas
of the same model. This feature is available for both Cloud TPU and GKE configurations.
This document is about using collection scheduling with the
Cloud TPU API. See the
GKE documentation
for more information about using collection scheduling with GKE.
By creating a collection for
your inference workload,
Google Cloud limits and streamlines
interruptions to the operations of inference workloads.
This is useful for inference workloads where high availability
is a concern. Google Cloud ensures high availability
for the collection to manage incoming traffic. A portion of
slices within a collection is always available to handle incoming traffic.
Each TPU slice in a collection will have the same accelerator type and topology.
Create a collection from the Cloud TPU API
When you request a queued resource using the Cloud TPU API,
you use the --workload-type=AVAILABILITY-OPTIMIZED flag to create a
collection. This flag indicates to the Cloud TPU infrastructure that it is
meant to be used for availability-focused workloads.
The following command provisions a collection using the
Cloud TPU API:
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[],[],null,["# TPU collection scheduling for inference workloads\n=================================================\n\nTrillium (v6e) includes a feature called \"collection scheduling\" that lets\nyou group a set of TPU slices, single or multi-host, intended to serve replicas\nof the same model. This feature is available for both Cloud TPU and GKE configurations.\n\nThis document is about using collection scheduling with the\nCloud TPU API. See the\n[GKE documentation](/kubernetes-engine/docs/concepts/tpus#collection-scheduling)\nfor more information about using collection scheduling with GKE.\n\nBy creating a collection for\nyour inference workload,\nGoogle Cloud limits and streamlines\ninterruptions to the operations of inference workloads.\nThis is useful for inference workloads where high availability\nis a concern. Google Cloud ensures high availability\nfor the collection to manage incoming traffic. A portion of\nslices within a collection is always available to handle incoming traffic.\n\nEach TPU slice in a collection will have the same accelerator type and topology.\n| **Note:** Collection scheduling only applies to v6e.\n\n### Create a collection from the Cloud TPU API\n\nWhen you request a queued resource using the Cloud TPU API,\nyou use the `--workload-type=AVAILABILITY-OPTIMIZED` flag to create a\ncollection. This flag indicates to the Cloud TPU infrastructure that it is\nmeant to be used for availability-focused workloads.\n\nThe following command provisions a collection using the\nCloud TPU API: \n\n```bash\ngcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \\\n --project=PROJECT_ID \\\n --zone=ZONE \\\n --accelerator-type=ACCELERATOR_TYPE \\\n --runtime-version=RUNTIME_VERSION \\\n --node-count=NODE_COUNT \\\n --node-prefix=NODE_PREFIX \\\n --workload-type=AVAILABILITY-OPTIMIZED\n```\n\nThe `--node-count` flag specifies the number of slices you want in your\nqueued resource. This creates a collection of TPU slices.\n\nOptional: The `--node-prefix` flag specifies a prefix for the slice names.\n| **Note:** The supported accelerator types are described in [v6e supported configurations](/tpu/docs/v6e#configurations)."]]