TPU collection scheduling for inference workloads
Trillium (v6e) includes a feature called "collection scheduling" that lets you group a set of TPU slices, single or multi-host intended to serve replicas of the same model. This feature is available for both Cloud TPU and GKE configurations.
This document is about using collection scheduling with the Cloud TPU API. See the GKE documentation for more information about using collection scheduling with GKE.
By creating a collection for your inference workload, Google Cloud limits and streamlines interruptions to the operations of inference workloads. This is useful for inference workloads where high availability is a concern. Google Cloud ensures high availability for the collection to manage incoming traffic. A portion of slices within a collection is always available to handle incoming traffic.
Each TPU slice in a collection will be of the same accelerator type and topology.
Collection scheduling only applies to v6e.
Create a collection from the Cloud TPU API
When you request a queued resource using the Cloud TPU API,
you use the --workload-type = AVAILABILITY-OPTIMIZED
flag to create a
collection. This flag indicates to the Cloud TPU infrastructure that it is
meant to be used for availability focused workloads.
The following command provisions a collection using the Cloud TPU API:
gcloud alpha compute tpus queued-resources create serving-QR \ --project=$PROJECT_ID \ --zone=${ZONE} \ --accelerator-type ${ACCELERATOR_TYPE} \ --node-count ${NODE_COUNT} \ --node-prefix "servingTPU" \ --workload-type = AVAILABILITY-OPTIMIZED
The --node-count
flag specifies the number of slices you want in your
queued resource. This creates a collection of TPU slices.
Optional: The --node-prefix
flag specifies a prefix for the slice names.