TPU collection scheduling for inference workloads

Trillium (v6e) includes a feature called "collection scheduling" that lets you group a set of TPU slices, single or multi-host intended to serve replicas of the same model. This feature is available for both Cloud TPU and GKE configurations.

This document is about using collection scheduling with the Cloud TPU API. See the GKE documentation for more information about using collection scheduling with GKE.

By creating a collection for your inference workload, Google Cloud limits and streamlines interruptions to the operations of inference workloads. This is useful for inference workloads where high availability is a concern. Google Cloud ensures high availability for the collection to manage incoming traffic. A portion of slices within a collection is always available to handle incoming traffic.

Each TPU slice in a collection will be of the same accelerator type and topology.

Collection scheduling only applies to v6e.

Create a collection from the Cloud TPU API

When you request a queued resource using the Cloud TPU API, you use the --workload-type = AVAILABILITY-OPTIMIZED flag to create a collection. This flag indicates to the Cloud TPU infrastructure that it is meant to be used for availability focused workloads.

The following command provisions a collection using the Cloud TPU API:

gcloud alpha compute tpus queued-resources create serving-QR \
   --project=$PROJECT_ID \
   --zone=${ZONE} \
   --accelerator-type ${ACCELERATOR_TYPE} \
   --node-count ${NODE_COUNT} \
   --node-prefix "servingTPU" \
   --workload-type = AVAILABILITY-OPTIMIZED

The --node-count flag specifies the number of slices you want in your queued resource. This creates a collection of TPU slices.

Optional: The --node-prefix flag specifies a prefix for the slice names.