TPU types and zones

Overview

When you create TPU nodes to handle your machine learning workloads, you must select a TPU type. The TPU type defines the TPU version, the number of TPU cores, and the amount of TPU memory that is available for your machine learning workload.

For example, the v2-8 TPU type defines a TPU node with 8 TPU v2 cores and 64 GiB of total TPU memory. The v3-2048 TPU type defines a TPU node with 2048 TPU v3 cores and 32 TiB of total TPU memory.

To learn about the hardware differences between TPU versions and configurations, read the System Architecture documentation.

To see pricing for each TPU type in each region, see the Pricing page.

You can change the TPU type to another TPU type that has the same number of cores (for example, v2-8 and v3-8) and run your training script without code changes. However if you change to a TPU type with a larger or smaller number of cores you will need to perform significant tuning and optimization. For more information, see Training on TPU Pods.

TPU types and zones

The main differences between each TPU type are price, performance, memory capacity, and zonal availability.

Google Cloud Platform uses regions, subdivided into zones, to define the geographic location of physical computing resources. For example, the us-central1 region denotes a region near the geographic center of the United States that has the following zones: us-central1-a, us-central1-b, us-central1-c, and us-central1-f. When you create a TPU node, you specify the zone in which you want to create it. See the Compute Engine Global, regional, and zonal resources document for more information about regional and zonal resources.

You can create your configuration with the following TPU types:

US

TPU type (v2) TPU v2 cores Total TPU memory Region/Zone
v2-8 8 64 GiB us-central1-b
us-central1-c
us-central1-f
v2-32 32 256 GiB us-central1-a
v2-128 128 1 TiB us-central1-a
v2-256 256 2 TiB us-central1-a
v2-512 512 4 TiB us-central1-a
TPU type (v3) TPU v3 cores Total TPU memory Available zones
v3-8 8 128 GiB us-central1-a
us-central1-b
us-central1-f

Europe

TPU type (v2) TPU v2 cores Total TPU memory Region/Zone
v2-8 8 64 GiB europe-west4-a
v2-32 32 256 GiB europe-west4-a
v2-128 128 1 TiB europe-west4-a
v2-256 256 2 TiB europe-west4-a
v2-512 512 4 TiB europe-west4-a
TPU type (v3) TPU v3 cores Total TPU memory Available zones
v3-8 8 128 GiB europe-west4-a
v3-32 32 512 GiB europe-west4-a
v3-64 64 1 TiB europe-west4-a
v3-128 128 2 TiB europe-west4-a
v3-256 256 4 TiB europe-west4-a
v3-512 512 8 TiB europe-west4-a
v3-1024 1024 16 TiB europe-west4-a
v3-2048 2048 32 TiB europe-west4-a

Asia Pacific

TPU type (v2) TPU v2 cores Total TPU memory Region/Zone
v2-8 8 64 GiB asia-east1-c

TPU types with higher numbers of cores are available only in limited quantities. TPU types with lower core counts are more likely to be available.

Calculating price and performance tradeoffs

To decide which TPU type you want to use, you can do experiments using a Cloud TPU tutorial to train a model that is similar to your application.

Run the tutorial for 5 - 10% of the number of steps you will use to run the full training on a v2-8 and a v3-8 TPU type. The result tells you how long it takes to run that number of steps for that model on each TPU type.

Because performance on TPU types scales linearly, if you know how long it takes to run a task on a v2-8 or v3-8 TPU type, you can estimate how much you can reduce task time by running your model on a larger TPU type with more cores.

For example, if a v2-8 TPU type takes 60 minutes to 10,000 steps, a v2-32 node should take approximately 15 minutes to perform the same task.

When you know the approximate training time for your model on a few different TPU types, you can weigh the VM/TPU cost against training time to help you decide your best price/performance tradeoff.

To determine the difference in cost between the different TPU types for Cloud TPU and the associated Compute Engine VM, see the TPU pricing page.

Specifying the TPU type

Regardless of which framework you are using, TensorFlow, PyTorch, or JAX, you specify a TPU type with the accelerator-type parameter when you launch a TPU. The command you use depends on whether you are using TPU VMs or TPU Nodes. Example commands are shown below.

TPU VM

$ gcloud alpha compute tpus tpu-vm create tpu-name \
--zone=zone \
--accelerator-type=v3-8 \
--version=v2-alpha

Command flag descriptions

zone
The zone where you plan to create your Cloud TPU.
accelerator-type
The type of the Cloud TPU to create.
version
The Cloud TPU runtime version.

TPU Node

$ gcloud compute tpus execution-groups create \
--name=tpu-name \
--zone=zone \
--tf-version=2.5.0 \
--machine-type=n1-standard-1 \
--accelerator-type=v3-8

Command flag descriptions

name
The name of the Cloud TPU to create.
zone
The zone where you plan to create your Cloud TPU.
tf-version
Tensorflow only, the version of Tensorflow the gcloud command installs on your VM.
machine-type
The machine type of the Compute Engine VM to create.
accelerator-type
The type of the Cloud TPU to create.
image-family
PyTorch only, set to torch-xla
image-project
PyTorch only, set to ml-images
boot-disk-size
The size of the boot disk of the VM.
scopes
Pytorch only, set to https://www.googleapis.com/auth/cloud-platform.

For more information on the gcloud command, see the gcloud Reference.

What's next