Overview
When you create TPU nodes to handle your machine learning workloads, you must select a TPU type. The TPU type defines the TPU version, the number of TPU cores, and the amount of TPU memory that is available for your machine learning workload.
For example, the v2-8
TPU type defines a TPU node with 8 TPU v2 cores
and 64 GiB of total TPU memory. The v3-2048
TPU type
defines a TPU node with 2048 TPU v3 cores and 32 TiB of total TPU memory.
To learn about the hardware differences between TPU versions and configurations, read the System Architecture documentation.
To see pricing for each TPU type in each region, see the Pricing page.
A model that runs on one TPU type can run with no TensorFlow code changes for
another TPU type. For example, v2-8
code can run without changes on a v3-8
.
However, scaling from a v2-8
or v3-8
to a larger TPU
type, such as v2-32
or v3-128
, requires significant tuning and optimization.
TPU types and zones
The main differences between each TPU type are price, performance, memory capacity, and zonal availability.
Google Cloud Platform uses regions, subdivided into zones, to define the
geographic location of physical computing resources. For example, the
us-central1
region denotes a region near the geographic center of the United
States that has the following zones: us-central1-a
, us-central1-b
,
us-central1-c
, and us-central1-f
. When you create a TPU node, you specify
the zone in which you want to create it. See the Compute Engine
Global, regional, and zonal resources
document for more information about regional and zonal resources.
You can configure your TPU nodes with the following TPU types:
US
TPU type (v2) | TPU v2 cores | Total TPU memory | Region/Zone |
---|---|---|---|
v2-8 | 8 | 64 GiB |
us-central1-b us-central1-c us-central1-f
|
v2-32 | 32 | 256 GiB |
us-central1-a
|
v2-128 | 128 | 1 TiB |
us-central1-a
|
v2-256 | 256 | 2 TiB |
us-central1-a
|
v2-512 | 512 | 4 TiB |
us-central1-a
|
TPU type (v3) | TPU v3 cores | Total TPU memory | Available zones |
v3-8 | 8 | 128 GiB |
us-central1-a us-central1-b us-central1-f
|
Europe
TPU type (v2) | TPU v2 cores | Total TPU memory | Region/Zone |
---|---|---|---|
v2-8 | 8 | 64 GiB |
europe-west4-a
|
v2-32 | 32 | 256 GiB |
europe-west4-a
|
v2-128 | 128 | 1 TiB |
europe-west4-a
|
v2-256 | 256 | 2 TiB |
europe-west4-a
|
v2-512 | 512 | 4 TiB |
europe-west4-a
|
TPU type (v3) | TPU v3 cores | Total TPU memory | Available zones |
v3-8 | 8 | 128 GiB |
europe-west4-a
|
v3-32 | 32 | 512 GiB |
europe-west4-a
|
v3-64 | 64 | 1 TiB |
europe-west4-a
|
v3-128 | 128 | 2 TiB |
europe-west4-a
|
v3-256 | 256 | 4 TiB |
europe-west4-a
|
v3-512 | 512 | 8 TiB |
europe-west4-a
|
v3-1024 | 1024 | 16 TiB |
europe-west4-a
|
v3-2048 | 2048 | 32 TiB |
europe-west4-a
|
Asia Pacific
TPU type (v2) | TPU v2 cores | Total TPU memory | Region/Zone |
---|---|---|---|
v2-8 | 8 | 64 GiB |
asia-east1-c
|
TPU types with higher numbers of cores are available only in limited quantities. TPU types with lower core counts are more likely to be available.
Calculating price and performance tradeoffs
To decide which TPU type you want to use, you can do experiments using a Cloud TPU tutorial to train a model that is similar to your application.
Run the tutorial for 5 - 10% of the number of steps you will use to run
the full training on a v2-8
and a v3-8
TPU type. The result tells
you how long it takes to run that number of steps for that model on each
TPU type.
Because performance on TPU types scales linearly, if you know how long it takes
to run a task on a v2-8
or v3-8
TPU type, you can estimate how much
you can reduce task time by running your model on a larger TPU type
with more cores.
For example, if a v2-8
TPU type takes 60 minutes to 10,000 steps,
a v2-32
node should take approximately 15 minutes to perform the same
task.
To determine the difference in cost within your region between the different TPU types for Cloud TPU and the associated Compute Engine VM, see the TPU pricing page. When you know the approximate training time for your model on a few different TPU types, you can weigh the VM/TPU cost against training time to help you decide your best price/performance tradeoff.
Specifying the TPU type
You specify a TPU type when you create a TPU node. For example, you can select a TPU type using one of the following methods:
gcloud command
- Use the gcloud compute tpus execution-groups command:
$ gcloud compute tpus execution-groups create \
--name=tpu_name \
--zone=zone \
--tf-version=tensorflow_version \
--machine-type=n1-standard-8 \
--accelerator-type=v3-8
Command flag descriptions
name
- The name of the Cloud TPU to create.
zone
- The zone where you plan to create your Cloud TPU.
tf-version
- The version of Tensorflow
gcloud
installs on the VM. machine-type
- The machine type of the Compute Engine VM to create.
accelerator-type
- The type of the Cloud TPU to create.
Cloud Console
- From the left navigation menu, select Compute Engine > TPUs.
- On the TPUs screen click Create TPU node. This brings up a configuration page for your TPU.
- Under TPU type select one of the supported TPU versions.
- Click the Create button.
What's next
- Learn more about TPU architecture in the system architecure page.
- See When to use TPUs to learn about the types of models that are well suited to Cloud TPU.
- If you plan to run on Kubernetes or ML Engine, see Deciding on a TPU service.