TPU configurations
TPU v5e configurations
Cloud TPU v5e is a combined training and inference (serving) product. To
differentiate between a training and an inference environment, use the
AcceleratorType
or AcceleratorConfig
flags with the TPU API or the
--machine-type
flag when creating a GKE node
pool.
Training jobs are optimized for throughput and availability while serving jobs are optimized for latency. So, a training job on TPUs provisioned for serving could have lower availability and similarly, a serving job executed on TPUs provisioned for training could have higher latency.
You use AcceleratorType
to specify the number of TensorCores you want to use.
You specify the AcceleratorType
when creating a TPU using the
gcloud CLI or the Google Cloud console. The value you
specify for AcceleratorType
is a string with the format:
v$VERSION_NUMBER-$CHIP_COUNT
.
You can also use AcceleratorConfig
to specify the number of TensorCores you
want to use. However, because there are no custom 2D topology variants for TPU
v5e, there is no difference between using AcceleratorConfig
and
AcceleratorType
.
To configure a TPU v5e using AcceleratorConfig
, use the --version
and the
--topology
flags. Set --version
to the TPU version you want to use and
--topology
to the physical arrangement of the TPU chips in the slice. The
value you specify for AcceleratorConfig
is a string with the format AxB
,
where A
and B
are the chip counts in each direction.
The following 2D slice shapes are supported for v5e:
Topology | Number of TPU chips | Number of Hosts |
1x1 | 1 | 1/8 |
2x2 | 4 | 1/2 |
2x4 | 8 | 1 |
4x4 | 16 | 2 |
4x8 | 32 | 4 |
8x8 | 64 | 8 |
8x16 | 128 | 16 |
16x16 | 256 | 32 |
Each TPU VM in a v5e TPU slice contains 1, 4 or 8 chips. In 4-chip and smaller slices, all TPU chips share the same Non Uniform Memory Access (NUMA) node.
For 8-chip v5e TPU VMs, CPU-TPU communication will be more efficient within NUMA
partitions. For example, in the following figure, CPU0-Chip0
communication will
be faster than CPU0-Chip4 communication.
Cloud TPU v5e types for serving
Single-host serving is supported for up to 8 v5e chips. The following configurations are supported: 1x1, 2x2 and 2x4 slices. Each slice has 1, 4 and 8 chips respectively.
To provision TPUs for a serving job, use one of the following accelerator types in your CLI or API TPU creation request:
AcceleratorType (TPU API) | Machine type (GKE API) |
---|---|
v5litepod-1 |
ct5lp-hightpu-1t |
v5litepod-4 |
ct5lp-hightpu-4t |
v5litepod-8 |
ct5lp-hightpu-8t |
Serving on more than 8 v5e chips, also called multi-host serving, is supported using Sax. For more information, see Large Language Model Serving.
Cloud TPU v5e types for training
Training is supported for up to 256 chips.
To provision TPUs for a v5e training job, use one of the following accelerator types in your CLI or API TPU creation request:
AcceleratorType (TPU API) | Machine type (GKE API) | Topology |
---|---|---|
v5litepod-16 |
ct5lp-hightpu-4t |
4x4 |
v5litepod-32 |
ct5lp-hightpu-4t |
4x8 |
v5litepod-64 |
ct5lp-hightpu-4t |
8x8 |
v5litepod-128 |
ct5lp-hightpu-4t |
8x16 |
v5litepod-256 |
ct5lp-hightpu-4t |
16x16 |
v5e TPU VM type comparison:
VM Type | n2d-48-24-v5lite-tpu | n2d-192-112-v5lite-tpu | n2d-384-224-v5lite-tpu |
# of v5e chips | 1 | 4 | 8 |
# of vCPUs | 24 | 112 | 224 |
RAM (GB) | 48 | 192 | 384 |
# of NUMA Nodes | 1 | 1 | 2 |
Applies to | v5litepod-1 | v5litepod-4 | v5litepod-8 |
Disruption | High | Medium | Low |
To make space for workloads that require more chips, schedulers may preempt VMs with fewer chips. So 8-chip VMs are likely to preempt 1 and 4-chip VMs.
TPU v4 configurations
A TPU v4 Pod is composed of 4096 chips interconnected with reconfigurable
high-speed links. TPU v4's flexible networking lets you connect the chips in a
same-sized Pod slice in multiple ways. When you create a TPU Pod slice, you
specify the TPU version and the number of TPU resources you require. When you
create a TPU v4 Pod slice, you can specify its type and size in one of two ways:
AcceleratorType
and AccleratorConfig
.
Using AcceleratorType
Use AcceleratorType when you are not specifying a topology. To configure v4 TPUs
using AcceleratorType
, use the --accelerator-type
flag when creating your
TPU Pod slice. Set --accelerator-type
to a string that contains the TPU
version and the number of TensorCores you want to use. For example, to create a
v4 Pod slice with 32 TensorCores, you would use --accelerator-type=v4-32
.
The following command creates a v4 TPU Pod slice with 512 TensorCores using
the --accelerator-type
flag:
$ gcloud compute tpus tpu-vm create tpu-name
--zone=zone
--accelerator-type=v4-512
--version=tpu-vm-tf-2.15.0-pod-pjrt
The number after the TPU version (v4
) specifies the number of TensorCores.
There are two TensorCores in a v4 TPU, so the number of TPU chips
would be 512/2 = 256.
Using AcceleratorConfig
Use AcceleratorConfig
when you want to customize the physical topology
of your TPU slice. This is generally required for performance tuning with Pod
slices greater than 256 chips.
To configure v4 TPUs using AcceleratorConfig
, use the --version
and the
--topology
flags. Set --version
to the TPU version you want to use and
--topology
to the physical arrangement of the TPU chips in the Pod slice.
You specify a TPU topology using a 3-tuple, AxBxC where A<=B<=C and A, B, C are
either all <= 4 or are all integer multiples of 4. The values A, B, and C are
the chip counts in each of the three dimensions. For example, to create a v4 Pod
slice with 16 chips, you would set --version=v4
and --topology=2x2x4
.
The following command creates a v4 TPU Pod slice with 128 TPU chips arranged in a 4x4x8 array:
$ gcloud compute tpus tpu-vm create tpu-name
--zone=zone
--type=v4
--topology=4x4x8
--version=tpu-vm-tf-2.15.0-pod-pjrt
Topologies where 2A=B=C or 2A=2B=C also have topology variants optimized for all-to-all communication, for example, 4×4×8, 8×8×16, and 12×12×24. These are known as twisted tori topologies.
The following illustrations show some common TPU v4 topologies.
Larger Pod slices can be built from one or more 4x4x4 "cubes" of chips.
Twisted Tori topologies
Some v4 3D torus slice shapes have the option to use what is known as a twisted torus topology. For example two v4 cubes can be arranged as a 4x4x8 slice or 4x4x8_twisted. Twisted topologies offer significantly higher bisection bandwidth. Increased bisection bandwidth is useful for workloads that use global communication patterns. Twisted topologies can improve performance for most models, with large TPU embedding workloads benefiting the most.
For workloads that use data parallelism as the only parallelism strategy, twisted topologies might perform slightly better. For LLMs, performance using a twisted topology can vary depending on the type of parallelism (DP, MP, etc.). Best practice is to train your LLM with and without a twisted topology to determine which provides the best performance for your model. Some experiments on the FSDP MaxText model have seen 1-2 MFU improvements using a twisted topology.
The primary benefit of twisted topologies is that it transforms an asymmetric torus topology (for example, 4×4×8) into a closely related symmetric topology. The symmetric topology has many benefits:
- Improved load balancing
- Higher bisection bandwidth
- Shorter packet routes
These benefits ultimately translate into improved performance for many global communication patterns.
The TPU software supports twisted tori on slices where the size of each dimension is either equal to or twice the size of the smallest dimension. For example, 4x4x8, 4×8×8, or 12x12x24.
As an example, consider this 4×2 torus topology with TPUs labeled with their (X,Y) coordinates in the slice:
The edges in this topology graph are shown as undirected edges for clarity. In practice, each edge is a bidirectional connection between TPUs. We refer to the edges between one side of this grid and the opposite side as wrap-around edges, as noted in the diagram.
By twisting this topology, we end up a completely symmetric 4×2 twisted torus topology:
All that has changed between this diagram and the previous one is the Y wrap-around edges. Instead of connecting to another TPU with the same X coordinate, they have been shifted to connect to the TPU with coordinate X+2 mod 4.
The same idea generalizes to different dimension sizes and different numbers of dimensions. The resulting network is symmetric, as long as each dimension is equal to or twice the size of the smallest dimension.
See using AcceleratorConfig for details about how to specify a twisted tori configuration when creating a Cloud TPU.
The following table shows the supported twisted topologies and a theoretical increase in bisection bandwidth with them versus untwisted topologies.
Twisted Topology | Theoretical increase in bisection bandwidth versus a non-twisted torus |
---|---|
4×4×8_twisted | ~70% |
8x8x16_twisted | |
12×12×24_twisted | |
4×8×8_twisted | ~40% |
8×16×16_twisted |
TPU v4 Topology variants
Some topologies containing the same number of chips can be arranged in different ways. For example, a TPU Pod slice with 512 chips (1024 TensorCores) can be configured using the following topologies: 4x4x32, 4x8x16, or 8x8x8. A TPU Pod slice with 2048 chips (4096 TensorCores) offers even more topology options: 4x4x128, 4x8x64, 4x16x32, and 8x16x16. A TPU Pod slice with 2048 chips (4096 TensorCores) offers even more topology options: 4x4x128, 4x8x64, 4x16x32, and 8x16x16.
The default topology associated with a given chip count is the one that's most similar to a cube (see v4 Topology). This shape is likely the best choice for data-parallel ML training. Other topologies can be useful for workloads with multiple kinds of parallelism (for example, model and data parallelism, or spatial partitioning of a simulation). These workloads perform best if the topology is matched to the parallelism used. For example, placing 4-way model parallelism on the X dimension and 256-way data parallelism on the Y and Z dimensions matches a 4x16x16 topology.
Models with multiple dimensions of parallelism perform best with their parallelism dimensions mapped to TPU topology dimensions. These are usually data+model parallel Large Language Models (LLMs). For example, for a TPU v4 Pod slice with topology 8x16x16, the TPU topology dimensions are 8, 16 and 16. It is more performant to use 8-way or 16-way model parallelism (mapped to one of the physical TPU topology dimensions). A 4-way model parallelism would be sub-optimal with this topology, since it's not aligned with any of the TPU topology dimensions, but it would be optimal with a 4x16x32 topology on the same number of chips.
TPU v4 configurations consist of two groups, those with topologies smaller than 64 chips (small topologies), and those with topologies greater than 64 chips (large topologies).
Small v4 topologies
Cloud TPU supports the following TPU v4 Pod slices smaller than 64 chips, a 4x4x4 cube. You can create these small v4 topologies using either their TensorCore-based name (for example, v4-32), or their topology (for example, 2x2x4):
Name (based on TensorCore count) | Number of chips | Topology |
v4-8 | 4 | 2x2x1 |
v4-16 | 8 | 2x2x2 |
v4-32 | 16 | 2x2x4 |
v4-64 | 32 | 2x4x4 |
Large v4 topologies
TPU v4 Pod slices are available in increments of 64 chips, with shapes that are
multiples of 4 on all three dimensions. The dimensions must also be in
increasing order. Several examples are shown in the following table. A few of
these topologies are "custom" topologies that can only be created using the
--type
and --topology
flags because there is more than one way to arrange
the chips.
The following command creates a v4 TPU Pod slice with 512 TPU chips arranged in a 8x8x8 array:
$ gcloud compute tpus tpu-vm create tpu-name
--zone=zone
--type=v4
--topology=8x8x8
--version=tpu-vm-tf-2.15.0-pod-pjrt
You can create a v4 TPU Pod slice with the same number of TensorCores using
--accelerator-type
:
$ gcloud compute tpus tpu-vm create tpu-name
--zone=zone
--accelerator-type=v4-1024
--version=tpu-vm-tf-2.15.0-pod-pjrt
Name (based on TensorCore count) | Number of chips | Topology |
v4-128 | 64 | 4x4x4 |
v4-256 | 128 | 4x4x8 |
v4-512 | 256 | 4x8x8 |
N/A - must use the --type and --topology flags |
256 | 4x4x16 |
v4-1024 | 512 | 8x8x8 |
v4-1536 | 768 | 8x8x12 |
v4-2048 | 1024 | 8x8x16 |
N/A - must use --type and --topology flags |
1024 | 4x16x16 |
v4-4096 | 2048 | 8x16x16 |
… | … | … |
TPU v3 configurations
A TPU v3 Pod is composed of 1024 chips interconnected with high-speed links. To
create a TPU v3 device or Pod slice, use the --accelerator-type
flag for the
gcloud compute tpus tpu-vm command. You specify the accelerator type by specifying the
TPU version and the number of TPU cores. For a single v3 TPU, use
--accelerator-type=v3-8
. For a v3 Pod slice with 128 TensorCores, use
--accelerator-type=v3-128
.
The following command shows how to create a v3 TPU Pod slice with 128 TensorCores:
$ gcloud compute tpus tpu-vm create tpu-name
--zone=zone
--accelerator-type=v3-128
--version=tpu-vm-tf-2.15.0-pjrt
The following table lists the supported v3 TPU types:
TPU version | Support ends |
---|---|
v3-8 | (End date not yet set) |
v3-32 | (End date not yet set) |
v3-128 | (End date not yet set) |
v3-256 | (End date not yet set) |
v3-512 | (End date not yet set) |
v3-1024 | (End date not yet set) |
v3-2048 | (End date not yet set) |
For more information about managing TPUs, see Manage TPUs. For more information about the different versions of Cloud TPU, see System architecture.
TPU v2 configurations
A TPU v2 Pod is composed of 512 chips interconnected with reconfigurable
high-speed links. To create a TPU v2 Pod slice, use the --accelerator-type
flag
for the gcloud compute tpus tpu-vm command. You specify the accelerator type by
specifying the TPU version and the number of TPU cores. For a single v2 TPU, use
--accelerator-type=v2-8
. For a v2 Pod slice with 128 TensorCores, use
--accelerator-type=v2-128
.
The following command shows how to create a v2 TPU Pod slice with 128 TensorCores:
$ gcloud compute tpus tpu-vm create tpu-name
--zone=zone
--accelerator-type=v2-128
--version=tpu-vm-tf-2.15.0-pjrt
For more information about managing TPUs, see Manage TPUs. For more information about the different versions of Cloud TPU, see System architecture.
The following table lists the supported v2 TPU types
TPU version | Support ends |
---|---|
v2-8 | (End date not yet set) |
v2-32 | (End date not yet set) |
v2-128 | (End date not yet set) |
v2-256 | (End date not yet set) |
v2-512 | (End date not yet set) |
TPU type compatibility
You can change the TPU type to another TPU type that has the same number of
TensorCores or chips (for example, v3-128
and v4-128
)
and run your training script without code changes. However, if you change to a
TPU type with a larger or smaller number TensorCores, you will need to perform
significant tuning and optimization. For more information,
see Training on TPU Pods.
TPU VM software versions
This section describes the TPU software versions you should use for a TPU with the TPU VM architecture. For the TPU Node architecture, see TPU Node software versions.
TPU software versions are available for TensorFlow, PyTorch, and JAX frameworks.
TensorFlow
For TensorFlow version 2.15.0, use the TPU software
version that matches the version of TensorFlow
with which your model was written. You must also specify either
the stream executor (SE) runtime or the PJRT runtime. For example, if you are
using TensorFlow 2.15.0 with the PJRT runtime, use
the tpu-vm-tf-2.15.0-pjrt
TPU software version.
PJRT features automatic device memory defragmentation and simplifies the integration of hardware with frameworks. For more information about PJRT, see PJRT: Simplifying ML Hardware and Framework Integration on the Google Open Source Blog.
We are working to migrate all features of TPU v2, v3, and v4 to the PJRT runtime. The following table describes which features are currently supported on PJRT or steam executor.
Accelerator | Feature | Supported on PJRT | Supported on stream executor |
---|---|---|---|
TPU v2-v4 | Dense compute (no TPU embedding API) | Yes | Yes |
TPU v2-v4 | Dense compute API + TPU embedding API | No | Yes |
TPU v2-v4 | tf.summary /tf.print with soft device placement |
No | Yes |
TPU v5e | Dense compute (no TPU embedding API) | Yes | No |
TPU v5e | TPU embedding API | N/A - TPU v5e doesn't support TPU embedding API | N/A |
TensorFlow versions 2.14.0 and earlier only support stream executor.
Use the TPU software version that matches the version of TensorFlow
with which your model was written. For example, if you are using
TensorFlow 2.14.0, use the tpu-vm-tf-2.14.0
TPU software version.
The current supported TensorFlow TPU VM software versions for TPUs are:
- tpu-vm-tf-2.15.0-pjrt
- tpu-vm-tf-2.15.0-se
- tpu-vm-tf-2.14.0
- tpu-vm-tf-2.13.1
- tpu-vm-tf-2.13.0
- tpu-vm-tf-2.12.1
- tpu-vm-tf-2.12.0
- tpu-vm-tf-2.11.1
- tpu-vm-tf-2.11.0
- tpu-vm-tf-2.10.1
- tpu-vm-tf-2.10.0
- tpu-vm-tf-2.9.3
- tpu-vm-tf-2.9.1
- tpu-vm-tf-2.8.4
- tpu-vm-tf-2.8.3
- tpu-vm-tf-2.8.0
- tpu-vm-tf-2.7.4
- tpu-vm-tf-2.7.3
If you are using a Pod slice, append -pod
after the TensorFlow version
number. For example, tpu-vm-tf-2.15.0-pod-pjrt
.
For more information on TensorFlow patch versions, see Supported TensorFlow patch versions.
TPU v4 with TensorFlow versions 2.10.0 and earlier
If you are training a model on TPU v4 with TensorFlow,
TensorFlow versions 2.10.0 and earlier use v4
-specific versions shown
in the following table. If the TensorFlow version you're using is not shown
in the table, follow the guidance in the TensorFlow section.
TensorFlow version | TPU software version |
---|---|
2.10.0 | tpu-vm-tf-2.10.0-v4, tpu-vm-tf-2.10.0-pod-v4 |
2.9.3 | tpu-vm-tf-2.9.3-v4, tpu-vm-tf-2.9.3-pod-v4 |
2.9.2 | tpu-vm-tf-2.9.2-v4, tpu-vm-tf-2.9.2-pod-v4 |
2.9.1 | tpu-vm-tf-2.9.1-v4, tpu-vm-tf-2.9.1-pod-v4 |
Libtpu versions
TPU VMs are created with TensorFlow and the corresponding Libtpu
library preinstalled. If you are creating your own VM image, specify the
following TensorFlow TPU software versions and corresponding libtpu
versions:
TensorFlow version | libtpu.so version |
---|---|
2.15.0 | 1.9.0 |
2.14.0 | 1.8.0 |
2.13.1 | 1.7.1 |
2.13.0 | 1.7.0 |
2.12.1 | 1.6.1 |
2.12.0 | 1.6.0 |
2.11.1 | 1.5.1 |
2.11.0 | 1.5.0 |
2.10.1 | 1.4.1 |
2.10.0 | 1.4.0 |
2.9.3 | 1.3.2 |
2.9.1 | 1.3.0 |
2.8.3 | 1.2.3 |
2.8.* | 1.2.0 |
2.7.3 | 1.1.2 |
PyTorch
Use the TPU software version that matches the version of PyTorch with which
your model was written. For example, if you are using PyTorch 1.13 and TPU v2 or
v3, use the tpu-vm-pt-1.13
TPU software version. If you are using TPU v4 use
the tpu-vm-v4-pt-1.13
TPU software version. The same TPU software version
is used for TPU Pods (for example,v2-32
, v3-128
, v4-32
). The current
supported TPU software versions are:
TPU v2/v3:
- tpu-vm-pt-2.0 (pytorch-2.0)
- tpu-vm-pt-1.13 (pytorch-1.13)
- tpu-vm-pt-1.12 (pytorch-1.12)
- tpu-vm-pt-1.11 (pytorch-1.11)
- tpu-vm-pt-1.10 (pytorch-1.10)
- v2-alpha (pytorch-1.8.1)
TPU v4:
- tpu-vm-v4-pt-2.0 (pytorch-2.0)
- tpu-vm-v4-pt-1.13 (pytorch-1.13)
When you create a TPU VM, the latest version of PyTorch is preinstalled on the TPU VM. The correct version of libtpu.so is automatically installed when you install PyTorch.
To change the current PyTorch software version, see Changing PyTorch version.
JAX
You must manually install JAX on your TPU VM, because there is no JAX-specific TPU software version. For all TPU versions, use tpu-ubuntu2204-base. The correct version of libtpu.so is automatically installed when you install JAX.
TPU Node software versions
This section describes the TPU software versions you should use for a TPU with the TPU Node architecture. For the TPU VM architecture, see TPU VM software versions.
TPU software versions are available for TensorFlow, PyTorch, and JAX frameworks.
TensorFlow
Use the TPU software version that matches the version of TensorFlow
with which your model was written. For example, if you are using
TensorFlow 2.12.0, use the 2.12.0
TPU software version. The TensorFlow specific TPU software versions
are:
- 2.12.1
- 2.12.0
- 2.11.1
- 2.11.0
- 2.10.1
- 2.10.0
- 2.9.3
- 2.9.1
- 2.8.4
- 2.8.2
- 2.7.3
For more information on TensorFlow patch versions, see Supported TensorFlow patch versions.
When you create a TPU Node, the latest version of TensorFlow is preinstalled on the TPU Node.
PyTorch
Use the TPU software version that matches the version of PyTorch with which your
model was written. For example, if you are using PyTorch 1.9, use the
pytorch-1.9
software version.
The PyTorch specific TPU software versions are:
- pytorch-2.0
- pytorch-1.13
- pytorch-1.12
- pytorch-1.11
- pytorch-1.10
- pytorch-1.9
- pytorch-1.8
- pytorch-1.7
pytorch-1.6
pytorch-nightly
When you create a TPU Node, the latest version of PyTorch is preinstalled on the TPU Node.
JAX
You must manually install JAX on your TPU VM, so there is no pre-installed JAX-specific TPU software version. You can use any of the software versions listed for TensorFlow.
What's next
- Learn more about TPU architecture in the System Architecture page.
- See When to use TPUs to learn about the types of models that are well suited to Cloud TPU.