Stay organized with collections Save and categorize content based on your preferences.

TPU types and topologies

Overview

When you create a TPU configuration, you need to specify a TPU type and size, or a TPU type and a topology (a logical chip arrangement). There are two ways to do this in a gcloud create operation:

  • Create a single accelerator-type string which specifies the type (v2, v3, or v4) and the number of TensorCores being requested
  • Specify the TPU type parameter along with the topology flag.

See TPU types for examples of creating TPUs using these the accelerator-type string or the type and topology flags.

Additional information about the different Cloud TPU versions and topologies can be found in, the System architecture document.

TPU types

You use the 'accelerator-type' flag of the gcloud command to specify a TPU configuration. A TPU configuration is composed of a TPU version and the number of TensorCores. For example, the following gcloud command creates a v3 TPU with 128 TensorCores:

    $ gcloud compute tpus tpu-vm create tpu-name
      --zone=zone
      --accelerator-type=v3-128
      --version=tpu-vm-tf-2.11.0

You can also specify the TPU configuration as the number of TPU chips and their logical arrangement (the topology). For more information, see Chip-based TPU configurations.

TensorCore-based TPU configurations

For all TPU types, the version is followed by the number of TensorCores (e.g., 8, 32, 128). For example, --accelerator-type=v2-8 specifies a TPU v2 with 8 TensorCores and v3-1024 specifies a v3 TPU with 1024 TensorCores (a slice of a v3 Pod).

Chip-topology-based TPU configurations

v4 configurations are specified with a TPU version and a topology. The topology is specified with 3-tuple that describes how the chips are arranged. For example, 32-TensorCore configuration can be represented in chips as 2x2x4 (16 chips, each with 2 TensorCores). The 3-tuple only specifies the topology, not the configuration, which requires a TPU version.

The representation of a v4 topology is AxBxC where A<=B<=C and A,B,C are either all <=4 or all integer multiples of 4. The values A, B, and C are the chip counts of the three dimensions. Topologies where 2A=B=C or 2A=2B=C also have variants optimized for all-to-all communication, for example, 4×4×8, 8×8×16, and 12×12×24.

For more information on v4 topologies, see TPU v4 configurations.

Creating chip-topology-based topologies (v4)

With v4, you see devices rather than TensorCores, one device per chip. To specify chip-topology-based topologies, v4 introduced 3-dimensional topology formats (for example, 4x4x4) along with a --topology flag for the gcloud create TPU operation. The tables shown under Topology variants include examples of other supported chip-topology-based topologies.

You can specify a v4 type and topology using the type and topology flags to the gcloud create operation. For example, the following gcloud command creates a v4 TPU with 64 chips in a 4x4x4 cube topology.

  $ gcloud compute tpus tpu-vm create tpu-name
    --zone=zone
    --type=v4
    --topology=4x4x4
    --version=tpu-vm-tf-2.11.0

The supported value for --type is v4.

Topology variants

The standard topology associated with a given TensorCore or chip count is the one that's most similar to a cube (see Topology shapes. This shape makes it likely to be the best choice for data-parallel ML training. Other topologies can be useful for workloads with multiple kinds of parallelism (for example, model and data parallelism, or spatial partitioning of a simulation). These workloads perform best if the slice shape is matched to the parallelism used. For example, placing 4-way model parallelism on the X axis and 256-way data parallelism on the Y and Z dimensions matches a 4x16x16 topology.

Models with multiple dimensions of parallelism perform best with their parallelism dimensions mapped to torus dimensions. These are usually data+model parallel Large Language Models (LLMs). For example, for a TPU v4 pod slice with topology 8x16x16, the torus dimensions are 8, 16 and 16. It is more performant to use 8 way or 16 way model parallelism (mapped to one of the physical torus dimensions). A 4-way model parallelism would be sub-optimal with this topology, since it's not aligned with any torus dimensions, but it would be optimal with a 4x16x32 topology on the same number of chips.

Small v4 topologies

Cloud TPU supports the following TPU v4 slices smaller than 64 chips, a 4x4x4 cube. You can create these small v4 topologies using either their TensorCore-based name (for example, v4-32), or their topology (for example, 2x2x4):

Name (based on TensorCore count) Number of chips Topology
v4-8 4 2x2x1
v4-16 8 2x2x2
v4-32 16 2x2x4
v4-64 32 2x4x4

Large v4 topologies

TPU v4 slices are available in increments of 64 chips, with shapes that are multiples of 4 on all three dimensions. The dimensions must also be in increasing order. Several examples are shown in the following table. A few of these topologies are "custom" topologies that can only be launched using the topology API because they have the same number of chips as a more commonly used named topology.

Name (based on TensorCore count) Number of chips Topology
v4-128 64 4x4x4
v4-256 128 4x4x8
v4-512 256 4x8x8
See Topology API 256 4x4x16
v4-1024 512 8x8x8
v4-1536 768 8x8x12
v4-2048 1024 8x8x16
See Topology API 1024 4x16x16
v4-4096 2048 8x16x16

Topology API

In order to create Cloud TPU Pod slices with custom topology the gcloud TPU API can be used as follows:

     $ gcloud compute tpus tpu-vm create tpu-name \
        --zone=us-central2-b \
        --subnetwork=tpusubnet \
        --type=v4 \
        --topology=4x4x16 \
        --version=runtime-version

Topology shapes

The following image shows three possible topology shapes and their associated topology flag settings.

image

TPU type compatibility

You can change the TPU type to another TPU type that has the same number of TensorCores or chips (for example, v3-128 and v4-128) and run your training script without code changes. However if you change to a TPU type with a larger or smaller number of chips or TensorCores, you will need to perform significant tuning and optimization. For more information, see Training on TPU Pods.

What's next