TPU v6e
This document describes the architecture and supported configurations of Cloud TPU v6e (Trillium).
Trillium is Cloud TPU's latest generation AI accelerator. On all technical surfaces, such as the API and logs, and throughout this document, Trillium will be referred to as v6e.
With a 256-chip footprint per Pod, v6e shares many similarities with v5e. This system is optimized to be the highest value product for transformer, text-to-image, and convolutional neural network (CNN) training, fine-tuning, and serving.
System architecture
Each v6e chip contains one TensorCore. Each TensorCore has 4 matrix-multiply units (MXU), a vector unit, and a scalar unit. The following table shows the key specifications and their values for TPU v6e compared to TPU v5e.
Specification | v5e | v6e |
---|---|---|
Performance/total cost of ownership (TCO) (expected) | 0.65x | 1 |
Peak compute per chip (bf16) | 197 TFLOPs | 918 TFLOPs |
Peak compute per chip (Int8) | 393 TOPs | 1836 TOPs |
HBM capacity per chip | 16 GB | 32 GB |
HBM bandwidth per chip | 819 GBps | 1640 GBps |
Inter-chip interconnect (ICI) bandwidth | 1600 Gbps | 3584 Gbps |
ICI ports per chip | 4 | 4 |
DRAM per host | 512 GiB | 1536 GiB |
Chips per host | 8 | 8 |
TPU Pod size | 256 chips | 256 chips |
Interconnect topology | 2D torus | 2D torus |
BF16 peak compute per Pod | 50.63 PFLOPs | 234.9 PFLOPs |
All-reduce bandwidth per Pod | 51.2 TB/s | 102.4 TB/s |
Bisection bandwidth per Pod | 1.6 TB/s | 3.2 TB/s |
Per-host NIC configuration | 2 x 100 Gbps NIC | 4 x 200 Gbps NIC |
Data center network bandwidth per Pod | 6.4 Tbps | 25.6 Tbps |
Special features | - | SparseCore |
Supported configurations
TPU v6e supports training for up to 256 v6e chips and single-host inference for up to 8 chips.
The following table shows the 2D slice shapes that are supported for v6e:
Topology | TPU chips | Hosts | VMs | Accelerator type (TPU API) | Machine type (GKE API) | Scope | Supports inference? |
---|---|---|---|---|---|---|---|
1x1 | 1 | 1/8 | 1 | v6e-1 |
ct6e-standard-1t |
Sub-host | Yes |
2x2 | 4 | 1/2 | 1 | v6e-4 |
ct6e-standard-4t |
Sub-host | Yes |
2x4 | 8 | 1 | 1 | v6e-8 |
ct6e-standard-8t |
Single-host | Yes |
2x4 | 8 | 1 | 2 | - | ct6e-standard-4t |
Single-host | No |
4x4 | 16 | 2 | 4 | v6e-16 |
ct6e-standard-4t |
Multi-host | No |
4x8 | 32 | 4 | 8 | v6e-32 |
ct6e-standard-4t |
Multi-host | No |
8x8 | 64 | 8 | 16 | v6e-64 |
ct6e-standard-4t |
Multi-host | No |
8x16 | 128 | 16 | 32 | v6e-128 |
ct6e-standard-4t |
Multi-host | No |
16x16 | 256 | 32 | 64 | v6e-256 |
ct6e-standard-4t |
Multi-host | No |
Slices with 8 chips (v6e-8
) attached to a single VM are optimized for
inference, allowing all 8 chips to be used in a single serving workload.
For information about the number of VMs for each topology, see VM Types.
VM types
Each TPU v6e VM can contain 1, 4, or 8 chips. 4-chip and smaller slices have the same non-uniform memory access (NUMA) node. For more information about NUMA nodes, see Non-uniform memory access on Wikipedia.
v6e slices are created using half-host VMs, each with 4 TPU chips. There are two exceptions to this rule:
v6e-1
: A VM with only a single chip, primarily intended for testingv6e-8
: A full-host VM that has been optimized for an inference use case with all 8 chips attached to a single VM.
The following table shows a comparison of TPU v6e VM types:
VM type | Number of vCPUs per VM | RAM (GB) per VM | Number of NUMA nodes per VM |
---|---|---|---|
1-chip VM | 44 | 176 | 1 |
4-chip VM | 180 | 720 | 1 |
8-chip VM | 180 | 1440 | 2 |
Specify v6e configuration
When you allocate a TPU v6e slice using the TPU API, you specify its size and
shape using either the AcceleratorType
or
AcceleratorConfig
parameters.
If you're using GKE, use the --machine-type
flag to specify a
machine type that supports the TPU you want to use. For more information, see
Deploy TPU workloads in GKE Standard in
the GKE documentation.
Use AcceleratorType
When you allocate TPU resources, you use AcceleratorType
to specify the number
of TensorCores in a slice. The value you specify for
AcceleratorType
is a string with the format: v$VERSION-$TENSORCORE_COUNT
.
For example, v6e-8
specifies a v6e TPU slice with 8 TensorCores.
The following example shows how to create a TPU v6e slice with 32 TensorCores
using AcceleratorType
:
gcloud
$ gcloud compute tpus tpu-vm create tpu-name \ --zone=zone \ --accelerator-type=v6e-32 \ --version=v2-alpha-tpuv6e
Console
In the Google Cloud console, go to the TPUs page:
Click Create TPU.
In the Name field, enter a name for your TPU.
In the Zone box, select the zone where you want to create the TPU.
In the TPU type box, select
v6e-32
.In the TPU software version box, select
v2-alpha-tpuv6e
. When creating a Cloud TPU VM, the TPU software version specifies the version of the TPU runtime to install. For more information, see TPU VM images.Click the Enable queueing toggle.
In the Queued resource name field, enter a name for your queued resource request.
Click Create.
Use AcceleratorConfig
You can also use AcceleratorConfig
to specify the number of TensorCores you
want to use. However, because there are no custom 2D topology variants for TPU
v6e, there is no difference between using AcceleratorConfig
and
AcceleratorType
.
To configure a TPU v6e using AcceleratorConfig
, use the --version
and the
--topology
flags. Set --version
to the TPU version you want to use and
--topology
to the physical arrangement of the TPU chips in the slice. The value
you specify for AcceleratorConfig
is a string with the format AxB
, where A
and B
are the chip counts in each direction.
The following example shows how to create a TPU v6e slice 32 TensorCores
using AcceleratorType
arranged in a 4x8 topology:
$ gcloud compute tpus tpu-vm create tpu-name \ --zone=zone \ --type=v6e \ --topology=4x8 \ --version=v2-alpha-tpuv6e