Cloud TPU performance guide

Your first step when troubleshooting TPU performance is to profile your model. For more information on capturing a performance profile, see Profiling your model on Cloud TPU.

TPU model performance

This section describes general issues that can reduce model performance and how you can address them.

  1. Model is input bound

    TPUs perform calculations very fast. To ensure the TPU is not idle, it is important to make sure there is a steady stream of data being loaded onto the TPU. How this is done depends on how you load and preprocess your dataset. For example, you can read datafiles in parallel using and the num_parallel_reads parameter.

  2. Batch size is too small because of sharding (splitting batches across cores)

    The TPU runtime splits a batch across all 8 cores of a TPU device (for example v2-8 or v3-8). If you specify a global batch size of 128, each core receives a batch size of 16 (128 / 8).

    For optimum memory usage, use the largest batch size that fits into TPU memory. Each TPU core uses two-dimensional 8 X 128 vector registers for processing matrix multiplications. In general, your batch size should be evenly divisible by 8 or 128.

XLA compiler optimizations

XLA is a compiler for machine learning that can produce binaries for TPUs, CPUs, GPUs and other platforms. While XLA is part of the standard TensorFlow code base, it can also be used on PyTorch and JAX models. Models for Cloud TPU are translated to an XLA graph, which XLA then compiles to a TPU executable. For more information about XLA, see XLA: Optimizing Compiler for Machine Learning.


To use TPU memory efficiently, structure your data so that it can be tiled into 128 x 8 chunks. When the data for a matrix computation does not fill an entire 128 x 8 chunk, the XLA compiler pads tensors. There are two drawbacks to padding:

  1. Padded tensors under-utilize the TPU core.
  2. Padding increases the amount of on-chip memory storage required for a tensor and can lead to an out-of-memory error.

While padding is automatically performed by the XLA compiler when necessary, you can determine the amount of padding performed using the memory viewer tool. You can avoid padding by picking tensor dimensions that are well suited for TPU.

Tensor dimensions

The XLA compiler rounds up the sizes of tensors stored in TPU HBM memory to perform computations more efficiently. This padding happens transparently at the hardware level and does not affect results. However, in certain cases the padding can result in significantly increased memory use and execution time.

The TPU runtime lays out tensors in memory to maximize computational efficiency and minimize padding. To minimize memory overhead and maximize computational efficiency, one of the following must be true:

  1. The total batch size should be a multiple of 64 (8 per TPU core), and feature dimension sizes should be a multiple of 128.

  2. The total batch size should be a multiple of 1024 (128 per TPU core), and feature dimension sizes should be a multiple of 8.

Using a batch size of 1024 and feature dimensions that are a multiple of 128 results in the best efficiency, although this may not be possible for all models.


Fusion is a general technique the XLA compiler uses to optimize programs. A fused operation is the combination of multiple constituent operations that are to be executed in combination.

For example, consider the following series of operations:

    tmp = tf.add(x, y)
    result = tf.multiply(tmp, z)

This code is roughly equivalent to the following pseudo code:

    for (i = 0; i < element_count; i++) {
      tmp[i] = x[i] + y[i];

    for (i = 0; i < element_count; i++) {
      result = tmp[i] * z[i];

With fusion, the array accesses happen at the same time:

    for (i = 0; i < element_count; i++) {
      result = (x[i] + y[i]) * z[i];

In this example, the number of memory round trips is reduced and XLA does not need to allocate any space for 'tmp'.

Fusion is a critical optimization and benefits the Cloud TPU in several ways:

  • It reduces memory transfers by removing the need to store intermediate results in main memory, which is slow.
  • It allows greater utilization of hardware units which would otherwise be unutilized.
  • It can reduce the memory utilization of a model as fewer buffers need to be live at the same time.


Broadcasting implicitly occurs when two tensors with different, but compatible, shapes are combined.

For example, tf.add(vector, matrix) requires the vector to be broadcasted to the shape of the matrix. The result of the operation has the same shape as the matrix. For more details, see the guide to broadcasting arrays.

While broadcasts can often be fused with their consumers, forcing a broadcast may result in poor performance and increased memory usage.

In the following example, the broadcast implicit in the addition of a vector and matrix cannot be fused with the argmax resulting in a materialized broadcast:

`tf.argmax(tf.add(vector, zero_matrix), axis=0)`