Training on TPU slices

TPUs are designed to be scaled out to a TPU Pod. A TPU Pod is a collection of TPU devices connected by dedicated high-speed network interfaces. A TPU Pod lets you distribute the processing load across multiple TPUs. Each TPU board is connected to a high-performance CPU-based host machine for things like loading and preprocessing data. To take full advantage of larger numbers of TPUs, you must tune several training task parameters.

The setup for training with TPU Pods is different for each framework. Use the following links to see detailed information about training on Pods with each framework:

JAX
PyTorch

The following sections explain some common issues, changes you need to make in your models, and best practices to reduce or avoid Pod failures.

Scaling batch size and train steps

To achieve linear scaling on larger TPU types, keep the per-core batch size the same.

For example, if you use a batch size of 1024 on a v6e-8, use a batch size of 4096 (4 * 1024) on a v6e-32. This fully utilizes the TPU hardware. You can use smaller batch sizes, but your training won't scale linearly if you do so.

Some models include a train_steps flag where one step corresponds to processing a single batch of data. When you increase the batch size, scale down the number of training steps so that the total number of training examples remains the same.

For example, if you have a batch size of 1000 for 100 steps, 100,000 examples are processed during training. If you now have 4 workers and an effective batch size of 4000, you would have to adjust the number of steps to 25 to process that same 100,000 examples. If your model uses an epochs flag, you don't need to scale the number of steps.

Larger batch sizes can change convergence behavior of the model, so you might also tune some hyperparameters, like learning rate.

Using regional Cloud Storage buckets in the same region as the TPU Pod

In general, the best practice for TPU training is to always use resources in the same region. Resource region is particularly important when using TPU Pods because the data transfer rates are higher when your Cloud Storage bucket and TPU are in the same region.

Ensure you are using a regional Cloud Storage bucket in the same region as the TPU for training datasets and checkpoints.

Workflow best practices for development on TPU Pods

When developing a new TPU workload, it is often optimal to begin development on the smallest TPUs and progressively iterate to larger TPU sizes. Start by using a small TPU version (for example, v6e-8).

Test your workload for expected behavior
Test and validate performance using the performance tools

Once your workload is functional and reaches your performance targets, scale up to a larger TPU type such as a v6e-32. Gradually and iteratively increase the TPU size while validating scalability (functionality and performance) until you reach the TPU size that you want.