The bfloat16 numerical format

Using reduced-precision floating point numbers is a common method used to decrease time to convergence without losing accuracy. TPUs use the bfloat16 number format when performing matrix operations. Matrix multiplication operations are performed on bfloat16 values and accumulations are performed on IEEE float32 values.

bfloat16 is a custom 16-bit floating point format for machine learning that is composed of one sign bit, eight exponent bits, and seven mantissa bits. The following diagram shows the internals of three floating point formats: float32: IEEE single-precision, float16: IEEE half-precision, and bfloat16.

image

The dynamic range of bfloat16 and float32 are equivalent. However, bfloat16 takes up half the memory space. For more information about bfloat16 performance, see A Study of BFLOAT16 for Deep Learning Training.

Choosing bfloat16

The Google hardware team chose bfloat16 for Cloud TPUs to improve hardware efficiency while maintaining the ability to train deep learning models accurately, all with minimal switching costs from float32. The physical size of a hardware multiplier scales with the square of the mantissa width. With fewer mantissa bits than FP16, the bfloat16 multipliers are about half the size in silicon of a typical FP16 multiplier, and they are eight times smaller than an float32 multiplier.

Neural networks are more sensitive to the size of the exponent than the size of the mantissa. To ensure identical behavior for underflows, overflows, and NaNs, bfloat16 has the same exponent size as float32. bfloat16 handles denormals differently from float32, it flushes them to zero. Unlike float16, which typically requires special handling like loss scaling, bfloat16 is a drop-in replacement for float32 when training and running deep neural networks.

Mixed-precision training

Most computations within a deep neural network can accomplish a task with the same accuracy using a lower-precision values. Some models can even reach a higher accuracy with lower-precision values.

When programming Cloud TPUs, the XLA compiler automatically converts values between float32 and bfloat16.

Model portability

The values of parameters and activations in a model can be stored in 32-bit format because the TPU hardware can automatically cast these values to bfloat16. Checkpoints obtained from a model trained on Cloud TPUs can be deployed on other hardware platforms (for example, inference or fine-tuning on CPUs or GPUs) without extensive manual conversions.

Improving performance with bfloat16

While automatic format conversion in TPUs lets you avoid thinking about numerical precision, further performance improvements can be achieved by explicitly casting values to bfloat16. There are two reasons for explicitly casting values to bfloat16:

  1. Storing values in bfloat16 format saves on-chip memory, enabling Cloud TPUs to train larger models or use larger batch sizes.

  2. Some operations are memory-bandwidth-bound, which means the amount of time it takes to load data from memory can slow down the overall time spent performing the computation. Storing operands and outputs of those ops in bfloat16 format reduces the amount of data that must be transferred, improving overall speed.

To get started, we recommend getting some hands-on experience with one of the bfloat16-enabled reference models that have been optimized for Cloud TPUs. After that, our performance guide, profiling tools guide, and troubleshooting guide provide in-depth technical information to help you create and optimize machine learning models on your own.