The bfloat16 numerical format
Using reduced-precision floating point numbers is a common method used to
decrease time to convergence without losing accuracy. TPUs use the
number format when performing matrix operations. Matrix multiplication operations
are performed on
bfloat16 values and accumulations are performed on IEEE
bfloat16 is a custom 16-bit floating point format for machine learning that is
composed of one sign bit, eight exponent bits, and seven mantissa bits. The
following diagram shows the internals of three floating point formats:
float32: IEEE single-precision,
float16: IEEE half-precision, and
The dynamic range of
float32 are equivalent. However,
takes up half the memory space. For more information about
see A Study of BFLOAT16 for Deep Learning Training.
The Google hardware team chose
bfloat16 for Cloud TPUs to improve hardware
efficiency while maintaining the ability to train deep learning models accurately,
all with minimal switching costs from
float32. The physical size of a hardware
multiplier scales with the square of the mantissa width. With fewer mantissa
bfloat16 multipliers are about half the size in silicon of a
FP16 multiplier, and they are eight times smaller than an
Neural networks are more sensitive to the size of the exponent than the size of
the mantissa. To ensure identical behavior for underflows, overflows, and NaNs,
bfloat16 has the same exponent size as
bfloat16 handles denormals
float32, it flushes them to zero. Unlike
float16, which typically
requires special handling like loss scaling,
bfloat16 is a drop-in replacement
float32 when training and running deep neural networks.
Most computations within a deep neural network can accomplish a task with the same accuracy using a lower-precision values. Some models can even reach a higher accuracy with lower-precision values.
When programming Cloud TPUs, the XLA compiler automatically converts values between
The values of parameters and activations in a model can be stored in 32-bit
format because the TPU hardware can automatically cast these values to
Checkpoints obtained from a model trained on Cloud TPUs can be deployed on other
hardware platforms (for example, inference or fine-tuning on CPUs or GPUs)
without extensive manual conversions.
Improving performance with bfloat16
While automatic format conversion in TPUs lets you avoid thinking about
numerical precision, further performance improvements can be achieved by
explicitly casting values to
bfloat16. There are two reasons for explicitly
casting values to
Storing values in
bfloat16format saves on-chip memory, enabling Cloud TPUs to train larger models or use larger batch sizes.
Some operations are memory-bandwidth-bound, which means the amount of time it takes to load data from memory can slow down the overall time spent performing the computation. Storing operands and outputs of those ops in
bfloat16format reduces the amount of data that must be transferred, improving overall speed.
To get started, we recommend getting some hands-on experience with one of the
bfloat16-enabled reference models
that have been optimized for Cloud TPUs. After that,
our performance guide, profiling tools guide,
and troubleshooting guide provide
in-depth technical information to help you create and optimize machine learning
models on your own.