The bfloat16 numerical format
Using reduced-precision floating point numbers is a common method used to
decrease time to convergence without losing accuracy. TPUs use the bfloat16
number format when performing matrix operations. Matrix multiplication operations
are performed on bfloat16
values and accumulations are performed on IEEE
float32
values.
bfloat16
is a custom 16-bit floating point format for machine learning that is
composed of one sign bit, eight exponent bits, and seven mantissa bits. The
following diagram shows the internals of three floating point formats: float32
: IEEE single-precision,
float16
: IEEE half-precision, and bfloat16
.
The dynamic range of bfloat16
and float32
are equivalent. However, bfloat16
takes up half the memory space. For more information about bfloat16
performance,
see A Study of BFLOAT16 for Deep Learning Training.
Choosing bfloat16
The Google hardware team chose bfloat16
for Cloud TPUs to improve hardware
efficiency while maintaining the ability to train deep learning models accurately,
all with minimal switching costs from float32
. The physical size of a hardware
multiplier scales with the square of the mantissa width. With fewer mantissa
bits than FP16
, the bfloat16
multipliers are about half the size in silicon of a
typical FP16
multiplier, and they are eight times smaller than an float32
multiplier.
Neural networks are more sensitive to the size of the exponent than the size of
the mantissa. To ensure identical behavior for underflows, overflows, and NaNs,
bfloat16
has the same exponent size as float32
. bfloat16
handles denormals
differently from float32
, it flushes them to zero. Unlike float16
, which typically
requires special handling like loss scaling, bfloat16
is a drop-in replacement
for float32
when training and running deep neural networks.
Mixed-precision training
Most computations within a deep neural network can accomplish a task with the same accuracy using a lower-precision values. Some models can even reach a higher accuracy with lower-precision values.
When programming Cloud TPUs, the XLA compiler automatically converts values between
float32
and bfloat16
.
Model portability
The values of parameters and activations in a model can be stored in 32-bit
format because the TPU hardware can automatically cast these values to bfloat16
.
Checkpoints obtained from a model trained on Cloud TPUs can be deployed on other
hardware platforms (for example, inference or fine-tuning on CPUs or GPUs)
without extensive manual conversions.
Improving performance with bfloat16
While automatic format conversion in TPUs lets you avoid thinking about
numerical precision, further performance improvements can be achieved by
explicitly casting values to bfloat16
. There are two reasons for explicitly
casting values to bfloat16
:
Storing values in
bfloat16
format saves on-chip memory, enabling Cloud TPUs to train larger models or use larger batch sizes.Some operations are memory-bandwidth-bound, which means the amount of time it takes to load data from memory can slow down the overall time spent performing the computation. Storing operands and outputs of those ops in
bfloat16
format reduces the amount of data that must be transferred, improving overall speed.
To get started, we recommend getting some hands-on experience with one of the
bfloat16
-enabled reference models
that have been optimized for Cloud TPUs. After that,
our performance guide, profiling tools guide,
and troubleshooting guide provide
in-depth technical information to help you create and optimize machine learning
models on your own.