Stay organized with collections
Save and categorize content based on your preferences.
Improve your model's performance with bfloat16
By default, TPUs perform
matrix multiplication operations with bfloat16
values and accumulations with IEEE float32
values. Using reduced-precision floating point numbers decreases time to
convergence without losing accuracy.
The dynamic range of bfloat16 and float32 are equivalent. However, bfloat16
uses half of the memory space. For more information about bfloat16 performance,
see A Study of BFLOAT16 for Deep Learning Training.
Use bfloat16 explicitly
While automatic format conversion in TPUs lets you avoid thinking about numerical
precision, you can achieve performance improvements by explicitly casting values
to bfloat16. There are two reasons to explicitly cast values to bfloat16:
Storing values in bfloat16 format saves on-chip memory, enabling Cloud TPUs
to train larger models or use larger batch sizes.
Some operations are memory-bandwidth-bound, which means the amount of time it
takes to load data from memory can slow down the overall time spent performing
the computation. Storing operands and outputs of those operations in bfloat16
format reduces the amount of data that must be transferred, improving overall
speed.
The format conversion from float32 to bfloat16 is automatically inserted by the
XLA compiler. On TPU, the rounding scheme in the conversion is
round to nearest even
and overflow to inf. Also, the bfloat16 on Cloud TPU does not support
subnormals, so all subnormals are flushed to zero during the conversion.
Special values, such as NaN and inf, are preserved in the conversion.
The format conversion from bfloat16 to float32 is also automatically inserted
by the XLA compiler. Since float32 can represent all exact values in bfloat16,
the conversion pads 16 zeros in the mantissa bits. Special values are
preserved in the conversion.
Checkpoints obtained from a model trained on Cloud TPUs can be deployed on other
hardware platforms (for example, inference or fine-tuning on CPUs or GPUs)
without extensive manual conversions.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[],[],null,["# Improve your model's performance with bfloat16\n==============================================\n\nBy default, TPUs perform\nmatrix multiplication operations with [`bfloat16`](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)\nvalues and accumulations with IEEE [`float32`](https://en.wikipedia.org/wiki/Single-precision_floating-point_format)\nvalues. Using reduced-precision floating point numbers decreases time to\nconvergence without losing accuracy.\n\nThe dynamic range of `bfloat16` and `float32` are equivalent. However, `bfloat16`\nuses half of the memory space. For more information about `bfloat16` performance,\nsee [A Study of BFLOAT16 for Deep Learning Training](https://arxiv.org/abs/1905.12322).\n\nUse bfloat16 explicitly\n-----------------------\n\nWhile automatic format conversion in TPUs lets you avoid thinking about numerical\nprecision, you can achieve performance improvements by explicitly casting values\nto `bfloat16`. There are two reasons to explicitly cast values to `bfloat16`:\n\n1. Storing values in `bfloat16` format saves on-chip memory, enabling Cloud TPUs\n to train larger models or use larger batch sizes.\n\n2. Some operations are memory-bandwidth-bound, which means the amount of time it\n takes to load data from memory can slow down the overall time spent performing\n the computation. Storing operands and outputs of those operations in `bfloat16`\n format reduces the amount of data that must be transferred, improving overall\n speed.\n\nTo get started, we recommend getting some experience with one of the\n[Cloud TPU reference models](/tpu/docs/tutorials/supported-models). After\nthat, the [profiling tools guide](/tpu/docs/profile-tpu-vm), and\n[troubleshooting guide](/tpu/docs/troubleshooting/troubleshooting) provide\nin-depth technical information to help you create and optimize machine learning\nmodels on your own.\n\n### Format conversion details\n\nThe format conversion from `float32` to `bfloat16` is automatically inserted by the\nXLA compiler. On TPU, the rounding scheme in the conversion is\n[round to nearest even](https://en.wikipedia.org/wiki/IEEE_754#Roundings_to_nearest)\nand overflow to `inf`. Also, the `bfloat16` on Cloud TPU does not support\nsubnormals, so all subnormals are flushed to zero during the conversion.\nSpecial values, such as `NaN` and `inf`, are preserved in the conversion.\n\nThe format conversion from `bfloat16` to `float32` is also automatically inserted\nby the XLA compiler. Since `float32` can represent all exact values in `bfloat16`,\nthe conversion pads 16 zeros in the mantissa bits. Special values are\npreserved in the conversion.\n\nCheckpoints obtained from a model trained on Cloud TPUs can be deployed on other\nhardware platforms (for example, inference or fine-tuning on CPUs or GPUs)\nwithout extensive manual conversions."]]