The software version of the framework running on your TPU must match the version running on your local VM. This software version can now be switched on a running Cloud TPU, without deleting and recreating the TPU. This also enables configuring the Cloud TPU with specific nightly versions of software frameworks. It is still recommended to select a supported version of these frameworks.
The recommended way to switch versions is to use the cloud-tpu-client python library.
Example usage for TensorFlow.
import tensorflow as tf from cloud_tpu_client import Client c = Client() c.configure_tpu_version(tf.__version__, restart_type='ifNeeded')
This configures the Cloud TPU to match the TensorFlow version running on your local VM, this includes official releases as well as dated nightly builds.
restart_type parameter of the
configure_tpu_version API defines
the TPU restart behavior when switching versions. Options are
'always' (the default) and 'ifNeeded'.
'always' can be used to fix a TPU with, for example, status UNHEALTHY_TENSORFLOW, or that is returning Out of Memory (OOM) errors due to leaked resources from a previous run. When this option is set, the TPU is restarted even when a new framework version is not installed.
'ifNeeded' can be useful because it does not restart the runtime if it is already on the right version, so it will not add any significant startup time to a training script. When this option is set, the TPU is only restarted if it does not have the correct framework version installed.
The library communicates directly with the Cloud TPU so this code needs to be run in a VM in the same network. It is recommended to run this within the code for the rest of your model.
Additional software options
TensorFlow includes a
tf.__version__ string which is the simplest way
to configure the correct version. Other software options include:
- PyTorch -
- Jax -
For example to configure a TPU to run with the latest nightly build of PyTorch.
from cloud_tpu_client import Client c = Client() c.configure_tpu_version('pytorch-nightly', restart_type='ifNeeded') c.wait_for_healthy()