Using Preemptible TPUs

A preemptible TPU is a Cloud TPU node that you can create and run at a much lower price than normal nodes. However, Cloud TPU may terminate (preempt) these nodes if it requires access to the resources for another purpose.

Creating a preemptible TPU

You can use the ctpu command or the gcloud command to create a preemptible TPU:

ctpu

To create a preemptible TPU, use the same ctpu up command as when creating a normal TPU, but add the --preemptible flag:

$ ctpu up --preemptible

Note that you need to specify the --preemptible flag each time you run ctpu up for the preemptible TPU. The command line flags and their default values apply to each command invocation individually.

gcloud

If you need a custom setup, you may decide to use the gcloud command instead of ctpu to create and manage your TPU resources. (For more about custom setups, see the setup guide.) To create a preemptible TPU, use the same gcloud compute tpus create command as when creating a normal TPU, but add the --preemptible flag:

$ gcloud compute tpus create demo-tpu \
  --range=10.240.1.0/29 \
  --version=1.9 \
  --preemptible
  

where:

  • demo-tpu is a name for identifying the TPU that you're creating.
  • --range specifies the address of the created Cloud TPU resource and can be any value in 10.240.*.*/29. For this example, use 10.240.1.0/29.
  • --version specifies the TensorFlow version to use with the TPU.
  • --preemptible allows Cloud TPU to preempt the TPU.

The preemptible status of the TPU is independent of any preemptible status of your VM instance. See the discussion of preemptible VMs and TPUs below.

Pricing and quota for preemptible TPUs

Pricing for preemptible TPUs is significantly lower than for normal TPUs. For details, see the pricing page.

Quota for preemptible TPUs is generally higher, and is separate from the quota for normal TPUs. See the quota page.

Preemptible VMs and preemptible TPUs

As described in the quickstart guide, you need a Compute Engine virtual machine (VM) in order to connect to a TPU. Note that the preemptible status of the TPU is independent of the preemptible status of the VM. You can define your TPU as preemptible and the VM as not preemptible, or the other way round. You can also define them both as preemptible.

The most likely combination is a preemptible TPU and a non-preemptible VM. Note the following points:

  • The charges for the VM are likely to be low in relation to the charges for the TPU. The VM charges depend on the machine type you use. See the pricing page for a simple example of the relative costs.
  • Cloud TPU does not coordinate the preempting of the VM and the TPU. If you define them both as preemptible, the VM and the TPU can be preempted at different times.
  • If Compute Engine preempts your VM, you are still charged for the TPU (unless the TPU is itself preempted). Note that the TPU is idle while the VM is preempted.

You can use the ctpu command or the gcloud command to define a preemptible VM:

ctpu

To create a preemptible VM, use the same ctpu up command as when creating a TPU with a normal VM, but add the --preemptible-vm flag:

$ ctpu up --preemptible-vm

Note that you need to specify the --preemptible-vm flag each time you run ctpu up for the preemptible TPU. The command line flags and their default values apply to each command invocation individually.

gcloud

If you need a custom setup, you may decide to use the gcloud command instead of ctpu to create and manage your TPU resources. (For more about custom setups, see the setup guide.) To create a preemptible VM, use the same gcloud compute instances create command as when creating a normal VM, but add the --preemptible flag:

$ gcloud compute instances create tpu-demo-vm \
  --machine-type=n1-standard-2 \
  --image-project=ml-images \
  --image-family=tf-1-9 \
  --scopes=cloud-platform \
  --preemptible

where:

  • tpu-demo-vm is a name for identifying the VM instance that you're creating.
  • --machine-type=n1-standard-2 is a standard machine type with 4 virtual CPUs and 15 GB of memory. See the available machine types.
  • --image-project=ml-images is a shared collection of images that makes the tf-1-9 image available for your use.
  • --image-family=tf-1-9 is an image with the required pip package for TensorFlow.
  • --scopes=cloud-platform allows the VM to access GCP APIs.
  • --preemptible allows Compute Engine to preempt the VM instance.

See the Compute Engine documentation on creating preemptible VM instances.

Detecting if a TPU has been preempted

You can use the ctpu command or the gcloud command to check whether the Cloud TPU service has preempted your TPU:

ctpu

Check the status of your TPUs:

$ ctpu status

The above command prints the details of the TPUs you've created. The printed value for TPU Preemptible indicates whether the TPU is preemptible or not. If the printed TPU State is READY, the TPU has not been preempted. If the TPU has been preempted, the state changes from READY to another state.

gcloud

List your available TPUs:

(vm)$ gcloud compute tpus list

The above command prints the details of the TPUs you've created. If the printed STATUS is READY, the TPU has not been preempted. If the TPU has been preempted, the status changes from READY to another status. For example:

NAME       ZONE           ACCELERATOR_TYPE  NETWORK_ENDPOINT   NETWORK  RANGE          STATUS
demo-tpu   us-central1-b  v2-8              10.240.1.2:8470    default  10.240.1.0/29  STOPPING

Detecting if a VM instance has been preempted

To check whether the VM instance has been preempted, use the gcloud compute operations list command to get a list of recent system operations:

$ gcloud compute operations list

An operation type of compute.instances.preempted indicates that the VM instance was preempted. For more details, see the Compute Engine guide.

Designing your machine learning application to run on preemptible TPUs

Make sure your application is resilient to restarts of the VM and TPU, by saving model checkpoints regularly and by configuring your application to restore the most recent checkpoint on restart.

The TensorFlow TPUEstimator API takes care of saving and restoring model checkpoints for you. If you use TPUEstimator, you don't need to worry about saving or restoring checkpoints of your TPUs or VMs. Read more about using the TPUEstimator with Cloud TPU.

Best practice is to use the TPUEstimator with Cloud TPU as described above. However, if you want to investigate how to write the checkpoint saving and restoration functionality into your model yourself, see the following resources in the TensorFlow tf.train module:

What's next

Was this page helpful? Let us know how we did:

Send feedback about...