Cloud TPU VM User's Guide

This guide describes how to set up a Google Cloud Platform project for using Cloud TPU VMs. It describes the commands for using Cloud TPU VMs and solutions to common issues you may encounter when starting to use Cloud TPU VMs.

Cloud TPU VMs run on the TPU host machine (the machine connected to the Cloud TPU device) and offer significantly better performance and usability when working with TPUs.

If you are new to Cloud TPUs, check out the TPU beginner's guide.

The Cloud TPU VM preview introduces a new Cloud TPU architecture. The current architecture is called TPU Nodes. For more information about the Cloud TPU Node and Cloud TPU VM architectures, see System Architecture.

Set up a GCP Project

After installing the Google Cloud SDK, install the alpha components for gcloud using the following command:

gcloud components install alpha

For more information about gcloud components, see Managing SDK Components.

Prepare a GCP Project

Sign in to your Google Account. If you don't already have one, sign up for a new account. In the Google Cloud Console, select or create a Cloud project from the project selector page. Make sure billing is enabled for your project. Set your project ID using gcloud in the Cloud Shell. The project ID is the name of your project shown in the Cloud console.

$ gcloud config set project project-id

Enable the Cloud TPU API

Enable the Cloud TPU API using the following gcloud command in Cloud Shell. (You may also enable it from the Google Cloud Console.

$ gcloud services enable tpu.googleapis.com

Configure the gcloud command

Run the following commands to configure gcloud to use your GCP project and install components needed for the TPU VM preview.

$ gcloud config set account your-email-account
$ gcloud config set project your-project

Managing TPUs

You can manage Cloud TPU VM using gcloud or curl. For more information, see Managing Cloud TPUs.

Creating a Cloud TPU VM with gcloud

$ gcloud alpha compute tpus tpu-vm create tpu-name \
  --zone=zone \
  --accelerator-type=v3-8 \
  --version=v2-alpha

Required fields

zone
The zone where you plan to create your Cloud TPU.
accelerator-type
The type of the Cloud TPU to create.
version
The Cloud TPU runtime version.

Creating a Cloud TPU VM with curl

$ curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" -d "{accelerator_type: 'v2-8', \
  runtime_version:'[runtime_version]', \
  network_config: {enable_external_ips: true}}" \
  https://tpu.googleapis.com/v2alpha1/projects/[project]/locations/[zone]/nodes?node_id=[node_name]

Required fields

runtime_version
The runtime version you wish to use.
project
The name of your enrolled GCP project.
zone
The zone where you are creating your Cloud TPU.
node_name
The name of the TPU VM you are creating.

Connecting to a Cloud TPU VM

  1. (optional). Set up a firewall for SSH

The default network comes preconfigured to allow SSH access to all VMs. If you don't use the default network, or the default network settings were edited, you may need to explicitly enable SSH access by adding a firewall-rule:

$ gcloud compute firewall-rules create --network=network allow-ssh --allow=tcp:22
  1. SSH into the TPU VMs
$ gcloud alpha compute tpus tpu-vm ssh tpu-name --zone zone --project project-id

Required fields

tpu_name
The name of the TPU VM to which you are connecting.
zone
The zone where you are creating your Cloud TPU. Currently europe-west4-a and us-central1-a are supported.
project-id
Your GCP project ID.

Optional fields

user
You can choose the username used to authenticate when connecting to the Cloud TPU VM over SSH, using $USER@ prefix to the TPU name, for example: my-email-account@tpu-node-1.
worker
For Cloud TPU Pods, you can choose which worker VM to SSH into. The default is worker 0, the first VM associated with the TPU Pod.
ssh-key-file
The path to the SSH key file. By default, this is ~/.ssh/google_compute_engine.
internal-ip
Connect to the TPU VMs using an internal IP address. For this connection to work, you must configure your networks and firewall to allow SSH connections to the internal IP address of the TPU VM to which you want to connect.
command
A command to run on the TPU VM. The command is run on the target TPU VM and then exits.
tunnel-through-iap
Tunnel the SSH connection through Cloud Identity-Aware Proxy for TCP forwarding. To learn more, see [Overview of TCP forwarding](https://cloud.google.com/iap/docs/tcp-forwarding-overview).

To SSH into other TPU VMs associated with the TPU Pod, append --worker ${WORKER_NUMBER} in the command, where the WORKER_NUMBER is 0-based index.

Listing your Cloud TPU resources

You can list all of your Cloud TPUs in a specified zone.

$ gcloud alpha compute tpus tpu-vm list --zone=zone

Required fields

zone
The zone where you plan to create your Cloud TPU.

This command lists the Cloud TPU resources in the specified zone. If no resources are currently set up, the output will just show dashes for the VM and TPU.

Retrieving information about your Cloud TPU

You can retrieve information about a specific Cloud TPU using the following command.

$ gcloud alpha compute tpus tpu-vm describe tpu-name \
  --zone=zone

Required fields

tpu-name
The name of the Cloud TPU to create.
zone
The zone where your Cloud TPU was created.

Stopping your Cloud TPU resources

You can stop a single Cloud TPU using the following command. You cannot stop a TPU Pod.

$ gcloud alpha compute tpus tpu-vm stop tpu-name \
  --zone=zone

Required fields

tpu-name
The name of the Cloud TPU to stop.
zone
The zone where you created your Cloud TPU.

Starting your Cloud TPU resources

If your Cloud TPU has been stopped, you can restart it using the following command.

$ gcloud alpha compute tpus tpu-vm start tpu-name --zone  zone

Command flag descriptions

tpu-name
The name of the Cloud TPU to start.
zone
The zone where the Cloud TPU was created.

Deleting your VM and Cloud TPU resources

You can delete your Cloud TPU when you are done using them.

$ gcloud alpha compute tpus tpu-vm delete tpu-name \
  --zone=zone

Required fields

zone
The zone where your Cloud TPU was created.

Capturing performance metrics

You can capture a performance profile using a command line script or using Tensorboard. For instructions on installing Tensorboard, see Tensorboard setup.

For Tensorflow models, you can capture profile data automatically by using the standard Tensorflow profiling callback method.

To manually capture profile data for Tensorflow models, use the following command on your TPU VM:

$ python3 -c "import tensorflow as tf; tf.profiler.experimental.client.trace('grpc://localhost:port', 'gs://model-dir', 1000)"

To capture profile data for PyTorch models using the command line, use the following command on your TPU VM:

$ python3 -c "import torch_xla.debug.profiler as xp; xp.trace('localhost:port', '/tmp/tb', 1000)"

For information about how to capture profile data for JAX models see Profiling JAX programs.

Viewing profile data

  1. Open a Cloud Shell
  2. Make sure you have installed Tensorboard
  3. Run Tensorboard
  4. From a Cloud Shell click the Web Preview button and select Change port and type 6006.
  5. Click profile, an overview page is displayed.
  6. Navigate to trace viewer under tools

Request More TPU quota

Cloud TPU allocates default TPU quota for your project. If you need more, see Requesting additional quota.

Generating and viewing profile information

You can generate profile information and use TensorBoard to visualize training metrics.

gcloud setup troubleshooting

Problem

gcloud components update displays the following error message:

ERROR: (gcloud.components.update)
You cannot perform this action because the Cloud SDK component manager
is disabled for this installation.
Solution

To use gcloud with TPU VM, you will need to use a gcloud installation that is not managed through a package manager. Follow these steps to install gcloud from source code:

sudo apt-get remove google-cloud-sdk
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-311.0.0-linux-x86_64.tar.gz
tar -xzf google-cloud-sdk-311.0.0-linux-x86_64.tar.gz
./google-cloud-sdk/install.sh
source ~/.bashrc
Problem

Running any command beginning with gcloud alpha compute tpus tpu-vm displays the following information:

ERROR: (gcloud.alpha.compute.tpus) Invalid choice: 'tpu-vm'.
Solution

This happens when the component repository has not been properly updated. To verify this, run gcloud --version. The first line of the output should be "Google Cloud SDK HEAD"; if the output is different, the update did not take place. if this happens, try updating the gcloud components with the following command.

gcloud components update

If you are still getting the same error, try reinstalling gcloud with the following command:

gcloud components reinstall
Problem

gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} --zone ${ZONE} command displays the following error message:

Waiting for SSH key to propagate.
ssh: connect to host 34.91.136.59 port 22: Connection timed out
ssh: connect to host 34.91.136.59 port 22: Connection timed out
ssh: connect to host 34.91.136.59 port 22: Connection timed out
ERROR: (gcloud.alpha.compute.tpus.tpu-vm.ssh) Could not SSH into the instance.  It is possible that your SSH key has not propagated to the instance yet. Try running this command again.  If you still cannot connect, verify that the firewall and instance are set to accept ssh traffic.
Solution

Something may be wrong with the SSH key propagation. Try moving the automatically-generated keys to a backup location to force gcloud to recreate them:

mv ~/.ssh/google_compute_engine ~/.ssh/old-google_compute_engine
mv ~/.ssh/google_compute_engine.pub ~/.ssh/old-google_compute_engine.pub

FAQ

Can I use V1Alpha1 and V1 APIs to manage Cloud TPU VMs?
Get/List is allowed, but mutations are only available in V2Alpha1 API Version.
How do I know whether the TPUs are using Cloud TPU VMs?
Make a GetNode call on the TPU, set the APIVersion field to V2_ALPHA1.