Cloud TPU VM user's guide
This guide describes how to set up a Google Cloud project for using Cloud TPU VMs. It describes the commands for using Cloud TPU VMs and solutions to common issues you may encounter when starting to use Cloud TPU VMs.
Cloud TPU VMs run on the TPU host machine (the machine connected to the Cloud TPU device) and offer significantly better performance and usability when working with TPUs.
If you are new to Cloud TPUs, check out the Cloud TPU quickstarts.
If you plan to run on a Cloud TPU Pod with TPU VM, refer to Training on TPU Pods.
Cloud TPU VM introduced a new Cloud TPU architecture. For more information about the Cloud TPU architectures, see System Architecture.
Set up a Google Cloud Project
After installing the Google Cloud CLI, install the gcloud
components
using the following command:
gcloud components install
For more information about gcloud
components, see Managing Google Cloud CLI Components.
Prepare a Google Cloud Project
Sign in to your Google Account. If you
don't already have one, sign up for a new account.
In the Google Cloud console, select or create a Cloud project from the project selector
page. Make sure billing is enabled for your project. Set your project ID using
gcloud
in the Cloud Shell. The project ID is the name of your project shown in
the Google Cloud console.
$ gcloud config set project project-id
Enable the Cloud TPU API
Enable the Cloud TPU API using the following gcloud command in Cloud Shell. (You may also enable it from the Google Cloud console.
$ gcloud services enable tpu.googleapis.com
Configure the gcloud
command
Run the following commands to configure gcloud
to use your Google Cloud project and
install components needed for the TPU VM preview.
$ gcloud config set account your-email-account $ gcloud config set project your-project
Managing TPUs
You can manage Cloud TPU VM using gcloud
or curl
. For more information,
see Managing Cloud TPUs.
Creating a Cloud TPU VM with gcloud
$ gcloud compute tpus tpu-vm create tpu-name \
--zone=zone \
--accelerator-type=v3-8 \
--version=tpu-vm-tf-2.11.0
Required fields
Optional flag
shielded-secure-boot
- Specifies that the TPU instances are created with secure boot enabled. This implicitly makes them Shielded VM instances. See What is shielded VM? for more details.
When creating a TPU VM, you can specify a startup script using the
--metadata startup-script
flag. For example:
$ gcloud compute tpus tpu-vm create tpu-name \
--zone=zone \
--accelerator-type=v3-8 \
--version=tpu-vm-tf-2.11.0 \
--metadata startup-script=your-script
A startup script is run whenever the TPU VM is provisioned as well as if the TPU VM is restarted due to a maintenance event.
Creating a Cloud TPU VM with curl
$ curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" -d "{accelerator_type: 'v2-8', \
runtime_version:'tpu-vm-tf-2.11.0', \
network_config: {enable_external_ips: true}, \
shielded_instance_config: { enable_secure_boot: true }}" \
https://tpu.googleapis.com/v2/projects/project-id/locations/us-central1-b/nodes?node_id=node_name
Required fields
runtime_version
- The runtime version you wish to use.
project
- The name of your enrolled Google Cloud project.
zone
- The zone where you are creating your Cloud TPU.
node_name
- The name of the TPU VM you are creating.
Connecting to a Cloud TPU VM
- (optional) Set up a firewall for SSH
The default network comes preconfigured to allow SSH access to all VMs. If you don't use the default network, or the default network settings were edited, you may need to explicitly enable SSH access by adding a firewall-rule:
$ gcloud compute firewall-rules create --network=network allow-ssh --allow=tcp:22
- SSH into the TPU VM
$ gcloud compute tpus tpu-vm ssh tpu-name --zone zone --project project-id
Required fields
tpu_name
- The name of the TPU VM to which you are connecting.
zone
- The zone where you are creating your Cloud TPU.
project-id
- Your Google Cloud project ID.
Optional fields
user
- You can choose the username used to authenticate when connecting to the
Cloud TPU VM over SSH, using $USER@ prefix to the TPU name, for example:
my-email-account@tpu-node-1
. worker
- For Cloud TPU Pods, you can choose which worker VM to SSH into. The default is worker 0, the first VM associated with the TPU Pod.
ssh-key-file
- The path to the SSH key file. By default, this is
~/.ssh/google_compute_engine
. internal-ip
- Connect to the TPU VMs using an internal IP address. For this connection to work, you must configure your networks and firewall to allow SSH connections to the internal IP address of the TPU VM to which you want to connect.
command
- A command to run on the TPU VM. The command is run on the target TPU VM and then exits.
tunnel-through-iap
- Tunnel the SSH connection through Cloud Identity-Aware Proxy for TCP forwarding. To learn more, see Overview of TCP forwarding.
To SSH into other TPU VMs associated with the TPU Pod, append --worker ${WORKER_NUMBER}
in the command,
where the WORKER_NUMBER
is 0-based index.
Listing your Cloud TPU resources
You can list all of your Cloud TPUs in a specified zone.
$ gcloud compute tpus tpu-vm list --zone=zone
Required fields
zone
- The zone where you plan to create your Cloud TPU.
This command lists the Cloud TPU resources in the specified zone. If no resources are currently set up, the output will just show dashes for the VM and TPU.
Retrieving information about your Cloud TPU
You can retrieve information about a specific Cloud TPU using the following command.
$ gcloud compute tpus tpu-vm describe tpu-name \
--zone=zone
Required fields
tpu-name
- The name of the Cloud TPU to create.
zone
- The zone where your Cloud TPU was created.
Stopping your Cloud TPU resources
You can stop a single Cloud TPU using the following command. You cannot stop a TPU Pod.
$ gcloud compute tpus tpu-vm stop tpu-name \
--zone=zone
Required fields
tpu-name
- The name of the Cloud TPU to stop.
zone
- The zone where you created your Cloud TPU.
Starting your Cloud TPU resources
If your Cloud TPU has been stopped, you can restart it using the following command.
$ gcloud compute tpus tpu-vm start tpu-name --zone zone
Command flag descriptions
tpu-name
- The name of the Cloud TPU to start.
zone
- The zone where the Cloud TPU was created.
Deleting your VM and Cloud TPU resources
You can delete your Cloud TPU when you are done using them.
$ gcloud compute tpus tpu-vm delete tpu-name \
--zone=zone
Required fields
zone
- The zone where your Cloud TPU was created.
Capturing performance metrics
You can capture a performance profile using a command line script or using TensorBoard. For instructions on installing TensorBoard, see TensorBoard setup.
For TensorFlow models, you can capture profile data automatically by using the standard TensorFlow profiling callback method.
To manually capture profile data for TensorFlow models, use the following command on your TPU VM:
$ python3 -c "import tensorflow as tf; tf.profiler.experimental.client.trace('grpc://localhost:port', 'gs://model-dir', 1000)"
To capture profile data for PyTorch models using the command line, use the following command on your TPU VM:
$ python3 -c "import torch_xla.debug.profiler as xp; xp.trace('localhost:port', '/tmp/tb', 1000)"
For information about how to capture profile data for JAX models see Profiling JAX programs.
Viewing profile data
- Open a Cloud Shell
- Make sure you have installed TensorBoard
- Run TensorBoard
- From Cloud Shell, click the Web Preview button and select Change port and
type
6006
. - Click profile, an overview page is displayed.
- Navigate to trace viewer under tools
Request More TPU quota
Cloud TPU allocates default TPU quota for your project. If you need more, see Requesting additional quota.
Generating and viewing profile information
You can generate profile information and use TensorBoard to visualize training metrics.
gcloud
setup troubleshooting
- Problem
gcloud components update
displays the following error message:ERROR: (gcloud.components.update) You cannot perform this action because the gcloud CLI component manager is disabled for this installation.
- Solution
To use
gcloud
with TPU VM, you will need to use agcloud
installation that is not managed through a package manager. Follow these steps to installgcloud
from source code:sudo apt-get remove google-cloud-sdk curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-311.0.0-linux-x86_64.tar.gz tar -xzf google-cloud-sdk-311.0.0-linux-x86_64.tar.gz ./google-cloud-sdk/install.sh source ~/.bashrc
- Problem
Running any command beginning with
gcloud compute tpus tpu-vm
displays the following information:ERROR: (gcloud.compute.tpus) Invalid choice: 'tpu-vm'.
- Solution
This happens when the component repository has not been properly updated. To verify this, run
gcloud --version
. The first line of the output should be "Google Cloud CLI HEAD"; if the output is different, the update did not take place. if this happens, try updating thegcloud
components with the following command.gcloud components update
If you are still getting the same error, try reinstalling
gcloud
with the following command:gcloud components reinstall
- Problem
gcloud compute tpus tpu-vm ssh ${TPU_NAME} --zone ${ZONE}
command displays the following error message:Waiting for SSH key to propagate. ssh: connect to host 34.91.136.59 port 22: Connection timed out ssh: connect to host 34.91.136.59 port 22: Connection timed out ssh: connect to host 34.91.136.59 port 22: Connection timed out ERROR: (gcloud.compute.tpus.tpu-vm.ssh) Could not SSH into the instance. It is possible that your SSH key has not propagated to the instance yet. Try running this command again. If you still cannot connect, verify that the firewall and instance are set to accept ssh traffic.
- Solution
Something may be wrong with the SSH key propagation. Try moving the automatically-generated keys to a backup location to force
gcloud
to recreate them:mv ~/.ssh/google_compute_engine ~/.ssh/old-google_compute_engine mv ~/.ssh/google_compute_engine.pub ~/.ssh/old-google_compute_engine.pub
FAQ
- Can I use V1Alpha1 and V1 APIs to manage Cloud TPU VMs?
- Get/List is allowed, but mutations are only available in V2Alpha1 API Version.
- How do I know whether the TPUs are using Cloud TPU VMs?
- Make a
GetNode
call on the TPU, set theAPIVersion
field toV2_ALPHA1
.