Run a calculation on a Cloud TPU VM using PyTorch

This quickstart shows you how to create a Cloud TPU, install PyTorch and run a simple calculation on a Cloud TPU. For a more in depth tutorial showing you how to train a model on a Cloud TPU see one of the Cloud TPU PyTorch Tutorials.

Before you begin

Before you follow this quickstart, you must create a Google Cloud Platform account, install the Google Cloud CLI. and configure the gcloud command. For more information, see Set up an account and a Cloud TPU project.

Create a Cloud TPU with gcloud

Launch a Compute Engine VM and Cloud TPU using the gcloud command. The command you use depends on whether you are using a TPU VM or a TPU node. For more information on the two VM architecture, see System Architecture. For more information on the gcloud command, see the gcloud Reference.

TPU VM

To create a TPU VM in the default user project, network and compute/zone run:

$ gcloud compute tpus tpu-vm create tpu-name \
--zone=us-central2-b \
--accelerator-type=v3-8 \
--version=tpu-vm-pt-2.0

Command flag descriptions

zone
The zone where you plan to create your Cloud TPU.
accelerator-type
The type of the Cloud TPU to create.
version
The Cloud TPU software version.

While creating your TPU, you can pass the additional --network and --subnetwork flags if you want to specify the default network and subnetwork. If you do not want to use the default network, you must pass the --network flag. The --subnetwork flag is optional and can be used to specify a default subnetwork for whatever network you are using (default or user-specified). See the gcloud API reference page for details on these flags.

TPU Node

  1. When creating a TPU Node for PyTorch, you first create a Compute Engine VM instance.

    gcloud compute instances create tpu-name \
    --zone=us-central2-b  \
    --machine-type=n1-standard-16  \
    --image-family=torch-xla \
    --image-project=ml-images  \
    --boot-disk-size=200GB \
    --scopes=https://www.googleapis.com/auth/cloud-platform

    Command flag descriptions

    project
    Your Cloud project ID
    name
    The name of the Cloud TPU to create.
    zone
    The zone where you plan to create your Cloud TPU.
    disk-size
    The size of the hard disk in GB of the VM created by the gcloud command.
    machine-type
    The machine type of the Compute Engine VM to create.
    tf-version
    The version of Tensorflow gcloud installs on the VM.
    accelerator-type
    The type of the Cloud TPU to create.
  2. Next, create the TPU instance.

    gcloud compute tpus create tpu-name \
    --zone=us-central2-b \
    --network=default \
    --version=pytorch-1.11 \
    --accelerator-type=v3-8

    Command flag descriptions

    name
    The name of the Cloud TPU to create.
    zone
    The zone where you plan to create your Cloud TPU.
    network
    The network to which the TPU is connected.
    version
    The version of PyTorch gcloud installs on the VM.
    accelerator-type
    The type of the Cloud TPU to create.

Connect to your Cloud TPU VM

TPU VM

$ gcloud compute tpus tpu-vm ssh tpu-name \
  --zone=us-central2-b

TPU Node

gcloud compute ssh tpu-name --zone=us-central2-b

Set TPU runtime configuration

TPU VM

Configure the Torch-XLA environment.

There are two PyTorch/XLA runtime options: PJRT and XRT. We recommend you use PJRT unless you have a reason to use XRT. To learn more about the different runtime configurations for PyTorch/XLA, see the PJRT runtime documentation.

PJRT

(vm) $ export PJRT_DEVICE=TPU

XRT (Legacy)

(vm) $ export XRT_TPU_CONFIG="localservice;0;localhost:51011"

TPU Node

  1. Find the IP address of the TPU Node.

    (vm)$ gcloud compute tpus describe \
    tpu-name \
    --zone=us-central2-b
  2. Configure the Torch-XLA environment. Make sure to replace your-tpu-ip-address with the IP address of your TPU.

    (vm)$ conda activate torch-xla-1.11
    (vm)$ export TPU_IP_ADDRESS=your-tpu-ip-address
    (vm)$ export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"

Perform a simple calculation:

  1. Create a file named tpu-test.py in the current directory and copy and paste the following script into it.

    import torch
    import torch_xla.core.xla_model as xm
    
    dev = xm.xla_device()
    t1 = torch.randn(3,3,device=dev)
    t2 = torch.randn(3,3,device=dev)
    print(t1 + t2)
    
  2. Run the script:

      (vm)$ python3 tpu-test.py

    Output from the script shows the result of the computation:

    tensor([[-0.2121,  1.5589, -0.6951],
            [-0.7886, -0.2022,  0.9242],
            [ 0.8555, -1.8698,  1.4333]], device='xla:1')
    
    OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
    OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
    

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

  1. Disconnect from the Compute Engine instance, if you have not already done so:

    (vm)$ exit
    

    Your prompt should now be username@projectname, showing you are in the Cloud Shell.

  2. Delete your Cloud TPU.

    TPU VM

    $ gcloud compute tpus tpu-vm delete tpu-name \
    --zone=us-central2-b

    TPU Node

    $ gcloud compute tpus execution-groups delete tpu-name \
    --zone=us-central2-b

    The output of this command should confirm that your TPU has been deleted.

What's next

Read more about Cloud TPU VMs: