Cloud TPU v4 user's guide

This guide describes how to set up and use a Google Cloud Platform project to use the Cloud TPU v4 release and the TPU VM architecture. If you are new to Cloud TPUs, you can learn about them from the TPUs, System architecture and the TPU beginner's guide.

Cloud TPU VMs run on a TPU host machine (the machine connected to the Cloud TPU device) and offer significantly better performance and usability than TPU Nodes when working with TPUs. The architectural differences between TPU VMs and TPU Nodes are described in the System architecture document.

This guide describes the commands used to set up and run Cloud TPU v4 applications using TPU VMs with TensorFlow, PyTorch, and JAX. It also describes solutions to common issues you might encounter when starting to use Cloud TPU v4.

This document uses gcloud commands to perform many of the tasks needed to use Cloud TPU v4. For more information on the gcloud API, see the gcloud reference.

Concepts and terminology

Quota

For v4 there is only one quota type for all TPU slice sizes. This is different from how quota works for v2 and v3 TPU types where there is a quota for a single-device TPU and a different quota for Pod TPUs. For Cloud TPU v4 slices, the default quota is 0. You must contact sales to have a quota allocated.

Accelerator type

TPU v4 has a wider range of accelerator types than previous generations. See AcceleratorConfig section for more information.

The available accelerator types will change throughout the Preview.

Cores, devices, and chips

  • Per TPU board, there are 4 chips and 8 cores (2 cores per chip)
  • XLA combines the resources of each chip to make a virtual core, therefore, when training, you will see 4 virtual cores per device rather than 8 physical cores per device.

Known Issues

  • Currently, you must use gcloud commands to create a TPU v4; they cannot be created through the console UI.
  • A subnet must be manually created in us-central2 before using a TPU v4. See Custom Network Resources for more information.
  • By default, projects have a low IN_USE_ADDRESSES limit so this quota needs to be increased before creating large slices. You need to contact your sales representative to increase this quota. Small increases (i.e., 8->16 for us-central2-b) should be approved automatically.

Set up and prepare a GCP project

To use v4 Cloud TPUs, you must first prepare a GCP project. Use the following steps to do that.

  1. Set up a GCP Project

    1. Sign in to your Google Account. If you don't already have a Google account, sign up for a new account.
    2. In the Cloud console, select or create a Cloud project from the project selector page.
    3. Make sure billing is enabled for your project.
    4. Set your project ID using the gcloud CLI. The project ID is the name of your project shown on the Cloud console.

      $ gcloud config set project project-ID
      
  2. Enable TPU API

    Enable the TPU API using the following gcloud command in Cloud Shell. (You can also enable it from the Google Cloud console.)

    $ gcloud services enable tpu.googleapis.com 
    
  3. Create a TPU service account using the following command:

    $ gcloud alpha compute tpus tpu-vm service-identity create --zone=us-central2-b
    
  4. Create a subnet for TPUs

    $ gcloud compute networks subnets create tpusubnet \
    --network=default \
    --range=10.12.0.0/20 \
    --region=us-central2 \
    --enable-private-ip-google-access
    
  5. Request allowlist

    Contact your account team to request access.

TPU setup

After setting up your project, create a Cloud TPU using the following steps.

  1. Set up Cloud TPU variables

    $ export TPU_NAME=my_tpu_vm
    $ export ZONE=us-central2-b
    $ export ACCELERATOR_TYPE=v4-8
    $ export RUNTIME_VERSION=tpu-vm-tf-2.9.1-v4
    
    • TPU_NAME: A user-assigned name of the TPU node.
    • ZONE: The location of the TPU node. Currently, only us-central2-b is supported.
    • ACCELERATOR_TYPE: See the topology section for the supported accelerator types.
    • RUNTIME_VERSION
      • If you are using JAX, use tpu-vm-v4-base
      • If you are using PyTorch, use v4-alpha
      • If you are using TensorFlow on a v4-8 TPU, use tpu-vm-tf-2.9.1-v4.
      • If you are using TensorFlow on a larger Pod slice, use tpu-vm-tf-2.9.1-pod-v4.

    The following commands show two ways to create a TPU VM.

    • Create a TPU VM using gcloud:
    $ gcloud alpha compute tpus tpu-vm create ${TPU_NAME} \
    --zone us-central2-b \
    --accelerator-type ${ACCELERATOR_TYPE} \
    --version ${RUNTIME_VERSION} \
    --subnetwork=tpusubnet
    

    Required flags

    tpu-name
    The name of the TPU VM you are creating.
    zone
    The zone where you are creating your Cloud TPU.
    accelerator-type
    See the topology section for the supported accelerator types.
    version
    The runtime version you wish to use.
    subnet
    The subnet you created previously.

    Optional flags

    preemptible
    Create a preemptible TPU. It may be preempted to free up resources. See preemptible TPUs for more details.
    PROJECT_ID
    The project you are using to set up the TPU.
    enable_external_ips
    When set to true, add access configs to the TPU VMs when the TPU is created. Refer to Private Google Access for more information.
    • Create a TPU VM using curl:
    curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" -d "{accelerator_config: {topology:'2x2x1', type:'V4_POD'}, runtime_version:'v2-nightly', network_config: {enable_external_ips: true}}" https://tpu.googleapis.com/v2alpha1/projects/tpu-prod-env-one-vm/locations/us-central2-b/nodes?node_id=
    
  2. SSH into the TPU VM

    $ gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} --zone=${ZONE}
    

Train ResNet with TensorFlow

You can train any TPU-compatible model with TensorFlow on a v4 Pod slice. This section shows how to train ResNet on a TPU.

Train ResNet on a v4-8 TPU

Complete the instructions in Project setup and TPU setup to set up a v4-8 slice. Then, run the following command on your TPU VM to train ResNet:

export PYTHONPATH=/usr/share/tpu/tensorflow/resnet50_keras
python3 /usr/share/tpu/tensorflow/resnet50_keras/resnet50.py --tpu=local --data=gs://cloud-tpu-test-datasets/fake_imagenet

Train ResNet on a TPU Pod slice

To train ResNet on a v4 Pod slice, must create a TPU v4 Pod slice. To do this, use the instructions in TPU setup and specify a Pod type (for example, v4-32 or V4_POD-2x2x4) as the accelerator type or accelerator-config and specify the Pod runtime version (tpu-vm-tf-2.9.1-pod-v4).

Export a required environment variable

SSH to any of the TPU workers, (e.g., worker 0) and export the following environment variable.

export TPU_LOAD_LIBRARY=0

Run the following commands to train the model. Substitute the TPU name you have chosen into the TPU_NAME variable.

export PYTHONPATH=/usr/share/tpu/tensorflow/resnet50_keras
export TPU_NAME=tpu-name

python3 /usr/share/tpu/tensorflow/resnet50_keras/resnet50.py --tpu=${TPU_NAME} --data=gs://cloud-tpu-test-datasets/fake_imagenet

You can check the logs of the TPU worker with:

sudo docker logs tpu-runtime

For other TF 2.x examples, you can follow TPU VM tutorials in the Cloud TPU documentation, for example, BERT on TF 2.x.

Train ML workloads with PyTorch / XLA

This section describes how to run a simple calculation using a v4-8 TPU with PyTorch / XLA. Train the ResNet model on a v4-8, extends the TPU use case to train ResNet on either a v4-8 TPU or on a larger v4 Pod slice.

Set XRT TPU device configuration:

export XRT_TPU_CONFIG="localservice;0;localhost:51011"

# Set the environment variable to visible devices*
export TPU_NUM_DEVICES=4

# Allow LIBTPU LOAD by multiple processes
export ALLOW_MULTIPLE_LIBTPU_LOAD=1

For models that have sizable, frequent allocations, memory allocation using tcmalloc significantly improves performance compared to the default malloc implementation. Therefore, tcmalloc is the default malloc used on TPU VM. However, depending on your workload (for example, with DLRM which has very large allocations for its embedding tables) tcmalloc might cause a slowdown. In this case, you can change the default to malloc by unsetting the following variable:

unset LD_PRELOAD

Perform a simple calculation

Start the Python 3 interpreter:

python3
import torch
import torch_xla.core.xla_model as xm
dev = xm.xla_device()
This generates the following messages:
2021-11-24 01:11:31.092645: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2021-11-24 01:11:31.092706: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey

Copy/paste the next lines to continue.
t1 = torch.randn(3,3,device=dev)
t2 = torch.randn(3,3,device=dev)
print(t1 + t2)
This generates the following output:
tensor([[-0.3689, -1.1727,  0.6910],
        [ 0.0431,  1.0158,  1.6740],
        [-0.8026,  2.5862, -1.5649]], device='xla:1') 
Use exit() or Ctrl-D (i.e. EOF) to exit

The OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op error message is expected and we will try to fix this in the following release.

Train ResNet on a v4-8 TPU with PyTorch / XLA

A v4-8 is the smallest supported v4 configuration. The following section shows how to train ResNet on a larger configuration.

Follow the instructions for setting up a TPU VM and perform the following steps on the TPU VM.

Export environment variables

After you create the v4-8 TPU and SSH into the TPU VM, export the following variables to the TPU VM:

(tpuvm):$ export TPU_NAME=tpu-name
(tpuvm):$ export ZONE=zone
(tpuvm):$ export XRT_TPU_CONFIG='localservice;0;localhost:51011'
(tpuvm):$ export TPU_NUM_DEVICES=4

Set the runtime version of PyTorch to 1.10

(tpuvm):$ cd /usr/share/ 
(tpuvm):$ sudo git clone -b release/1.10 --recursive https://github.com/pytorch/pytorch 
(tpuvm):$ cd pytorch/
(tpuvm):$ sudo git clone -b r1.10 --recursive https://github.com/pytorch/xla.git 
(tpuvm):$ cd xla/
(tpuvm):$ yes | sudo pip3 uninstall torch_xla 
(tpuvm):$ yes | sudo pip3 uninstall torch 
(tpuvm):$ yes | sudo pip3 uninstall torch_vision
(tpuvm):$ sudo pip3 install torch==1.10.0 
(tpuvm):$ sudo pip3 install torchvision==0.11.1 
(tpuvm):$ sudo pip3 install https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/torch_xla-1.10-cp38-cp38-linux_x86_64.whl 
(tpuvm):$ sudo mv /usr/lib/libtpu.so /tmp 
(tpuvm):$ sudo /snap/bin/gsutil cp gs://tpu-pytorch/v4_wheel/110/libtpu.so /lib/libtpu.so

Run the training

(tpuvm):$ python /usr/share/pytorch/xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1

The training takes approximately 7 minutes to run and generates output similar to the following:

Epoch 1 test end 17:04:13, Accuracy=100.00
Max Accuracy: 100.00%

Train ResNet on a larger v4 Pod slice with PyTorch / XLA

The previous section specified a v4-8 configuration. This section specifies a v4-32 configuration.

Perform the following steps:

  1. Export TPU configuration variables

    $ export TPU_NAME=tpu-name
    $ export ZONE=zone
    $ export PROJECT_ID=project-id
    
  2. Create a v4-Pod with a startup script

    $ gcloud alpha compute tpus tpu-vm create ${TPU_NAME} \
      --zone ${ZONE} \
      --accelerator-type v4-32 \
      --project ${PROJECT_ID} \
      --version v4-alpha-tpuv4 \
      --metadata startup-script='#! /bin/bash 
      ulimit -S -c unlimited > /dev/null 2>&1 
      echo "CLOUDSDK_PYTHON=/usr/bin/python3" | sudo tee -a /etc/environment 
      sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.8 100
      cd /usr/share/ 
      sudo git clone -b release/1.10 --recursive https://github.com/pytorch/pytorch 
      cd pytorch/ 
      sudo git clone -b r1.10 --recursive https://github.com/pytorch/xla.git 
      cd xla/ 
      sudo /snap/bin/gsutil cp gs://tpu-pytorch/v4-libtpu-debug/v4.diff . 
      sudo git apply v4.diff
      sudo /snap/bin/gsutil cp gs://tpu-pytorch/v4-libtpu-debug/version.py torch_xla/ 
      yes | sudo pip3 uninstall torch_xla 
      yes | sudo pip3 uninstall torch_xla 
      yes | sudo pip3 uninstall torch 
      yes | sudo pip3 uninstall torch 
      yes | sudo pip3 uninstall torch_vision 
      sudo pip3 install torch==1.10.0 
      sudo pip3 install torchvision==0.11.1 
      sudo pip3 install https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/torch_xla-1.10-cp38-cp38-linux_x86_64.whl 
      sudo mv /usr/lib/libtpu.so /tmp 
      sudo /snap/bin/gsutil cp gs://tpu-pytorch/v4_wheel/110/libtpu.so /lib/libtpu.so
      EOF'
    
  1. SSH to the TPU VM

    $ gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} \
    --zone ${ZONE} \
    --project ${PROJECT_ID}
    
  1. Configure SSH for Pod use

    $ gcloud compute config-ssh
    

    This command asks for a passphrase, you can hit enter twice for an empty passphrase.

  2. Run the training script

    $ export TPU_NAME=tpu-name
    
    $ cd /usr/share/pytorch/xla
    
    $ python3 -m torch_xla.distributed.xla_dist
      --tpu=${TPU_NAME}
      --restart-tpuvm-pod-server
      -- python3 /usr/share/pytorch/xla/test/test_train_mp_imagenet.py 
      --fake_data
      --model=resnet50
      --num_epochs=1 2>&1 | tee ~/logs.txt
    

    The training takes approximately 3 minutes to run and generates output similar to:

    Epoch 1 test end 15:50:56, Accuracy=100.00
    Max Accuracy: 100.00%
    

Set up and train ML workloads on JAX

To train ResNet on a TPU v4, create TPU Pod slice. To do this, follow the instructions in TPU setup, to specify a type as the accelerator type (for example, v4-32), and specify the Pod runtime version (tpu-vm-v4-base).

Basic JAX setup

Install JAX and jaxlib on a Cloud TPU VM:

sudo pip uninstall jax jaxlib libtpu-nightly libtpu -y
pip3 install -U pip
pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
python3
Python 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0] on linux 
Type "help", "copyright", "credits" or "license" for more information.
import jax
jax.device_count()
4
jax.numpy.add(1, 1)
DeviceArray(2, dtype=int32)

At this point, you're ready to run any JAX code you please! The flax examples are a great place to start running standard ML models in JAX. For instance, to train a basic MNIST convolutional model:

# Run flax mnist example (optional)
pip install --user tensorflow-datasets==3.1.0 ml_collections clu
git clone https://github.com/google/flax.git
pip install --user -e flax
cd flax/examples/mnist
mkdir /tmp/mnist
python3 main.py --workdir=/tmp/mnist --config=configs/default.py --config.learning_rate=0.05 --config.num_epochs=5

JAX training on TPU Pods

TPU Pod slices give you access to even more networked TPU cores (see the System Architecture documentation for more information on what Pods are). The main difference when running JAX code on Pods is that a Pod includes multiple host machines. In general, you should run your JAX program on each host in the Pod, using jax.pmap to perform cross-Pod computation and communication. See the pmap documentation for more details, especially the "Multi-host platforms" section.

Pod training setup

This section shows how to set up a v4-16 Pod slice and run a small program on each Pod host.

  1. If you haven't already done so for your project, create a TPU service account using the following command:

    $ gcloud alpha compute tpus tpu-vm service-identity create --zone=us-central2-b
    
  2. If you haven't already done so for your project, create a subnet for TPUs

    $ gcloud compute networks subnets create tpusubnet \
    --network=default \
    --range=10.12.0.0/20 \
    --region=us-central2 \
    --enable-private-ip-google-access
    
  3. Set up Cloud TPU variables

    $ export TPU_NAME=gm-jax-pod
    $ export ZONE=us-central2-b
    $ export ACCELERATOR_TYPE=v4-16
    $ export RUNTIME_VERSION=tpu-vm-v4-base
    
  4. Create a TPU VM using gcloud

    $ gcloud alpha compute tpus tpu-vm create ${TPU_NAME} \
    --zone us-central2-b \
    --accelerator-type ${ACCELERATOR_TYPE} \
    --version ${RUNTIME_VERSION} \
    --subnetwork=tpusubnet
    
  5. Set up a firewall for SSH

    The default network comes preconfigured to allow SSH access to all VMs. If you don't use the default network, or the default network was edited, you may need to explicitly enable SSH access by adding a firewall-rule:

    $ gcloud compute firewall-rules create \
    --network=NETWORK allow-ssh \
    --allow=tcp:22
    
  6. Install JAX into the TPU VM

    $ gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} \
    --zone=${ZONE} --worker=all --command="pip install \
    --upgrade 'jax[tpu]>0.3.0' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html" --project=tpu-prod-env-one-vm"
    
  7. Set up a local example file with commands to run on the Pod.

    # The following code snippet will be run on all TPU hosts
    import jax
    
    # The total number of TPU chips in the Pod
    device_count = jax.device_count()
    
    # The number of TPU chip attached to this host
    local_device_count = jax.local_device_count()
    
    # The psum is performed over all mapped devices across the Pod
    xs = jax.numpy.ones(jax.local_device_count())
    r = jax.pmap(lambda x: jax.lax.psum(x, 'i'), axis_name='i')(xs)
    
    # Print from a single host to avoid duplicated output
    if jax.process_index() == 0:
        print('global device count:', jax.device_count())
        print('local device count:', jax.local_device_count())
        print('pmap result:', r)
    
  8. Copy the example file to all Pod hosts

    $ gcloud alpha compute tpus tpu-vm scp example.py gm-jax-pod:   --worker=all --zone=us-central2-b
    
  9. Launch the example.py program on each host (TPU worker) in the Pod:

    gcloud alpha compute tpus tpu-vm ssh gm-jax-pod \
    --zone us-central2-b --worker=all --command "python3 example.py"
    

    Running example.py on all the Pod hosts generates the following output:

    SSH: Attempting to connect to worker 0...
    SSH: Attempting to connect to worker 1...
    global device count: 8
    local device count: 4
    pmap result: [8. 8. 8. 8.]
    

Manage TPUs

You can manage TPUs with the gcloud CLI or with curl calls.

Create TPUs

TPU setup shows an example of how to create a TPU v4. See the gcloud API documentation for details on the gcloud create command.

Get TPU details

You can get the details of a node through TPU API requests.

$ gcloud alpha compute tpus tpu-vm describe ${TPU_NAME} \
 --zone ${ZONE} \
 --project ${PROJECT_ID}

Using a curl call:

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" https://tpu.googleapis.com/v2alpha1/projects/[PROJECT]/locations/[ZONE]/nodes/[NODE_NAME]

The response body contains an instance of Node.

List TPUs

You can get a list of Cloud TPUs through TPU API requests.

Using the gcloud CLI:

$ gcloud alpha compute tpus tpu-vm list \
--zone ${ZONE} \
--project ${PROJECT_ID}

Using a curl call:

curl -H "Authorization: Bearer $(gcloud auth print-access-token)"
https://tpu.googleapis.com/v2alpha1/projects/[PROJECT]/locations/[ZONE]/nodes/

Delete TPUs

You can delete created Cloud TPUs through TPU API requests.
Using the gcloud CLI:

$ gcloud alpha compute tpus tpu-vm delete ${TPU_NAME} \
--zone ${ZONE} \
--project ${PROJECT_ID}

Using a curl call:

curl -X DELETE -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" https://tpu.googleapis.com/v2alpha1/projects/[PROJECT]/locations/[ZONE]/nodes/[NODE_NAME]

Access TPU VMs with SSH

  1. (optional). Set up a firewall for SSH

    The default network comes preconfigured to allow SSH access to all VMs. If you don't use the default network, or the default network was edited, you may need to explicitly enable SSH access by adding a firewall-rule:

    $ gcloud compute firewall-rules create \
    --network=NETWORK allow-ssh \
    --allow=tcp:22
    
  2. SSH into the TPU VMs

    $ gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} \
    --zone ${ZONE} \
    --project ${PROJECT_ID}
    

    Required fields

    • TPU_NAME: Name of the TPU node.
    • ZONE: The location of the TPU node. Currently, only us-central2-b is supported.
    • PROJECT_ID: The project you created above.

    See the gcloud API documentation for a list of optional fields.

Use TPUs with Advanced Configs

Custom Network Resources

When you create the TPU, you can choose to specify the network and/or a subnetwork. You can do this either by submitting a gcloud command or a curl call.

To specify the network or subnetwork in the gcloud CLI, use:

--network [NETWORK] --subnetwork [SUBNETWORK]

To specify the network or subnetwork in a curl call, use:

network_config: {network: '[NETWORK]', subnet: '[SUBNETWORK]', enable_external_ips: true}

Network

You can optionally specify the network to use for the TPU. If not specified, the default network is used.

Valid network formats:

https://www.googleapis.com/compute/{version}/projects/{proj-id}/global/networks/{network}
compute/{version}/projects/{proj-id}/global/networks/{network}
compute/{version}/projects/{proj-##}/global/networks/{network}
projects/{proj-id}/global/networks/{network}
projects/{proj-##}/global/networks/{network}
global/networks/{network}
{network}

Subnetwork

You can specify the subnetwork to use a specific subnetwork. The specified subnetwork needs to be in the same region as the zone where the TPU runs.

Valid Formats:

https://www.googleapis.com/compute/{version}/projects/{proj-id}/regions/{region}/subnetworks/{subnetwork}
compute/{version}/projects/{proj-id}/regions/{region}/subnetworks/{subnetwork}
compute/{version}/projects/{proj-##}/regions/{region}/subnetworks/{subnetwork}
projects/{proj-id}/regions/{region}/subnetworks/{subnetwork}
projects/{proj-##}/regions/{region}/subnetworks/{subnetwork}
regions/{region}/subnetworks/{subnetwork}
{subnetwork}

Private Google Access

In order to SSH into the TPU VMs, you need to either add access configs for the TPU VMs, or turn on the Private Google Access for the subnetwork to which the TPU VMs are connected.

To add access configs, enable_external_ips must be set. When you create a TPU, enable_external_ips is set by default. If you want to opt out, specify the following command:

--internal-ips

Or use a curl call:

network_config: {enable_external_ips: true}

After you have configured Private Google Access, connect to the VM via SSH.

Custom Service Account

Each TPU VM has an associated service account it uses to make API requests on your behalf. TPU VMs use this service account to access files on Cloud Storage and access other services. See Service Accounts to learn more about them.

When you create a TPU node, you can choose to specify a custom service account for the TPU VM identities. By default, Google Compute Engine default service account is used. A custom service account needs to be in the project that you use to create the TPU. Use the following commands to specify a custom service account.

Specify in the gcloud CLI:

--service-account=[SERVICE_ACCOUNT]

Specify using curl:

service_account: {email: '[SERVICE_ACCOUNT]'}

To use a custom service account, you need to authorize the service account for your Google Cloud Storage buckets. See Connecting to Cloud Storage Buckets for instructions.

Custom VM SSH methods

To use custom VM SSH methods, follow the instructions in Access TPU VMs with SSH.

Accelerator Configuration

During the v4 Preview, support will be added for an AcceleratorConfig flag, which will enable you to specify all TPU configurations in more flexible, and expanded ways.

With TPU v4, types and topologies can be specified using one of two gcloud flags when creating a TPU: AcceleratorType or AcceleratorConfig. AcceleratorType is the current flag consisting of a TPU type followed by the number of cores. For example, v3-128 specifies a TPU v3 with 128 cores. The v4 AcceleratorConfig feature offers more configuration options. See the Topology section for more information on AcceleratorConfig topology options. AcceleratorType and AcceleratorConfig will both supported for all TPU versions (v2, v3, v4).

V4 TPU types

The v4 type must be V4_POD.

V4 Topology

The v4 topology is specified in chips (unlike AcceleratorType which uses cores) and there are 3 aspects to v4 topologies, e.g., 4x4x4. Note that there are two cores per chip. Refer to the System Architecture document for details on the possible v4 topologies.

Large slices can be built from one or more 4x4x4 "cubes" of chips.

Specify the AcceleratorType

Specify the AcceleratorType using the gcloud CLI:

$ gcloud alpha compute tpus tpu-vm create ${TPU_NAME} \
--zone ${ZONE} \
--project ${PROJECT_ID}  \
--accelerator-type ${ACCELERATOR_TYPE} \
--version ${RUNTIME_VERSION}

Other Information

Request More TPU quota

The default quota allocation for Cloud TPU v4 is zero for all projects. Request quota by contacting your sales representative. following the instructions in the quota policy.

Troubleshooting

["gcloud auth login" cannot open browser]
When running

$ gcloud auth login

It attempts to open a browser window over SSH, and prints a link that leads to an invalid localhost URL. Use the --no-launch-browser flag instead:

$ gcloud auth login --no-launch-browser

[Cannot SSH to TPU VM]
When running

$ gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} --zone ${ZONE}

Example error message:

Waiting for SSH key to propagate.
ssh: connect to host 34.91.136.59 port 22: Connection timed out
ssh: connect to host 34.91.136.59 port 22: Connection timed out
ssh: connect to host 34.91.136.59 port 22: Connection timed out
ERROR: (gcloud.alpha.compute.tpus.tpu-vm.ssh) Could not SSH into the instance.
It is possible that your SSH key has not propagated to the instance yet.
Try running this command again.  If you still cannot connect, verify that
the [firewall](#access-tpu-vms-with-ssh) and instance are set to
accept SSH traffic.

Something might be wrong with the SSH key propagation. Try moving the automatically-generated keys to a backup location to force gcloud to recreate them:

mv ~/.ssh/google_compute_engine ~/.ssh/old-google_compute_engine
mv ~/.ssh/google_compute_engine.pub ~/.ssh/old-google_compute_engine.pub

Clean up

You should delete your TPUs when they are no longer needed. Follow the TPU deletion instructions to delete your TPU.

Q/A

JAX jobs on Pods
We're working on tools and recommendations for orchestrating JAX jobs on Pods, but we'd also like it to be possible for users to bring their own if their lab already uses a multi-machine job scheduler or cluster manager (e.g., SLURM or Kubernetes).

Can I use V1Alpha1 and V1 APIs to manage direct-access Cloud TPU VMs?
Get/List is allowed, but mutations are only available in the V2Alpha1 API Version.

How do I request help?
Contact cloud-tpu-support@google.com with information:

  • Your GCP project ID
  • Your TPU node name, if exists
  • Other information you want to provide