Enabling TPUs

This page explains how to toggle Cloud TPU support in your existing Google Kubernetes Engine (GKE) clusters without migrating your workloads.

Before you begin

Before you start, make sure you have performed the following tasks:

Set up default gcloud settings using one of the following methods:

  • Using gcloud init, if you want to be walked through setting defaults.
  • Using gcloud config, to individually set your project ID, zone, and region.

Using gcloud init

If you receive the error One of [--zone, --region] must be supplied: Please specify location, complete this section.

  1. Run gcloud init and follow the directions:

    gcloud init

    If you are using SSH on a remote server, use the --console-only flag to prevent the command from launching a browser:

    gcloud init --console-only
  2. Follow the instructions to authorize gcloud to use your Google Cloud account.
  3. Create a new configuration or select an existing one.
  4. Choose a Google Cloud project.
  5. Choose a default Compute Engine zone.

Using gcloud config

  • Set your default project ID:
    gcloud config set project project-id
  • If you are working with zonal clusters, set your default compute zone:
    gcloud config set compute/zone compute-zone
  • If you are working with regional clusters, set your default compute region:
    gcloud config set compute/region compute-region
  • Update gcloud to the latest version:
    gcloud components update

Enable the following APIs on the Google Cloud Console:

Requirements and limitations

  • Your cluster must be a VPC-native cluster.
  • Your GKE cluster version must be 1.13.4-gke-5 or higher.
  • Your GKE cluster and node pools must be in a zone where Cloud TPU is available. Additionally, the Cloud Storage buckets that hold your training data and models must be in the same region as your GKE cluster.

    The following zones are available:

    US

    TPU type (v2) TPU v2 cores Total TPU memory Region/Zone
    v2-8 8 64 GiB us-central1-b
    us-central1-c
    us-central1-f
    v2-32 32 256 GiB us-central1-a
    v2-128 128 1 TiB us-central1-a
    v2-256 256 2 TiB us-central1-a
    v2-512 512 4 TiB us-central1-a
    TPU type (v3) TPU v3 cores Total TPU memory Available zones
    v3-8 8 128 GiB us-central1-a
    us-central1-b
    us-central1-f

    Europe

    TPU type (v2) TPU v2 cores Total TPU memory Region/Zone
    v2-8 8 64 GiB europe-west4-a
    v2-32 32 256 GiB europe-west4-a
    v2-128 128 1 TiB europe-west4-a
    v2-256 256 2 TiB europe-west4-a
    v2-512 512 4 TiB europe-west4-a
    TPU type (v3) TPU v3 cores Total TPU memory Available zones
    v3-8 8 128 GiB europe-west4-a
    v3-32 32 512 GiB europe-west4-a
    v3-64 64 1 TiB europe-west4-a
    v3-128 128 2 TiB europe-west4-a
    v3-256 256 4 TiB europe-west4-a
    v3-512 512 8 TiB europe-west4-a
    v3-1024 1024 16 TiB europe-west4-a
    v3-2048 2048 32 TiB europe-west4-a

    Asia Pacific

    TPU type (v2) TPU v2 cores Total TPU memory Region/Zone
    v2-8 8 64 GiB asia-east1-c

Enabling Cloud TPU

gcloud

Enable Cloud TPU support in your cluster:

gcloud beta container clusters update [CLUSTER_NAME] --enable-tpu

where [CLUSTER_NAME] is the name of your cluster.

Update the kubeconfig entry:

gcloud container clusters get-credentials [CLUSTER_NAME]

where [CLUSTER_NAME] is the name of your cluster.

Setting a custom CIDR range

By default, GKE allocates a CIDR block with the size of /20 for the TPUs provisioned by the cluster.

gcloud

To specify a custom CIDR range for the TPUs:

gcloud beta container clusters update [CLUSTER_NAME] \
  --enable-tpu \
  --tpu-ipv4-cidr 10.100.0.0/20

where 10.100.0.0/20 is your custom CIDR range.

Viewing operations

Enabling Cloud TPU support starts an update operation. For zonal clusters this operation takes around 5 minutes, and for regional clusters, this operation takes roughly 15 minutes, depending on the cluster's region.

gcloud

To list every running and completed operation in your cluster, run:

gcloud container operations list

To get more information about a specific operation, run:

gcloud container operations operations describe [OPERATION_ID]

where [OPERATION_ID] is the operations ID.

Disabling Cloud TPU

gcloud

Verify that none of your workloads are using Cloud TPU:

kubectl get tpu

Disable Cloud TPU support in your cluster:

gcloud beta container clusters update [CLUSTER_NAME] --no-enable-tpu

where [CLUSTER_NAME] is the name of your cluster.

For zonal clusters this operation takes around 5 minutes, and for regional clusters, this operation takes roughly 15 minutes, depending on the cluster's region.

Once the operations completes with no errors, you can verify that the TPUs provisioned by the cluster have been removed:

gcloud compute tpus list

The namves of TPUs created by GKE have the format: gke-<var>[CLUSTER_NAME]</var>-<var>[CLUSTER_ID]</var>-tpu-<var>[TPU_ID]</var>

If any TPUs appear, you can manually delete them by running:

gcloud compute tpus delete gke-[CLUSTER_NAME]-[CLUSTER_ID]-tpu-[TPU_ID]

where:

  • [CLUSTER_NAME] is your cluster's name.
  • [CLUSTER_ID] is your cluster's ID.
  • [TPU_ID] is the TPU's ID.

What's next?