Setting up Cloud TPU on GKE

This page is a quick guide to setting up Cloud TPU with Google Kubernetes Engine. If you're looking for a detailed walkthrough, follow the tutorial which shows you how to train the TensorFlow ResNet-50 model using Cloud TPU and GKE.

Why use GKE?

GKE offers Kubernetes clusters as a managed service.

  • Easier setup and management: When you use Cloud TPU, you need a Compute Engine VM to run your workload, and a Classless Inter-Domain Routing (CIDR) block for Cloud TPU. GKE sets up and manages the VM and the CIDR block for you.

  • Optimized cost: GKE scales your VMs and Cloud TPU nodes automatically based on workloads and traffic. You only pay for Cloud TPU and the VM when you run workloads on them.

  • Flexible usage: It's a one-line change in your Pod spec to request a different hardware accelerator (CPU, GPU, or TPU):

    kind: Pod
    spec:
      containers:
      - name: example-container
        resources:
          limits:
            cloud-tpus.google.com/v2: 8
            # See the line above for TPU, or below for CPU / GPU.
            # cpu: 2
            # nvidia.com/gpu: 1
    
  • Scalability: GKE provides APIs (Job and Deployment) that can easily scale to hundreds of Pods and Cloud TPU nodes.

  • Fault tolerance: GKE's Job API, along with the TensorFlow checkpoint mechanism, provide the run-to-completion semantic. Your training jobs will automatically rerun with the latest state read from the checkpoint if failures occur on the VM instances or Cloud TPU nodes.

Requirements and limitations

Note the following when defining your configuration:

  • You must use GKE version 1.10.4-gke.2 or above. You should specify the version when creating your cluster, by adding the --cluster-version parameter to the gcloud beta container clusters create command as described below. See more information about the version in the SDK documentation.
  • You must use TensorFlow 1.11 or above. You should specify the TensorFlow version used by the Cloud TPU in your Kubernetes Pod spec, as described below.
  • You must create your GKE cluster and node pools in a zone where Cloud TPU is available. The following zones are available:

    US

    Cloud TPU v2 and Preemptible v2 us-central1-b
    us-central1-c
    us-central1-f ( TFRC program only)
    Cloud TPU v3 and Preemptible v3 us-central1-a
    us-central1-b
    us-central1-f
    ( TFRC program only)
    Cloud TPU v2 Pod (alpha) us-central1-a

    Europe

    Cloud TPU v2 and Preemptible v2 europe-west4-a
    Cloud TPU v3 and Preemptible v3 europe-west4-a
    Cloud TPU v2 Pod (alpha) europe-west4-a

    Asia Pacific

    Cloud TPU v2 and Preemptible v2 asia-east1-c
  • Each container can request at most one Cloud TPU, but multiple containers in a Pod can request a Cloud TPU each.
  • Cluster Autoscaler supports Cloud TPU on GKE 1.11.4-gke.12 and above.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Google Cloud Platform project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your Google Cloud Platform project.

    Learn how to enable billing

  4. When you use Cloud TPU with GKE, your project uses billable components of Google Cloud Platform. Check the Cloud TPU pricing and the GKE pricing to estimate your costs, and follow the instructions to clean up resources when you've finished with them.

  5. Enable the following APIs on the GCP Console:

Create a Cloud Storage bucket

You need a Cloud Storage bucket to store the results of training your machine learning model.

  1. Go to the Cloud Storage page on the GCP Console.

    Go to the Cloud Storage page

  2. Create a new bucket, specifying the following options:

    • A unique name of your choosing.
    • Default storage class: Regional
    • Location: us-central1

Authorize Cloud TPU to access your Cloud Storage bucket

You need to give your Cloud TPU read/write access to your Cloud Storage objects. To do that, you must grant the required access to the service account used by the Cloud TPU. Follow the guide to grant access to your storage bucket.

Create a GKE cluster with Cloud TPU support

You can create a GKE cluster on the GCP Console or using the gcloud command-line tool. Select an option below to see the relevant instructions:

console

Follow these instructions to create a GKE cluster with Cloud TPU support:

  1. Go to the GKE page on the GCP Console.

    Go to the GKE page

  2. Click Create cluster.

  3. Specify a Name for your cluster. The name must be unique within the project and zone.

  4. For the Location type, select zonal and then specify the zone where you plan to use a Cloud TPU resource from the Zone pulldown menu. For example, select the us-central1-b zone.

    Cloud TPU is available in the following zones:

    US

    Cloud TPU v2 and Preemptible v2 us-central1-b
    us-central1-c
    us-central1-f ( TFRC program only)
    Cloud TPU v3 and Preemptible v3 us-central1-a
    us-central1-b
    us-central1-f
    ( TFRC program only)
    Cloud TPU v2 Pod (alpha) us-central1-a

    Europe

    Cloud TPU v2 and Preemptible v2 europe-west4-a
    Cloud TPU v3 and Preemptible v3 europe-west4-a
    Cloud TPU v2 Pod (alpha) europe-west4-a

    Asia Pacific

    Cloud TPU v2 and Preemptible v2 asia-east1-c
  5. Ensure that the Master version is set to 1.10.4-gke.2 or later, to allow support for Cloud TPU.

  6. Under Node pools click on Advanced edit.

  7. On the Adanced edit page, under Access scopes, click Allow full access to all Cloud APIs. This ensures that all nodes in the cluster have access to your Cloud Storage bucket. The cluster and the storage bucket must be in the same project for this to work. Note that the Pods by default inherit the scopes of the nodes to which they are deployed. If you want to limit the access on a per Pod basis, see the GKE guide to authenticating with service accounts. Click Save. This will take you back to the Create a Kubernetes cluster page.

  8. On the Create a Kubernetes cluster page, scroll down to the bottom of the page and click Advanced options.

  9. On the Advanced options page, enable VPC-native (using alias IP).

  10. Enable Cloud TPU (beta).

  11. Configure the remaining options for your cluster as desired. You can leave them at their default values.

  12. Click Create.

  13. Connect to the cluster. You can do this by selecting your cluster from the Console Kubernetes clusters page and clicking the CONNECT button. This displays the gcloud command to run in a Cloud shell to connect.

gcloud

Follow the instructions below to set up your environment and create a GKE cluster with Cloud TPU support, using the gcloud command-line tool:

  1. Install the gcloud beta components, which you need for running the Beta version of GKE with Cloud TPU:

    $ gcloud components install beta
  2. Specify your GCP project:

    $ gcloud config set project YOUR-CLOUD-PROJECT
    

    where YOUR-CLOUD-PROJECTis the name of your GCP project.

  3. Specify the zone where you plan to use a Cloud TPU resource. For this example, use the us-central1-b zone:

    $ gcloud config set compute/zone us-central1-b
    

    Cloud TPU is available in the following zones:

    US

    Cloud TPU v2 and Preemptible v2 us-central1-b
    us-central1-c
    us-central1-f ( TFRC program only)
    Cloud TPU v3 and Preemptible v3 us-central1-a
    us-central1-b
    us-central1-f
    ( TFRC program only)
    Cloud TPU v2 Pod (alpha) us-central1-a

    Europe

    Cloud TPU v2 and Preemptible v2 europe-west4-a
    Cloud TPU v3 and Preemptible v3 europe-west4-a
    Cloud TPU v2 Pod (alpha) europe-west4-a

    Asia Pacific

    Cloud TPU v2 and Preemptible v2 asia-east1-c
  4. Use the gcloud beta container clusters create command to create a cluster on GKE with support for Cloud TPU. In the command below, replace YOUR-CLUSTER with a cluster name of your choice:

    $ gcloud beta container clusters create YOUR-CLUSTER \
      --cluster-version=1.10 \
      --scopes=cloud-platform \
      --enable-ip-alias \
      --enable-tpu
    

    In the above command:

    • --cluster-version=1.10 indicates that the cluster will use the latest Kubernetes 1.10 release. You must use version 1.10.4-gke.2 or later.
    • --scopes=cloud-platform ensures that all nodes in the cluster have access to your Cloud Storage bucket in the GCP project defined as YOUR-CLOUD-PROJECT above. The cluster and the storage bucket must be in the same project for this to work. Note that the Pods by default inherit the scopes of the nodes to which they are deployed. Therefore, --scopes=cloud-platform gives all Pods running in the cluster the cloud-platform scope. If you want to limit the access on a per Pod basis, see the GKE guide to authenticating with service accounts.
    • --enable-ip-alias indicates that the cluster uses alias IP ranges. This is required for using Cloud TPU on GKE.
    • --enable-tpu indicates that the cluster must support Cloud TPU.
    • --tpu-ipv4-cidr (optional, not specified above) indicates the location of the CIDR range to use for Cloud TPU. IP_RANGE can be in the form of IP/20, such as 10.100.0.0/20. If you do not specify this flag, a /20 size CIDR range is automatically allocated and assigned.

Build and containerize your model in Docker image

You can use either an official TPU model that has been containerized in Docker images, or build and containerize your own model.

  • Use the official TPU models

    The latest official TPU models are containerized in Docker images. These Docker images were built with a Dockerfile. The following sections use the official TPU model Docker images.

  • Build your own models

    If you want to build your own model to run Cloud TPU on GKE, you can utilize Cloud TPU in TensorFlow with tf.contrib.cluster_resolver.TPUClusterResolver and tf.contrib.tpu.TPUEstimator, as follows:

    # The tpu_names, zone, and project must be omitted in TPUClusterResolver
    as GKE automatically uses the Cloud TPU created for your
    job when your runs.
    
    tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver()
    
    run_config = tpu_config.RunConfig(
      cluster=tpu_cluster_resolver,
      ...
    )
    estimator = tpu_estimator.TPUEstimator(
      use_tpu=True,
      config=run_config,
      ...
    )
    

For more information, see the TensorFlow documentation on how to use Cloud TPUs in TensorFlow.

You can follow steps 1 and 2 in Deploying a containerized web application to containerize your model in a Docker image and push it to the Google Container Registry.

Request a Cloud TPU in your Kubernetes Pod spec

In your Pod spec:

  • Use the following Pod annotation to specify the TensorFlow version that the Cloud TPU nodes use:

    tf-version.cloud-tpus.google.com: "x.y"
    

    Where x.y is the TensorFlow version supported by the Cloud TPU. You must use TensorFlow 1.11 or above. All Cloud TPU instances created for a Pod must use the same TensorFlow version. You must build your models in your containers using the same TensorFlow version. See the supported versions.

  • Specify the Cloud TPU resource in the limits section under the resource field in the container spec.

    Note that the unit of the Cloud TPU resource is the number of Cloud TPU cores. The following table lists all the valid resource requests.

    Resource Request Cloud TPU type Required GKE version
    cloud-tpus.google.com/v2: 8 A Cloud TPU v2 device (8 cores) 1.10.4-gke.2 or above
    cloud-tpus.google.com/v2: 32 A v2-32 Cloud TPU Pod (32 cores) (alpha) 1.10.7-gke.6 or above
    cloud-tpus.google.com/v2: 128 A v2-128 Cloud TPU Pod (128 cores) (alpha) 1.10.7-gke.6 or above
    cloud-tpus.google.com/v2: 256 A v2-256 Cloud TPU Pod (256 cores) (alpha) 1.10.7-gke.6 or above
    cloud-tpus.google.com/v2: 512 A v2-512 Cloud TPU Pod (512 cores) (alpha) 1.10.7-gke.6 or above
    cloud-tpus.google.com/preemptible-v2: 8 A Preemptible Cloud TPU v2 device (8 cores) 1.10.6-gke.1 or above
    cloud-tpus.google.com/v3: 8 A Cloud TPU v3 device (8 cores) 1.10.7-gke.6 or above
    cloud-tpus.google.com/preemptible-v3: 8 A Preemptible Cloud TPU v3 device (8 cores) 1.10.7-gke.6 or above

    For more information on specifying resources and limits in the Pod spec, see the Kubernetes documentation.

For example, this Job spec requests one Preemptible Cloud TPU v2 device with TensorFlow 1.13:

apiVersion: batch/v1
kind: Job
metadata:
  name: resnet-tpu
spec:
  template:
    metadata:
      annotations:
        # The Cloud TPUs that will be created for this Job must support
        # TensorFlow 1.13. This version MUST match
        # the TensorFlow version that your model is built on.
        tf-version.cloud-tpus.google.com: "1.13"
    spec:
      restartPolicy: Never
      containers:
      - name: resnet-tpu
        # The official TensorFlow 1.13 TPU model image built from
        # https://github.com/tensorflow/tpu/blob/r1.13/tools/docker/Dockerfile.
        image: gcr.io/tensorflow/tpu-models:r1.13
        command:
        - python
        - /tensorflow_tpu_models/models/official/resnet/resnet_main.py
        - --data_dir=gs://cloud-tpu-test-datasets/fake_imagenet
        - --model_dir=gs://<my-model-bucket>/resnet
        env:
        # Point PYTHONPATH to the top level models folder
        - name: PYTHONPATH
          value: "/tensorflow_tpu_models/models"
        resources:
          limits:
            # Request a single v2-8 Preemptible Cloud TPU device to train the model.
            # A single v2-8 Preemptible Cloud TPU device consists of 4 chips, each of which
            # has 2 cores, so there are 8 cores in total.
            cloud-tpus.google.com/preemptible-v2: 8

Create the Job

Follow these steps to create the Job in the GKE cluster:

  1. Create your Job spec, example-job.yaml, including the Job spec described above.

  2. Run the Job:

    $ kubectl create -f example-job.yaml
    job "resnet-tpu" created

    This command creates the job that automatically schedules the Pod.

  3. Verify that the Pod has been scheduled and Cloud TPU nodes have been provisioned. A Pod requesting Cloud TPU nodes can be pending for 5 minutes before running. You will see output similar to the following until the Pod is scheduled.

    $ kubectl get pods -w
    
    NAME               READY     STATUS    RESTARTS   AGE
    resnet-tpu-cmvlf   0/1       Pending   0          1m
    

    After 5 minutes, you should see something like this:

    NAME               READY     STATUS    RESTARTS   AGE
    resnet-tpu-cmvlf   1/1       Running   0          6m
    
    The lifetime of Cloud TPU nodes is bound to the Pods that request them. The Cloud TPU is created on demand when the Pod is scheduled, and recycled when the Pod is deleted.

Get the Cloud TPU logs

Follow these steps to get the logs of the Cloud TPU instances used by your Kubernetes Pods.

  1. Go to the GKE page on the GCP Console.

    Go to the GKE page

  2. Click Workloads.

  3. Find your Deployment/Job and click it.

  4. Find the Pod under Managed pods and click it.

  5. Find the Container under Containers and click it.

  6. Click the Stackdriver logs to see the logs of the Cloud TPU used by this Container.

Use TensorBoard for visualizing metrics and analyzing performance

TensorBoard is a suite of tools designed to present TensorFlow data visually. TensorBoard can help identify bottlenecks in processing and suggest ways to improve performance.

TPU Profiler is a TensorBoard plugin for capturing a profile on an individual Cloud TPU which can be visualized on TensorBoard. The Cloud TPU tool selector will become available under the Profile tab on the TensorBoard menu bar only after you have collected trace information from a running TensorFlow model using TPU Profiler.

Run Tensorboard in a GKE cluster

Follow these steps to run TensorBoard in the GKE cluster:

  1. Download the TensorBoard Deployment YAML configuration.

    wget https://raw.githubusercontent.com/tensorflow/tpu/r1.13/tools/kubernetes/tensorboard_k8s.yaml
  2. Change the environment variable MODEL_BUCKET in the file to the Cloud Storage location where your model and the TensorFlow events exist.

  3. Run TensorBoard in a GKE cluster.

    <pre>kubectl apply -f tensorboard_k8s.yaml</pre>
    
  4. Get the EXTERNAL_IP of the TensorBoard. Note that a Load Balancer will be created to route the requests to TensorBoard, which will incur additional cost.

    <pre>kubectl get service tensorboard-service</pre>
    
  5. Access http://EXTERNAL_IP:6006 within your browser.

Run TPU Profiler to capture trace information

Follow these steps to run TPU Profiler in the GKE cluster:

  1. Download the TPU Profiler Job YAML configuration.

    wget https://raw.githubusercontent.com/tensorflow/tpu/r1.13/tools/kubernetes/tpu_profiler_k8s.yaml
  2. Change the environment variable TPU_NAME in the file to the name of the Cloud TPU you want to profile. The TPU name appears in the Cloud Console on the Compute Engine > TPUs page, in the following format:

    gke-[CLUSTER-NAME]-[CLUSTER-ID]-tpu-[TPU-ID]
    
    For example:
    gke-demo-cluster-25cee208-tpu-4b90f4c5

  3. Change the environment variable MODEL_BUCKET in the file to the Cloud Storage location where your model and the TensorFlow events exist.

  4. Run TPU Profiler in a GKE cluster.

    kubectl create -f tpu_profiler_k8s.yaml
    
  5. Refresh your browser to see the tracing data under the Profile tab on TensorBoard.

Clean up

When you've finished with Cloud TPU on GKE, clean up the resources to avoid incurring extra charges to your Google Cloud Platform account.

console

Delete your GKE cluster:

  1. Go to the GKE page on the GCP Console.

    Go to the GKE page

  2. Select the checkbox next to the cluster that you want to delete.

  3. Click Delete.

When you've finished finished examining the data, delete the Cloud Storage bucket that you created during this tutorial:

  1. Go to the Cloud Storage page on the GCP Console.

    Go to the Cloud Storage page

  2. Select the checkbox next to the bucket that you want to delete.

  3. Click Delete.

See the Cloud Storage pricing guide for free storage limits and other pricing information.

gcloud

If you haven't set the project and zone for this session, do so now. See the instructions earlier in this guide. Then follow this cleanup procedure:

  1. Run the following command to delete your GKE cluster, replacing YOUR-CLUSTER with your cluster name, and YOUR-PROJECT with your GCP project name:

    $ gcloud container clusters delete YOUR-CLUSTER --project=YOUR-PROJECT
    
  2. When you've finished finished examining the data, use the gsutil command to delete the Cloud Storage bucket that you created during this tutorial. Replace YOUR-BUCKET with the name of your Cloud Storage bucket:

    $ gsutil rm -r gs://YOUR-BUCKET
    

    See the Cloud Storage pricing guide for free storage limits and other pricing information.

What's next

  • Work through the tutorial to train the TensorFlow ResNet-50 model on Cloud TPU and GKE.
  • Run more models and dataset retrieval jobs using one of the following Job specs:

Was this page helpful? Let us know how we did:

Send feedback about...