Running Cloud TPU applications on GKE

This guide describes how to:

If you're looking for a detailed walkthrough, follow the tutorial which shows you how to train the TensorFlow ResNet-50 model using Cloud TPU and GKE.

Benefits of running Cloud TPU applications on GKE

Cloud TPU training applications can be configured to run in GKE containers within GKE Pods. When they are, you see the following benefits:

  • Easier setup and management: When you use Cloud TPU, you need a Compute Engine VM to run your workload, and a Classless Inter-Domain Routing (CIDR) block for Cloud TPU. GKE sets up the VM and the CIDR block and manages the VM for you.

  • Optimized cost: You only pay for the TPU while the Job is active. GKE automatically creates and deletes TPUs according to a Job's resource requirements.

  • Flexible usage: It's a one-line change in your Pod spec to request a different hardware accelerator (CPU, GPU, or TPU):

    kind: Pod
    spec:
      containers:
      - name: example-container
        resources:
          limits:
            cloud-tpus.google.com/v2: 8
            # See the line above for TPU, or below for CPU / GPU.
            # cpu: 2
            # nvidia.com/gpu: 1
    
  • Scalability: GKE provides APIs (Job and Deployment) that can easily scale to hundreds of GKE Pods and Cloud TPU nodes.

  • Fault tolerance: GKE's Job API, along with the TensorFlow checkpoint mechanism, provide the run-to-completion semantic. Your training jobs will automatically rerun with the latest state read from the checkpoint if failures occur on the VM instances or Cloud TPU nodes.

Cloud TPU and GKE configuration requirements and limitations

Note the following when defining your GKE configuration:

  • You must use GKE version 1.13.4-gke.5 or later. You can specify the version by adding the --cluster-version parameter to the gcloud container clusters create command as described below. See more information about the version in the SDK documentation.
  • You must use TensorFlow 1.15.5 or later. You should specify the TensorFlow version used by the Cloud TPU in your Kubernetes Pod spec, as described below.
  • You must create your GKE cluster and node pools in a zone where Cloud TPU is available. You must also create the Cloud Storage buckets to hold your training data and models in the same region as your GKE cluster. The following zones are available:

    US

    TPU type (v2) TPU v2 cores Total TPU memory Region/Zone
    v2-8 8 64 GiB us-central1-b
    us-central1-c
    us-central1-f
    v2-32 32 256 GiB us-central1-a
    v2-128 128 1 TiB us-central1-a
    v2-256 256 2 TiB us-central1-a
    v2-512 512 4 TiB us-central1-a
    TPU type (v3) TPU v3 cores Total TPU memory Available zones
    v3-8 8 128 GiB us-central1-a
    us-central1-b
    us-central1-f

    Europe

    TPU type (v2) TPU v2 cores Total TPU memory Region/Zone
    v2-8 8 64 GiB europe-west4-a
    v2-32 32 256 GiB europe-west4-a
    v2-128 128 1 TiB europe-west4-a
    v2-256 256 2 TiB europe-west4-a
    v2-512 512 4 TiB europe-west4-a
    TPU type (v3) TPU v3 cores Total TPU memory Available zones
    v3-8 8 128 GiB europe-west4-a
    v3-32 32 512 GiB europe-west4-a
    v3-64 64 1 TiB europe-west4-a
    v3-128 128 2 TiB europe-west4-a
    v3-256 256 4 TiB europe-west4-a
    v3-512 512 8 TiB europe-west4-a
    v3-1024 1024 16 TiB europe-west4-a
    v3-2048 2048 32 TiB europe-west4-a

    Asia Pacific

    TPU type (v2) TPU v2 cores Total TPU memory Region/Zone
    v2-8 8 64 GiB asia-east1-c
  • Each container can request at most one Cloud TPU, but multiple containers in a Pod can request a Cloud TPU each.
  • Cluster Autoscaler supports Cloud TPU on GKE 1.13.4-gke.5 or later.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. When you use Cloud TPU with GKE, your project uses billable components of Google Cloud. Check Cloud TPU pricing and GKE pricing to estimate your costs, and follow the instructions to clean up resources when you've finished with them.

  5. Enable the following APIs on the Cloud Console:

Create a Service Account and a Cloud Storage bucket

You need a Cloud Storage bucket to store the results of training your machine learning model.

  1. Open a Cloud Shell window.

    Open Cloud Shell

  2. Create a variable for your project's ID.

    export PROJECT_ID=project-id
    
  3. Use the gcloud API to create a Service Account for the Cloud TPU project.

    gcloud beta services identity create --service tpu.googleapis.com --project $PROJECT_ID
    

    The command returns a Cloud TPU Service Account with following format:

    service-PROJECT_NUMBER@cloud-tpu.iam.gserviceaccount.com
    
  4. Create a new bucket, specifying the following options:

    • A unique name of your choosing.
    • Location type: region
    • Location: us-central1
    • Default storage class: Standard
    • Access control: fine-grained

    Before using the storage bucket, you need to authorize Cloud TPU Service Account access to the bucket. Set fine-grained ACLs for your Cloud TPU Service Account.

Authorize Cloud TPU to access your Cloud Storage bucket

You need to give your Cloud TPU read/write access to your Cloud Storage objects. To do that, you must grant the required access to the service account used by the Cloud TPU. Follow the guide to grant access to your storage bucket.

Creating a new cluster with Cloud TPU support

You can create a cluster with Cloud TPU support by using the Cloud Console or the gcloud tool.

Select an option below to see the relevant instructions:

Console

Follow these instructions to create a GKE cluster with Cloud TPU support:

  1. Go to the GKE page on the Cloud Console.

    Go to the GKE page

  2. Click Create cluster.

  3. Specify a Name for your cluster. The name must be unique within the project and zone.

  4. For the Location type, select zonal and then select the desired zone where you plan to use a Cloud TPU resource. For this example, select the us-central1-b zone.

  5. Ensure that the Master version is set to 1.13.4-gke.5 or later, to allow support for Cloud TPU.

  6. From the navigation pane, under the node pool you want to configure, click Security.

  7. Select Allow full access to all Cloud APIs. This ensures that all nodes in the cluster have access to your Cloud Storage bucket. The cluster and the storage bucket must be in the same project for this to work. By default, Kubernetes Pods inherit the scopes of the nodes to which they are deployed. If you want to limit the access on a per Pod basis, see the GKE guide to authenticating with service accounts.

  8. From the navigation pane, under Cluster, click Networking.

  9. Select Enable VPC-native traffic routing (uses alias IP), you need to create a VPC network if one does not exist for the current project.

  10. From the navigation pane, under Cluster, click Features.

  11. Select Enable Cloud TPU.

  12. Configure the remaining options for your cluster as desired. You can leave the options at their default values.

  13. Click Create.

  14. Connect to the cluster. You can do this by selecting your cluster from the Console Kubernetes clusters page and clicking the Connect button. This displays the gcloud command to run in a Cloud shell to connect.

gcloud

Follow the instructions below to set up your environment and create a GKE cluster with Cloud TPU support, using the gcloud tool:

  1. Install the gcloud components, which you need for running GKE with Cloud TPU:

    $ gcloud components install kubectl 
  2. Configure gcloud with your Google Cloud project ID:

    $ gcloud config set project project-name
    

    Replace project-name with the name of your Google Cloud project.

    The first time you run this command in a new Cloud Shell VM, an Authorize Cloud Shell page is displayed. Click Authorize at the bottom of the page to allow gcloud to make GCP API calls with your credentials.

  3. Configure gcloud with the zone where you plan to use a Cloud TPU resource. For this tutorial, use the us-central1-b zone:

    $ gcloud config set compute/zone us-central1-b
    
  4. Use the gcloud container clusters create command to create a cluster on GKE with support for Cloud TPU. In the following command, replace cluster-name with a cluster name of your choice:

    $ gcloud container clusters create cluster-name \
      --cluster-version=1.16 \
      --scopes=cloud-platform \
      --enable-ip-alias \
      --enable-tpu
    

    Command flag descriptions

    cluster-version
    Indicates that the cluster will use the latest Kubernetes 1.16 release. You must use version 1.13.4-gke.5 or later.
    scopes
    Ensures that all nodes in the cluster have access to your Cloud Storage bucket. The cluster and the storage bucket must be in the same project for this to work. Note that the Kubernetes Pods by default inherit the scopes of the nodes to which they are deployed. Therefore, scopes=cloud-platform gives all Kubernetes Pods running in the cluster the cloud-platform scope. If you want to limit the access on a per Pod basis, see the GKE guide to authenticating with service accounts.
    enable-ip-alias
    Indicates that the cluster uses alias IP ranges. This is required for using Cloud TPU on GKE.
    enable-tpu
    Indicates that the cluster must support Cloud TPU.
    tpu-ipv4-cidr (optional, not specified above)
    Indicates the CIDR range to use for Cloud TPU. Specify the IP_RANGE in the form of IP/20, such as 10.100.0.0/20. If you do not specify this flag, a /20 size CIDR range is automatically allocated and assigned.

    When the cluster has been created, you should see a message similar to the following:

    NAME             LOCATION       MASTER_VERSION    MASTER_IP     MACHINE_TYPE   NODE_VERSION      NUM_NODES  STATUS
    cluster-resnet  us-central1-b  1.16.15-gke.4901  34.71.245.25  n1-standard-1  1.16.15-gke.4901  3          RUNNING
    

View cluster operations

Enabling Cloud TPU support starts an update operation. For zonal clusters this operation takes around 5 minutes, and for regional clusters, this operation takes roughly 15 minutes, depending on the cluster's region.

To list every running and completed operation in your cluster, run the following command:

   $ gcloud container operations list
   

To get more information about a specific operation, run the following command:

   $ gcloud container operations describe operation-id
   

Replace operation-id with the ID of the specific operation.

Request a Cloud TPU in your Kubernetes Pod spec

In your Kubernetes Pod spec:

  • Use the following Pod annotation to specify the TensorFlow version that the Cloud TPU nodes use:

    tf-version.cloud-tpus.google.com: "x.y"
    

    Where x.y is the TensorFlow version supported by the Cloud TPU. You must use TensorFlow 1.15.5 or above. All Cloud TPU instances created for a Kubernetes Pod must use the same TensorFlow version. You must build your models in your containers using the same TensorFlow version. See the supported versions.

  • Specify the Cloud TPU resource in the limits section under the resource field in the container spec.

    Note that the unit of the Cloud TPU resource is the number of Cloud TPU cores. The following table lists all the valid resource requests.

    If the resource intended to be used is a Cloud TPU Pod, please request quota since the default quota for Cloud TPU Pod is zero.

    Resource Request Cloud TPU type Required GKE version
    cloud-tpus.google.com/v2: 8 A Cloud TPU v2 device (8 cores) 1.10.4-gke.2 or later
    cloud-tpus.google.com/v2: 32 A v2-32 Cloud TPU Pod (32 cores) (beta) 1.10.7-gke.6 or later
    cloud-tpus.google.com/v2: 128 A v2-128 Cloud TPU Pod (128 cores) (beta) 1.10.7-gke.6 or later
    cloud-tpus.google.com/v2: 256 A v2-256 Cloud TPU Pod (256 cores) (beta) 1.10.7-gke.6 or later
    cloud-tpus.google.com/v2: 512 A v2-512 Cloud TPU Pod (512 cores) (beta) 1.10.7-gke.6 or later
    cloud-tpus.google.com/v3: 32 A v3-32 Cloud TPU Pod (32 cores) (beta) 1.10.7-gke.6 or later
    cloud-tpus.google.com/v3: 64 A v3-64 Cloud TPU Pod (64 cores) (beta) 1.10.7-gke.6 or later
    cloud-tpus.google.com/v3: 128 A v3-128 Cloud TPU Pod (128 cores) (beta) 1.10.7-gke.6 or later
    cloud-tpus.google.com/v3: 256 A v3-256 Cloud TPU Pod (256 cores) (beta) 1.10.7-gke.6 or later
    cloud-tpus.google.com/v3: 512 A v3-512 Cloud TPU Pod (512 cores) (beta) 1.10.7-gke.6 or later
    cloud-tpus.google.com/v3: 1024 A v3-1024 Cloud TPU Pod (1024 cores) (beta) 1.10.7-gke.6 or later
    cloud-tpus.google.com/v3: 2048 A v3-2048 Cloud TPU Pod (2048 cores) (beta) 1.10.7-gke.6 or later
    cloud-tpus.google.com/preemptible-v2: 8 A Preemptible Cloud TPU v2 device (8 cores) 1.10.6-gke.1 or later
    cloud-tpus.google.com/v3: 8 A Cloud TPU v3 device (8 cores) 1.10.7-gke.6 or later
    cloud-tpus.google.com/preemptible-v3: 8 A Preemptible Cloud TPU v3 device (8 cores) 1.10.7-gke.6 or later

    For more information on specifying resources and limits in the Pod spec, see the Kubernetes documentation.

The sample Job spec shown below requests one Preemptible Cloud TPU v2 device with TensorFlow 2.3. It also starts a TensorBoard process process.

The lifetime of Cloud TPU nodes is bound to the Kubernetes Pods that request them. The Cloud TPU is created on demand when the Kubernetes Pod is scheduled, and recycled when the Kubernetes Pod is deleted.

apiVersion: batch/v1
kind: Job
metadata:
  name: resnet-tpu
spec:
  template:
    metadata:
      annotations:
        # The Cloud TPUs that will be created for this Job will support
        # TensorFlow 2.3. This version MUST match the
        # TensorFlow version that your model is built on.
        tf-version.cloud-tpus.google.com: "2.3"
    spec:
      restartPolicy: Never
      containers:
      - name: resnet-tpu
        # The official TensorFlow 2.3.0 image.
        # https://hub.docker.com/r/tensorflow/tensorflow
        image: tensorflow/tensorflow:2.3.0
        command:
        - bash
        - -c
        - |
          pip install tf-models-official==2.3.0
          python3 -m official.vision.image_classification.resnet.resnet_ctl_imagenet_main \
            --tpu=$(KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS) \
            --distribution_strategy=tpu \
            --steps_per_loop=500 \
            --log_steps=500 \
            --use_synthetic_data=true \
            --dtype=fp32 \
            --enable_tensorboard=true \
            --train_epochs=90 \
            --epochs_between_evals=1 \
            --batch_size=1024 \
            --model_dir=gs://bucket-name/resnet
        resources:
          limits:
            # Request a single Preemptible v2-8 Cloud TPU device to train the
            # model. A single v2-8 Cloud TPU device consists of 4 chips, each of
            # which has 2 cores, so there are 8 cores in total.
            cloud-tpus.google.com/preemptible-v2: 8
      - name: tensorboard
        image: tensorflow/tensorflow:2.2.0
        command:
        - bash
        - -c
        - |
          pip install tensorboard-plugin-profile==2.3.0 cloud-tpu-client
          tensorboard --logdir=gs://bucket-name/resnet --port=6006
        ports:
        - containerPort: 6006

Creating the Job

Follow these steps to create the Job in the GKE cluster, and to install kubectl:

  1. Using a text editor, create a Job spec, example-job.yaml, and copy/paste in the Job spec shown above. Be sure to replace the bucket-name variable in the --model_dir parameter and in the tensorboard command with the name of your storage bucket.

  2. Run the Job:

    $ kubectl create -f example-job.yaml
    
    job "resnet-tpu" created

    This command creates the job that automatically schedules the Pod.

  3. Verify that the Pod has been scheduled and Cloud TPU nodes have been provisioned. A Pod requesting Cloud TPU nodes can be pending for 5 minutes before running. You will see output similar to the following until the Pod is scheduled.

    $ kubectl get pods -w
    
    NAME               READY     STATUS    RESTARTS   AGE
    resnet-tpu-cmvlf   0/1       Pending   0          1m
    

    After 5 minutes, you should see something like this:

    NAME               READY     STATUS    RESTARTS   AGE
    resnet-tpu-cmvlf   1/1       Running   0          6m
    

View the Cloud TPU status and logs

Follow these steps to verify the status and view the logs of the Cloud TPU instances used by your Kubernetes Pods.

  1. Go to the GKE page on the Cloud Console.

    Go to the GKE page

  2. On the left-hand navigation bar, click Workloads.

  3. Select your Job. This takes you to a page that includes a heading Managed Pods.

  4. Under Managed Pods, select your Kubernetes Pod. This takes you to a page that includes a heading Containers.

    Under Containers, a list of Containers is displayed. The list includes all Cloud TPU instances. For each Container, the following information is displayed:

    • The run status
    • A link to the Container logs

Use TensorBoard for visualizing metrics and analyzing performance

TensorBoard is a suite of tools designed to present TensorFlow data visually. TensorBoard can help identify bottlenecks in processing and suggest ways to improve performance.

TPU Profiler is a TensorBoard plugin for capturing a profile on an individual Cloud TPU or a Cloud TPU Pod which can be visualized on TensorBoard. The Cloud TPU tool selector becomes available under the Profile tab on the TensorBoard menu bar after you have collected trace information from a running TensorFlow model using TPU Profiler.

Run Tensorboard in a GKE cluster

Follow these steps to run TensorBoard in the GKE cluster:

  1. Follow the steps to view the TensorBoard status to verify that the TensorBoard instance is running in a container.

  2. Port-forward to the TensorBoard Kubernetes Pod:

    $ kubectl port-forward pod/resnet-tpu-pod-id 6006
    

    where pod-id is the last set of digits of your GKE Pod name shown on the Console at: Kubernetes Engine > Workloads > Managed Pods. For example: resnet-tpu-wxskc.

  3. On the bar at the top right-hand side of the Cloud Shell, click the Web preview button and open port 6006 to view the TensorBoard output. The TensorBoard UI will appear as a tab in your browser.

  4. Select PROFILE from the dropdown menu on the top right side of the TensorBoard page.

  5. Click on the CAPTURE PROFILE button on the PROFILE page.

  6. In the popup menu, select the TPU name address type and enter the TPU name. The TPU name appears in the Cloud Console on the Compute Engine > TPUs page, in the following format:

    gke-cluster-name-cluster-id-tpu-tpu-id
    
    For example:
    gke-demo-cluster-25cee208-tpu-4b90f4c5

  7. Select the CAPTURE button on the pop up menu when you're ready to begin profiling, and wait a few seconds for the profile to complete.

  8. Refresh your browser to see the tracing data under the PROFILE tab on TensorBoard.

For more information on how to capture and interpret profiles, see the TensorFlow profiler guide.

Build and containerize your model in Docker image

You can use either an official TPU model that has been containerized in Docker images, or build and containerize your own model.

  • Use the official TPU models

    The latest official TPU models are containerized in Docker images. These Docker images were built with a Dockerfile.

  • Build your own model

    The colab notebooks in the TensorFlow documentation. provide examples of how to build your own model.

    If you choose to build your own model, use the following steps to containerize the model in a Docker image and push it to the Google Container Registry.

    1. Enable the following APIs on the Cloud Console:

    2. Create a Dockerfile containing the following lines.

      FROM tensorflow/tensorflow:2.3.0
      
      RUN pip install tf-models-official==2.3.0 tensorboard-plugin-profile==2.3.0 cloud-tpu-client
      
    3. Run the following command in the same directory as the Dockerfile to build and tag the Docker image. Replace the my-project variable with the name of your project and replace the my-image variable with an image name.

      gcloud builds submit . -t gcr.io/my-project/my-image
      

      The gcr.io prefix refers to Container Registry, where the image is hosted. Running this command does not upload the image.

    4. Run the docker images command to verify that the build was successful:

       docker images
      
      Output:
      REPOSITORY                     TAG                 IMAGE ID            CREATED             SIZE
      gcr.io/my-project/my-image     v1                  25cfadb1bf28        10 seconds ago      54 MB
      
    5. Update the job spec to use the Docker image. Replace the variables my-project and my-image with your project name and an image name you have defined. Also replace the bucket-name variable with the name of a bucket you will use to store the training output of your model.

        image: gcr.io/my-project/my-image
        command:
          - python3
          - -m
          - official.vision.image_classification.resnet.resnet_ctl_imagenet_main
          -  --tpu=$(KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS)
          -  --distribution_strategy=tpu
          -  --steps_per_loop=500
          -  --log_steps=500
          -  --use_synthetic_data=true
          -  --dtype=fp32
          -  --enable_tensorboard=true
          -  --train_epochs=90
          -  --epochs_between_evals=1
          -  --batch_size=1024
          -  --model_dir=gs://<bucket-name>/resnet-output
      
    6. Create and run the job as you would with an official TPU model.

Enabling Cloud TPU support on an existing cluster

To enable Cloud TPU support on an existing GKE cluster, perform the following steps in the gcloud command-line tool:

  1. Enable Cloud TPU support:

    gcloud beta container clusters update cluster-name --enable-tpu
    

    Replace cluster-name with the name of your cluster.

  2. Update the kubeconfig entry:

    gcloud container clusters get-credentials cluster-name
    

Setting a custom CIDR range

By default, GKE allocates a CIDR block with the size of /20 for the TPUs provisioned by the cluster. You can specify a custom CIDR range for the Cloud TPU by running the following command:

gcloud beta container clusters update cluster-name \
  --enable-tpu \
  --tpu-ipv4-cidr 10.100.0.0/20

Replace the following:

  • cluster-name: the name of your existing cluster.
  • 10.100.0.0/20: your custom CIDR range.

Disabling Cloud TPU in a cluster

To disable Cloud TPU support on an existing GKE cluster, perform the following steps in the gcloud command-line tool:

  1. Verify that none of your workloads are using Cloud TPU:

    $ kubectl get tpu
    
  2. Disable Cloud TPU support in your cluster:

    $ gcloud beta container clusters update cluster-name --no-enable-tpu
    

    Replace cluster-name with the name of your cluster.

    For zonal clusters this operation takes around 5 minutes, and for regional clusters, this operation takes roughly 15 minutes, depending on the cluster's region.

  3. Once the operations completes with no errors, you can verify that the TPUs provisioned by the cluster have been removed:

    $ gcloud compute tpus list
    

    The names of the TPUs created by Cloud TPU have the following format:

    $ gke-cluster-name-cluster-id-tpu-tpu-id
    

    Replace the following:

    • cluster-name: the name of your existing cluster.
    • cluster-id: the ID of your existing cluster.
    • tpu-id: the ID of the Cloud TPU.

    If any TPUs appear, you can manually delete them by running:

    $ gcloud compute tpus delete gke-cluster-name-cluster-id-tpu-tpu-id
    

Clean up

When you've finished with Cloud TPU on GKE, clean up the resources to avoid incurring extra charges to your Cloud Billing account.

console

Delete your GKE cluster:

  1. Go to the GKE page on the Cloud Console.

    Go to the GKE page

  2. Select the checkbox next to the cluster that you want to delete.

  3. Click Delete.

When you've finished finished examining the data, delete the Cloud Storage bucket that you created:

  1. Go to the Cloud Storage page on the Cloud Console.

    Go to the Cloud Storage page

  2. Select the checkbox next to the bucket that you want to delete.

  3. Click Delete.

See the Cloud Storage pricing guide for free storage limits and other pricing information.

gcloud

If you haven't set the project and zone for this session, do so now. See the instructions earlier in this guide. Then follow this cleanup procedure:

  1. Run the following command to delete your GKE cluster, replacing cluster-name with your cluster name, and project-name with your Google Cloud project name:

    $ gcloud container clusters delete cluster-name --project=project-name
    
  2. When you've finished finished examining the data, use the gsutil command to delete the Cloud Storage bucket that you created. Replace bucket-name with the name of your Cloud Storage bucket:

    $ gsutil rm -r gs://bucket-name
    

    See the Cloud Storage pricing guide for free storage limits and other pricing information.

What's next

  • Work through the tutorial to train the TensorFlow ResNet-50 model on Cloud TPU and GKE.
  • Run more models and dataset retrieval jobs using one of the following Job specs:
  • Download and preprocess the COCO dataset on GKE.
  • Download and preprocess ImageNet on GKE.
  • Train AmoebaNet-D using Cloud TPU and GKE.
  • Train Inception v3 using Cloud TPU and GKE.
  • Train RetinaNet using Cloud TPU and GKE.