Run Cloud TPU applications on GKE

This guide describes how to:

For more information about TPU VM architectures, see System Architecture. This guide can only be used with the TPU Nodes architecture.

Benefits of running Cloud TPU applications on GKE

Cloud TPU training applications can be configured to run in GKE containers within GKE Pods. When they are, you see the following benefits:

  • Improved workflow setup and management: GKE manages the TPU lifecycle. Once Cloud TPU initialization and training are set up with GKE, your workloads can be repeated and managed by GKE, including Job failure recovery.

  • Optimized cost: You only pay for the TPU while the Job is active. GKE automatically creates and deletes TPUs according to a Pod's resource requirements.

  • Flexible usage: It's a small change in your Pod spec to request a different hardware accelerator (CPU, GPU, or TPU):

    kind: Pod
    metadata:
      name: example-tpu
      annotations:
        # The Cloud TPUs that will be created for this Job will support
        # TensorFlow 2.12.1. This version MUST match the
        # TensorFlow version that your model is built on.
        tf-version.cloud-tpus.google.com: "2.12.1"
    spec:
      containers:
      - name: example-container
        resources:
          limits:
            cloud-tpus.google.com/v2: 8
            # See the line above for TPU, or below for CPU / GPU.
            # cpu: 2
            # nvidia.com/gpu: 1
    
  • Scalability: GKE provides APIs (Job and Deployment) that can scale to hundreds of GKE Pods and TPU Nodes.

  • Fault tolerance: GKE's Job API, along with the TensorFlow checkpoint mechanism, provide the run-to-completion semantic. Your training Jobs will automatically rerun with the latest state read from the checkpoint if failures occur on the VM instances or Cloud TPU nodes.

Cloud TPU and GKE configuration requirements and limitations

Note the following when defining your GKE configuration:

  • Cloud TPU is not supported in Windows Server node pools.
  • You must create your GKE cluster and node pools in a zone where Cloud TPU is available. You must also create the Cloud Storage buckets to hold your training data and models in the same region as your GKE cluster. See the types and zones document for a list of the available zones.
  • You must use RFC 1918 compliant IP addresses for your GKE clusters. For more information, see GKE Networking.
  • Each container can request at most one Cloud TPU, but multiple containers in a Pod can request a Cloud TPU each.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. When you use Cloud TPU with GKE, your project uses billable components of Google Cloud. Check Cloud TPU pricing and GKE pricing to estimate your costs, and follow the instructions to clean up resources when you've finished with them.

  7. Enable the following APIs on the Google Cloud console:

Create a new cluster with Cloud TPU support

Use the following instructions to set up your environment and create a GKE cluster with Cloud TPU support, using the gcloud CLI:

  1. Install the gcloud components, which you need for running GKE with Cloud TPU:

    $ gcloud components install kubectl 
  2. Configure gcloud with your Google Cloud project ID:

    $ gcloud config set project project-name
    

    Replace project-name with the name of your Google Cloud project.

    The first time you run this command in a new Cloud Shell VM, an Authorize Cloud Shell page is displayed. Click Authorize at the bottom of the page to allow gcloud to make Google Cloud API calls with your credentials.

  3. Configure gcloud with the zone where you plan to use a Cloud TPU resource. This example uses us-central1-b, but you can use a TPU in any supported zone.

    $ gcloud config set compute/zone us-central1-b
    
  4. Use the gcloud container clusters create command to create a cluster on GKE with support for Cloud TPU.

    $ gcloud container clusters create cluster-name \
      --release-channel=stable \
      --scopes=cloud-platform \
      --enable-ip-alias \
      --enable-tpu
    

    Command flag descriptions

    release-channel
    Release channels provide a way to manage automatic upgrades for your clusters. When you create a new cluster, you can choose its release channel. Your cluster will only be upgraded to versions offered in that channel.
    scopes
    Ensures that all nodes in the cluster have access to your Cloud Storage bucket. The cluster and the storage bucket must be in the same project for this to work. Note that the Kubernetes Pods by default inherit the scopes of the nodes to which they are deployed. Therefore, scopes=cloud-platform gives all Kubernetes Pods running in the cluster the cloud-platform scope. If you want to limit the access on a per Pod basis, see the GKE guide to authenticating with Service Accounts.
    enable-ip-alias
    Indicates that the cluster uses alias IP ranges. This is required for using Cloud TPU on GKE.
    enable-tpu
    Indicates that the cluster must support Cloud TPU.
    tpu-ipv4-cidr (optional, not specified above)
    Indicates the CIDR range to use for Cloud TPU. Specify the IP_RANGE in the form of IP/20, such as 10.100.0.0/20. If you don't specify this flag, a /20 size CIDR range is automatically allocated and assigned.

When the cluster has been created, you should see a message similar to the following:

NAME             LOCATION       MASTER_VERSION    MASTER_IP     MACHINE_TYPE   NODE_VERSION      NUM_NODES  STATUS
cluster-name  us-central1-b  1.16.15-gke.4901  34.71.245.25  n1-standard-1  1.16.15-gke.4901  3          RUNNING

Request a Cloud TPU in your Kubernetes Pod spec

In your Kubernetes Pod spec:

  • You must build your models in your containers using the same TensorFlow version. See the supported versions.

  • Specify the Cloud TPU resource in the limits section under the resource field in the container spec.

    Note that the unit of the Cloud TPU resource is the number of Cloud TPU cores. The following table lists examples of valid resource requests. See TPU types and zones for a complete list of valid TPU resources.

    If the resource intended to be used is a Cloud TPU Pod, request quota since the default quota for Cloud TPU Pod is zero.

    Resource Request Cloud TPU type
    cloud-tpus.google.com/v2: 8 A Cloud TPU v2 device (8 cores)
    cloud-tpus.google.com/preemptible-v2: 8 A Preemptible Cloud TPU v2 device (8 cores)
    cloud-tpus.google.com/v3: 8 A Cloud TPU v3 device (8 cores)
    cloud-tpus.google.com/preemptible-v3: 8 A Preemptible Cloud TPU v3 device (8 cores)
    cloud-tpus.google.com/v2: 32 A v2-32 Cloud TPU Pod (32 cores)
    cloud-tpus.google.com/v3: 32 A v3-32 Cloud TPU Pod (32 cores)

    For more information on specifying resources and limits in the Pod spec, see the Kubernetes documentation.

The following sample Pod spec requests one Preemptible Cloud TPU v2-8 TPU with TensorFlow 2.12.1.

The lifetime of Cloud TPU nodes is bound to the Kubernetes Pods that request them. The Cloud TPU is created on demand when the Kubernetes Pod is scheduled, and recycled when the Kubernetes Pod is deleted.

apiVersion: v1
kind: Pod
metadata:
  name: gke-tpu-pod
  annotations:
     # The Cloud TPUs that will be created for this Job will support
     # TensorFlow 2.12.1. This version MUST match the
     # TensorFlow version that your model is built on.
     tf-version.cloud-tpus.google.com: "2.12.1"
spec:
  restartPolicy: Never
  containers:
  - name: gke-tpu-container
    # The official TensorFlow 2.12.1 image.
    # https://hub.docker.com/r/tensorflow/tensorflow
    image: tensorflow/tensorflow:2.12.1
    command:
    - python
    - -c
    - |
      import tensorflow as tf
      print("Tensorflow version " + tf.__version__)

      tpu = tf.distribute.cluster_resolver.TPUClusterResolver('$(KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS)')
      print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])

      tf.config.experimental_connect_to_cluster(tpu)
      tf.tpu.experimental.initialize_tpu_system(tpu)
      strategy = tf.distribute.TPUStrategy(tpu)

      @tf.function
      def add_fn(x,y):
          z = x + y
          return z

      x = tf.constant(1.)
      y = tf.constant(1.)
      z = strategy.run(add_fn, args=(x,y))
      print(z)
    resources:
      limits:
        # Request a single Preemptible v2-8 Cloud TPU device to train the model.
        cloud-tpus.google.com/preemptible-v2: 8

Creating the Job

Follow these steps to create the Job in the GKE cluster, and to install kubectl

  1. Using a text editor, create a Pod spec, example-job.yaml, and copy and paste in the Pod spec shown previously.

  2. Run the Job:

    $ kubectl create -f example-job.yaml
    
    pod "gke-tpu-pod" created

    This command creates the job that automatically schedules the Pod.

  3. Verify that the GKE Pod has been scheduled and Cloud TPU nodes have been provisioned. A GKE Pod requesting Cloud TPU nodes can be pending for 5 minutes before running. You will see output similar to the following until the GKE Pod is scheduled.

    $ kubectl get pods -w
    
    NAME          READY     STATUS    RESTARTS   AGE
    gke-tpu-pod   0/1       Pending   0          1m
    

    After approximately 5 minutes, you should see something like this:

    NAME          READY     STATUS              RESTARTS   AGE
    gke-tpu-pod   0/1       Pending             0          21s
    gke-tpu-pod   0/1       Pending             0          2m18s
    gke-tpu-pod   0/1       Pending             0          2m18s
    gke-tpu-pod   0/1       ContainerCreating   0          2m18s
    gke-tpu-pod   1/1       Running             0          2m48s
    gke-tpu-pod   0/1       Completed           0          3m8s
    

    You need to use Ctrl-C to exit 'kubectl get' command.

    You can print log information and retrieve more detailed information about each GKE Pod using the following kubectl commands. For example, to see the log output for your GKE Pod, use:

    $ kubectl logs gke-tpu-pod

    You should see output similar to the following:

    2021-09-24 18:55:25.400699: I tensorflow/core/platform/cpu_feature_guard.cc:142]
    This TensorFlow binary is optimized with oneAPI Deep Neural Network Library
    (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    2021-09-24 18:55:25.405947: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272]
    Initialize GrpcChannelCache for job worker -> {0 -> 10.0.16.2:8470}
    2021-09-24 18:55:25.406058: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272]
    Initialize GrpcChannelCache for job localhost -> {0 -> localhost:32769}
    2021-09-24 18:55:28.091729: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272]
    Initialize GrpcChannelCache for job worker -> {0 -> 10.0.16.2:8470}
    2021-09-24 18:55:28.091896: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272]
    Initialize GrpcChannelCache for job localhost -> {0 -> localhost:32769}
    2021-09-24 18:55:28.092579: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427]
    Started server with target: grpc://localhost:32769
    Tensorflow version 2.12.1
    Running on TPU  ['10.0.16.2:8470']
    PerReplica:{
      0: tf.Tensor(2.0, shape=(), dtype=float32),
      1: tf.Tensor(2.0, shape=(), dtype=float32),
      2: tf.Tensor(2.0, shape=(), dtype=float32),
      3: tf.Tensor(2.0, shape=(), dtype=float32),
      4: tf.Tensor(2.0, shape=(), dtype=float32),
      5: tf.Tensor(2.0, shape=(), dtype=float32),
      6: tf.Tensor(2.0, shape=(), dtype=float32),
      7: tf.Tensor(2.0, shape=(), dtype=float32)
    }
    

    To see a full description of the GKE Pod, use:

    $ kubectl describe pod gke-tpu-pod
    

    See Application Introspection and Debugging for more details.

Build and containerize your model in Docker image

Refer to build and containerize your own model for details on this process.

Enable Cloud TPU support on an existing cluster

To enable Cloud TPU support on an existing GKE cluster, perform the following steps in the Google Cloud CLI:

  1. Enable Cloud TPU support:

    gcloud beta container clusters update cluster-name --enable-tpu
    

    Replace cluster-name with the name of your cluster.

  2. Update the kubeconfig entry:

    gcloud container clusters get-credentials cluster-name
    

Setting a custom CIDR range

By default, GKE allocates a CIDR block with the size of /20 for the TPUs provisioned by the cluster. You can specify a custom CIDR range for the Cloud TPU by running the following command:

gcloud beta container clusters update cluster-name \
  --enable-tpu \
  --tpu-ipv4-cidr 10.100.0.0/20

Replace the following:

  • cluster-name: the name of your existing cluster.
  • 10.100.0.0/20: your custom CIDR range.

Using existing CIDR ranges with Shared VPC

Follow the guide on TPU in GKE clusters using a Shared VPC to verify the correct configuration for your Shared VPC.

Disabling Cloud TPU in a cluster

To disable Cloud TPU support on an existing GKE cluster, perform the following steps in the Google Cloud CLI:

  1. Verify that none of your workloads are using Cloud TPU:

    $ kubectl get tpu
    
  2. Disable Cloud TPU support in your cluster:

    $ gcloud beta container clusters update cluster-name --no-enable-tpu
    

    Replace cluster-name with the name of your cluster.

    For zonal clusters this operation takes around 5 minutes, and for regional clusters, this operation takes roughly 15 minutes, depending on the cluster's region.

  3. Once the operations completes with no errors, you can verify that the TPUs provisioned by the cluster have been removed:

    $ gcloud compute tpus list
    

    The names of the TPUs created by Cloud TPU have the following format:

    $ gke-cluster-name-cluster-id-tpu-tpu-id
    

    Replace the following:

    • cluster-name: the name of your existing cluster.
    • cluster-id: the ID of your existing cluster.
    • tpu-id: the ID of the Cloud TPU.

    If any TPUs appear, you can manually delete them by running:

    $ gcloud compute tpus delete gke-cluster-name-cluster-id-tpu-tpu-id
    

Clean up

When you've finished with Cloud TPU on GKE, clean up the resources to avoid incurring extra charges to your Cloud Billing account.

  1. Run the following command to delete your GKE cluster, replacing cluster-name with your cluster name, and project-name with your Google Cloud project name:

    $ gcloud container clusters delete cluster-name \
    --project=project-name --zone=us-central1-b
    
  2. When you've finished examining the data, use the gcloud CLI command to delete the Cloud Storage bucket that you created. Replace bucket-name with the name of your Cloud Storage bucket:

    $ gcloud storage rm gs://bucket-name --recursive