Training ResNet with Cloud TPU and GKE

This tutorial shows you how to train the TensorFlow ResNet-50 model on Cloud TPU and GKE.

In summary, the tutorial leads you through the following steps to run the model, using a fake data set provided for testing purposes:

  • Create a Cloud Storage bucket to hold your model output.
  • Create a GKE cluster to manage your Cloud TPU resources.
  • Download a Kubernetes Job spec describing the resources needed to train ResNet-50 with TensorFlow on a Cloud TPU.
  • Run the Job in your GKE cluster, to start training the model.
  • Check the logs and the model output.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. When you use Cloud TPU with GKE, your project uses billable components of Google Cloud Platform. Check the Cloud TPU pricing and the GKE pricing to estimate your costs, and follow the instructions to clean up resources when you've finished with them.

  5. Enable the following APIs on the GCP Console:

Choose a shell and install your command-line tools if necessary

You can use either Cloud Shell or your local shell to complete this tutorial. Cloud Shell comes preinstalled with the gcloud and kubectl command-line tools. gcloud is the command-line interface for GCP, and kubectl provides the command-line interface for running commands against Kubernetes clusters.

If you prefer using your local shell, you must install the gcloud and kubectl command-line tools in your environment.

Cloud Shell

To launch Cloud Shell, perform the following steps:

  1. Go to the Google Cloud Platform Console.

  2. Click the Activate Cloud Shell button at the top-right corner of the console.

A Cloud Shell session opens inside a frame at the bottom of the console. Use this shell to run gcloud and kubectl commands.

Local Shell

To install gcloud and kubectl, perform the following steps:

  1. Install the Google Cloud SDK, which includes the gcloud command-line tool.

  2. Install the kubectl command-line tool by running the following command:

    $ gcloud components install kubectl

  3. Install the gcloud beta components, which you need for running the Beta version of GKE with Cloud TPU.

    $ gcloud components install beta

Set up your data and your storage bucket

You can use a fake data set provided with this tutorial or the full ImageNet data to train your model. Either way, you need to set up a Cloud Storage bucket as described below.

Use the fake data set or the ImageNet data

The instructions below assume that you want to use a randomly generated fake dataset to test the model. Alternatively, you can use the full ImageNet dataset.

The fake dataset is at this location on Cloud Storage:

gs://cloud-tpu-test-datasets/fake_imagenet

Note that the fake dataset is only useful for understanding how to use a Cloud TPU, and validating end-to-end performance. The accuracy numbers and saved model will not be meaningful.

Create a Cloud Storage bucket

You need a Cloud Storage bucket to store the results of training your machine learning model. If you decide to use real training data rather than the fake dataset provided with this tutorial, you can store the data in the same bucket.

  1. Go to the Cloud Storage page on the GCP Console.

    Go to the Cloud Storage page

  2. Create a new bucket, specifying the following options:

    • A unique name of your choosing.
    • Default storage class: Regional
    • Location: us-central1

Authorize Cloud TPU to access your Cloud Storage bucket

You need to give your Cloud TPU read/write access to your Cloud Storage objects. To do that, you must grant the required access to the service account used by the Cloud TPU. Follow the guide to grant access to your storage bucket.

Create a cluster on GKE

You can create a GKE cluster on the GCP Console or using the gcloud command-line tool. Select an option below to see the relevant instructions:

console

Follow these instructions to create a GKE cluster with Cloud TPU support:

  1. Go to the GKE page on the GCP Console.

    Go to the GKE page

  2. Click Create cluster.

  3. Specify a Name for your cluster. The name must be unique within the project and zone. For example, tpu-models-cluster.
  4. Specify the Zone where you plan to use a Cloud TPU resource. For example, select the us-central1-b zone.

    Cloud TPU is available in the following zones:

    US

    Cloud TPU v2 and Preemptible v2 us-central1-b
    us-central1-c
    us-central1-f ( TFRC program only)
    Cloud TPU v3 (beta) and Preemptible v3 (beta) us-central1-b
    us-central1-f
    ( TFRC program only)

    Europe

    Cloud TPU v2 and Preemptible v2 europe-west4-a
    Cloud TPU v3 (beta) and Preemptible v3 (beta) europe-west4-a

    Asia Pacific

    Cloud TPU v2 and Preemptible v2 asia-east1-c

  5. Ensure that the Cluster Version is set to 1.10.4-gke.2 or later, to allow support for Cloud TPU.

  6. Scroll down to the bottom of the page and click More.
  7. Enable VPC-native (using alias IP).
  8. Enable Cloud TPU (beta).
  9. Set Access scopes to Allow full access to all Cloud APIs. This ensures that all nodes in the cluster have access to your Cloud Storage bucket. The cluster and the storage bucket must be in the same project for this to work. Note that the Pods by default inherit the scopes of the nodes to which they are deployed. If you want to limit the access on a per Pod basis, see the GKE guide to authenticating with service accounts.
  10. Configure the remaining options for your cluster as desired. You can leave them at their default values.
  11. Click Create.

gcloud

Follow the instructions below to set up your environment and create a GKE cluster with Cloud TPU support, using the gcloud command-line tool:

  1. Specify your GCP project:

    $ gcloud config set project YOUR-CLOUD-PROJECT
    

    where YOUR-CLOUD-PROJECTis the name of your GCP project.

  2. Specify the zone where you plan to use a Cloud TPU resource. For this example, use the us-central1-b zone:

    $ gcloud config set compute/zone us-central1-b
    

    Cloud TPU is available in the following zones:

    US

    Cloud TPU v2 and Preemptible v2 us-central1-b
    us-central1-c
    us-central1-f ( TFRC program only)
    Cloud TPU v3 (beta) and Preemptible v3 (beta) us-central1-b
    us-central1-f
    ( TFRC program only)

    Europe

    Cloud TPU v2 and Preemptible v2 europe-west4-a
    Cloud TPU v3 (beta) and Preemptible v3 (beta) europe-west4-a

    Asia Pacific

    Cloud TPU v2 and Preemptible v2 asia-east1-c

  3. Use the gcloud beta container clusters command to create a cluster on GKE with support for Cloud TPU. Note that the GKE cluster and its node-pools must be created in a zone where Cloud TPU is available, as described in the section on environment variables above. The following command creates a cluster named tpu-models-cluster:

    $ gcloud beta container clusters create tpu-models-cluster \
      --cluster-version=1.10 \
      --scopes=cloud-platform \
      --enable-ip-alias \
      --enable-tpu
    

    In the above command:

    • --cluster-version=1.10 indicates that the cluster will use the latest Kubernetes 1.10 release. You must use version 1.10.4-gke.2 or later.
    • --scopes=cloud-platform ensures that all nodes in the cluster have access to your Cloud Storage bucket in the GCP defined as YOUR-CLOUD-PROJECT above. The cluster and the storage bucket must be in the same project for this to work. Note that the Pods by default inherit the scopes of the nodes to which they are deployed. Therefore, --scopes=cloud-platform gives all Pods running in the cluster the cloud-platform scope. If you want to limit the access on a per Pod basis, see the GKE guide to authenticating with service accounts.
    • --enable-ip-alias indicates that the cluster uses alias IP ranges. This is required for using Cloud TPU on GKE.
    • --enable-tpu indicates that the cluster must support Cloud TPU.

    When the command has finished running a confirmation message appears, similar to this:

      kubeconfig entry generated for tpu-models-cluster.
      NAME                LOCATION       MASTER_VERSION  MASTER_IP     MACHINE_TYPE    NODE_VERSION  NUM_NODES  STATUS
      tpu-models-cluster  us-central1-b  1.10.4-gke.2    35.232.204.86  n1-standard-2  1.10.4-gke.2  3          RUNNING
    

Run the ResNet-50 model

Everything is now in place for you to run the ResNet-50 model using Cloud TPU and GKE.

  1. Create a Job spec in a file named resnet_k8s.yaml:

    • Download or copy the prepared Job spec from GitHub.
    • In the Job spec, change <my-model-bucket> to the name of the Cloud Storage bucket you created earlier.

    Note that the Job spec references the TensorFlow TPU models that are available in a Docker container at gcr.io/tensorflow/tpu-models. (That's a location on Container Registry.)

  2. Create the Job in the GKE cluster:

    $ kubectl create -f resnet_k8s.yaml
    job "resnet-tpu" created

  3. Wait for the Job to be scheduled.

    $ kubectl get pods -w
    
    NAME               READY     STATUS    RESTARTS   AGE
    resnet-tpu-cmvlf   0/1       Pending   0          1m
    

    The lifetime of Cloud TPU nodes is bound to the Pods that request them. The Cloud TPU is created on demand when the Pod is scheduled, and recycled when the Pod is deleted.

    It takes about 5 minutes for the Pod scheduling to finish.

    After 5 minutes, you should see something like this:

    NAME               READY     STATUS    RESTARTS   AGE
    resnet-tpu-cmvlf   1/1       Running   0          6m
    
  4. Check the Pod logs to see how the job is doing:

    $ kubectl logs resnet-tpu-cmvlf

    You can also check the output on the GKE Workloads dashboard on the GCP Console.

    Note that it takes a while for the first entry to appear in the logs. You can expect to see something like this:

    I0622 18:14:31.617954 140178400511808 tf_logging.py:116] Calling model_fn.
    I0622 18:14:40.449557 140178400511808 tf_logging.py:116] Create CheckpointSaverHook.
    I0622 18:14:40.697138 140178400511808 tf_logging.py:116] Done calling model_fn.
    I0622 18:14:44.004508 140178400511808 tf_logging.py:116] TPU job name worker
    I0622 18:14:45.254548 140178400511808 tf_logging.py:116] Graph was finalized.
    I0622 18:14:48.346483 140178400511808 tf_logging.py:116] Running local_init_op.
    I0622 18:14:48.506665 140178400511808 tf_logging.py:116] Done running local_init_op.
    I0622 18:14:49.135080 140178400511808 tf_logging.py:116] Init TPU system
    I0622 18:15:00.188153 140178400511808 tf_logging.py:116] Start infeed thread controller
    I0622 18:15:00.188635 140177578452736 tf_logging.py:116] Starting infeed thread controller.
    I0622 18:15:00.188838 140178400511808 tf_logging.py:116] Start outfeed thread controller
    I0622 18:15:00.189151 140177570060032 tf_logging.py:116] Starting outfeed thread controller.
    I0622 18:15:07.316534 140178400511808 tf_logging.py:116] Enqueue next (100) batch(es) of data to infeed.
    I0622 18:15:07.316904 140178400511808 tf_logging.py:116] Dequeue next (100) batch(es) of data from outfeed.
    I0622 18:16:13.881397 140178400511808 tf_logging.py:116] Saving checkpoints for 100 into gs://<my-model-bucket>/resnet/model.ckpt.
    I0622 18:16:21.147114 140178400511808 tf_logging.py:116] loss = 1.589756, step = 0
    I0622 18:16:21.148168 140178400511808 tf_logging.py:116] loss = 1.589756, step = 0
    I0622 18:16:21.150870 140178400511808 tf_logging.py:116] Enqueue next (100) batch(es) of data to infeed.
    I0622 18:16:21.151168 140178400511808 tf_logging.py:116] Dequeue next (100) batch(es) of data from outfeed.
    I0622 18:17:00.739207 140178400511808 tf_logging.py:116] Enqueue next (100) batch(es) of data to infeed.
    I0622 18:17:00.739809 140178400511808 tf_logging.py:116] Dequeue next (100) batch(es) of data from outfeed.
    I0622 18:17:36.598773 140178400511808 tf_logging.py:116] global_step/sec: 2.65061
    I0622 18:17:37.040504 140178400511808 tf_logging.py:116] examples/sec: 2698.56
    I0622 18:17:37.041333 140178400511808 tf_logging.py:116] loss = 2.63023, step = 200 (75.893 sec)
    
  5. View the trained model at gs://<my-model-bucket>/resnet/model.ckpt. You can see your buckets on the Cloud Storage browser page on the GCP Console.

Clean up

When you've finished with Cloud TPU on GKE, clean up the resources to avoid incurring extra charges to your Google Cloud Platform account.

console

Delete your GKE cluster:

  1. Go to the GKE page on the GCP Console.

    Go to the GKE page

  2. Select the checkbox next to the cluster that you want to delete.

  3. Click Delete.

When you've finished finished examining the data, delete the Cloud Storage bucket that you created during this tutorial:

  1. Go to the Cloud Storage page on the GCP Console.

    Go to the Cloud Storage page

  2. Select the checkbox next to the bucket that you want to delete.

  3. Click Delete.

See the Cloud Storage pricing guide for free storage limits and other pricing information.

gcloud

If you haven't set the project and zone for this session, do so now. See the instructions earlier in this guide. Then follow this cleanup procedure:

  1. Run the following command to delete your GKE cluster, tpu-models-cluster, replacing YOUR-PROJECT with your GCP project name:

    $ gcloud container clusters delete tpu-models-cluster --project=YOUR-PROJECT
    

  2. When you've finished finished examining the data, use the gsutil command to delete the Cloud Storage bucket you created during this tutorial. Replace YOUR-BUCKET with the name of your Cloud Storage bucket:

    $ gsutil rm -r gs://YOUR-BUCKET
    

See the Cloud Storage pricing guide for free storage limits and other pricing information.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...