Training Mask R-CNN with Cloud TPU and GKE

This tutorial shows you how to train the Mask RCNN model on Cloud TPU and GKE.

Objectives

  • Create a Cloud Storage bucket to hold your dataset and model output.
  • Create a GKE cluster to manage your Cloud TPU resources.
  • Download a Kubernetes job spec describing the resources needed to train the Mask RCNN model with TensorFlow on a Cloud TPU.
  • Run the job in your GKE cluster, to start training the model.
  • Check the logs and the model output.

Costs

This tutorial uses billable components of Google Cloud Platform, including:

  • Compute Engine
  • Cloud TPU
  • Cloud Storage

Use the pricing calculator to generate a cost estimate based on your projected usage. New GCP users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Project selector page

  3. Make sure that billing is enabled for your Google Cloud Platform project.

    Learn how to enable billing

  4. When you use Cloud TPU with GKE, your project uses billable components of Google Cloud Platform. Check the Cloud TPU pricing and the GKE pricing to estimate your costs, and follow the instructions to clean up resources when you've finished with them.

  5. Enable the following APIs on the GCP Console:

Create a Cloud Storage bucket

You need a Cloud Storage bucket to store the data you use to train your model and the training results. The ctpu up tool used in this tutorial sets up default permissions for the Cloud TPU service account. If you want finer-grain permissions, review the access level permissions.

  1. Go to the Cloud Storage page on the GCP Console.

    Go to the Cloud Storage page

  2. Create a new bucket, specifying the following options:

    • A unique name of your choosing.
    • Default storage class: Regional
    • Location: us-central1

Authorize Cloud TPU to access your Cloud Storage bucket

You need to give your Cloud TPU read/write access to your Cloud Storage objects. To do that, you must grant the required access to the service account used by the Cloud TPU. Follow the guide to grant access to your storage bucket.

Create a cluster on GKE

Follow the instructions below to set up your environment and create a GKE cluster with Cloud TPU support, using the gcloud command-line tool.

  1. Open a Cloud Shell window.

    Open Cloud Shell

  2. Specify your GCP project:

    $ gcloud config set project YOUR-CLOUD-PROJECT
    

    where YOUR-CLOUD-PROJECTis the name of your GCP project.

  3. Specify the zone where you plan to use a Cloud TPU resource. For this example, use the us-central1-b zone:

    $ gcloud config set compute/zone us-central1-b
    

    Cloud TPU is available in the following zones:

    US

    TPU type (v2) TPU v2 cores Total TPU memory Available zones
    v2-8 8 64 GiB us-central1-a
    us-central1-b
    us-central1-c
    (us-central1-f TFRC only)
    v2-32 (Beta) 32 256 GiB us-central1-a
    v2-128 (Beta) 128 1 TiB us-central1-a
    v2-256 (Beta) 256 2 TiB us-central1-a
    v2-512 (Beta) 512 4 TiB us-central1-a
    TPU type (v3) TPU v3 cores Total TPU memory Available regions
    v3-8 8 128 GiB us-central1-a
    us-central1-b
    (us-central1-f TFRC only)

    Europe

    TPU type (v2) TPU v2 cores Total TPU memory Available zones
    v2-8 8 64 GiB europe-west4-a
    v2-32 (Beta) 32 256 GiB europe-west4-a
    v2-128 (Beta) 128 1 TiB europe-west4-a
    v2-256 (Beta) 256 2 TiB europe-west4-a
    v2-512 (Beta) 512 4 TiB europe-west4-a
    TPU type (v3) TPU v3 cores Total TPU memory Available regions
    v3-8 8 128 GiB europe-west4-a
    v3-32 (Beta) 32 512 GiB europe-west4-a
    v3-64 (Beta) 64 1 TiB europe-west4-a
    v3-128 (Beta) 128 2 TiB europe-west4-a
    v3-256 (Beta) 256 4 TiB europe-west4-a
    v3-512 (Beta) 512 8 TiB europe-west4-a
    v3-1024 (Beta) 1024 16 TiB europe-west4-a
    v3-2048 (Beta) 2048 32 TiB europe-west4-a

    Asia Pacific

    TPU type (v2) TPU v2 cores Total TPU memory Available zones
    v2-8 8 64 GiB asia-east1-c
  4. Use the gcloud container clusters command to create a cluster on GKE with support for Cloud TPU. Note that the GKE cluster and its node-pools must be created in a zone where Cloud TPU is available, as described in the section on environment variables above. The following command creates a cluster named tpu-models-cluster:

    $ gcloud container clusters create tpu-models-cluster \
    --cluster-version=1.13 \
    --scopes=cloud-platform \
    --enable-ip-alias \
    --enable-tpu
    

    In the above command:

    • --cluster-version=1.13 indicates that the cluster will use the latest Kubernetes 1.13 release. You must use version 1.13.4-gke.5 or later.
    • --scopes=cloud-platform ensures that all nodes in the cluster have access to your Cloud Storage bucket in the GCP defined as YOUR-CLOUD-PROJECT above. The cluster and the storage bucket must be in the same project. Note that the Pods by default inherit the scopes of the nodes to which they are deployed. This flag gives all Pods running in the cluster the cloud-platform scope. If you want to limit the access on a per Pod basis, see the GKE guide to authenticating with service accounts.
    • --enable-ip-alias indicates that the cluster uses alias IP ranges. This is required for using Cloud TPU on GKE.
    • --enable-tpu indicates that the cluster must support Cloud TPU.

    When the command has finished running a confirmation message appears, similar to this:

    kubeconfig entry generated for tpu-models-cluster.
    NAME                LOCATION       MASTER_VERSION   MASTER_IP      MACHINE_TYPE   NODE_VERSION   NUM_NODES  STATUS
    tpu-models-cluster  us-central1-b  1.13.6-gke.5     35.232.204.86  n1-standard-2  1.13.6-gke.5   3          RUNNING
    

Process the training data

Your next task is to create a job. In Google Kubernetes Engine, a job is a controller object that represents a finite task. The first job you need to create downloads and processes the COCO dataset used to train the Mask RCNN model.

  1. In your shell environment, create a file named download_and_preprocess_coco_k8s.yaml as shown below. Alternatively, you can download this file from GitHub.

    # Download and preprocess the COCO dataset.
    #
    # Instructions:
    #   1. Follow the instructions on https://cloud.google.com/tpu/docs/kubernetes-engine-setup
    #      to create a Kubernetes Engine cluster. The Job must be running at least
    #      on a n1-standard-4 machine.
    #   2. Change the environment variable DATA_BUCKET below to the path of the
    #      Google Cloud Storage bucket where you want to store the training data.
    #   3. Run `kubectl create -f download_and_preprocess_coco_k8s.yaml`.
    
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: download-and-preprocess-coco
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: download-and-preprocess-coco
            # The official TensorFlow 1.13 TPU model image built from https://github.com/tensorflow/tpu/blob/r1.13/tools/docker/Dockerfile.
            image: gcr.io/tensorflow/tpu-models:r1.13
            command:
              - /bin/bash
              - -c
              - >
                cd /tensorflow_tpu_models/tools/datasets &&
                bash download_and_preprocess_coco.sh /scratch-dir &&
                gsutil -m cp /scratch-dir/*.tfrecord ${DATA_BUCKET}/coco &&
                gsutil cp /scratch-dir/raw-data/annotations/*.json ${DATA_BUCKET}/coco
            env:
              # [REQUIRED] Must specify the Google Cloud Storage location where the
              # COCO dataset will be stored.
            - name: DATA_BUCKET
              value: "gs://<my-data-bucket>/data/coco"
            volumeMounts:
            - mountPath: /scratch-dir
              name: scratch-volume
          volumes:
          - name: scratch-volume
            persistentVolumeClaim:
              claimName: scratch-disk-coco
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: scratch-disk-coco
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 100Gi
    
  2. Locate the environment variable named DATA_BUCKET. Update the value for this variable to the path to your Cloud Storage bucket.

  3. Run the following command to create the job in your cluster:

    kubectl create -f download_and_preprocess_coco_k8s.yaml
    

Jobs can take a few minutes to start. You can view the status of the jobs running on your cluster by using the following command:

kubectl get pods -w

This command returns information about all jobs running in the cluster. For example

NAME                                 READY     STATUS      RESTARTS   AGE
download-and-preprocess-coco-xmx9s   0/1       Pending     0          10s

The -w flag instructs the command to watch for any changes in a pod's status. After a few minutes, the status for your job should update to Running.

If you encounter any issues with your job, you can delete it and try again. To do so, run the kubectl delete -f download_and_preprocess_coco_k8s.yaml command before running the kubectl create command again.

Run the Mask RCNN model

Everything is now in place for you to run the Mask RCNN model using Cloud TPU and GKE.

  1. In your shell environment, create a file named mask_rcnn_k8s.yaml as shown below. Alternatively, you can download this file from GitHub.

    # Train Mask-RCNN with Coco dataset using Cloud TPU and Google Kubernetes Engine.
    #
    # [Training Data]
    #   Download and preprocess the COCO dataset using https://github.com/tensorflow/tpu/blob/r1.13/tools/datasets/download_and_preprocess_coco_k8s.yaml
    #   if you don't already have the data.
    #
    # [Instructions]
    #   1. Follow the instructions on https://cloud.google.com/tpu/docs/kubernetes-engine-setup
    #      to create a Kubernetes Engine cluster.
    #   2. Change the environment variable MODEL_BUCKET in the Job spec to the
    #      Google Cloud Storage location where you want to store the output model.
    #   3. Run `kubectl create -f mask_rcnn_k8s.yaml`.
    
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: mask-rcnn-gke-tpu
    spec:
      template:
        metadata:
          annotations:
            # The Cloud TPUs that will be created for this Job must support
            # TensorFlow 1.13. This version MUST match the TensorFlow version that
            # your model is built on.
            tf-version.cloud-tpus.google.com: "1.13"
        spec:
          restartPolicy: Never
          containers:
          - name: mask-rcnn-gke-tpu
            # The official TensorFlow 1.13 TPU model image built from https://github.com/tensorflow/tpu/blob/r1.13/tools/docker/Dockerfile.
            image: gcr.io/tensorflow/tpu-models:r1.13
            command:
            - /bin/sh
            - -c
            - >
                DEBIAN_FRONTEND=noninteractive apt-get update &&
                DEBIAN_FRONTEND=noninteractive apt-get install -y python-dev python-tk libsm6 libxrender1 libxrender-dev libgtk2.0-dev libxext6 libglib2.0 &&
                pip install Cython matplotlib opencv-python-headless &&
                pip install 'git+https://github.com/cocodataset/cocoapi#egg=pycocotools&subdirectory=PythonAPI' &&
                python /tensorflow_tpu_models/models/official/mask_rcnn/mask_rcnn_main.py
                --model_dir=${MODEL_BUCKET}
                --params_override=iterations_per_loop=500,resnet_checkpoint=${RESNET_CHECKPOINT},resnet_depth=50,precision=bfloat16,train_batch_size=64,eval_batch_size=8,training_file_pattern=${DATA_BUCKET}/train-*,validation_file_pattern=${DATA_BUCKET}/val-*,val_json_file=${DATA_BUCKET}/instances_val2017.json,total_steps=22500
            env:
              # The Google Cloud Storage location to store dataset.
            - name: DATA_BUCKET
              value: "gs://<my-data-bucket>"
            - name: MODEL_BUCKET
              value: "gs://<my-model-bucket>/mask_rcnn"
            - name: RESNET_CHECKPOINT
              value: "gs://cloud-tpu-artifacts/resnet/resnet-nhwc-2018-02-07/model.ckpt-112603"
    
    
              # Point PYTHONPATH to the top level models folder
            - name: PYTHONPATH
              value: "/tensorflow_tpu_models/models"
            resources:
              limits:
                # Request a single v3-8 Cloud TPU device to train the model.
                # A single v3-8 Cloud TPU device consists of 4 chips, each of which
                # has 2 cores, so there are 8 cores in total.
                cloud-tpus.google.com/v3: 8
    
  2. Locate the environment variable named DATA_BUCKET. Update the value for this variable to the path to your Cloud Storage bucket.

  3. Locate the environment variable named MODEL_BUCKET. Update the value for this variable to the path to your Cloud Storage bucket. This location is where the script saves the training output.

  4. Run the following command to create the job in your cluster.

    kubectl create -f mask_rcnn_k8s.yaml
    

    When you run this command, the following confirmation message appears:

    job "mask-rcnn-gke-tpu" created

As mentioned in the section, Process the training data, it can take a few minutes for a job to start. View the status of the jobs running on your cluster by using the following command:

kubectl get pods -w

View training status

You can track the training status by using the kubectl command-line utility.

  1. Run the following command to get the status of the job:

    kubectl get pods
    

    During the training process, the status of the pod should be RUNNING.

  2. To get additional information on the training process, run the following command:

    $ kubectl logs job name

    Where job name is the full name of the job, for example, mask-rcnn-gke-tpu-abcd.

    You can also check the output on the GKE Workloads dashboard on the GCP Console.

    Note that it takes a while for the first entry to appear in the logs. You can expect to see something like this:

    I0622 18:14:31.617954 140178400511808 tf_logging.py:116] Calling model_fn.
    I0622 18:14:40.449557 140178400511808 tf_logging.py:116] Create CheckpointSaverHook.
    I0622 18:14:40.697138 140178400511808 tf_logging.py:116] Done calling model_fn.
    I0622 18:14:44.004508 140178400511808 tf_logging.py:116] TPU job name worker
    I0622 18:14:45.254548 140178400511808 tf_logging.py:116] Graph was finalized.
    I0622 18:14:48.346483 140178400511808 tf_logging.py:116] Running local_init_op.
    I0622 18:14:48.506665 140178400511808 tf_logging.py:116] Done running local_init_op.
    I0622 18:14:49.135080 140178400511808 tf_logging.py:116] Init TPU system
    I0622 18:15:00.188153 140178400511808 tf_logging.py:116] Start infeed thread controller
    I0622 18:15:00.188635 140177578452736 tf_logging.py:116] Starting infeed thread controller.
    I0622 18:15:00.188838 140178400511808 tf_logging.py:116] Start outfeed thread controller
    I0622 18:15:00.189151 140177570060032 tf_logging.py:116] Starting outfeed thread controller.
    I0622 18:15:07.316534 140178400511808 tf_logging.py:116] Enqueue next (100) batch(es) of data to infeed.
    I0622 18:15:07.316904 140178400511808 tf_logging.py:116] Dequeue next (100) batch(es) of data from outfeed.
    I0622 18:16:13.881397 140178400511808 tf_logging.py:116] Saving checkpoints for 100 into gs://<my-model-bucket>/mask_rcnn/model.ckpt.
    I0622 18:16:21.147114 140178400511808 tf_logging.py:116] loss = 1.589756, step = 0
    I0622 18:16:21.148168 140178400511808 tf_logging.py:116] loss = 1.589756, step = 0
    I0622 18:16:21.150870 140178400511808 tf_logging.py:116] Enqueue next (100) batch(es) of data to infeed.
    I0622 18:16:21.151168 140178400511808 tf_logging.py:116] Dequeue next (100) batch(es) of data from outfeed.
    I0622 18:17:00.739207 140178400511808 tf_logging.py:116] Enqueue next (100) batch(es) of data to infeed.
    I0622 18:17:00.739809 140178400511808 tf_logging.py:116] Dequeue next (100) batch(es) of data from outfeed.
    I0622 18:17:36.598773 140178400511808 tf_logging.py:116] global_step/sec: 2.65061
    I0622 18:17:37.040504 140178400511808 tf_logging.py:116] examples/sec: 2698.56
    I0622 18:17:37.041333 140178400511808 tf_logging.py:116] loss = 2.63023, step = 200 (75.893 sec)
    
  3. View the trained model at gs://<my-model-bucket>/mask-rcnn/model.ckpt.

    gsutil ls -r gs://my-model-bucket/mask-rcnn/model.ckpt

Clean up

When you've finished with Cloud TPU on GKE, clean up the resources to avoid incurring extra charges to your Google Cloud Platform account.

If you haven't set the project and zone for this session, do so now. See the instructions earlier in this guide. Then follow this cleanup procedure:

  1. Run the following command to delete your GKE cluster, tpu-models-cluster, replacing YOUR-PROJECT with your GCP project name:

    $ gcloud container clusters delete tpu-models-cluster --project=YOUR-PROJECT
    
  2. When you've finished examining the data, use the gsutil command to delete the Cloud Storage bucket you created during this tutorial. Replace YOUR-BUCKET with the name of your Cloud Storage bucket:

    $ gsutil rm -r gs://YOUR-BUCKET
    

    See the Cloud Storage pricing guide for free storage limits and other pricing information.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...