Training Mask RCNN on Cloud TPU

This tutorial shows you how to train the Mask RCNN model on Cloud TPU.

Overview

This tutorial demonstrates how to run the Mask RCNN model using Cloud TPU. The Mask RCNN model is a deep neural network. It's purpose is to address one of the more difficult vision challenges: instance segmentation. Instance segmentation is the task of detecting and distinguishing multiple objects within a single image.

To address instance segmentation, the Mask RCNN model generates bounding boxes and segmentation masks for each instance of an object in the image. The model is based on the Feature Pyramid Network (FPN) and a ResNet50 backbone.

This tutorial uses tf.contrib.tpu.TPUEstimator to train the model. The TPU Estimator API is a high-level TensorFlow API and is the recommended way to build and run a machine learning model on Cloud TPU. The API simplifies the model development process by hiding most of the low-level implementation, which makes it easier to switch between TPU and other platforms such as GPU or CPU.

Before You Begin

Before starting this tutorial, check that your Google Cloud Platform project is correctly set up.

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Google Cloud Platform project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your Google Cloud Platform project.

    Learn how to enable billing

  4. Verify that you have sufficient quota to use either TPU devices or Pods.

Set up your resources

This section provides information on setting up Cloud Storage storage, VM, and Cloud TPU resources for tutorials.

Create a Cloud Storage bucket

You need a Cloud Storage bucket to store the data you use to train your model and the training results. The ctpu up tool used in this tutorial sets up default permissions for the Cloud TPU service account. If you want finer-grain permissions, review the access level permissions.

The bucket you create must reside in the same region as your virtual machine (VM) and your Cloud TPU device or Cloud TPU slice (multiple TPU devices) do.

  1. Go to the Cloud Storage page on the GCP Console.

    Go to the Cloud Storage page

  2. Create a new bucket, specifying the following options:

    • A unique name of your choosing.
    • Default storage class: Regional
    • Location: If you want to use a Cloud TPU device, accept the default presented. If you want to use a Cloud TPU Pod slice, you must specify a region where Cloud TPU Pods are available.

Use the ctpu tool

This section demonstrates using the Cloud TPU provisioning tool (ctpu) for creating and managing Cloud TPU project resources. The resources are comprised of a virtual machine (VM) and a Cloud TPU resource that have the same name. These resources must reside in the same region/zone as the bucket you just created.

You can also set up your VM and TPU resources using gcloud commands or through the Cloud Console. For more information, see Managing VM and TPU Resource.

Run ctpu up to create resources

  1. Open a Cloud Shell window.

    Open Cloud Shell

  2. Run ctpu up specifying the flags shown for either a Cloud TPU device or Pod slice. Refer to CTPU Reference for flag options and descriptions.

  3. Set up either a Cloud TPU device or a Pod slice:

TPU Device

Set up a Cloud TPU device:

$ ctpu up --machine-type n1-standard-8 --tpu-size v3-8

The following configuration message appears:

ctpu will use the following configuration:

Name: [your TPU's name]
Zone: [your project's zone]
GCP Project: [your project's name]
TensorFlow Version: 1.13
VM:
  Machine Type: [your machine type]
  Disk Size: [your disk size]
  Preemptible: [true or false]
Cloud TPU:
  Size: [your TPU size]
  Preemptible: [true or false]

OK to create your Cloud TPU resources with the above configuration? [Yn]:

Press y to create your Cloud TPU resources.

TPU Pod

Set up a Cloud TPU slice on the VM and the zone you are working in:

$ ctpu up --zone=us-central1-a --tpu-size=v2-128 --machine-type n1-standard-8

The following configuration message appears:

ctpu will use the following configuration:

Name: [your TPU's name]
Zone: [your project's zone]
GCP Project: [your project's name]
TensorFlow Version: 1.13
VM:
  Machine Type: [your machine type]
  Disk Size: [your disk size]
  Preemptible: [true or false]
Cloud TPU:
  Size: [your TPU size]
  Preemptible: [true or false]

OK to create your Cloud TPU resources with the above configuration? [Yn]:

Press y to create your Cloud TPU resources.

The ctpu up command creates a virtual machine (VM) and Cloud TPU services.

From this point on, a prefix of (vm)$ means you should run the command on the Compute Engine VM instance.

Verify your Compute Engine VM

When the ctpu up command has finished executing, verify that your shell prompt is username@tpuname, which shows you are logged into your Compute Engine VM.

Install extra packages

The Mask RCNN training application requires several extra packages. Install them now:

(vm)$ sudo apt-get install -y python-tk && \
pip install Cython matplotlib opencv-python-headless pyyaml Pillow && \
pip install 'git+https://github.com/cocodataset/cocoapi#egg=pycocotools&subdirectory=PythonAPI'

Update the keepalive values of your vm connection

This tutorial requires a long-lived connection to the Compute Engine instance. To ensure you aren't disconnected from the instance, run the following command:

(vm)$ sudo /sbin/sysctl \
       -w net.ipv4.tcp_keepalive_time=60 \
       net.ipv4.tcp_keepalive_intvl=60 \
       net.ipv4.tcp_keepalive_probes=5

Verify the Mask RCNN model files

Run the following commands to get the latest Mask RCNN files.

(vm)$ git clone https://github.com/tensorflow/tpu/

Define parameter values

Next, you need to define several parameter values. You use these parameters to train and evaluate your model.

The variables you need to set are:

  • STORAGE_BUCKET. This is the name of the Cloud Storage bucket that you created in the Before you begin section.
  • GCS_MODEL_DIR. This is the directory that contains the model files. This tutorial uses a folder within the Cloud Storage bucket. You do not have to create this folder beforehand. The script creates the folder if it does not already exist.
  • CHECKPOINT. This variable specifies a pre-trained checkpoint. The Mask RCNN model requires a pre-trained image classification model, such as Resnet, to be used as a backbone network. This tutorial uses a pre-trained checkpoint created with the ResNet demonstration model. You can also train your own ResNet model and specify a checkpoint from your ResNet model directory.
  • BACKBONE (TPU device only). The backbone network for MaskRCNN. In this tutorial, you'll set the backbone to resnet50.
  • PATH_GCS_MASKRCNN. This is the address of your Cloud Storage bucket where you want to store your model training artifacts. As with the GCS_MODEL_DIR variable, this tutorial uses a folder in the Cloud Storage bucket.

To define these variables, you can use the export command to create multiple bash variables and use them in a configuration string. If you're using a single TPU device, you can set these variables using a config.yaml file.

TPU Device

Using bash variables:

(vm)$ export STORAGE_BUCKET=gs://[YOUR_BUCKET_NAME]
    
(vm)$ export GCS_MODEL_DIR=${STORAGE_BUCKET}/mask-rcnn-model; \
export CHECKPOINT=gs://cloud-tpu-artifacts/resnet/resnet-nhwc-2018-02-07/model.ckpt-112603; \
export BACKBONE='resnet50'; \
export PATH_GCS_MASKRCNN=${STORAGE_BUCKET}/coco
    

Using a config.yaml file:

checkpoint: gs://cloud-tpu-artifacts/resnet/resnet-nhwc-2018-02-07/model.ckpt-112603
backbone: 'resnet50'
use_bfloat16: True
train_batch_size: 64
eval_batch_size: 8
training_file_pattern: gs://[YOUR_BUCKET_NAME]/train-*
validation_file_pattern: gs://[YOUR_BUCKET_NAME]/val-*
val_json_file: gs://[YOUR_BUCKET_NAME]/instances_val2017.json
total_steps: 22500
      

TPU Pod

(vm)$ export STORAGE_BUCKET=gs://[YOUR_BUCKET_NAME]
    
(vm)$ export GCS_MODEL_DIR=${STORAGE_BUCKET}/mask-rcnn-model; \
export CHECKPOINT=gs://cloud-tpu-artifacts/resnet/resnet-nhwc-2018-02-07/model.ckpt-112603; \
export PATH_GCS_MASKRCNN=${STORAGE_BUCKET}/coco
    

Prepare your data

The Mask RCNN model trains using COCO. COCO is a large-scale object detection, segmentation, and captioning dataset. In this step, you convert this dataset into a set of TFRecords (*.tfrecord) that the training application can use. To convert the dataset, you use the tpu/tools/datasets/download_and_preprocess_coco.sh script.

(vm)$ cd tpu/tools/datasets && \
bash download_and_preprocess_coco.sh ./data/dir/coco

Copy your data to your Cloud Storage bucket

After you convert the data into TFrecords, copy them to your Cloud Storage bucket using the gsutil command.

(vm)$ gsutil -m cp ./data/dir/coco/*.tfrecord ${STORAGE_BUCKET}/coco && \
gsutil cp ./data/dir/coco/raw-data/annotations/*.json ${STORAGE_BUCKET}/coco

Notice that you must also copy the annotation files. These files help validate the model's performance.

Run the training and evaluation

You are now ready to run the model on the preprocessed COCO data. To run the model, you use the mask_rcnn_main.py script.

TPU Device

Using a TPU device, you can run both the training and evaluation at the same time, using either bash variables or a config.yaml file.

Using bash variables:

(vm)$ python ~/tpu/models/official/mask_rcnn/mask_rcnn_main.py \
    --use_tpu=True \
    --tpu=${TPU_NAME:?} \
    --model_dir=${GCS_MODEL_DIR:?} \
    --num_cores=8 \
    --mode="train_and_eval" \
    --config="checkpoint=${CHECKPOINT},backbone=${BACKBONE},use_bfloat16=true,train_batch_size=64,eval_batch_size=8,training_file_pattern=${PATH_GCS_MASKRCNN:?}/train-*,validation_file_pattern=${PATH_GCS_MASKRCNN:?}/val-*,val_json_file=${PATH_GCS_MASKRCNN:?}/instances_val2017.json,total_steps=22500"

Using a config.yaml file:

(vm)$ python ~/tpu/models/official/mask_rcnn/mask_rcnn_main.py \
--use_tpu=True \
--tpu=${TPU_NAME:?} \
--model_dir="gs://[YOUR_BUCKET_NAME]/mask-rcnn-model" \
--config="[PATH_TO_CONFIG_FILE]" \
--mode="train_and_eval"

TPU Pod

When using a TPU pod, you first train the model using a TPU pod. Then, you launch a single TPU device to run the evaluation.

  1. Start the training script.

    (vm)$ python ~/tpu/models/official/mask_rcnn/mask_rcnn_main.py  \
    --tpu=${TPU_NAME} \
    --config=${CONFIG_SCRIPT} \
    --iterations_per_loop=500 \
    --mode=train  \
    --model_dir=${GCS_MODEL_DIR} \
    --num_cores=128  \
    --use_tpu
    
  2. After the training completes, exit the TPU pod.

    (vm)$ exit
  3. Launch a new TPU device, mask-rcnn-eval.

    $ ctpu up --tpu-size=v3-8 --name=mask-rcnn-eval --zone=us-central1-a
  4. Create your environmental variables.

    (vm)$ export STORAGE_BUCKET=gs://[YOUR_BUCKET_NAME]
    $ export GCS_MODEL_DIR=${STORAGE_BUCKET}/mask-rcnn-model; \
    export CHECKPOINT=gs://cloud-tpu-artifacts/resnet/resnet-nhwc-2018-02-07/model.ckpt-112603; \
    export PATH_GCS_MASKRCNN=${STORAGE_BUCKET}/coco
  5. Clone the Mask RCNN repository.

    (vm)$ git clone https://github.com/tensorflow/tpu/
  6. Install the packages necessary to run the evaluation.

    (vm)$ sudo apt-get install -y python-tk && \
    pip install Cython matplotlib opencv-python-headless pyyaml && \
    (vm)$ pip install 'git+https://github.com/cocodataset/cocoapi#egg=pycocotools&subdirectory=PythonAPI'
    
  7. Start the evaluation.

    (vm)$ python ~/tpu/models/official/mask_rcnn/mask_rcnn_main.py \
    --tpu=mask-rcnn-eval \
    --config="use_bfloat16=True,init_learning_rate=0.28,learning_rate_levels=[0.028, 0.0028, 0.00028],learning_rate_steps=[6000, 8000, 10000],momentum=0.95,num_batch_norm_group=1,num_steps_per_eval=500,global_gradient_clip_ratio=0.02,checkpoint=${CHECKPOINT},total_steps=11250,train_batch_size=512,training_file_pattern=${PATH_GCS_MASKRCNN}/train-*,val_json_file=${PATH_GCS_MASKRCNN}/instances_val2017.json,validation_file_pattern=${PATH_GCS_MASKRCNN}/val-*,warmup_steps=1864" \
    --iterations_per_loop=500 \
    --mode=eval \
    --model_dir=${GCS_MODEL_DIR} \
    --use_tpu

Clean up

To avoid incurring charges to your GCP account for the resources used in this topic:

  1. Disconnect from the Compute Engine VM:

    (vm)$ exit
    

    Your prompt should now be user@projectname, showing you are in the Cloud Shell.

  2. In your Cloud Shell, run ctpu delete with the --zone flag you used when you set up the Cloud TPU to delete your Compute Engine VM and your Cloud TPU:

    $ ctpu delete [optional: --zone]
    
  3. Run ctpu status to make sure you have no instances allocated to avoid unnecessary charges for TPU usage. The deletion might take several minutes. A response like the one below indicates there are no more allocated instances:

    2018/04/28 16:16:23 WARNING: Setting zone to "us-central1-b"
    No instances currently exist.
            Compute Engine VM:     --
            Cloud TPU:             --
    
  4. Run gsutil as shown, replacing YOUR-BUCKET-NAME with the name of the Cloud Storage bucket you created for this tutorial:

    $ gsutil rm -r gs://YOUR-BUCKET-NAME
    

What's Next

Learn how to run Tensorboard to visualize and analyze program performance. For more information, see Setting up Tensorboard.

Was this page helpful? Let us know how we did:

Send feedback about...