Training Mask RCNN on Cloud TPU

Overview

This tutorial demonstrates how to run the Mask RCNN model using Cloud TPU with the COCO dataset.

Mask RCNN is a deep neural network designed to address object detection and image segmentation, one of the more difficult computer vision challenges.

The Mask RCNN model generates bounding boxes and segmentation masks for each instance of an object in the image. The model is based on the Feature Pyramid Network (FPN) and a ResNet50 backbone.

This tutorial uses tf.contrib.tpu.TPUEstimator to train the model. The TPUEstimator API is a high-level TensorFlow API and is the recommended way to build and run a machine learning model on Cloud TPU. The API simplifies the model development process by hiding most of the low-level implementation, which makes it easier to switch between TPU and other platforms such as GPU or CPU.

Before You Begin

Before starting this tutorial, check that your Google Cloud Platform project is correctly set up.

If you plan to train on a TPU Pod slice, review Training on TPU Pods to understand parameter changes required for Pod slices.

Before starting this tutorial, check that your Google Cloud Platform project is correctly set up.

Objectives

  • Create a Cloud Storage bucket to hold your dataset and model output
  • Prepare the COCO dataset
  • Set up a Compute Engine VM and Cloud TPU node for training and evaluation
  • Run training and evaluation on a single Cloud TPU or a Cloud TPU Pod

Costs

This tutorial uses billable components of Google Cloud Platform, including:

  • Compute Engine
  • Cloud TPU
  • Cloud Storage

Use the pricing calculator to generate a cost estimate based on your projected usage. New GCP users might be eligible for a free trial.

Before you begin

This section provides information on setting up Cloud Storage storage, VM, and Cloud TPU resources for tutorials.

  1. Open a Cloud Shell window.

    Open Cloud Shell

  2. Create a variable for your project's name.

        export PROJECT_NAME=your-project_name
    
  3. Configure gcloud command-line tool to use the project where you want to create Cloud TPU.

        gcloud config set project ${PROJECT_NAME}
    
  4. Create a Cloud Storage bucket using the following command:

        gsutil mb -p ${PROJECT_NAME} -c standard -l europe-west4 -b on gs://your-bucket-name
    

    This Cloud Storage bucket stores the data you use to train your model and the training results.

    Your Compute Engine VM, your Cloud TPU node and your Cloud Storage bucket should all be located in the same region.

  5. Launch a Compute Engine VM using the ctpu up command.

    $ ctpu up --vm-only --machine-type=n1-standard-8 --zone=europe-west4-a
    

    After running ctpu up, a message appears showing your proposed configuration and you are prompted to confirm the configuration:

    OK to create your Cloud TPU resources with the above configuration? [Yn]:
  6. Press y to create your Compute Engine VM.

    When the ctpu up command has finished executing, verify that your shell prompt has changed from username@project to username@tpuname. This change shows that you are now logged into your Compute Engine VM.

    If you are not connected to the Compute Engine instance, you can do so by running the following command:

        gcloud compute ssh username --zone=europe-west4-a
    

As you continue these instructions, run each command that begins with (vm)$ in your VM session window.

Prepare the data

  1. Run the download_and_preprocess_coco.sh script to convert the COCO dataset into a set of TFRecords (*.tfrecord) that the training application expects.

    (vm)$ cd /usr/share/tpu/tools/datasets \
          sudo bash /usr/share/tpu/tools/datasets/download_and_preprocess_coco.sh ./data/dir/coco
    

    This installs the required libraries and then runs the preprocessing script. It outputs a number of *.tfrecord files in your local data directory. The COCO download and conversion script takes approximately 1 hour to complete.

  2. Copy the data to your Cloud Storage bucket

    After you convert the data into TFRecords, copy them from local storage to your Cloud Storage bucket using the gsutil command. You must also copy the annotation files. These files help validate the model's performance.

    (vm)$ gsutil -m cp ./data/dir/coco/*.tfrecord ${STORAGE_BUCKET}/coco \
          gsutil cp ./data/dir/coco/raw-data/annotations/*.json ${STORAGE_BUCKET}/coco
    

Set up and start the Cloud TPU

The training runs for 22,500 steps and takes approximately 5 hours on a v2-8 TPU node and approximately 3 and 1/2 hours on a v3-8 TPU node.

  1. Run the following command to create your Cloud TPU.

    (vm)$ ctpu up --tpu-only --tpu-size=v3-8 --zone=europe-west4-a
    
    Parameter Description
    tpu-size Specifies the size of the Cloud TPU. This tutorial uses a v3-8 TPU size for the single device training and evaluation.
    zone The zone where you plan to create your Cloud TPU. This should be the same zone you used for the Compute Engine VM. For example, europe-west4-a.

    After running ctpu up to start the Cloud TPU, a message appears showing your proposed configuration, including the Compute Engine VM. You are prompted to confirm the configuration:

    OK to create your Cloud TPU resources with the above configuration? [Yn]:
  2. Press y to create your Cloud TPU. It will take several minutes to create your Cloud TPU.

Install extra packages

The Mask RCNN training application requires several extra packages. Install them now:

    (vm)$ sudo apt-get install -y python-tk && \
          pip install --user Cython matplotlib opencv-python-headless pyyaml Pillow && \
          pip install --user 'git+https://github.com/cocodataset/cocoapi#egg=pycocotools&subdirectory=PythonAPI' && \
          pip install --user -U gast==0.2.2

Update the keepalive values of your VM connection

This tutorial requires a long-lived connection to the Compute Engine instance. To ensure you aren't disconnected from the instance, run the following command:

(vm)$ sudo /sbin/sysctl \
       -w net.ipv4.tcp_keepalive_time=120 \
       net.ipv4.tcp_keepalive_intvl=120 \
       net.ipv4.tcp_keepalive_probes=5

Define parameter values

Next, you need to define several parameter values. You use these parameters to train and evaluate your model.

The variables you need to set are described in the following table:

Parameter Description
STORAGE_BUCKET This is the name of the Cloud Storage bucket that you created in the Before you begin section.
TPU_NAME This is the name of the Compute Engine VM and the the Cloud TPU. The Compute Engine VM and the Cloud TPU name must be the same. Since the Compute Engine VM was set to the default value, your username, set the Cloud TPU to the same value.
ACCELERATOR_TYPE This is the accelerator version and the number of cores you want to use, for example, v2-128 (128 cores). In this tutorial, you will use a v3-8 TPU type when training on a single TPU device.
GCS_MODEL_DIR This is the directory that contains the model files. This tutorial uses a folder within the Cloud Storage bucket. You do not have to create this folder beforehand. The script creates the folder if it does not already exist.
CHECKPOINT This variable specifies a pre-trained checkpoint. The Mask RCNN model requires a pre-trained image classification model, such as ReNset, to be used as a backbone network. This tutorial uses a pre-trained checkpoint created with the ResNet demonstration model. You can also train your own ResNet model and specify a checkpoint from your ResNet model directory.
PATH_GCS_MASKRCNN This is the address of your Cloud Storage bucket where you want to store your model training artifacts. As with the GCS_MODEL_DIR variable, this tutorial uses a folder in the Cloud Storage bucket.
Use the export command to set these variables.

(vm)$ export STORAGE_BUCKET=gs://your-bucket-name
(vm)$ export TPU_NAME=username
(vm)$ export ACCELERATOR_TYPE=v3-8
(vm)$ export GCS_MODEL_DIR=${STORAGE_BUCKET}/mask-rcnn-model \
      export CHECKPOINT=gs://cloud-tpu-artifacts/resnet/resnet-nhwc-2018-02-07/model.ckpt-112603 \
      export PATH_GCS_MASKRCNN=${STORAGE_BUCKET}/coco

Run the training and evaluation

You are now ready to run the model on the preprocessed COCO data. To run the model, you use the mask_rcnn_main.py script.

  1. First, add the top-level /models folder to the Python path with the command:

    (vm)$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
    
  2. Run the following command to run both the training and evaluation.

    (vm)$ cd /usr/share/ && python tpu/models/official/mask_rcnn/mask_rcnn_main.py \
        --use_tpu=True \
        --tpu=${TPU_NAME:?} \
        --model_dir=${GCS_MODEL_DIR:?} \
        --num_cores=8 \
        --mode="train_and_eval" \
        --config_file="/usr/share/tpu/models/official/mask_rcnn/configs/cloud/${ACCELERATOR_TYPE}.yaml" \
        --params_override="checkpoint=${CHECKPOINT},training_file_pattern=${PATH_GCS_MASKRCNN:?}/train-*,validation_file_pattern=${PATH_GCS_MASKRCNN:?}/val-*,val_json_file=${PATH_GCS_MASKRCNN:?}/instances_val2017.json"
    

From here, you can either conclude this tutorial and clean up your GCP resources, or you can further explore running the model on a Cloud TPU Pod.

Scaling your model with Cloud TPU Pods

You can get results faster by scaling your model with Cloud TPU Pods. The fully supported Mask RCNN model can work with the following Pod slices:

  • v2-32
  • v3-32

When working with Cloud TPU Pods, you first train the model using a Pod, then use a single Cloud TPU device to evaluate the model.

Training with Cloud TPU Pods

The training runs for 11,250 steps and takes approximately 2 hours on a v3-32 TPU node.

  1. Delete the Cloud TPU resource you created for training the model on a single Cloud TPU device.

    (vm)$ ctpu delete --tpu-only --zone=europe-west4-a
  2. Run the ctpu up command, using the tpu-size parameter to specify the Pod slice you want to use. For example, the following command uses a v2-32 Pod slice.

    (vm)$ ctpu up --tpu-only --tpu-size=v3-32 --zone=europe-west4-a
    
  3. Install the extra packages needed by Mask RCNN.

    (vm)$ sudo apt-get install -y python-tk && \
          pip install --user Cython matplotlib opencv-python-headless pyyaml Pillow && \
          pip install --user 'git+https://github.com/cocodataset/cocoapi#egg=pycocotools&subdirectory=PythonAPI' && \
          pip install --user -U gast==0.2.2
    
  4. Update the keepalive values of your VM connection.

    This tutorial requires a long-lived connection to the Compute Engine instance. To ensure you aren't disconnected from the instance, run the following command:

    (vm)$ sudo /sbin/sysctl \
       -w net.ipv4.tcp_keepalive_time=120 \
       net.ipv4.tcp_keepalive_intvl=120 \
       net.ipv4.tcp_keepalive_probes=5
    
  5. Define the variables you need for training on a Pod. Use the export command to create multiple bash variables and use them in a configuration string.

    (vm)$ export STORAGE_BUCKET=gs://your-bucket-name
    (vm)$ export TPU_NAME=username
    (vm)$ export ACCELERATOR_TYPE=v3-32
    
    (vm)$ export GCS_MODEL_DIR=${STORAGE_BUCKET}/mask-rcnn-pod \
          export CHECKPOINT=gs://cloud-tpu-artifacts/resnet/resnet-nhwc-2018-02-07/model.ckpt-112603 \
          export PATH_GCS_MASKRCNN=${STORAGE_BUCKET}/coco
    
  6. Add the top-level /models folder to the Python path.

    (vm)$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
    
  7. Start the training script.

    (vm)$ cd /usr/share/ && python tpu/models/official/mask_rcnn/mask_rcnn_main.py \
          --use_tpu=True \
          --tpu=${TPU_NAME:?} \
          --iterations_per_loop=500 \
          --model_dir=${GCS_MODEL_DIR} \
          --num_cores=32 \
          --mode=train \
          --config_file="/usr/share/tpu/models/official/mask_rcnn/configs/cloud/${ACCELERATOR_TYPE}.yaml" \
          --params_override="checkpoint=${CHECKPOINT},training_file_pattern=${PATH_GCS_MASKRCNN:?}/train-*,validation_file_pattern=${PATH_GCS_MASKRCNN:?}/val-*,val_json_file=${PATH_GCS_MASKRCNN:?}/instances_val2017.json"
    

Evaluating the model

In this step, you use a single Cloud TPU node to evaluate the above trained model against the COCO dataset. The evaluation takes about 10 minutes.

  1. Delete the Cloud TPU resource you created to train the model on a Pod.

    (vm)$ ctpu delete --tpu-only --zone=europe-west4-a
  2. Launch a new TPU device to run evaluation.

    (vm)$ ctpu up --tpu-only --tpu-size=v3-8 --zone=europe-west4-a
    
  3. Install the packages necessary to run the evaluation.

    (vm)$ sudo apt-get install -y python-tk && \
          pip install --user Cython matplotlib opencv-python-headless pyyaml Pillow && \
          pip install --user 'git+https://github.com/cocodataset/cocoapi#egg=pycocotools&subdirectory=PythonAPI' \
          pip install --user -U gast==0.2.2
    
  4. Create your environmental variables.

      (vm)$ export STORAGE_BUCKET=gs://your-bucket-name
      (vm)$ export TPU_NAME=username
      (vm)$ export ACCELERATOR_TYPE=v3-8
      
      (vm)$ export GCS_MODEL_DIR=${STORAGE_BUCKET}/mask-rcnn-pod \
            export CHECKPOINT=gs://cloud-tpu-artifacts/resnet/resnet-nhwc-2018-02-07/model.ckpt-112603 \
            export PATH_GCS_MASKRCNN=${STORAGE_BUCKET}/coco
       
  1. Add the top-level /models folder to the Python path.

    (vm)$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
    
  2. Start the evaluation.

    (vm)$ cd /usr/share/ && python tpu/models/official/mask_rcnn/mask_rcnn_main.py \
          --use_tpu=True \
          --tpu=${TPU_NAME:?} \
          --iterations_per_loop=500 \
          --mode=eval \
          --model_dir=${GCS_MODEL_DIR} \
          --config_file="/usr/share/tpu/models/official/mask_rcnn/configs/cloud/${ACCELERATOR_TYPE}.yaml" \
          --params_override="checkpoint=${CHECKPOINT},training_file_pattern=${PATH_GCS_MASKRCNN}/train-*,val_json_file=${PATH_GCS_MASKRCNN}/instances_val2017.json,validation_file_pattern=${PATH_GCS_MASKRCNN}/val-*,init_learning_rate=0.28,learning_rate_levels=[0.028, 0.0028, 0.00028],learning_rate_steps=[6000, 8000, 10000],momentum=0.95,num_batch_norm_group=1,num_steps_per_eval=500,global_gradient_clip_ratio=0.02,total_steps=11250,train_batch_size=512,warmup_steps=1864"
    

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

Clean up the Compute Engine VM instance and Cloud TPU resources.

  1. Disconnect from the Compute Engine instance, if you have not already done so:

    (vm)$ exit
    

    Your prompt should now be user@projectname, showing you are in the Cloud Shell.

  2. In your Cloud Shell, run ctpu delete with the --zone flag you used when you set up the Cloud TPU to delete your Compute Engine VM and your Cloud TPU:

    $ ctpu delete --zone=europe-west4-a
    
  3. Run the following command to verify the Compute Engine VM and Cloud TPU have been shut down:

    $ ctpu status --zone=europe-west4-a
    

    The deletion might take several minutes. A response like the one below indicates there are no more allocated instances:

    2018/04/28 16:16:23 WARNING: Setting zone to "europe-west4-a"
    No instances currently exist.
            Compute Engine VM:     --
            Cloud TPU:             --
    
  4. Run gsutil as shown, replacing your-bucket-name with the name of the Cloud Storage bucket you created for this tutorial:

    $ gsutil rm -r gs://your-bucket-name
    

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Need help? Visit our support page.