Running Deeplab-v3 on Cloud TPU

This tutorial shows you how to train the Deeplab-v3 model on Cloud TPU.

This model is an image semantic segmentation model. Image semantic segmentation models focus on identifying and localizing multiple objects in a single image. This type of model is frequently used in machine learning applications such as autonomous driving, geospatial image processing, and medical imaging.

In this tutorial, you'll run a training model against the PASCAL VOC 2012 dataset. For more information on this data set, see The PASCAL Visual Object Classes Homepage.

Before you begin

Before starting this tutorial, check that your Google Cloud Platform project is correctly set up.

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Google Cloud Platform project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your Google Cloud Platform project.

    Learn how to enable billing

  4. This walkthrough uses billable components of Google Cloud Platform. Check the Cloud TPU pricing page to estimate your costs. Be sure to clean up resources you create when you've finished with them to avoid unnecessary charges.

Set up your resources

This section provides information on setting up Cloud Storage storage, VM, and Cloud TPU resources for tutorials.

Create a Cloud Storage bucket

You need a Cloud Storage bucket to store the data you use to train your model and the training results. The ctpu up tool used in this tutorial sets up default permissions for the Cloud TPU service account. If you want finer-grain permissions, review the access level permissions.

The bucket you create must reside in the same region as your virtual machine (VM) and your Cloud TPU device or Cloud TPU slice (multiple TPU devices) do.

  1. Go to the Cloud Storage page on the GCP Console.

    Go to the Cloud Storage page

  2. Create a new bucket, specifying the following options:

    • A unique name of your choosing.
    • Default storage class: Regional
    • Location: If you want to use a Cloud TPU device, accept the default presented. If you want to use a Cloud TPU Pod slice, you must specify a region where Cloud TPU Pods are available.

Use the ctpu tool

This section demonstrates using the Cloud TPU provisioning tool (ctpu) for creating and managing Cloud TPU project resources. The resources are comprised of a virtual machine (VM) and a Cloud TPU resource that have the same name. These resources must reside in the same region/zone as the bucket you just created.

You can also set up your VM and TPU resources using gcloud commands or through the Cloud Console. See the managing VM and TPU resources page to learn all the ways you can set up and manage your Compute Engine VM and Cloud TPU resources.

Run ctpu up to create resources

  1. Open a Cloud Shell window.

    Open Cloud Shell

  2. Run ctpu up specifying the flags shown for either a Cloud TPU device or Pod slice. Refer to CTPU Reference for flag options and descriptions.

  3. Set up a Cloud TPU device:

    $ ctpu up 

    The following configuration message appears:

    ctpu will use the following configuration:
    
    Name: [your TPU's name]
    Zone: [your project's zone]
    GCP Project: [your project's name]
    TensorFlow Version: 1.13
    VM:
     Machine Type: [your machine type]
     Disk Size: [your disk size]
     Preemptible: [true or false]
    Cloud TPU:
     Size: [your TPU size]
     Preemptible: [true or false]
    
    OK to create your Cloud TPU resources with the above configuration? [Yn]:
    

    Press y to create your Cloud TPU resources.

The ctpu up command creates a virtual machine (VM) and Cloud TPU services.

From this point on, a prefix of (vm)$ means you should run the command on the Compute Engine VM instance.

Verify your Compute Engine VM

When the ctpu up command has finished executing, verify that your shell prompt has changed from username@project to username@tpuname. This change shows that you are now logged into your Compute Engine VM.

Install additional packages

For this model, you need to install the following additional packages on your Compute Engine instance:

  • python-pil
  • python-numpy
  • jupyter
  • matplotlib
  • PrettyTable
(vm)$ sudo apt-get install python-pil python-numpy && \
pip install --user jupyter && \
pip install --user matplotlib && \
pip install --user PrettyTable

Clone the TensorFlow models and tpu repositories

Run the following command to clone the TensorFlow models repository to your Compute Engine instance:

(vm)$ git clone https://github.com/tensorflow/models.git && \
git clone https://github.com/tensorflow/tpu.git

Add tensorflow/models/research/ to PYTHONPATH

Next, add the models in tensorflow/models/research to the PYTHONPATH variable:

(vm)$ cd ~/models/research && \
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim

Download and convert the PASCAL VOC 2012 dataset

This model uses the PASCAL VOC 2012 dataset for training and evaluation. Run the following script to download the dataset and convert it to TensorFlow's TFRecord format:

(vm)$ cd ~/ && \
bash ~/models/research/deeplab/datasets/download_and_convert_voc2012.sh

Download the pretrained checkpoint

In this step, you download the a modified resnet 101 pretrained checkpoint. To start, download the checkpoint:

(vm)$ wget http://download.tensorflow.org/models/resnet_v1_101_2018_05_04.tar.gz

Then, extract the contents of the tar file:

(vm)$ tar -vxf resnet_v1_101_2018_05_04.tar.gz

Upload data to your Cloud Storage bucket

At this point, you can now upload your data to the Cloud Storage bucket you created earlier:

(vm)$ gsutil cp -r ~/models/research/deeplab/datasets/pascal_voc_seg/tfrecords gs://YOUR-BUCKET-NAME &&
gsutil -m cp -r ~/resnet_v1_101 ${STORAGE_BUCKET}

Train the model

You're now ready to train the model. Be sure to change YOUR-BUCKET-NAME to the name of your Cloud Storage bucket.

(vm)$ python ~/tpu/models/experimental/deeplab/main.py \
--mode='train' \
--num_shards=8 \
--alsologtostderr=true \
--model_dir=gs://YOUR-BUCKET-NAME \
--dataset_dir=gs://YOUR-BUCKET-NAME/tfrecord \
--init_checkpoint=gs://YOUR-BUCKET-NAME/resnet_v1_101/model.ckpt \
--model_variant=resnet_v1_101_beta \
--image_pyramid=1. \
--aspp_with_separable_conv=false \
--multi_grid=1 \
--multi_grid=2 \
--multi_grid=4 \
--decoder_use_separable_conv=false \
--train_split='train'

Evaluate the model

When the training completes, you can evaluate the model. To do so, change the --mode flag from train to eval:

(vm)$ python ~/tpu/models/experimental/deeplab/main.py \
--mode='eval' \
--num_shards=8 \
--alsologtostderr=true \
--model_dir=gs://YOUR-BUCKET-NAME \
--dataset_dir=gs://YOUR-BUCKET-NAME/tfrecord \
--init_checkpoint=gs://YOUR-BUCKET-NAME/resnet_v1_101/model.ckpt \
--model_variant=resnet_v1_101_beta \
--image_pyramid=1. \
--aspp_with_separable_conv=false \
--multi_grid=1 \
--multi_grid=2 \
--multi_grid=4 \
--decoder_use_separable_conv=false \
--train_split='train'

Clean up

To avoid incurring charges to your GCP account for the resources used in this topic:

  1. Disconnect from the Compute Engine VM:

    (vm)$ exit
    

    Your prompt should now be user@projectname, showing you are in the Cloud Shell.

  2. In your Cloud Shell, run ctpu delete with the --zone flag you used when you set up the Cloud TPU to delete your Compute Engine VM and your Cloud TPU:

    $ ctpu delete [optional: --zone]
    
  3. Run ctpu status to make sure you have no instances allocated to avoid unnecessary charges for TPU usage. The deletion might take several minutes. A response like the one below indicates there are no more allocated instances:

    2018/04/28 16:16:23 WARNING: Setting zone to "us-central1-b"
    No instances currently exist.
            Compute Engine VM:     --
            Cloud TPU:             --
    
  4. Run gsutil as shown, replacing YOUR-BUCKET-NAME with the name of the Cloud Storage bucket you created for this tutorial:

    $ gsutil rm -r gs://YOUR-BUCKET-NAME
    

What's next

Was this page helpful? Let us know how we did:

Send feedback about...