What’s in an image: fast, accurate image segmentation with Cloud TPUs
Hardware Engineer, Cloud TPU
Google designed Cloud TPUs from the ground up to accelerate cutting-edge machine learning (ML) applications, from image recognition, to language modeling, to reinforcement learning. And now, we’ve made it even easier for you to use Cloud TPUs for image segmentation—the process of identifying and labeling regions of an image based on the objects or textures they contain—by releasing high-performance TPU implementations of two state-of-the-art segmentation models, Mask R-CNN and DeepLab v3+ as open source code. Below, you can find performance and cost metrics for both models that can help you choose the right model and TPU configuration for your business or product needs.
A brief introduction to image segmentation
Image segmentation is the process of labeling regions in an image, often down to the pixel level. There are two common types of image segmentation:
Instance segmentation: This process gives each individual instance of one or multiple object classes a distinct label. In a family photo containing several people, this type of model would automatically highlight each person with a different color.
Semantic segmentation: This process labels each pixel of an image according to the class of object or texture it represents. For example, pixels in an image of a city street scene might be labeled as “pavement,” “sidewalk,” “building,” “pedestrian,” or “vehicle.”
Autonomous driving, geospatial image processing, and medical imaging, among other applications, typically require both of these types of segmentation. And image segmentation is even an exciting new enabler for certain photo and video editing processes, including bokeh and background removal!
High performance, high accuracy, and low cost
When you choose to work with image segmentation models, you’ll want to consider a number of factors: your accuracy target, the total training time to reach this accuracy, the cost of each training run, and more. To jump-start your analysis, we have trained Mask R-CNN and DeepLab v3+ on standard image segmentation datasets and collected many of these metrics in the tables below.
Instance segmentation using Mask R-CNN
Figure 1: Mask R-CNN training performance and accuracy, measured on the COCO dataset
Semantic segmentation using DeepLab v3+
Figure 2: DeepLab v3+ training performance and accuracy, measured on the PASCAL VOC 2012 dataset
As you can see above, Cloud TPUs can help you train state-of-the-art image segmentation models with ease, and you’ll often reach usable accuracy very quickly. At the time we wrote this blog post, the first two Mask R-CNN training runs and both of the DeepLab v3+ runs in the tables above cost less than $50 using the on-demand Cloud TPU devices that are now generally available.
By providing these open source image segmentation models and optimizing them for a range of Cloud TPU configurations, we aim to enable ML researchers, ML engineers, app developers, students, and many others to train their own models quickly and affordably to meet a wide range of real-world image segmentation needs.
A closer look at Mask R-CNN and DeepLab v3+
In order to achieve the image segmentation performance described above, you’ll need to use a combination of extremely fast hardware and well-optimized software. In the following sections, you can find more details on each model’s implementation.
Mask R-CNN is a two-stage instance segmentation model that can be used to localize multiple objects in an image down to the pixel level. The first stage of the model extracts features (distinctive patterns) from an input image to generate region proposals that are likely to contain objects of interest. The second stage refines and filters those region proposals, predicts the class of every high-confidence object, and generates a pixel-level mask for each object.
Figure 3: An image from Wikipedia with an overlay of Mask R-CNN instance segmentation results.
In the Mask R-CNN table above, we explored various trade-offs between training time and accuracy. The accuracy you wish to achieve as you train Mask R-CNN will vary by application: for some, training speed might be your top priority, whereas for others, you’ll prioritize around training to the highest possible accuracy, even if more training time and associated costs are needed to reach that accuracy threshold.
The training time your model will require depends on both the number of training epochs and your chosen TPU hardware configuration. When training for 12 epochs, Mask R-CNN training on the COCO dataset typically surpasses an object detection “box accuracy” of 37 mAP (“mean Average Precision”). While this accuracy threshold may be considered usable for many applications, we also report training results using 24 and 48 epochs across various Cloud TPU configurations to help you evaluate the current accuracy-speed trade off and choose an option that works best for your application. All the numbers in the tables above were collected using TensorFlow version 1.13. While we expect your results to be similar to ours, your results may vary.
Here are some high-level conclusions from our Mask R-CNN training trials:
If budget is your top priority, a single Cloud TPU v2 device (v2-8) should serve you well. With a Cloud TPU v2, our Mask R-CNN implementation trains overnight to an accuracy point of more than 37 mAP for less than $50. With a preemptible Cloud TPU device, that cost can drop to less than $20.
Alternatively, if you choose a Cloud TPU v3 device (v3-8), you should benefit from a speedup of up to 1.7x over a Cloud TPU v2 device—without any code changes.
Cloud TPU Pods enable even faster training at larger scale. Using just 1/16th of a Cloud TPU v3 Pod, Mask R-CNN trains to the highest accuracy tier in the table in under two hours.
Google’s DeepLab v3+, a fast and accurate semantic segmentation model, makes it easy to label regions in images. For example, a photo editing application might use DeepLab v3+ to automatically select all of the pixels of sky above the mountains in a landscape photograph.
Last year, we announced the initial open source release of DeepLab v3+, which as of writing is still the most recent version of DeepLab. The DeepLab v3+ implementation featured above includes optimizations that target Cloud TPU.
Figure 4: Semantic segmentation results using DeepLab v3+ [image from the DeepLab v3 paper]
We trained DeepLab v3+ on the PASCAL VOC 2012 dataset using TensorFlow version 1.13 on both Cloud TPU v2 and Cloud TPU v3 hardware. Using a single Cloud TPU v2 device (v2-8), DeepLab v3+ training completes in about 8 hours and costs less than $40 (less than $15 using preemptible Cloud TPUs). Cloud TPU v3 offers twice the memory (128 GB) and more than twice the peak compute (420 teraflops), enabling a speedup of about 1.7x without any code changes.
Getting started—in a sandbox, or in your own project
It’s easy to start experimenting with both the models above by using a free Cloud TPU in Colab right in your browser:
You can also get started with these image segmentation models in your own Google Cloud projects by following these tutorials:
If you’re new to Cloud TPUs, you can get familiar with the platform by following our quickstart guide, and you can also request access to Cloud TPU v2 Pods—available in alpha today. For more guidance on determining whether you should use an individual Cloud TPU or an entire Cloud TPU Pod, check out our comparison documentation here.
Many thanks to the Googlers who contributed to this post, including Zak Stone, Pengchong Jin, Shawn Wang, Chiachen Chou, David Shevitz, Barrett Williams, Liang-Chieh Chen, Yukun Zhu, Yeqing Li, Wes Wahlin, Pete Voss, Sharon Maher, Tom Nguyen, Xiaodan Song, Adam Kerin, and Ruoxin Sang.