AI & Machine Learning

Better scalability with Cloud TPU pods and TensorFlow 2.1

tensorflow.png

Cloud TPU Pods have gained recognition recently for setting performance records in both training and inference. These custom-built AI supercomputers are now generally available to help all types of enterprises solve their biggest AI challenges. 

“We've been greatly impressed with the speed and scale of Google TPU while making use of it for a variety of internal NLP tasks,” said Seobok Jang, AI Development Infra Lead at LG. “It helped us minimize tedious training time for our unique language models based on BERT, thus making it remarkably productive. Overall, the utilization of TPU was an excellent choice especially while training complex and time consuming language models.”

Not only are Cloud TPUs now more widely available, they are increasingly easy to use. For example, the latest TensorFlow 2.1 release includes support for Cloud TPUs using Keras, offering both high-level and low-level APIs. This makes it possible to leverage petaflops of TPU compute that’s optimized for deep learning with the same user-friendly APIs familiar to the large community of Keras users. (The TensorFlow 2.x series of releases will also continue to support the older TPUEstimator API.)

In this post, we’ll walk through how to use Keras to train on Cloud TPUs at small scale, demonstrate how to scale up to training on Cloud TPU Pods, and showcase a few additional examples and new features.

google cloud tpu.jpg
A single Cloud TPU v3 device (left) with 420 teraflops and 128 GB HBM, and a Cloud TPU v3 Pod (right) with 100+ petaflops and 32 TB HBM connected via a 2-D toroidal mesh network.

Train on a single Cloud TPU device

You can use almost identical code whether you’re training on a single Cloud TPU device or across a large Cloud TPU Pod slice for increased performance. Here, we show how to train a model using Keras on a single Cloud TPU v3 device.

  resolver =  tf.distribute.cluster_resolver.TPUClusterResolver(tpu=FLAGS.tpu)
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

with strategy.scope():
    # Build a Keras model.
    model = build_model()

# Distribute the dataset.
train_dataset = strategy.experimental_distribute_datasets_from_function(
    imagenet_train.input_fn)
train_iterator = iter(train_dataset)

@tf.function
def train_step(iterator):
  def step_fn(inputs):
    images, labels = inputs
    with tf.GradientTape() as tape:
      predictions = model(images, training=True)
      loss = compute_loss(predications, labels)
    grads = tape.gradient(scaled_loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

  strategy.experimental_run_v2(step_fn, args=(next(iterator),))

train_step(train_iterator)

Scale up to Cloud TPU Pods

You only need minimal code changes to scale jobs from a single Cloud TPU (four chips) to a full Cloud TPU Pod (1,024 chips). In the example above, you need to set FLAGS.tpu to your Cloud TPU Pod instance name when creating the TPUClusterResolver. To use Cloud TPU slices effectively, you may also need to scale the batch size and number of training steps in your configuration. 

TensorFlow Model Garden examples

TensorFlow 2.1 includes example code for training a diverse set of models with Keras on TPUs, as well as full backward compatibility for Cloud TPU models written using TPUEstimator in TensorFlow 1.15. At the time of writing, Keras implementations for BERT, Transformer, MNIST, ResNet-50, and RetinaNet are included in the TensorFlow Model Garden GitHub repo, and a larger set of models with tutorials is available via the official Cloud TPU documentation.

The TensorFlow Model Garden includes Keras examples with user-implemented “custom training loops” as well as Keras examples using higher-level model.compile and model.fit APIs. Writing your own training loop, as shown in this blog post, provides more power and flexibility, and is often a higher-performance choice when working with Cloud TPUs.

Additional features

TensorFlow 2.1 makes working with Cloud TPUs even easier by adding support for the following features.

Automatic handling of unsupported ops

In common cases, unsupported ops can now be automatically handled when porting models to Cloud TPUs. Adding tf.config.set_soft_device_placement(True) to TensorFlow code (as shown below) will cause any ops that aren’t supported on Cloud TPUs to be detected and placed on the host CPUs. This means that custom tf.summary usage in model functions, tf.print with string types unsupported on Cloud TPUs, and others will now just work.

  # Define a layer with summary op.
class CustomLayer(tf.keras.layers.Layer):
 """A pass-through layer that only records values to summary."""
 def call(self, x):
   tf.summary.histogram('custom_histogram_summary', x)
   return x

 def get_model():
   """Returns a Keras model instance with summary ops."""
   model = tf.keras.models.Sequential()

   ...

   model.add(CustomLayer())
   return model

...

# Enable soft device placement.
tf.config.set_soft_device_placement(True)
strategy = tf.distribute.experimental.TPUStrategy(..)
with strategy.scope():
  # Define tensorboard callback and set directory to which summary 
  # values will be saved.
  tensorboard_callback = tf.keras.callbacks.Tensorboard(..)
  model.compile(..)
  model.fit(..)

Improved support for dynamic shapes

Working with dynamic shapes on Cloud TPUs is also easier in TensorFlow 2.1. TensorFlow 1.x Cloud TPU training requires specifying static per-replica and global batch sizes, for example by setting drop_remainder=True in the input dataset. TensorFlow 2.1 no longer requires this step. Even if the last partial batch is not even across replicas or some replicas have no data, the training job will run and complete as expected.

Using ops with dynamic output dimensions and slicing with dynamic indexes is also now supported on Cloud TPUs.

  # Dynamic batch sizes: 
dataset = tf.data.Dataset.from_tensor_slices([5., 6., 7.])
dataset = dataset.batch(2) # No need to drop the last partial batch.

...

# Dynamic output dimensions:
tensor = [[1, 2], [3, 4], [5, 6]]
mask = [True, False, True]
tf.boolean_mask(tensor, mask) # The output has dynamic shapes.

...

# Dynamic indexes:
index = tf.random_uniform([],minval=0,maxval=3,dtype=tf.int32)
sliced = tensor[index]

Mixed precision 

The Keras mixed precision API now supports Cloud TPUs, and it can significantly increase performance in many applications. The example code below shows how to enable bfloat16 mixed precision on Cloud TPUs with Keras. Check out the mixed precision tutorial for more information.

  policy = tf.keras.mixed_precision.experimental.Policy('mixed_bfloat16')
tf.keras.mixed_precision.experimental.set_policy(policy)

Get started

To quickly try out Cloud TPU on TensorFlow 2.1, check out this free codelab, Keras and modern convnets on TPUs. You can also experiment with Cloud TPUs in a Kaggle competition for the first time. It includes starter material to build a model that identifies flowers. Once you’re ready to accelerate your AI workloads on Cloud TPU Pods, learn more about reservations and pricing on the product page. Stay tuned for additional posts about getting started on Cloud TPUs and TensorFlow 2.1 in the coming weeks.