AI & Machine Learning

Train fast on TPU, serve flexibly on GPU: switch your ML infrastructure to suit your needs

ml-competition-2.png

When developing machine learning models, fast iteration and short training times are of utmost importance. In order for you or your data science team to reach higher levels of accuracy, you may need to run tens or hundreds of training iterations in order to explore different options.

A growing number of organizations use Tensor Processing Units (Cloud TPUs) to train complex models due to their ability to reduce the training time from days to hours (roughly a 10X reduction) and the training costs from thousands of dollars to tens of dollars (roughly a 100X reduction). You can then deploy your trained models to CPUs, GPUs, or TPUs to make predictions at serving time. In some applications for which response latency is critical—e.g., robotics or self-driving cars—you might need to make additional optimizations. For example, many data scientists frequently use NVIDIA’s TensorRT to improve inference speed on GPUs. In this post, we walk through training and serving an object detection model and demonstrate how TensorFlow’s comprehensive and flexible feature set can be used to perform each step, regardless of which hardware platform you choose.

A TensorFlow model consists of many operations (ops) that are responsible for training and making predictions, for example, telling us whether a person is crossing the street. Most of TensorFlow ops are platform-agnostic and can run on CPU, GPU, or TPU. In fact, if you implement your model using TPUEstimator, you can run it on a Cloud TPU by just setting the use_tpu flag to True, and run it on a CPU or GPU by setting the flag to False.

NVIDIA has developed TensorRT (an inference optimization library) for high-performance inference on GPUs. TensorFlow (TF) now includes TensorRT integration (TF-TRT) module that can convert TensorFlow ops in your model to TensorRT ops. With this integration, you can train your model on TPUs and then use TF-TRT to convert the trained model to a GPU-optimized one for serving. In the following example we will train a state-of-the-art object detection model, RetinaNet, on a Cloud TPU, convert it to a TensorRT-optimized version, and run predictions on a GPU.

Train and save a model

You can use the following instructions for any TPU model, but in this guide, we choose as our example the TensorFlow TPU RetinaNet model. Accordingly, you can start by following this tutorial to train a RetinaNet model on Cloud TPU. Feel free to skip the section titled "Evaluate the model while you train (optional)"1.

For the RetinaNet model that you just trained, if you look inside the model directory (${MODEL_DIR} in the tutorial) in Cloud Storage you’ll see multiple model checkpoints. Note that checkpoints may be dependent on the architecture used to train a model and are not suitable for porting the model to a different architecture.

TensorFlow offers another model format, SavedModel, that you can use to save and restore your model independent of the code that generated it. A SavedModel is language-neutral and contains everything you need (graph, variables, and metadata) to port your model from TPU to GPU or CPU.

Inside the model directory, you should find a timestamped subdirectory (in Unix epoch time format, for example, 1546300800 for 2019-01-01-12:00:00 GMT) that contains the exported SavedModel. Specifically, your subdirectory contains the following files:

  • saved_model.pb
  • variables/variables.data-00000-of-00001
  • variables/variables.index

The training script stores your model graph as saved_model.pb in a protocol buffer (protobuf) format, and stores in the variables in the aptly named variables subdirectory. Generating a SavedModel involves two steps—first, to define a serving_input_receiver_fn and then, to export a SavedModel.

At serving time, the serving input receiver function ingests inference requests and prepares them for the model, just as at training time the input function input_fn ingests the training data and prepares them for the model. In the case of RetinaNet, the following code defines the serving input receiver function:

  def serving_input_fn(image_size):
  """Input function for SavedModels and TF serving."""

  def _decode_and_crop(img_bytes):
    img = tf.image.decode_jpeg(img_bytes)
    img = tf.image.resize_image_with_crop_or_pad(img, image_size, image_size)
    img = tf.image.convert_image_dtype(img, tf.float32)
    return img

  image_bytes_list = tf.placeholder(shape=[None], dtype=tf.string)
  images = tf.map_fn(
      _decode_and_crop, image_bytes_list, back_prop=False, dtype=tf.float32)
  images = tf.reshape(images, [-1, image_size, image_size, 3])
  return tf.estimator.export.TensorServingInputReceiver(
      features=images,
      receiver_tensors={'image_bytes': image_bytes_list})

The  serving_input_receiver_fn returns a tf.estimator.export.ServingInputReceiver object that takes the inference requests as arguments in the form of receiver_tensors and the features used by model as features. When the script returns a ServingInputReceiver, it’s telling TensorFlow everything it needs to know in order to construct a server. The features arguments describe the features that will be fed to our model. In this case, features is simply the set of images to run our detector on. receiver_tensors specifies the inputs to our server. Since we want our server to take JPEG encoded images, there will be a tf.placeholder for an array of strings. We decode each string into an image, crop it to the correct size and return the resulting image tensor.

To export a SavedModel, call the export_saved_model method on your estimator shown in the following code snippet:


  eval_estimator.export_saved_model(
    export_dir_base=FLAGS.model_dir,
    serving_input_receiver_fn=lambda: serving_input_fn(hparams.image_size))

Running export_saved_model generates a `SavedModel` directory in your FLAGS.model_dir directory. The SavedModel exported from TPUEstimator contains information on how to serve your model on CPU, GPU and TPU architectures.

Inference

You can take the SavedModel that you trained on a TPU and load it on CPU(s), GPU(s) or TPU(s), to run predictions. The following lines of code restore the model and run inference.

  from tensorflow.python.saved_model import loader
from tensorflow.python.saved_model import tag_constants

with tf.Session() as sess:
  loader.load(sess, [tag_constants.SERVING], model_dir)
  sess.run(model_outputs, feed_dict={model_input: [input_image_batch]})

model_dir is your model directory where the SavedModel is stored. loader.load returns a MetaGraphDef protocol buffer loaded in the provided session. model_outputs is the list of model outputs you’d like to predict, model_input is the name of the placeholder that receives the input data, and input_image_batch is the input data directory2.

With TensorFlow, you can very easily train and save a model on one platform (like TPU) and load and serve it on another platform (like GPU or CPU). You can choose from different Google Cloud Platform services such as Cloud Machine Learning Engine, Kubernetes Engine, or Compute Engine to serve your models. In the remainder of this post you’ll learn how to optimize the SavedModel using TF-TRT, which is a common process if you plan to serve your model on one or more GPUs.

TensorRT optimization

While you can use the SavedModel exported earlier to serve predictions on GPUs directly, NVIDIA’s TensorRT allows you to get improved performance from your model by using some advanced GPU features. To use TensorRT, you’ll need a virtual machine (VM) with a GPU and NVIDIA drivers. Google Cloud’s Deep Learning VMs are ideal for this case, because they have everything you need pre-installed.

Follow these instructions to create a Deep Learning VM instance with one or more GPUs on Compute Engine. Select the checkbox "Install NVIDIA GPU driver automatically on first startup?" and choose a "Framework" (for example, "Intel optimized TensorFlow 1.12" at the time of writing this post) that comes with the most recent version of CUDA and TensorRT that satisfy the dependencies for the TensorFlow with GPU support and TF-TRT modules. After your VM is initialized and booted, you can remotely log into it by clicking the SSH button next to its name on the Compute Engine page on Cloud Console or using the gcloud compute ssh command. Install the dependencies (recent versions of TensorFlow include TF-TRT by default) and clone the TensorFlow TPU GitHub repository3.

Now run tpu/models/official/retinanet/retinanet_tensorrt.py and provide the location of the SavedModel as an argument:


  $ python tpu/models/official/retinanet/retinanet_tensorrt.py \
   --saved_model_dir=${SAVED_MODEL_DIR} \
   --number=10

In the preceding code snippet, SAVED_MODEL_DIR is the path where SavedModel is stored (on Cloud Storage or local disk). This step converts the original SavedModel to a new GPU optimized SavedModel and prints out the prediction latency for the two models.

If you look inside the model directory you can see that retinanet_tensorrt.py has converted the original SavedModel to a TensorRT-optimized SavedModel and stored it in a new folder ending in _trt. This step was done using the command.

  tensorflow.contrib.tensorrt.create_inference_graph(
    None,
    None,
    input_saved_model_dir=original_model_dir,
    output_saved_model_dir=tensorrt_model_dir)

In the new SavedModel, the TensorFlow ops have been replaced by their GPU-optimized TensorRT implementations. During conversion, the script converts all variables to constants, and writes out to saved_model.pb, and therefore the variables folder is empty. TF-TRT module has an implementation for the majority of TensorFlow ops. For some ops, such as control flow ops such as Enter, Exit, Merge, and Switch, there are no TRT implementation, therefore they stay unchanged in the new SavedModel, but their effect on prediction latency is negligible.

Another method to convert the SavedModel to its TensorRT inference graph is to use the saved_model_cli tool using the following command:


  $ nvidia-docker run -v $MY_DIR:$MY_DIR -it tensorflow/tensorflow:nightly-gpu \
   /usr/local/bin/saved_model_cli convert \
   --dir $MY_DIR/${SAVED_MODEL_DIR} \
   --output_dir $MY_DIR/${SAVED_MODEL_DIR}_trt \
   --tag_set serve tensorrt

In the preceding command MY_DIR is the shared filesystem directory and SAVED_MODEL_DIR is the directory inside the shared filesystem directory where the SavedModel is stored.

retinanet_tensorrt.py also loads and runs two models before and after conversion and prints the prediction latency. As we expect, the converted model has lower latency. Note that for inference, the first prediction often takes longer than subsequent predictions. This is due to startup overhead and for TPUs, the time taken to compile the TPU program via XLA. In our example, we skip the time taken by the first inference step, and average the remaining steps from the second iteration onwards.

You can apply these steps to other models to easily port them to a different architecture, and optimize their performance. The TensorFlow and TPU GitHub repositories contain a diverse collection of different models that you can try out for your application including another state of the art object detection model, Mask R-CNN. If you’re interested in trying out TPUs, to see what they can offer you in terms of training and serving times, try this Colab and quickstart.



1. You can skip the training step altogether by using the pre-trained checkpoints which are stored in Cloud Storage under gs://cloud-tpu-checkpoints/retinanet-model.
2. Use loader.load(sess, [tag_constants.SERVING], saved_model_dir).signature_def to load the model and return the signature_def which contains the model input(s) and output(s). sess is the Session object here.
3. Alternatively you can use a Docker image with a recent version of TensorFlow and the dependencies.