AI & Machine Learning

Running TensorFlow inference workloads at scale with TensorRT 5 and NVIDIA T4 GPUs

Today, we announced that Google Compute Engine now offers machine types with NVIDIA T4 GPUs, to accelerate a variety of cloud workloads, including high-performance computing, deep learning training and inference, broader machine learning (ML) workloads, data analytics, and graphics rendering.

In addition to its GPU hardware, NVIDIA also offers tools to help developers make the best use of their infrastructure. NVIDIA TensorRT is a software inference platform for developing high-performance deep learning inference—the stage in the machine learning process where a trained model is used, typically in a runtime, live environment, to recognize, process, and classify results. The library includes a deep learning inference data type (quantization) optimizer, model conversion process, and runtime that delivers low latency and high throughput. TensorRT-based applications perform up to 40 times faster1 than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in most major frameworks, calibrate for lower precision with high accuracy, and finally, deploy to a variety of environments. These might include hyperscale data centers, embedded systems, or automotive product platforms.

In this blog post, we’ll show you how to run deep learning inference on large-scale workloads with NVIDIA TensorRT 5 running on Compute Engine VMs configured with our Cloud Deep Learning VM image and NVIDIA T4 GPUs.


This tutorial shows you how to set up a multi-zone cluster for running an inference workload on an autoscaling group that scales to meet changing GPU utilization demands, and covers the following steps:

  • Preparing a model using a pre-trained graph (ResNet)

  • Benchmarking the inference speed for a model with different optimization modes

  • Converting a custom model to TensorRT format

  • Setting up a multi-zone cluster that is:

    • Built on Deep Learning VMs preinstalled with TensorFlow, TensorFlow serving, and TensorRT 5.

    • Configured to auto-scale based on GPU utilization.

    • Configured for load-balancing.

    • Firewall enabled.  

  • Running an inference workload in the multi-zone cluster.

Here’s a high-level architectural perspective for this setup:

1_high-level architectural perspective.png

Preparing and optimizing the model with TensorRT

In this section, we will create a VM instance to run the model, and then download a model from the TensorFlow official models catalog.

Create a new Deep Learning Virtual Machine instance

Create the VM instance:

  export IMAGE_FAMILY="tf-1-12-cu100"
export ZONE="us-central1-b"
export INSTANCE_NAME="model-prep"
gcloud compute instances create $INSTANCE_NAME \
       --zone=$ZONE \
       --image-family=$IMAGE_FAMILY \
       --machine-type=n1-standard-8 \
       --image-project=deeplearning-platform-release \
       --maintenance-policy=TERMINATE \
       --accelerator="type=nvidia-tesla-t4,count=1" \

If command is successful you should see a message that looks like this:

  Created [].
model-prep  us-central1-b  n1-standard-8       RUNNING


  • You can create this instance in any available zone that supports T4 GPUs.

  • A single GPU is enough to compare the different TensorRT optimization modes.

Download a ResNet model pre-trained graph

This tutorial uses the ResNet model, which trained on the ImageNet dataset that is in TensorFlow. To download the ResNet model to your VM instance, run the following command:
  wget -q

Verify model was downloaded correctly:

  ls -al
-rw-r--r--  1 root staff  98M Apr 16  2018 resnetv2_imagenet_frozen_graph.pb

Save the location of your ResNet model in the $WORKDIR variable:

  export WORKDIR=<model_location>

Benchmarking the model

Leveraging fast linear algebra libraries and hand tuned kernels, TensorRT can speed up inference workloads, but the most significant speed-up comes from the quantization process. Model quantization is the process by which you reduce the precision of weights for a model. For example, if the initial weight of a model is FP32, you have the option to reduce the precision to FP16 and INT8 with the goal of improving runtime performance. It’s important to pick the right balance between speed (precision of weights) and accuracy of a model. Luckily, TensorFlow includes functionality that does exactly this, measuring accuracy vs. speed, or other metrics such as throughput, latency, node conversion rates, and total training time. TensorRT supports two modes: TensorFlow+TensorRT and TensorRT native, in this example we use the first option.

Note: This test is limited to image recognition models at the moment, however it should not be too hard to implement a custom test based on this code.

Set up the ResNet model

To set up the model, run the following command:

  git clone
cd models
git checkout f0e10716160cd048618ccdd4b6e18336223a172f
touch research/
touch research/tensorrt/
cp research/tensorrt/labellist.json .
cp research/tensorrt/image.jpg ..

This test requires a frozen graph from the ResNet model (the same one that we downloaded before), as well as arguments for the different quantization modes that we want to test.

The following command prepares the test for the execution:

Run the test

  python -m research.tensorrt.tensorrt \
--frozen_graph=$WORKDIR/resnetv2_imagenet_frozen_graph.pb \
--image_file=$WORKDIR/image.jpg \
--native --fp32 --fp16 --int8 \

This command will take some time to finish.


  • $WORKDIR is the directory in which you downloaded the ResNet model.

  • The --native arguments are the different available quantization modes you can test.

Review the results

When the test completes, you will see a comparison of the inference results for each optimization mode.

Precision:  native [u'seashore, coast, seacoast, sea-coast', u'promontory, headland, head, foreland', u'breakwater, groin, groyne, mole, bulwark, seawall, jetty', u'lakeside, lakeshore', u'grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus']
Precision:  FP32 [u'seashore, coast, seacoast, sea-coast', u'promontory, headland, head, foreland', u'breakwater, groin, groyne, mole, bulwark, seawall, jetty', u'lakeside, lakeshore', u'sandbar, sand bar']
Precision:  FP16 [u'seashore, coast, seacoast, sea-coast', u'promontory, headland, head, foreland', u'breakwater, groin, groyne, mole, bulwark, seawall, jetty', u'lakeside, lakeshore', u'sandbar, sand bar']
Precision:  INT8 [u'seashore, coast, seacoast, sea-coast', u'promontory, headland, head, foreland', u'breakwater, groin, groyne, mole, bulwark, seawall, jetty', u'grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus', u'lakeside, lakeshore']

To see the full results, run the following command:

  cat $WORKDIR/log.txt







From the above results, you can see that FP32 and FP16 performance numbers are identical under predictions. This means that if you are content working with TensorRT, you can definitely start using FP16 right away. INT8, on the other hand, shows slightly worse accuracy and requires understanding the accuracy-versus-performance tradeoffs for your models.

In addition, you can observe that when you run the model with TensorRT 5:

  • Using FP32 optimization improves throughput by 40% (440 vs 314). At the same time it decreases latency by ~30%, making it 0.28 ms instead of 0.40 ms.

  • Using FP16 optimization rather than native TF graph increases the speed by 214%! (from 314 to 988 fps). At the same time latency decreased by 0.12 ms (almost 3x decrease!).

  • Using INT8, the last result displayed above, we observed a speedup of 385% (from 314 to 1524) with the latency decreasing to 0.08 ms.


  • The above results do not include latency for image pre-processing nor HTTP requests latency. In production systems the inference’ speed may not be a bottleneck at all, and you will need to account for all the factors mentioned in order to measure your end to end inference’ speed.

Now, let’s pick a model, in this case, INT8.

Converting a custom model to TensorRT

Download and extract ResNet model

To convert a custom model to a TensorRT graph you will need a saved model. To download a saved INT8 ResNet model, run the following command:

SavedModels are generated to accept either tensor or JPG inputs, and with channels_first (NCHW) and channels_last (NHWC) convolutions. NCHW is generally better for GPUs, while NHWC is generally better for CPUs, in this case we are downloading a model that can handle JPG inputs.

tar -xzvf resnet_v2_fp32_savedmodel_NCHW_jpg

Convert the model to a TensorRT graph with TFTools

Now we can convert this model to its corresponding TensorRT graph with a simple tool:

  git clone
cd dlvm/tools/
python ./ \
      --input_model_dir=$WORKDIR/resnet_v2_fp32_savedmodel_NCHW_jpg/1538687370 \
      --output_model_dir=$WORKDIR/resnet_v2_int8_NCHW/00001 \
      --batch_size=128 \

You now have an INT8 model in your $WORKDIR/resnet_v2_int8_NCHW/00001 directory.

To ensure that everything is set up properly, try running an inference test.

  tensorflow_model_server --model_base_path=$WORKDIR/resnet_v2_int8_NCHW/ --rest_api_port=8888

Upload the model to Cloud Storage

You’ll need to run this step so that the model can be served from the multi-zone cluster that we will set up in the next section. To upload the model, complete the following steps:

1. Archive the model.

  tar -zcvf model.tar.gz ./resnet_v2_int8_NCHW/

2. Upload the archive.

gsutil cp model.tar.gz $GCS_PATH

If needed, you can obtain an INT8 precision variant of the frozen graph from Cloud Storage at this URL:

  gsutil cp gs://cloud-samples-data/dlvm/t4/model.tar.gz .

Setting up a multi-zone cluster

Create the cluster

Now that we have a model in Cloud Storage, let’s create a cluster.

Create an instance template

An instance template is a useful way to create new instances. Here’s how:
  export INSTANCE_TEMPLATE_NAME="tf-inference-template"
export IMAGE_FAMILY="tf-1-12-cu100" 
export PROJECT_NAME=#your project

gcloud beta compute --project=$PROJECT_NAME instance-templates create $INSTANCE_TEMPLATE_NAME \
     --machine-type=n1-standard-16 \
     --maintenance-policy=TERMINATE \
     --accelerator=type=nvidia-tesla-t4,count=4 \
     --min-cpu-platform=Intel\ Skylake \
     --tags=http-server,https-server \
     --image-family=$IMAGE_FAMILY \
     --image-project=deeplearning-platform-release \
     --boot-disk-size=100GB \
     --boot-disk-type=pd-ssd \
     --boot-disk-device-name=$INSTANCE_TEMPLATE_NAME \
     --metadata startup-script-url=gs://cloud-samples-data/dlvm/t4/


  • This instance template includes a startup script that is specified by the metadata parameter.

  • The startup script runs during instance creation on every instance that uses this template, and performs the following steps:

    • Installs NVIDIA drivers, NVIDIA drivers are installed on each new instance. Without NVIDIA drivers, inference will not work.

    • Installs a monitoring agent that monitors GPU usage on the instance

    • Downloads the model

    • Starts the inference service

  • The startup script runs, which contains the inference logic. For this example, I have created a very small Python file based on the TFServe package.

  • To view the startup script, see

Create a managed instance group

You’ll need to set up a managed instance group, to allow you to run multiple instances in specific zones. The instances are created based on the instance template generated in the previous step.

  export INSTANCE_GROUP_NAME="deeplearning-instance-group"
export INSTANCE_TEMPLATE_NAME="tf-inference-template"
gcloud compute instance-groups managed create $INSTANCE_GROUP_NAME \
   --base-instance-name deeplearning-instances \
   --size 2 \
   --zones us-central1-a,us-central1-b


  • INSTANCE_TEMPLATE_NAME is the name of the instance that you created in the previous step.

  • You can create this instance in any available zone that supports T4 GPUs. Ensure that you have available GPU quotas in the zone.

  • Creating the instance takes some time. You can watch the progress with the following command:

  export INSTANCE_GROUP_NAME="deeplearning-instance-group"
gcloud compute instance-groups managed list-instances $INSTANCE_GROUP_NAME --region us-central1
  gcloud compute instance-groups managed list-instances $INSTANCE_GROUP_NAME --region us-central1
NAME                         ZONE           STATUS   ACTION  LAST_ERROR
deeplearning-instances-nrhq  us-central1-a  STAGING  CREATING
deeplearning-instances-gp2b  us-central1-b  STAGING  CREATING

Once the managed instance group is created, you should see output that resembles the following:

  gcloud compute instance-groups managed list-instances $INSTANCE_GROUP_NAME --region us-central1
NAME                         ZONE           STATUS   ACTION  LAST_ERROR
deeplearning-instances-nrhq  us-central1-a  RUNNING  NONE
deeplearning-instances-gp2b  us-central1-b  RUNNING  NONE

Confirm metrics in Stackdriver

1. Access Stackdriver’s Metrics Explorer here

2. Search for gpu_utilization. StackDriver > Resources > Metrics Explorer


3. If data is coming in, you should see something like this:


Enable auto-scaling

Now, you’ll need to enable auto-scaling for your managed instance group.

  export INSTANCE_GROUP_NAME="deeplearning-instance-group"

gcloud compute instance-groups managed set-autoscaling $INSTANCE_GROUP_NAME \
   --custom-metric-utilization,utilization-target-type=GAUGE,utilization-target=85 \
   --max-num-replicas 4 \
   --cool-down-period 360 \
   --region us-central1


  • The is the full path to our metric.

  • We are using level 85, this means that whenever GPU utilization reaches 85, the platform will create a new instance in our group.

Test auto-scaling

To test auto-scaling, perform the following steps:

1. SSH to the instances. See Connecting to Instances for more details.

2. Use the gpu-burn tool to load your GPU to 100% utilization for 600 seconds:
  git clone
cd ml-on-gcp/third_party/gpu-burn
./gpu_burn 600 > /dev/null &


  • During the make process, you may receive some warnings, ignore them.

  • You can monitor the gpu usage information, with a refresh interval of 5 seconds:

  nvidia-smi -l 5

3. You can observe the autoscaling in Stackdriver, one instance at a time.

10_autoscaling in Stackdriver.png

4. Go to the Instance Groups page in the Google Cloud Console.

5. Click on the deeplearning-instance-group managed instance group.

6. Click on the Monitoring tab.

11_metrics monitoring.png

At this point your auto-scaling logic should be trying to spin as many instances as possible to reduce the load. And that is exactly what is happening:

11_metrics monitoring.png

At this point you can safely stop any loaded instances (due to the burn-in tool) and watch the cluster scale down.

Set up a load balancer

Let's revisit what we have so far:

  • A trained model, optimized with TensorRT 5 (using INT8 quantization)

  • A managed instance group. These instances have auto-scaling enable based on the GPU utilization

Now you can create a load balancer in front of the instances.

Create health checks

Health checks are used to determine if a particular host on our backend can serve the traffic.

  export HEALTH_CHECK_NAME="http-basic-check"

gcloud compute health-checks create http $HEALTH_CHECK_NAME \
   --request-path /v1/models/default \
   --port 8888

Create inferences forwarder

Configure named-ports of the instance group so that LB can forward inference requests, sent via port 80, to the inference service that is served via port 8888.

  export INSTANCE_GROUP_NAME="deeplearning-instance-group"

gcloud compute instance-groups set-named-ports $INSTANCE_GROUP_NAME \
   --named-ports http:8888 \
   --region us-central1

Create a backend service

Create a backend service that has an instance group and health check.

First, create the health check:

  export HEALTH_CHECK_NAME="http-basic-check"
export WEB_BACKED_SERVICE_NAME="tensorflow-backend"

gcloud compute backend-services create $WEB_BACKED_SERVICE_NAME \
   --protocol HTTP \
   --health-checks $HEALTH_CHECK_NAME \

Then, add the instance group to the new backend service:

  export INSTANCE_GROUP_NAME="deeplearning-instance-group"
export WEB_BACKED_SERVICE_NAME="tensorflow-backend"

gcloud compute backend-services add-backend $WEB_BACKED_SERVICE_NAME \
   --balancing-mode UTILIZATION \
   --max-utilization 0.8 \
   --capacity-scaler 1 \
   --instance-group $INSTANCE_GROUP_NAME \
   --instance-group-region us-central1 \

Set up the forwarding URL

The load balancer needs to know which URL can be forwarded to the backend services.

  export WEB_BACKED_SERVICE_NAME="tensorflow-backend"
export WEB_MAP_NAME="map-all"

gcloud compute url-maps create $WEB_MAP_NAME \
   --default-service $WEB_BACKED_SERVICE_NAME

Create the load balancer

  export WEB_MAP_NAME="map-all"
export LB_NAME="tf-lb"

gcloud compute target-http-proxies create $LB_NAME \
   --url-map $WEB_MAP_NAME

Add an external IP address to the load balancer:

  export IP4_NAME="lb-ip4"

gcloud compute addresses create $IP4_NAME \
   --ip-version=IPV4 \

Find the allocated IP address:

  gcloud compute addresses list

Set up the forwarding rule that tells GCP to forward all requests from the public IP to the load balancer:

  export IP=$(gcloud compute addresses list | grep ${IP4_NAME} | awk '{print $2}')
export LB_NAME="tf-lb"
export FORWARDING_RULE="lb-fwd-rule"

gcloud compute forwarding-rules create $FORWARDING_RULE \
   --address $IP \
   --global \
   --target-http-proxy $LB_NAME \
   --ports 80

After creating the global forwarding rules, it can take several minutes for your configuration to propagate.

Enable the firewall

You need to enable a firewall on your project, or else it will be impossible to connect to your VM instances from the external internet. To enable a firewall for your instances, run the following command:

  gcloud compute firewall-rules create www-firewall-80 \
    --target-tags http-server --allow tcp:80

gcloud compute firewall-rules create www-firewall-8888 \
    --target-tags http-server --allow tcp:8888

Running inference

You can use the following Python script to convert images to a format that can be uploaded to the server.

  import base64

INPUT_FILE = 'image.jpg'
OUTPUT_FILE = '/tmp/out.json'

"""Open image and convert it to base64"""
with open(INPUT_FILE, 'rb') as f:
  jpeg_bytes = base64.b64encode('utf-8')
  predict_request = '{"instances" : [{"b64": "%s"}]}' % jpeg_bytes
  # Write JSON to file
  with open(OUTPUT_FILE, 'w') as f:

Finally, run the inference request:

  curl -X POST $IP/v1/models/default:predict -d @/tmp/out.json

That’s it!

Toward TensorFlow inference bliss

Running ML inference workloads with TensorFlow has come a long way. Together, the combination of NVIDIA T4 GPUs and its TensorRT framework make running inference workloads a relatively trivial task—and with T4 GPUs available on Google Cloud, you can spin them up and down on demand. If you have feedback on this post, please reach out to us here.

Acknowledgements: Viacheslav Kovalevskyi, Software Engineer, Gonzalo Gasca Meza, Developer Programs Engineer, Yaboo Oyabu, Machine Learning Specialist and Karthik Ramasamy, Software Engineer contributed to this post.

1. Inference benchmarks show ResNet training times to be 27x faster, and GNMT times to be 36x faster