Running a TensorFlow inference at scale using TensorRT 5 and NVIDIA T4 GPUs

This tutorial discusses how to run an inference at large scale on NVIDIA TensorRT 5 and T4 GPUs. NVIDIA TensorRT™ is a platform for high-performance deep-learning inference. It includes a deep-learning inference optimizer and runtime that deliver low latency and high throughput for deep-learning inference applications.

In the tutorial, you set up a multi-zone cluster for running an inference with an autoscaling group based on GPU utilization.


This tutorial provides the following:

  • A reference architecture for implementing a scalable machine learning inference system on Google Cloud that's suitable for a development environment. Your infrastructure and security needs might vary, so you can adjust the configurations described in this tutorial accordingly.
  • A GitHub repository that contains scripts that you use in the tutorial to install the TensorFlow model and other required components.
  • Instructions for how to quantize the TensorFlow model using TensorRT, how to deploy scripts, and how to deploy the reference architecture.
  • Instructions for how to configure Cloud Load Balancing.

When you complete this tutorial, you'll have a pre-trained quantized model in Cloud Storage, and two clustered Compute Engine Instance Groups in different regions fronted by Cloud Load Balancing for web traffic. This architecture is illustrated in the following diagram.

Architecture used in this tutorial


  • Start with a pre-trained graph.
  • Optimize the model with TensorRT, and see how much faster the model is with different optimizations.
  • After finalizing the model, create a cluster based on the Compute Engine Deep Learning VMs that already come with preinstalled TensorFlow, TensorFlow Serving, and TensorRT 5.


This tutorial uses the following billable components of Google Cloud:

  • Compute Engine
  • Persistent Disk
  • Cloud Storage
  • Networking

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. In the Google Cloud console, go to the project selector page.

    Go to project selector

  2. Select or create a Google Cloud project.

  3. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  4. Enable the Compute Engine and Cloud Logging APIs.

    Enable the APIs

  5. Ensure that you have sufficient GPU quota to create VMs.

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Clean up.

Preparing your environment

In this section, you make default settings for values that are used throughout the tutorial, like region and zone. The tutorial uses us-central1 as the default region and us-central1-b as the default zone.

You also create a file with all of the settings, so that you can load the variables automatically if you need to reopen Cloud Shell and reinitialize the settings.

  1. Open Cloud Shell:

    OPEN Cloud Shell

  2. Set the default region and zone:

    gcloud compute project-info add-metadata \
        --metadata google-compute-default-region=us-central1,google-compute-default-zone=us-central1-a
  3. Reinitialize the shell:

    gcloud init --console-only
  4. Press 1 for the first three questions, and then enter the project ID for the last question.

Optimizing the model with TensorRT

  1. In Cloud Shell, create an instance that you can use for the model preparation:

    export IMAGE_FAMILY="tf-latest-cu100"
    export INSTANCE_NAME="model-prep"
    gcloud compute instances create $INSTANCE_NAME \
        --image-family=$IMAGE_FAMILY \
        --machine-type=n1-standard-8 \
        --image-project=deeplearning-platform-release \
        --maintenance-policy=TERMINATE \
        --accelerator="type=nvidia-tesla-t4,count=1" \

    One GPU is more than enough to compare different TensorRT optimization modes and to get a feeling of how fast a single GPU can be.

  2. After the VM instance is created, use ssh to connect to the VM.

  3. In the instance, download the resnetv2 model from the official TensorFlow repository to test TensorRT optimization:

    wget -q

TensorRT can speed up the inference, but additional improvement comes from quantization. Linear model quantization converts weights and activations from floating points to integers. For example, if initial weights of the model are FP32 (floating-point 32 bits), by reducing the precision you can use INT8. But quantization doesn't come for free: by reducing storage representation, you can minimally reduce the accuracy of the model. However, going from FP32 to FP16 is practically free.

How do you pick the right compromise between speed (weights precision) and accuracy? There is existing code that does this. This code can measure accuracy against speed and other metrics. The test is limited to image recognition models, but it isn't difficult to implement a custom test based on this code.

  1. In Cloud Shell, download and implement a custom test:

    git clone
    cd models
    git checkout f0e10716160cd048618ccdd4b6e18336223a172f
    touch research/
    touch research/tensorrt/
    cp research/tensorrt/labellist.json .
    cp research/tensorrt/image.jpg .
  2. Prepare the test for execution:

    python -m research.tensorrt.tensorrt \
        --frozen_graph=$HOME/resnetv2_imagenet_frozen_graph.pb \
        --image_file=$HOME/models/image.jpg \
        --native --fp32 --fp16 --int8 \

    The test requires a frozen graph (the resnetv2 model that you downloaded earlier) and arguments for the different quantization modes that you want to test. This command takes some time to finish.

    When execution is finished, the resulting output is a comparison of the result of the inference for a different version of the graph:

    comparing the inference result for a different version of the graph

    The results from FP32 and FP16 are identical, showing the same accuracy, which means that if you are okay with TensorRT, you can definitely start using FP16 right away. In contrast, INT8 shows slightly less accurate results.

  3. Show the accuracy numbers:

    cat $HOME/log.txt

    This command produces the following output:

    inference at scale log

TensorRT 5 shows the following results, all compared to native:

  • For FP32, throughput improved by ~34%, from 319.1 fps vs. 428.2 fps.
  • For FP16, throughput improved by ~207%, from 319.1 fps to 979.6 fps.
  • For INT8, throughput improved by ~376%, from 319.1 fps to 1519.5 fps.

What you can learn from these results:

  • Going from native to TensorRT impacts uncertainty. However, if you're okay with the small price, you can probably go directly to FP16.
  • INT8 is very fast, but the price is noticeably higher.

The next section uses the INT8 model.

Converting a custom model to TensorRT

In order to convert a model to a TensorRT graph, you need a SavedModel.

  1. In Cloud Shell, download the following pre-trained save model:

    tar -xzvf resnet_v2_fp32_savedmodel_NCHW.tar.gz
  2. Create a Python script that converts the frozen model to a TensorRT graph:

    cat <<EOF >
    import tensorflow.contrib.tensorrt as trt
    import argparse
    parser = argparse.ArgumentParser(description="Converts TF SavedModel to the TensorRT enabled graph.")
    parser.add_argument("--input_model_dir", required=True)
    parser.add_argument("--output_model_dir", required=True)
    parser.add_argument("--batch_size", type=int, required=True)
    parser.add_argument("--precision_mode", choices=["FP32", "FP16", "INT8"], required=True)
    args = parser.parse_args()
        None, None, max_batch_size=args.batch_size,
  3. Convert the model to the TensorRT graph:

    python ./ \
        --input_model_dir=$HOME/resnet_v2_fp32_savedmodel_NCHW/1538687196 \
        --output_model_dir=$HOME/resnet_v2_int8_NCHW/00001 \
        --batch_size=128 \

    You now have an INT8 model in the folder $HOME/resnet_v2_int8_NCHW/00001.

  4. Run an inference to make sure that everything is working:

    tensorflow_model_server --model_base_path=$HOME/resnet_v2_int8_NCHW/ --rest_api_port=8888
  5. To verify that it is working, send the following sample input:

    curl -X POST localhost:8888/v1/models/default:predict -d '{"instances": [[[[1,1,1]]]]}'
  6. If you see results from this curl command, exit the inference run by pressing Ctrl+C.

  7. To use the optimized model from your cluster, upload the model to Cloud Storage, replacing [GCS_PATH] with your Cloud Storage bucket name:

    tar -zcvf model.tar.gz ./resnet_v2_int8_NCHW/
    export GCS_PATH=[GCS_PATH]
    gsutil cp model.tar.gz $GCS_PATH

    The next time you want to use this model, you don't have to repeat this whole process. Instead, you can use the INT8 frozen graph that's in the Cloud Storage bucket:


Setting up a cluster

Now that you have a model in Cloud Storage, you can create a cluster. The first step is to create a VM template. The cluster uses the VM template to create new instances.

  1. In Cloud Shell, download the code you need to set up a cluster:

    git clone
  2. Create the VM template, replacing [PROJECT_NAME] with your project name:

    export INSTANCE_TEMPLATE_NAME="tf-inference-template"
    export IMAGE_FAMILY="tf-latest-cu100"
    gcloud beta compute --project=$PROJECT_NAME instance-templates create $INSTANCE_TEMPLATE_NAME \
        --machine-type=n1-standard-16 \
        --maintenance-policy=TERMINATE \
        --accelerator=type=nvidia-tesla-t4,count=4 \
        --min-cpu-platform=Intel\ Skylake \
        --tags=http-server,https-server \
        --image-family=$IMAGE_FAMILY \
        --image-project=deeplearning-platform-release \
        --boot-disk-size=100GB \
        --boot-disk-type=pd-ssd \
        --boot-disk-device-name=$INSTANCE_TEMPLATE_NAME \
        --metadata startup-script-url=gs://solutions-public-assets/tensorrt-t4-gpu/

    The metadata parameter specifies a startup script that's installed on every instance created by the VM template. The startup script performs the following procedures when the VM instance starts:

    • Installs NVIDIA drivers.
    • Installs a monitoring agent to monitor the GPU usage.
    • Downloads the model.
    • Starts the inference service.

    When the template is ready, you can create the managed instance group. The group won't be a scaling group, and there won't be a health check. You're creating a group that gives you only one guarantee that there will be 2 running instances in specific zones.

  3. Create the managed instance group:

    gcloud compute instance-groups managed create $INSTANCE_GROUP_NAME \
        --template $INSTANCE_TEMPLATE_NAME \
        --base-instance-name deeplearning-instances \
        --size 2 \
        --zones us-central1-a,us-central1-b

    The INSTANCE_TEMPLATE_NAME value is the name of the instance that you set in an earlier step. Pick zones based on the availability of GPUs (not all GPUs are available in all zones) and based on your quotas.

    Creating the group takes some time.

  4. Watch the progress by running the following command:

    gcloud compute instance-groups managed list-instances $INSTANCE_GROUP_NAME --region us-central1

    The output looks something like this:

    while the group is being created

    When creation is finished, you get something like this:

    after creating the group

  5. Open the Monitoring page in the console.

    Go to the Monitoring page

  6. Ensure that you're in the correct project workspace, which is shown on the top left corner. If you're visiting this page for the first time, you must create a new workspace.

  7. In the Metrics Explorer page, for Resource type, select GCE VM Instance, and for Metrics, select custom/gpu_utilization:

    Metrics Explorer page

    If data is coming in, you see something like this:

    Metrics graph showing zero usage

  8. In Cloud Shell, enable autoscaling for your group:

    gcloud compute instance-groups managed set-autoscaling $INSTANCE_GROUP_NAME \
        --custom-metric-utilization,utilization-target-type=GAUGE,utilization-target=85 \
        --max-num-replicas 4 \
        --cool-down-period 360 \
        --region us-central1

    The important part here is the utilization path,, which is the full path to your metric. Also, because you specified a target level of 85, whenever GPU utilization reaches 85, a new instance is created in your group.

Testing autoscaling

In order to test autoscaling, which you set up in the previous section, you must do the following:

  • Use SSH to connect to one of the deep-learning GPU instances.
  • Load all GPUs to 100%.
  • Observe as your autoscaling group scales up by creating one more instance.

  1. Connect to the instance through ssh.
  2. Load your GPU to 100% for 600 seconds:

    git clone
    cd tensorflow-inference-tensorrt5-t4-gpu
    git submodule update --init --recursive
    cd third_party/gpu-burn
    ./gpu_burn 600 > /dev/null &

    In the console, notice the activity on the Metrics Explorer page:

    activity spike in Metrics Explorer page

  3. Create the second instance.

  4. Go to GCE Compute Page > Instance Groups > Monitoring and observe the activity:

    higher activity spike in monitoring

    At this point, your autoscaler is trying to spin up as many instances as possible to reduce the load (without any luck). And that's what's happening:

    spinning up lots of instances

  5. Stop spinning up instances and watch how the activity scales down.

Take a look at what you have:

  • A trained model, optimized with TensorRT 5 (INT8)
  • A managed group of deep-learning instances
  • Autoscaling based on GPU utilization

Creating a load balancer

The final step is to create a load balancer in front of the instances.

  1. In Cloud Shell, create health checks to determine if a particular host on your backend can serve the traffic:

    gcloud compute health-checks create http $HEALTH_CHECK_NAME \
        --request-path /v1/models/default \
        --port 8888
  2. Configure the named ports of the instance group so that the load balancer can forward inference requests through port 80 to the inference service that's served through port 8888:

    gcloud compute instance-groups set-named-ports $INSTANCE_GROUP_NAME \
        --named-ports http:8888 \
        --region us-central1
  3. Create a backend service:

    export WEB_BACKED_SERVICE_NAME="tensorflow-backend"
    gcloud compute backend-services create $WEB_BACKED_SERVICE_NAME \
        --protocol HTTP \
        --health-checks $HEALTH_CHECK_NAME \

    Effectively, a backend service is an instance group with health check.

  4. Add your instance group to the new backend service:

    gcloud compute backend-services add-backend $WEB_BACKED_SERVICE_NAME \
        --balancing-mode UTILIZATION \
        --max-utilization 0.8 \
        --capacity-scaler 1 \
        --instance-group $INSTANCE_GROUP_NAME \
        --instance-group-region us-central1 \
  5. Tell the load balancer which URL to forward to the backend service:

    export WEB_MAP_NAME="map-all"
    gcloud compute url-maps create $WEB_MAP_NAME \
        --default-service $WEB_BACKED_SERVICE_NAME
  6. Create the load balancer:

    export LB_NAME="tf-lb"
    gcloud compute target-http-proxies create $LB_NAME \
        --url-map $WEB_MAP_NAME
  7. Create an external IP address for your load balancer:

    export IP4_NAME="lb-ip4"
    gcloud compute addresses create $IP4_NAME \
        --ip-version=IPV4 \
  8. Verify that the IP address has been allocated:

    gcloud compute addresses list
  9. Review the forwarding rule that Google Cloud uses to forward all requests from the public IP to the load balancer:

    export IP=$(gcloud compute addresses list | grep ${IP4_NAME} | awk '{print $2}')
    export FORWARDING_RULE="lb-fwd-rule"
    gcloud compute forwarding-rules create $FORWARDING_RULE \
        --address $IP \
        --global \
        --target-http-proxy $LB_NAME \
        --ports 80

    After you create the global forwarding rules, it can take several minutes for your configuration to propagate.

  10. To connect to external instances, enable the firewall on the project:

    gcloud compute firewall-rules create www-firewall-80 \
        --target-tags http-server --allow tcp:80
    gcloud compute firewall-rules create www-firewall-8888 \
        --target-tags http-server --allow tcp:8888
  11. Convert the image to a format that can be sent to the server:

    cat <<EOF >
    from PIL import Image
    import numpy as np
    import json
    import codecs
    img ="image.jpg").resize((240, 240))
    result = {
    print(json.dump(result,, 'w', encoding='utf-8'), separators=(',', ':'), sort_keys=True, indent=4))
  12. Run an inference:

    wget -O image.jpg
    curl -X POST $IP/v1/models/default:predict -d @/tmp/out.json

    If the inference works correctly correctly, the result is something like this:

    Successful result of running an inference

Clean up

After you finish the tutorial, you can clean up the resources that you created so that they stop using quota and incurring charges. The following sections describe how to delete or turn off these resources.

  1. In the console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next