Building a scalable TensorFlow inference system using Triton Inference Server and Tesla T4

Stay organized with collections Save and categorize content based on your preferences.

This tutorial shows how to build a scalable TensorFlow inference system that uses NVIDIA Tesla T4 and Triton Inference Server (formerly called TensorRT Inference Server, or TRTIS). For an architectural overview of the system and to review terminology used throughout this series, see part 1 of this series. To learn how to measure performance and tune the system, see part 3 of this series.

Objectives

  • Download a pretrained ResNet-50 model, and use TensorFlow integration with TensorRT (TF-TRT) to apply optimizations.
  • Build an inference server system for the ResNet-50 model by using Triton.
  • Build a monitoring system for Triton by using Prometheus and Grafana.
  • Build a load testing tool by using Locust.

Costs

In addition to using NVIDIA T4 GPU, this tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator.

When you finish this tutorial, don't delete the resources you created. You need these resources in part 3 of this series.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  4. Enable the GKE API.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  7. Enable the GKE API.

    Enable the API

Preparing a ResNet-50 model

In this tutorial, you serve a ResNet-50 model by using the inference system. In the next section, you download the pretrained model. If you prefer to use that pretrained model, you can skip this section. Otherwise, if you want to build the same model yourself, follow the steps in this section.

You train a ResNet-50 model and export it in the SavedModel format by following Training ResNet on Cloud TPU and making the following changes:

  1. Replace the serving input function image_serving_input_fn in /usr/share/tpu/models/official/resnet/imagenet_input.py with the following function:

    def image_serving_input_fn():
      """Serving input fn for raw images."""
    
      # The shape of input tensor is changed to NWHC.
      input_tensor = tf.placeholder(
          shape=[None, 224, 224, 3],
          dtype=tf.float32,
          name='input_tensor')
    
      # this line is just for simplicity
      images = input_tensor
    
      return tf.estimator.export.TensorServingInputReceiver(
          features=images, receiver_tensors=input_tensor)
    

    Because Triton cannot handle base64 string format efficiently as input data, you must change the format of the input tensor from base64 string to NWHC (N, width, height, channel). Note that the first dimension of the shape option (6th line in the code) should be None. If you set this dimension to a constant number, the model won't be correctly quantized with INT8 in the later step.

  2. Follow the step Run the ResNet-50 model with fake_imagenet instead of using the full ImageNet dataset because it takes a few days to complete the procedure if you use the full dataset. The model trained by using the fake image suffices for the purpose of this tutorial.

  3. Modify the number of training steps and the number of iterations per loop in the configuration file to reduce the training time. For example, you might change /usr/share/tpu/models/official/resnet/configs/cloud/v2-8.yaml as follows:

    train_steps: 100
    train_batch_size: 1024
    eval_batch_size: 1024
    iterations_per_loop: 100
    skip_host_call: True
    num_cores: 8
    
  4. When you run the training script, specify the export path by using the --export_dir option. This option tells the script to export the trained model in the SavedModel format:

    export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
    python /usr/share/tpu/models/official/resnet/resnet_main.py \
        --tpu=${TPU_NAME} \
        --mode=train \
        --data_dir=gs://cloud-tpu-test-datasets/fake_imagenet/ \
        --model_dir=${MODEL_DIR} \
        --export_dir=${MODEL_DIR}/export \
        --config_file=/usr/share/tpu/models/official/resnet/configs/cloud/v2-8.yaml
    

    In this example, when the training script finishes successfully, the trained model is exported in the SavedModel format under ${MODEL_DIR}/export in Cloud Storage.

In the following section, the model is assumed to be trained with the fake dataset at gs://cloud-tpu-test-datasets/fake_imagenet/ and is exported to gs://solutions-public-assets/tftrt-tutorial/resnet/export/1584366419/. In the following steps, you change this URI according to your setup.

Building optimized models with TF-TRT

In this section, you optimize and quantize the pretrained model. In the following sections, you use the working environment that you create in this section.

Create a working environment

You create your working environment by creating a Compute Engine instance using the Deep Learning VM Image. You optimize and quantize the ResNet-50 model with TensorRT on this instance.

  1. Open Cloud Shell.

    Open Cloud Shell

  2. Deploy an instance, replacing PROJECT_ID with the project ID of the Cloud project that you created earlier:

    gcloud config set project PROJECT_ID
    gcloud config set compute/zone us-west1-b
    gcloud compute instances create working-vm \
        --scopes cloud-platform \
        --image-family common-cu101 \
        --image-project deeplearning-platform-release \
        --machine-type n1-standard-8 \
        --min-cpu-platform="Intel Skylake" \
        --accelerator=type=nvidia-tesla-t4,count=1 \
        --boot-disk-size=200GB \
        --maintenance-policy=TERMINATE \
        --metadata="install-nvidia-driver=True"
    

    This command launches a Google Cloud instance using Tesla T4. At the first boot, it automatically installs the NVIDIA GPU driver that is compatible with TensorRT 5.1.5.

Create model files with different optimizations

You apply the following optimizations to the original ResNet-50 model by using TF-TRT:

  • Graph optimization
  • Conversion to FP16 in addition to the graph optimization
  • Quantization with INT8 in addition to the graph optimization

These optimizations are explained in the Performance tuning section in part 1 of this series.

  1. In the Google Cloud console, go to Compute Engine > VM instances.

    Go to VM Instances

    You see the instance that you created in the previous section.

  2. Click SSH to open the terminal console of the instance. You use this terminal to run the commands in this tutorial.

  3. In the terminal, clone the repository that you need for this tutorial and change the current directory:

    cd $HOME
    git clone https://github.com/GoogleCloudPlatform/gke-tensorflow-inference-system-tutorial
    cd gke-tensorflow-inference-system-tutorial/server
    
  4. Download a pretrained ResNet-50 model (or copy the model that you built) to a local directory:

    mkdir -p models/resnet/original/00001
    gsutil cp -R gs://solutions-public-assets/tftrt-tutorial/resnet/export/1584366419/* models/resnet/original/00001
    
  5. Build a container image that contains optimization tools for TF-TRT:

    docker build ./ -t trt-optimizer
    docker image list
    

    The last command shows image IDs. Copy the image ID that has the repository name tft-optimizer. In the following example, the image ID is 3fa16b1b864c.

    REPOSITORY                     TAG                 IMAGE ID            CREATED              SIZE
    trt-optimizer                  latest              3fa16b1b864c        About a minute ago   6.96GB
    nvcr.io/nvidia/tensorflow      19.05-py3           01c8c4b0d7ff        2 months ago         6.96GB
    
  6. Apply the optimizations to the original model, replacing IMAGE-ID with the image ID that you copied in the previous step:

    export IMAGE_ID=IMAGE-ID
    
    nvidia-docker run --rm \
        -v `pwd`/models/:/workspace/models ${IMAGE_ID} \
        --input-model-dir='models/resnet/original/00001' \
        --output-dir='models/resnet' \
        --precision-mode='FP32' \
        --batch-size=64
    
    nvidia-docker run --rm \
        -v `pwd`/models/:/workspace/models ${IMAGE_ID} \
        --input-model-dir='models/resnet/original/00001' \
        --output-dir='models/resnet' \
        --precision-mode='FP16' \
        --batch-size=64
    
    nvidia-docker run --rm \
        -v `pwd`/models/:/workspace/models ${IMAGE_ID} \
        --input-model-dir='models/resnet/original/00001' \
        --output-dir='models/resnet' \
        --precision-mode='INT8' \
        --batch-size=64 \
        --calib-image-dir='gs://cloud-tpu-test-datasets/fake_imagenet/' \
        --calibration-epochs=10
    

    The three preceding commands correspond to the three optimizations: graph optimization, conversion to FP16, and quantization with INT8. There is an additional process called calibration for INT8 quantization. For that process, you must provide training data by specifying the --calib-image-dir option in the last of the three commands. You use the training data that you used to train the original model. The calibration process takes just over 5 minutes.

    When the commands complete, optimized model binaries are stored under the directory ./models/resnet. The structure of the directory is the following:

    models
    └── resnet
        ├── FP16
        │   └── 00001
        │       ├── saved_model.pb
        │       └── variables
        ├── FP32
        │   └── 00001
        │       ├── saved_model.pb
        │       └── variables
        ├── INT8
        │   └── 00001
        │       ├── saved_model.pb
        │       └── variables
        └── original
            └── 00001
                ├── saved_model.pb
                └── variables
                    ├── variables.data-00000-of-00001
                    └── variables.index
    

The following table summarizes the relationship between directories and optimizations.

Directory Optimization
original Original model (no optimization with TF-TRT)
FP32 Graph optimization
FP16 Conversion to FP16 in addition to the graph optimization
INT8 Quantization with INT8 in addition to the graph optimization

Deploying an inference server

In this section, you deploy Triton servers with five models. First, you upload the model binary that you created in the previous section to Cloud Storage. Then you create a GKE cluster and deploy Triton servers on the cluster.

Upload the model binary

  1. Upload the model binaries to a storage bucket, replacing PROJECT_ID with the project ID of your Google Cloud project:

    export PROJECT_ID=PROJECT_ID
    export BUCKET_NAME=${PROJECT_ID}-models
    
    mkdir -p original/1/model/
    cp -r models/resnet/original/00001/* original/1/model/
    cp original/config.pbtxt original/1/model/
    cp original/imagenet1k_labels.txt original/1/model/
    
    mkdir -p tftrt_fp32/1/model/
    cp -r models/resnet/FP32/00001/* tftrt_fp32/1/model/
    cp tftrt_fp32/config.pbtxt tftrt_fp32/1/model/
    cp tftrt_fp32/imagenet1k_labels.txt tftrt_fp32/1/model/
    
    mkdir -p tftrt_fp16/1/model/
    cp -r models/resnet/FP16/00001/* tftrt_fp16/1/model/
    cp tftrt_fp16/config.pbtxt tftrt_fp16/1/model/
    cp tftrt_fp16/imagenet1k_labels.txt tftrt_fp16/1/model/
    
    mkdir -p tftrt_int8/1/model/
    cp -r models/resnet/INT8/00001/* tftrt_int8/1/model/
    cp tftrt_int8/config.pbtxt tftrt_int8/1/model/
    cp tftrt_int8/imagenet1k_labels.txt tftrt_int8/1/model/
    
    mkdir -p tftrt_int8_bs16_count4/1/model/
    cp -r models/resnet/INT8/00001/* tftrt_int8_bs16_count4/1/model/
    cp tftrt_int8_bs16_count4/config.pbtxt tftrt_int8_bs16_count4/1/model/
    cp tftrt_int8_bs16_count4/imagenet1k_labels.txt tftrt_int8_bs16_count4/1/model/
    
    gsutil mb gs://${BUCKET_NAME}
    gsutil -m cp -R original tftrt_fp32 tftrt_fp16 tftrt_int8 tftrt_int8_bs16_count4 \
        gs://${BUCKET_NAME}/resnet/
    

    In this step, you uploaded a configuration file config.pbtxt in addition to the model binary. For example, the following shows the content of original/1/model/config.pbtxt:

    name: "original"
    platform: "tensorflow_savedmodel"
    max_batch_size: 64
    input {
        name: "input"
        data_type: TYPE_FP32
        format: FORMAT_NHWC
        dims: [ 224, 224, 3 ]
    }
    output {
        name: "probabilities"
        data_type: TYPE_FP32
        dims: 1000
        label_filename: "imagenet1k_labels.txt"
    }
    default_model_filename: "model"
    instance_group [
      {
        count: 1
        kind: KIND_GPU
      }
    ]
    dynamic_batching {
      preferred_batch_size: [ 64 ]
      max_queue_delay_microseconds: 20000
    }
    

Note that the following tuning parameters are specified in this file. The batch size and number of instance groups is explained in the Performance tuning section in part 1 of this series.

  • Model name
  • Input tensor name and output tensor name
  • GPU allocation to each model
  • Batch size and number of instance groups

The following table summarizes the five models that you deployed in this section.

Model name Optimization
original Original model (no optimization with TF-TRT)
tftrt_fp32 Graph optimization
(batch size=64, instance groups=1)
tftrt_fp16 Conversion to FP16 in addition to the graph optimization
(batch size=64, instance groups=1)
tftrt_int8 Quantization with INT8 in addition to the graph optimization
(batch size=64, instance groups=1)
tftrt_int8_bs16_count4 Quantization with INT8 in addition to the graph optimization
(batch size=16, instance groups=4)

Deploy inference servers by using Triton

  1. Create a GKE cluster by using compute nodes with NVIDIA Tesla T4:

    gcloud auth login
    gcloud config set compute/zone us-west1-b
    gcloud container clusters create tensorrt-cluster \
        --num-nodes=20
    gcloud container node-pools create t4-pool \
        --num-nodes=1 \
        --machine-type=n1-standard-8 \
        --cluster=tensorrt-cluster \
        --accelerator type=nvidia-tesla-t4,count=1
    

    These commands create a GKE cluster that has 20 nodes, and they add a GPU node pool gpu-pool. The GPU node pool consists of a single n1-standard-8 instance with NVIDIA Tesla T4 GPU. The number of GPU instances should be equal to or larger than the number inference server pods because NVIDIA Tesla T4 GPU cannot be shared by multiple pods on the same instance. The option --num-nodes in the previous command specifies the number of instances.

  2. Show the cluster information:

    gcloud container clusters list
    

    The output is similar to the following:

    NAME              LOCATION    MASTER_VERSION  MASTER_IP      MACHINE_TYPE   NODE_VERSION    NUM_NODES  STATUS
    tensorrt-cluster  us-west1-b  1.14.10-gke.17  XX.XX.XX.XX    n1-standard-1  1.14.10-gke.17  21         RUNNING
    
  3. Show the node pool information:

    gcloud container node-pools list --cluster tensorrt-cluster
    

    The output is similar to the following:

    NAME          MACHINE_TYPE   DISK_SIZE_GB  NODE_VERSION
    default-pool  n1-standard-1  100           1.14.10-gke.17
    t4-pool       n1-standard-8  100           1.14.10-gke.17
    
  4. Enable the daemonSet workload:

    sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin
    export USE_GKE_GCLOUD_AUTH_PLUGIN=True
    gcloud container clusters get-credentials tensorrt-cluster
    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
    

    This command loads the NVIDIA GPU driver on the nodes in the GPU node pool. It also automatically loads the driver when you add a new node to the GPU node pool.

  5. Deploy inference servers on the cluster:

    sed -i.bak "s/YOUR-BUCKET-NAME/${PROJECT_ID}-models/" trtis_deploy.yaml
    kubectl create -f trtis_service.yaml
    kubectl create -f trtis_deploy.yaml
    
  6. Wait a few minutes until services become available.

  7. Get the clusterIP address of Triton and store it in the environment variable that you use in the steps that follow:

    export TRITON_IP=$(kubectl get svc inference-server \
      -o "jsonpath={.spec['clusterIP']}")
    echo ${TRITON_IP}
    

At this point, the inference server is serving four ResNet-50 models that you created in the section Create model files with different optimizations. Clients can specify the model to use when sending inference requests.

Deploy monitoring servers with Prometheus and Grafana

  1. Deploy Prometheus servers on the cluster:

    sed -i.bak "s/CLUSTER-IP/${TRITON_IP}/" prometheus-configmap.yml
    kubectl create namespace monitoring
    kubectl apply -f prometheus-service.yml -n monitoring
    kubectl create -f clusterRole.yml
    kubectl create -f prometheus-configmap.yml -n monitoring
    kubectl create -f prometheus-deployment.yml -n monitoring
    
  2. Get the endpoint URL of the Prometheus service. Copy the endpoint because you use it to configure Grafana in the steps that follow.

    ip_port=$(kubectl get svc prometheus-service \
      -o "jsonpath={.spec['clusterIP']}:{.spec['ports'][0]['port']}" -n monitoring)
    echo "http://${ip_port}"
    
  3. Deploy Grafana servers on the cluster:

    kubectl create -f grafana-service.yml -n monitoring
    kubectl create -f grafana-deployment.yml -n monitoring
    
  4. Wait a few minutes until all services become available.

  5. Get the endpoint URL of the Grafana service.

    ip_port=$(kubectl get svc grafana-service \
      -o "jsonpath={.status['loadBalancer']['ingress'][0]['ip']}:{.spec['ports'][0]['port']}" -n monitoring)
    echo "http://${ip_port}"
    
  6. Open this URL from a web browser and log in with the default user ID and password (admin and admin). You are asked to change the default password.

  7. Click the Add your first data source icon, and from the Time series databases list, select Prometheus.

  8. In the Settings tab, in the URL field, set the endpoint URL of the Prometheus service. The endpoint URL is the one you noted in step 2.

    Endpoint URL of the Prometheus service.

  9. Click Save and Test, and then click the Grafana icon to return to the home screen.

  10. Click the Create your first dashboard icon, and then click Add new panel to add a monitoring metric.

  11. In the Query tab, for Metric, enter nv_gpu_utilization.

    Set a metric for monitoring GPU utilization.

  12. In Panel options, for Title, enter GPU Utilization. Then, click the left arrow.

    Set the panel title for GPU utilization.

    You see the graph for GPU utilization.

    Graph for GPU utilization.

  13. Click the Add panel icon, click Add new panel, and then repeat the steps from step 11 to add a graph for the metric nv_gpu_memory_used_bytes with the title GPU Memory Used.

Deploying a load testing tool

In this section, you deploy the Locust load testing tool on GKE and generate workload to measure performance of the inference servers.

  1. Build a Docker image that contains Triton client libraries, and then upload it to Container Registry:

    cd ../client
    git clone https://github.com/triton-inference-server/server
    cd server
    git checkout r19.05
    sed -i.bak "s/bootstrap.pypa.io\/get-pip.py/bootstrap.pypa.io\/pip\/2.7\/get-pip.py/" Dockerfile.client
    docker build -t tritonserver_client -f Dockerfile.client .
    gcloud auth configure-docker
    docker tag tritonserver_client \
        gcr.io/${PROJECT_ID}/tritonserver_client
    docker push gcr.io/${PROJECT_ID}/tritonserver_client
    

    The build process takes a little over 5 minutes.

  2. Build a Docker image to generate testing workload, and then upload it to Container Registry:

    cd ..
    sed -i.bak "s/YOUR-PROJECT-ID/${PROJECT_ID}/" Dockerfile
    docker build -t locust_tester -f Dockerfile .
    docker tag locust_tester gcr.io/${PROJECT_ID}/locust_tester
    docker push gcr.io/${PROJECT_ID}/locust_tester
    

    This image is built from the image that you created in the previous step.

  3. Deploy the Locust files service_master.yaml and deployment_master.yaml:

    sed -i.bak "s/YOUR-PROJECT-ID/${PROJECT_ID}/" deployment_master.yaml
    sed -i.bak "s/CLUSTER-IP-TRTIS/${TRITON_IP}/" deployment_master.yaml
    
    kubectl create namespace locust
    kubectl create configmap locust-config --from-literal model=original --from-literal saddr=${TRITON_IP} --from-literal rps=10 -n locust
    
    kubectl apply -f service_master.yaml -n locust
    kubectl apply -f deployment_master.yaml -n locust
    

    The configmap resource is used to specify the machine learning model that clients send requests to for inference.

  4. Wait a few minutes until services become available.

  5. Get the clusterIP address of locust-master and store that address in an environment variable:

    export LOCUST_MASTER_IP=$(kubectl get svc locust-master -n locust \
        -o "jsonpath={.spec['clusterIP']}")
    echo ${LOCUST_MASTER_IP}
    
  6. Deploy the Locust client:

    sed -i.bak "s/YOUR-PROJECT-ID/${PROJECT_ID}/" deployment_slave.yaml
    sed -i.bak "s/CLUSTER-IP-LOCUST-MASTER/${LOCUST_MASTER_IP}/" deployment_slave.yaml
    kubectl apply -f deployment_slave.yaml -n locust
    

    These commands deploy 10 Locust client Pods that you can use to generate testing workloads. If you can't generate enough requests with the current number of clients, you can change the number of Pods by using the following command:

    kubectl scale deployment/locust-slave --replicas=20 -n locust
    

    When there is not enough capacity for a default cluster to increase the number of replicas, we recommend that you increase the number of nodes in the GKE cluster.

  7. Copy the URL of the Locust console, and then open this URL in a web browser:

    export LOCUST_IP=$(kubectl get svc locust-master -n locust \
         -o "jsonpath={.status.loadBalancer.ingress[0].ip}")
    echo "http://${LOCUST_IP}:8089"
    

    You see the following console. You can generate testing workloads from this console.

    Locust console used for generating testing workloads.

You have completed building the inference server system. You can check the running Pods:

  1. Check the inference server Pod:

    kubectl get pods
    

    The output is similar to the following:

    NAME                                READY   STATUS    RESTARTS   AGE
    inference-server-67786cddb4-qrw6r   1/1     Running   0          83m
    
  2. Check the Locust Pods:

    kubectl get pods -n locust
    

    The output is similar to the following:

    NAME                                READY   STATUS    RESTARTS   AGE
    locust-master-75f6f6d4bc-ttllr      1/1     Running   0          10m
    locust-slave-76ddb664d9-8275p       1/1     Running   0          2m36s
    locust-slave-76ddb664d9-f45ww       1/1     Running   0          2m36s
    locust-slave-76ddb664d9-q95z9       1/1     Running   0          2m36s
    
  3. Check the monitoring Pods:

    kubectl get pods -n monitoring
    

    The output is similar to the following:

    NAME                                     READY   STATUS    RESTARTS   AGE
    grafana-deployment-644bbcb84-k6t7v       1/1     Running   0          79m
    prometheus-deployment-544b9b9f98-hl7q8   1/1     Running   0          81m
    

In the next part of this series, you use this inference server system to learn how various optimizations improve performance and how to interpret those optimizations.

What's next