Measuring and tuning performance of a TensorFlow inference system using Triton Inference Server and Tesla T4

This tutorial shows how to measure the performance of the TensorFlow inference system that you created in part 2 of this series, and to apply parameter tuning to improve system throughput. The tutorial is not intended to provide the performance data of a particular system. Instead, it offers general guidance on the performance measurement process.

The concrete numbers of performance metrics such as "Total Requests per Second (RPS)" and "Response Time" differ depending on the trained model, software versions, and hardware configurations.

Objectives

  • Define the performance objective and metrics.
  • Measure baseline performance.
  • Perform graph optimization.
  • Measure FP16 conversion.
  • Measure INT8 quantization.
  • Adjust the number of instances.

Costs

In addition to using NVIDIA T4 GPU, this tutorial uses the following billable components of Google Cloud:

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Clean up.

Before you begin

Before starting this tutorial, you need to finish building the inference system by following part 2 of this series. You use the following interfaces to work through this tutorial:

In the SSH terminal, set the current directory to the client subdirectory:

cd $HOME/gke-tensorflow-inference-system-tutorial/client

In this tutorial, you run all commands from this directory.

Defining the performance objective

When you measure performance of inference systems, you must define the performance objective and appropriate performance metrics according to the use case of the system. For simplicity, this tutorial assumes the following performance objectives:

  • 95% of requests receive responses within 100 ms.
  • Total throughput (requests per second) improves without breaking the previous objective.

Using these assumptions, you measure the throughput of the following ResNet-50 models with different optimizations. When a client sends inference requests, it specifies the model using the model name in this table. You also apply parameter tuning to improve the throughput of the model.

Model name Optimization
original Original model (no optimization with TF-TRT)
tftrt_fp32 Graph optimization
(batch size: 64, instance groups: 1)
tftrt_fp16 Conversion to FP16 in addition to the graph optimization
(batch size: 64, instance groups: 1)
tftrt_int8 Quantization with INT8 in addition to the graph optimization
(batch size: 64, instance groups: 1)
tftrt_int8_bs16_count4 Quantization with INT8 in addition to the graph optimization
(batch size: 16, instance groups: 4)

Measuring baseline performance

You start by using TF-TRT as a baseline to measure the performance of the original, non-optimized model. You compare the performance of other models with the original in order to quantitatively evaluate the performance improvement. When you deployed Locust, it was already configured to send requests for the original model.

  1. Open the Locust console and confirm that the number of clients (referred to as slaves) is 10. If the number is less than 10, the clients are still starting up. In that case, wait a few minutes until it becomes 10.
  2. Set Number of users to simulate to 3000, and set Hatch rate to 5.
  3. Increase the number of simulated uses by 5 per second until it reaches 3000 by clicking Start swarming.

    Starting a new Locust swarm.

  4. Click Charts to show the following charts.

    Starting a new Locust swarm.

    Observe that while the Total Requests per Second value linearly increases, the Response Times (ms) value increases accordingly.

  5. When the 95% percentile of Response Times value exceeds 100 ms, click Stop to stop the simulation. If you move the mouse pointer over the graph, you can check the number of requests per second corresponding to when the value of 95% percentile of Response Times exceeded 100 ms.

    In the following screenshot, the number of requests per second is 253.1.

    Graph showing 253.1 requests per second.

    We recommend that you repeat this measurement several times and take an average to account for fluctuation.

  6. Restart Locust:

    kubectl delete -f deployment_master.yaml -n locust
    kubectl delete -f deployment_slave.yaml -n locust
    kubectl apply -f deployment_master.yaml -n locust
    kubectl apply -f deployment_slave.yaml -n locust
    
  7. Go back to step 1 to repeat the measurement.

Optimizing graphs

In this section, you measure the performance of the model tftrt_fp32 that is optimized by using TF-TRT for graph optimization. This is a common optimization that is compatible with most of the NVIDIA GPU cards.

  1. Restart the load testing tool. Use the configmap resource to specify the model as tftrt_fp32.

    kubectl delete configmap locust-config -n locust
    kubectl create configmap locust-config \
        --from-literal model=tftrt_fp32 \
        --from-literal saddr=${TRITON_IP} \
        --from-literal rps=10 -n locust
    kubectl delete -f deployment_master.yaml -n locust
    kubectl delete -f deployment_slave.yaml -n locust
    kubectl apply -f deployment_master.yaml -n locust
    kubectl apply -f deployment_slave.yaml -n locust
    
  2. Restart the Triton server:

    kubectl scale deployment/inference-server --replicas=0
    kubectl scale deployment/inference-server --replicas=1
    

    Wait a few minutes until the server processes become ready.

  3. Repeat the performance measurement that you took in the previous section.

    In the following screenshots, the number of requests per second is 381.

    Graph showing response time with requests per second of 381.

    Graph showing 381 requests per second.

    These images show the performance improvement of the TF-TRT graph optimization.

Converting to FP16

In this section, you measure the performance of the model tftrt_fp16 that is optimized with TF-TRT for graph optimization and FP16 conversion. This is an optimization available for NVIDIA Tesla T4.

  1. Restart the load testing tool. Use the configmap resource to specify the model as tftrt_fp16.

    kubectl delete configmap locust-config -n locust
    kubectl create configmap locust-config \
        --from-literal model=tftrt_fp16 \
        --from-literal saddr=${TRITON_IP} \
        --from-literal rps=10 -n locust
    kubectl delete -f deployment_master.yaml -n locust
    kubectl delete -f deployment_slave.yaml -n locust
    kubectl apply -f deployment_master.yaml -n locust
    kubectl apply -f deployment_slave.yaml -n locust
    
  2. Restart the Triton server:

    kubectl scale deployment/inference-server --replicas=0
    kubectl scale deployment/inference-server --replicas=1
    

    Wait a few minutes until the server processes become ready.

  3. Repeat the performance measurement that you took in the previous section. In the following example, the number of requests per second is 1072.5.

    Graph showing response time with 1072.5 requests per second.

    Graph showing 1072.5 requests per second.

    These images show the performance improvement of the FP16 conversion in addition to the TF-TRT graph optimization.

Quantizing with INT8

In this section, you measure the performance of the model tftrt_int8 that is optimized with TF-TRT for graph optimization and INT8 quantization. This optimization is available for NVIDIA Tesla T4.

  1. Restart the load testing tool. Use the configmap resource to specify the model as tftrt_int8.

    kubectl delete configmap locust-config -n locust
    kubectl create configmap locust-config \
        --from-literal model=tftrt_int8 \
        --from-literal saddr=${TRITON_IP} \
        --from-literal rps=10 -n locust
    kubectl delete -f deployment_master.yaml -n locust
    kubectl delete -f deployment_slave.yaml -n locust
    kubectl apply -f deployment_master.yaml -n locust
    kubectl apply -f deployment_slave.yaml -n locust
    
  2. Restart the Triton server:

    kubectl scale deployment/inference-server --replicas=0
    kubectl scale deployment/inference-server --replicas=1
    

    Wait a few minutes until the server processes become ready.

  3. Repeat the performance measurement that you took in the previous section.

    In the following screenshots, the number of requests per second is 1085.4.

    Graph showing response time with 1085.4 requests per second.

    Graph showing 1085.4 requests per second.

    This result is almost the same as the FP16 conversion. You don't observe an advantage to using INT8 quantization. In theory, NVIDIA Tesla T4 GPU can handle INT8 quantization models faster than FP16 conversion models. In this case, there might be a bottleneck other than GPU performance. You can confirm it from the following GPU utilization data on the Grafana dashboard. Notice that utilization is less than 40%, which means that the model cannot fully utilize the GPU performance.

    Graph showing GPU utilization of less than 40%.

    As the next section shows, you might be able to ease this bottleneck by increasing the number of instance groups. For example, increase the number of instance groups from 1 to 4, and decrease the batch size from 64 to 16. This approach keeps the total number of requests processed on a single GPU at 64.

Adjusting the number of instances

In this section, you measure the performance of the model tftrt_int8_bs16_count4. This model has the same structure as tftrt_int8, but you change the batch size and number of instance groups as described at the end of the previous section.

  1. Restart Locust. Use the configmap resource to specify the model as tftrt_int8_bs16_count4. At the same time, increase the number of Locust client Pods to generate enough workloads to measure the performance limitation of the model.

    kubectl delete configmap locust-config -n locust
    kubectl create configmap locust-config \
        --from-literal model=tftrt_int8_bs16_count4 \
        --from-literal saddr=${TRITON_IP} \
        --from-literal rps=10 -n locust
    kubectl delete -f deployment_master.yaml -n locust
    kubectl delete -f deployment_slave.yaml -n locust
    kubectl apply -f deployment_master.yaml -n locust
    kubectl apply -f deployment_slave.yaml -n locust
    kubectl scale deployment/locust-slave --replicas=20 -n locust
    
  2. Restart the Triton server:

    kubectl scale deployment/inference-server --replicas=0
    kubectl scale deployment/inference-server --replicas=1
    

    Wait a few minutes until the server processes become ready.

  3. Repeat the performance measurement that you took in the previous section. But in this case, set Hatch rate to 15 because it takes a long time to reach the performance limit if you set Hatch rate to 5. In the following example, the number of requests per second is 2236.6.

    Graph showing response time with 2236.6 requests per second.

    Graph showing 2236.6 requests per second.

    By adjusting the number of instances, you almost double requests per second. Notice that the GPU utilization has reached about 75% on the Grafana dashboard.

    Graph showing GPU utilization of 75%.

Scaling with multiple nodes

When scaling with multiple nodes, you measure the performance of a single Pod. Because the inference processes are executed independently on different Pods in a shared-nothing manner, you can assume that the total throughput would scale linearly with the number of Pods. This assumption applies as long there are no bottlenecks such as network bandwidth between clients and inference servers.

However, it's important to understand how inference requests are balanced among multiple inference servers. Triton uses the gRPC protocol to establish a TCP connection between a client and a server. Because Triton reuses the established connection for sending multiple inference requests, requests from a single client are always sent to the same server. To distribute requests for multiple servers, you must use multiple clients.

Clean up

Delete the project

  1. In the console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next