Deploy a scalable TensorFlow inference system

Last reviewed 2023-11-02 UTC

This document describes how you deploy the reference architecture described in Scalable TensorFlow inference system.

This series is intended for developers who are familiar with Google Kubernetes Engine and machine learning (ML) frameworks, including TensorFlow and NVIDIA TensorRT.

After you complete this deployment, see Measure and tune performance of a TensorFlow inference system.

Architecture

The following diagram shows the architecture of the inference system.

Architecture of the inference system.

The Cloud Load Balancing sends the request traffic to the closest GKE cluster. The cluster contains a Pod for each node. In each Pod, a Triton Inference Server provides an inference service (to serve ResNet-50 models), and an NVIDIA T4 GPU improves performance. Monitoring servers on the cluster collect metrics data on GPU utilization and memory usage.

For details, see Scalable TensorFlow inference system.

Objectives

Download a pretrained ResNet-50 model, and use TensorFlow integration with TensorRT (TF-TRT) to apply optimizations
Serve a ResNet-50 model from an NVIDIA Triton Inference Server
Build a monitoring system for Triton by using Prometheus and Grafana
Build a load testing tool by using Locust

Costs

In addition to NVIDIA T4 GPU, in this deployment, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator.

When you finish this deployment, don't delete the resources you created. You need these resources when you measure and tune the deployment.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API.

Enable the API

Build optimized models with TF-TRT

In this section, you create a working environment and optimize the pretrained model.

The pretrained model uses the fake dataset at gs://cloud-tpu-test-datasets/fake_imagenet/. There is also a copy of the pretrained model in the Cloud Storage location at gs://solutions-public-assets/tftrt-tutorial/resnet/export/1584366419/.

Create a working environment

For your working environment, you create a Compute Engine instance by using Deep Learning VM Images. You optimize and quantize the ResNet-50 model with TensorRT on this instance.

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

Deploy an instance named working-vm:

gcloud config set project PROJECT_ID
gcloud config set compute/zone us-west1-b
gcloud compute instances create working-vm \
    --scopes cloud-platform \
    --image-family common-cu113 \
    --image-project deeplearning-platform-release \
    --machine-type n1-standard-8 \
    --min-cpu-platform="Intel Skylake" \
    --accelerator=type=nvidia-tesla-t4,count=1 \
    --boot-disk-size=200GB \
    --maintenance-policy=TERMINATE \
    --metadata="install-nvidia-driver=True"

Replace PROJECT_ID with the ID of the Google Cloud project that you created earlier.

This command launches a Compute Engine instance using NVIDIA T4. Upon the first boot, it automatically installs the NVIDIA GPU driver that is compatible with TensorRT 5.1.5.

Create model files with different optimizations

In this section, you apply the following optimizations to the original ResNet-50 model by using TF-TRT:

Graph optimization
Conversion to FP16 with the graph optimization
Quantization with INT8 with the graph optimization

For details about these optimizations, see Performance optimization.

In the Google Cloud console, go to Compute Engine > VM instances.

Go to VM Instances

You see the working-vm instance that you created earlier.
To open the terminal console of the instance, click SSH.

You use this terminal to run the rest of the commands in this document.

In the terminal, clone the required repository and change the current directory:

cd $HOME
git clone https://github.com/GoogleCloudPlatform/gke-tensorflow-inference-system-tutorial
cd gke-tensorflow-inference-system-tutorial/server

Download the pretrained ResNet-50 model to a local directory:

mkdir -p models/resnet/original/00001
gsutil cp -R gs://solutions-public-assets/tftrt-tutorial/resnet/export/1584366419/* models/resnet/original/00001

Build a container image that contains optimization tools for TF-TRT:
```
docker build ./ -t trt-optimizer
docker image list
```
The last command shows a table of repositories.
In the table, in the row for the tft-optimizer repository, copy the image ID.

Apply the optimizations (graph optimization, conversion to FP16, and quantization with INT8) to the original model:

export IMAGE_ID=IMAGE_ID

nvidia-docker run --rm \
    -v `pwd`/models/:/workspace/models ${IMAGE_ID} \
    --input-model-dir='models/resnet/original/00001' \
    --output-dir='models/resnet' \
    --precision-mode='FP32' \
    --batch-size=64

nvidia-docker run --rm \
    -v `pwd`/models/:/workspace/models ${IMAGE_ID} \
    --input-model-dir='models/resnet/original/00001' \
    --output-dir='models/resnet' \
    --precision-mode='FP16' \
    --batch-size=64

nvidia-docker run --rm \
    -v `pwd`/models/:/workspace/models ${IMAGE_ID} \
    --input-model-dir='models/resnet/original/00001' \
    --output-dir='models/resnet' \
    --precision-mode='INT8' \
    --batch-size=64 \
    --calib-image-dir='gs://cloud-tpu-test-datasets/fake_imagenet/' \
    --calibration-epochs=10

Replace IMAGE_ID with the image ID for tft-optimizer that you copied in the previous step.

The --calib-image-dir option specifies the location of the training data that is used for the pretrained model. The same training data is used for a calibration for INT8 quantization. The calibration process can take about 5 minutes.

When the commands finish running, the last output line is similar to the following, in which the optimized models are saved in ./models/resnet:

INFO:tensorflow:SavedModel written to: models/resnet/INT8/00001/saved_model.pb

The directory structure is similar to the following:

models
└── resnet
    ├── FP16
    │   └── 00001
    │       ├── saved_model.pb
    │       └── variables
    ├── FP32
    │   └── 00001
    │       ├── saved_model.pb
    │       └── variables
    ├── INT8
    │   └── 00001
    │       ├── saved_model.pb
    │       └── variables
    └── original
        └── 00001
            ├── saved_model.pb
            └── variables
                ├── variables.data-00000-of-00001
                └── variables.index

The following table summarizes the relationship between directories and optimizations.

Directory	Optimization
`FP16`	Conversion to FP16 in addition to the graph optimization
`FP32`	Graph optimization
`INT8`	Quantization with INT8 in addition to the graph optimization
`original`	Original model (no optimization with TF-TRT)

Deploy an inference server

In this section, you deploy Triton servers with five models. First, you upload the model binary that you created in the previous section to Cloud Storage. Then, you create a GKE cluster and deploy Triton servers on the cluster.

Upload the model binary

In the SSH terminal, upload the model binaries and config.pbtxt configuration files to a storage bucket:

export PROJECT_ID=PROJECT_ID
export BUCKET_NAME=${PROJECT_ID}-models

mkdir -p original/1/model/
cp -r models/resnet/original/00001/* original/1/model/
cp original/config.pbtxt original/1/model/
cp original/imagenet1k_labels.txt original/1/model/

mkdir -p tftrt_fp32/1/model/
cp -r models/resnet/FP32/00001/* tftrt_fp32/1/model/
cp tftrt_fp32/config.pbtxt tftrt_fp32/1/model/
cp tftrt_fp32/imagenet1k_labels.txt tftrt_fp32/1/model/

mkdir -p tftrt_fp16/1/model/
cp -r models/resnet/FP16/00001/* tftrt_fp16/1/model/
cp tftrt_fp16/config.pbtxt tftrt_fp16/1/model/
cp tftrt_fp16/imagenet1k_labels.txt tftrt_fp16/1/model/

mkdir -p tftrt_int8/1/model/
cp -r models/resnet/INT8/00001/* tftrt_int8/1/model/
cp tftrt_int8/config.pbtxt tftrt_int8/1/model/
cp tftrt_int8/imagenet1k_labels.txt tftrt_int8/1/model/

mkdir -p tftrt_int8_bs16_count4/1/model/
cp -r models/resnet/INT8/00001/* tftrt_int8_bs16_count4/1/model/
cp tftrt_int8_bs16_count4/config.pbtxt tftrt_int8_bs16_count4/1/model/
cp tftrt_int8_bs16_count4/imagenet1k_labels.txt tftrt_int8_bs16_count4/1/model/

gsutil mb gs://${BUCKET_NAME}
gsutil -m cp -R original tftrt_fp32 tftrt_fp16 tftrt_int8 tftrt_int8_bs16_count4 \
    gs://${BUCKET_NAME}/resnet/

Replace PROJECT_ID with the ID of the Google Cloud project that you created earlier.

The following tuning parameters are specified in the config.pbtxt files:

Model name
Input tensor name and output tensor name
GPU allocation to each model
Batch size and number of instance groups

As an example, the original/1/model/config.pbtxt file contains the following content:

name: "original"
platform: "tensorflow_savedmodel"
max_batch_size: 64
input {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NHWC
    dims: [ 224, 224, 3 ]
}
output {
    name: "probabilities"
    data_type: TYPE_FP32
    dims: 1000
    label_filename: "imagenet1k_labels.txt"
}
default_model_filename: "model"
instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]
dynamic_batching {
  preferred_batch_size: [ 64 ]
  max_queue_delay_microseconds: 20000
}

For details on batch size and number of instance groups, see Performance optimization.

The following table summarizes the five models that you deployed in this section.

Model name	Optimization
`original`	Original model (no optimization with TF-TRT)
`tftrt_fp32`	Graph optimization (batch size=64, instance groups=1)
`tftrt_fp16`	Conversion to FP16 in addition to the graph optimization (batch size=64, instance groups=1)
`tftrt_int8`	Quantization with INT8 in addition to the graph optimization (batch size=64, instance groups=1)
`tftrt_int8_bs16_count4`	Quantization with INT8 in addition to the graph optimization (batch size=16, instance groups=4)

Deploy inference servers by using Triton

In the SSH terminal, install and configure the authentication package, which manages GKE clusters:

export USE_GKE_GCLOUD_AUTH_PLUGIN=True
sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin

Create a GKE cluster and a GPU node pool with compute nodes that use an NVIDIA T4 GPU:
```
gcloud auth login
gcloud config set compute/zone us-west1-b
gcloud container clusters create tensorrt-cluster \
    --num-nodes=20
gcloud container node-pools create t4-gpu-pool \
    --num-nodes=1 \
    --machine-type=n1-standard-8 \
    --cluster=tensorrt-cluster \
    --accelerator type=nvidia-tesla-t4,count=1
```
The --num-nodes flag specifies 20 instances for the GKE cluster and one instance for the GPU node pool t4-gpu-pool.

The GPU node pool consists of a single n1-standard-8 instance with an NVIDIA T4 GPU. The number of GPU instances should be equal to or larger than the number of inference server pods, because the NVIDIA T4 GPU cannot be shared by multiple pods on the same instance.

Show the cluster information:

gcloud container clusters list

The output is similar to the following:

NAME              LOCATION    MASTER_VERSION  MASTER_IP      MACHINE_TYPE   NODE_VERSION    NUM_NODES  STATUS
tensorrt-cluster  us-west1-b  1.14.10-gke.17  XX.XX.XX.XX    n1-standard-1  1.14.10-gke.17  21         RUNNING

Show the node pool information:

gcloud container node-pools list --cluster tensorrt-cluster

The output is similar to the following:

NAME          MACHINE_TYPE   DISK_SIZE_GB  NODE_VERSION
default-pool  n1-standard-1  100           1.14.10-gke.17
t4-pool       n1-standard-8  100           1.14.10-gke.17

Enable the daemonSet workload:

gcloud container clusters get-credentials tensorrt-cluster
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

This command loads the NVIDIA GPU driver on the nodes in the GPU node pool. It also automatically loads the driver when you add a new node to the GPU node pool.

Deploy inference servers on the cluster:

sed -i.bak "s/YOUR-BUCKET-NAME/${PROJECT_ID}-models/" trtis_deploy.yaml
kubectl create -f trtis_service.yaml
kubectl create -f trtis_deploy.yaml

Wait a few minutes until services become available.

Get the clusterIP address of Triton and store it in an environment variable:

export TRITON_IP=$(kubectl get svc inference-server \
  -o "jsonpath={.spec['clusterIP']}")
echo ${TRITON_IP}

At this point, the inference server is serving four ResNet-50 models that you created in the section Create model files with different optimizations. Clients can specify the model to use when sending inference requests.

Deploy monitoring servers with Prometheus and Grafana

In the SSH terminal, deploy Prometheus servers on the cluster:

sed -i.bak "s/CLUSTER-IP/${TRITON_IP}/" prometheus-configmap.yml
kubectl create namespace monitoring
kubectl apply -f prometheus-service.yml -n monitoring
kubectl create -f clusterRole.yml
kubectl create -f prometheus-configmap.yml -n monitoring
kubectl create -f prometheus-deployment.yml -n monitoring

Get the endpoint URL of the Prometheus service.

ip_port=$(kubectl get svc prometheus-service \
  -o "jsonpath={.spec['clusterIP']}:{.spec['ports'][0]['port']}" -n monitoring)
echo "http://${ip_port}"

Make a note of the Prometheus endpoint URL, because you use it to configure Grafana later.

Deploy Grafana servers on the cluster:

kubectl create -f grafana-service.yml -n monitoring
kubectl create -f grafana-deployment.yml -n monitoring

Wait a few minutes until all services become available.

Get the endpoint URL of the Grafana service.

ip_port=$(kubectl get svc grafana-service \
  -o "jsonpath={.status['loadBalancer']['ingress'][0]['ip']}:{.spec['ports'][0]['port']}" -n monitoring)
echo "http://${ip_port}"

Make a note of the Grafana endpoint URL to use in the next step.

In a web browser, go to the Grafana URL that you noted in the preceding step.
Sign in with the default user ID and password (admin and admin). When prompted, change the default password.
Click Add your first data source, and in the Time series databases list, select Prometheus.
In the Settings tab, in the URL field, enter the Prometheus endpoint URL that you noted earlier.
Click Save and Test, and then return to the home screen.
Add a monitoring metric for nv_gpu_utilization:
1. Click Create your first dashboard, and then click Add visualization.
2. In the Data source list, select Prometheus.
3. In the Query tab, in the Metric field, enter nv_gpu_utilization.
4. In the Panel options section, in the Title field, enter GPU Utilization, and then click Apply.
  
  The page displays a panel for GPU utilization.
Add a monitoring metric for nv_gpu_memory_used_bytes:
1. Click Add, and select Visualization.
2. In the Query tab, in the Metric field, enter nv_gpu_memory_used_bytes.
3. In the Panel options section, in the Title field, enter GPU Memory Used, and then click Save.
To add the dashboard, in the Save dashboard panel, click Save.

You see the graphs for GPU Utilization and GPU Memory Used.

Deploy a load testing tool

In this section, you deploy the Locust load testing tool on GKE, and generate workload to measure performance of the inference servers.

In the SSH terminal, build a Docker image that contains Triton client libraries, and upload it to Container Registry:

cd ../client
git clone https://github.com/triton-inference-server/server
cd server
git checkout r19.05
sed -i.bak "s/bootstrap.pypa.io\/get-pip.py/bootstrap.pypa.io\/pip\/2.7\/get-pip.py/" Dockerfile.client
docker build -t tritonserver_client -f Dockerfile.client .
gcloud auth configure-docker
docker tag tritonserver_client \
    gcr.io/${PROJECT_ID}/tritonserver_client
docker push gcr.io/${PROJECT_ID}/tritonserver_client

The build process can take about 5 minutes. When the process is complete, a command prompt appears in the SSH terminal.

When the build process is finished, build a Docker image to generate testing workload, and upload it to Container Registry:
```
cd ..
sed -i.bak "s/YOUR-PROJECT-ID/${PROJECT_ID}/" Dockerfile
docker build -t locust_tester -f Dockerfile .
docker tag locust_tester gcr.io/${PROJECT_ID}/locust_tester
docker push gcr.io/${PROJECT_ID}/locust_tester
```
Don't change or replace YOUR-PROJECT-ID in the commands.

This image is built from the image that you created in the previous step.

Deploy the Locust files service_master.yaml and deployment_master.yaml:

sed -i.bak "s/YOUR-PROJECT-ID/${PROJECT_ID}/" deployment_master.yaml
sed -i.bak "s/CLUSTER-IP-TRTIS/${TRITON_IP}/" deployment_master.yaml

kubectl create namespace locust
kubectl create configmap locust-config --from-literal model=original --from-literal saddr=${TRITON_IP} --from-literal rps=10 -n locust

kubectl apply -f service_master.yaml -n locust
kubectl apply -f deployment_master.yaml -n locust

The configmap resource is used to specify the machine learning model to which clients send requests for inference.

Wait a few minutes until services become available.

Get the clusterIP address of the locust-master client, and store that address in an environment variable:

export LOCUST_MASTER_IP=$(kubectl get svc locust-master -n locust \
    -o "jsonpath={.spec['clusterIP']}")
echo ${LOCUST_MASTER_IP}

Deploy the Locust client:
```
sed -i.bak "s/YOUR-PROJECT-ID/${PROJECT_ID}/" deployment_slave.yaml
sed -i.bak "s/CLUSTER-IP-LOCUST-MASTER/${LOCUST_MASTER_IP}/" deployment_slave.yaml
kubectl apply -f deployment_slave.yaml -n locust
```
These commands deploy 10 Locust client Pods that you can use to generate testing workloads. If you can't generate enough requests with the current number of clients, you can change the number of Pods by using the following command:
```
kubectl scale deployment/locust-slave --replicas=20 -n locust
```
When there is not enough capacity for a default cluster to increase the number of replicas, we recommend that you increase the number of nodes in the GKE cluster.

Copy the URL of the Locust console, and then open this URL in a web browser:

export LOCUST_IP=$(kubectl get svc locust-master -n locust \
     -o "jsonpath={.status.loadBalancer.ingress[0].ip}")
echo "http://${LOCUST_IP}:8089"

The Locust console opens, and you can generate testing workloads from it.

Check the running Pods

To ensure that the components are deployed successfully, check that the Pods are running.

In the SSH terminal, check the inference server Pod:
```
kubectl get pods
```
The output is similar to the following:
```
NAME                                READY   STATUS    RESTARTS   AGE
inference-server-67786cddb4-qrw6r   1/1     Running   0          83m
```
If you don't get the expected output, ensure that you've completed the steps in Deploy inference servers by using Triton.

Check the Locust Pods:

kubectl get pods -n locust

The output is similar to the following:

NAME                                READY   STATUS    RESTARTS   AGE
locust-master-75f6f6d4bc-ttllr      1/1     Running   0          10m
locust-slave-76ddb664d9-8275p       1/1     Running   0          2m36s
locust-slave-76ddb664d9-f45ww       1/1     Running   0          2m36s
locust-slave-76ddb664d9-q95z9       1/1     Running   0          2m36s

If you don't get the expected output, ensure that you've completed the steps in Deploy a load testing tool.

Check the monitoring Pods:

kubectl get pods -n monitoring

The output is similar to the following:

NAME                                     READY   STATUS    RESTARTS   AGE
grafana-deployment-644bbcb84-k6t7v       1/1     Running   0          79m
prometheus-deployment-544b9b9f98-hl7q8   1/1     Running   0          81m

If you don't get the expected output, ensure that you've completed the steps in Deploy monitoring servers with Prometheus and Grafana.

In the next part of this series, you use this inference server system to learn how various optimizations improve performance and how to interpret those optimizations. For next steps, see Measure and tune performance of a TensorFlow inference system.

What's next

Learn more about Google Kubernetes Engine (GKE).
Learn more about Cloud Load Balancing.
For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.