This document describes how you deploy the reference architecture described in Scalable TensorFlow inference system.
This series is intended for developers who are familiar with Google Kubernetes Engine and machine learning (ML) frameworks, including TensorFlow and NVIDIA TensorRT.
After you complete this deployment, see Measure and tune performance of a TensorFlow inference system.
Architecture
The following diagram shows the architecture of the inference system.
The Cloud Load Balancing sends the request traffic to the closest GKE cluster. The cluster contains a Pod for each node. In each Pod, a Triton Inference Server provides an inference service (to serve ResNet-50 models), and an NVIDIA T4 GPU improves performance. Monitoring servers on the cluster collect metrics data on GPU utilization and memory usage.
For details, see Scalable TensorFlow inference system.
Objectives
- Download a pretrained ResNet-50 model, and use TensorFlow integration with TensorRT (TF-TRT) to apply optimizations
- Serve a ResNet-50 model from an NVIDIA Triton Inference Server
- Build a monitoring system for Triton by using Prometheus and Grafana
- Build a load testing tool by using Locust
Costs
In addition to NVIDIA T4 GPU, in this deployment, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage, use the pricing calculator.
When you finish this deployment, don't delete the resources you created. You need these resources when you measure and tune the deployment.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the GKE API.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the GKE API.
Build optimized models with TF-TRT
In this section, you create a working environment and optimize the pretrained model.
The pretrained model uses the fake dataset at gs://cloud-tpu-test-datasets/fake_imagenet/
.
There is also a copy of the pretrained model in the Cloud Storage location at gs://solutions-public-assets/tftrt-tutorial/resnet/export/1584366419/
.
Create a working environment
For your working environment, you create a Compute Engine instance by using Deep Learning VM Images. You optimize and quantize the ResNet-50 model with TensorRT on this instance.
In the Google Cloud console, activate Cloud Shell.
Deploy an instance named
working-vm
:gcloud config set project PROJECT_ID gcloud config set compute/zone us-west1-b gcloud compute instances create working-vm \ --scopes cloud-platform \ --image-family common-cu113 \ --image-project deeplearning-platform-release \ --machine-type n1-standard-8 \ --min-cpu-platform="Intel Skylake" \ --accelerator=type=nvidia-tesla-t4,count=1 \ --boot-disk-size=200GB \ --maintenance-policy=TERMINATE \ --metadata="install-nvidia-driver=True"
Replace
PROJECT_ID
with the ID of the Google Cloud project that you created earlier.This command launches a Compute Engine instance using NVIDIA T4. Upon the first boot, it automatically installs the NVIDIA GPU driver that is compatible with TensorRT 5.1.5.
Create model files with different optimizations
In this section, you apply the following optimizations to the original ResNet-50 model by using TF-TRT:
- Graph optimization
- Conversion to FP16 with the graph optimization
- Quantization with INT8 with the graph optimization
For details about these optimizations, see Performance optimization.
In the Google Cloud console, go to Compute Engine > VM instances.
You see the
working-vm
instance that you created earlier.To open the terminal console of the instance, click SSH.
You use this terminal to run the rest of the commands in this document.
In the terminal, clone the required repository and change the current directory:
cd $HOME git clone https://github.com/GoogleCloudPlatform/gke-tensorflow-inference-system-tutorial cd gke-tensorflow-inference-system-tutorial/server
Download the pretrained ResNet-50 model to a local directory:
mkdir -p models/resnet/original/00001 gcloud storage cp gs://solutions-public-assets/tftrt-tutorial/resnet/export/1584366419/* models/resnet/original/00001 --recursive
Build a container image that contains optimization tools for TF-TRT:
docker build ./ -t trt-optimizer docker image list
The last command shows a table of repositories.
In the table, in the row for the
tft-optimizer
repository, copy the image ID.Apply the optimizations (graph optimization, conversion to FP16, and quantization with INT8) to the original model:
export IMAGE_ID=IMAGE_ID nvidia-docker run --rm \ -v `pwd`/models/:/workspace/models ${IMAGE_ID} \ --input-model-dir='models/resnet/original/00001' \ --output-dir='models/resnet' \ --precision-mode='FP32' \ --batch-size=64 nvidia-docker run --rm \ -v `pwd`/models/:/workspace/models ${IMAGE_ID} \ --input-model-dir='models/resnet/original/00001' \ --output-dir='models/resnet' \ --precision-mode='FP16' \ --batch-size=64 nvidia-docker run --rm \ -v `pwd`/models/:/workspace/models ${IMAGE_ID} \ --input-model-dir='models/resnet/original/00001' \ --output-dir='models/resnet' \ --precision-mode='INT8' \ --batch-size=64 \ --calib-image-dir='gs://cloud-tpu-test-datasets/fake_imagenet/' \ --calibration-epochs=10
Replace
IMAGE_ID
with the image ID fortft-optimizer
that you copied in the previous step.The
--calib-image-dir
option specifies the location of the training data that is used for the pretrained model. The same training data is used for a calibration for INT8 quantization. The calibration process can take about 5 minutes.When the commands finish running, the last output line is similar to the following, in which the optimized models are saved in
./models/resnet
:INFO:tensorflow:SavedModel written to: models/resnet/INT8/00001/saved_model.pb
The directory structure is similar to the following:
models └── resnet ├── FP16 │ └── 00001 │ ├── saved_model.pb │ └── variables ├── FP32 │ └── 00001 │ ├── saved_model.pb │ └── variables ├── INT8 │ └── 00001 │ ├── saved_model.pb │ └── variables └── original └── 00001 ├── saved_model.pb └── variables ├── variables.data-00000-of-00001 └── variables.index
The following table summarizes the relationship between directories and optimizations.
Directory | Optimization |
---|---|
FP16 |
Conversion to FP16 in addition to the graph optimization |
FP32 |
Graph optimization |
INT8 |
Quantization with INT8 in addition to the graph optimization |
original |
Original model (no optimization with TF-TRT) |
Deploy an inference server
In this section, you deploy Triton servers with five models. First, you upload the model binary that you created in the previous section to Cloud Storage. Then, you create a GKE cluster and deploy Triton servers on the cluster.
Upload the model binary
In the SSH terminal, upload the model binaries and
config.pbtxt
configuration files to a storage bucket:export PROJECT_ID=PROJECT_ID export BUCKET_NAME=${PROJECT_ID}-models mkdir -p original/1/model/ cp -r models/resnet/original/00001/* original/1/model/ cp original/config.pbtxt original/1/model/ cp original/imagenet1k_labels.txt original/1/model/ mkdir -p tftrt_fp32/1/model/ cp -r models/resnet/FP32/00001/* tftrt_fp32/1/model/ cp tftrt_fp32/config.pbtxt tftrt_fp32/1/model/ cp tftrt_fp32/imagenet1k_labels.txt tftrt_fp32/1/model/ mkdir -p tftrt_fp16/1/model/ cp -r models/resnet/FP16/00001/* tftrt_fp16/1/model/ cp tftrt_fp16/config.pbtxt tftrt_fp16/1/model/ cp tftrt_fp16/imagenet1k_labels.txt tftrt_fp16/1/model/ mkdir -p tftrt_int8/1/model/ cp -r models/resnet/INT8/00001/* tftrt_int8/1/model/ cp tftrt_int8/config.pbtxt tftrt_int8/1/model/ cp tftrt_int8/imagenet1k_labels.txt tftrt_int8/1/model/ mkdir -p tftrt_int8_bs16_count4/1/model/ cp -r models/resnet/INT8/00001/* tftrt_int8_bs16_count4/1/model/ cp tftrt_int8_bs16_count4/config.pbtxt tftrt_int8_bs16_count4/1/model/ cp tftrt_int8_bs16_count4/imagenet1k_labels.txt tftrt_int8_bs16_count4/1/model/ gcloud storage buckets create gs://${BUCKET_NAME} gcloud storage cp original tftrt_fp32 tftrt_fp16 tftrt_int8 tftrt_int8_bs16_count4 \ gs://${BUCKET_NAME}/resnet/ --recursive
Replace
PROJECT_ID
with the ID of the Google Cloud project that you created earlier.The following tuning parameters are specified in the
config.pbtxt
files:- Model name
- Input tensor name and output tensor name
- GPU allocation to each model
- Batch size and number of instance groups
As an example, the
original/1/model/config.pbtxt
file contains the following content:name: "original" platform: "tensorflow_savedmodel" max_batch_size: 64 input { name: "input" data_type: TYPE_FP32 format: FORMAT_NHWC dims: [ 224, 224, 3 ] } output { name: "probabilities" data_type: TYPE_FP32 dims: 1000 label_filename: "imagenet1k_labels.txt" } default_model_filename: "model" instance_group [ { count: 1 kind: KIND_GPU } ] dynamic_batching { preferred_batch_size: [ 64 ] max_queue_delay_microseconds: 20000 }
For details on batch size and number of instance groups, see Performance optimization.
The following table summarizes the five models that you deployed in this section.
Model name | Optimization |
---|---|
original |
Original model (no optimization with TF-TRT) |
tftrt_fp32 |
Graph optimization (batch size=64, instance groups=1) |
tftrt_fp16 |
Conversion to FP16 in addition to the graph optimization (batch size=64, instance groups=1) |
tftrt_int8 |
Quantization with INT8 in addition to the graph optimization (batch size=64, instance groups=1) |
tftrt_int8_bs16_count4 |
Quantization with INT8 in addition to the graph optimization (batch size=16, instance groups=4) |
Deploy inference servers by using Triton
In the SSH terminal, install and configure the authentication package, which manages GKE clusters:
export USE_GKE_GCLOUD_AUTH_PLUGIN=True sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin
Create a GKE cluster and a GPU node pool with compute nodes that use an NVIDIA T4 GPU:
gcloud auth login gcloud config set compute/zone us-west1-b gcloud container clusters create tensorrt-cluster \ --num-nodes=20 gcloud container node-pools create t4-gpu-pool \ --num-nodes=1 \ --machine-type=n1-standard-8 \ --cluster=tensorrt-cluster \ --accelerator type=nvidia-tesla-t4,count=1
The
--num-nodes
flag specifies 20 instances for the GKE cluster and one instance for the GPU node poolt4-gpu-pool
.The GPU node pool consists of a single
n1-standard-8
instance with an NVIDIA T4 GPU. The number of GPU instances should be equal to or larger than the number of inference server pods, because the NVIDIA T4 GPU cannot be shared by multiple pods on the same instance.Show the cluster information:
gcloud container clusters list
The output is similar to the following:
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS tensorrt-cluster us-west1-b 1.14.10-gke.17 XX.XX.XX.XX n1-standard-1 1.14.10-gke.17 21 RUNNING
Show the node pool information:
gcloud container node-pools list --cluster tensorrt-cluster
The output is similar to the following:
NAME MACHINE_TYPE DISK_SIZE_GB NODE_VERSION default-pool n1-standard-1 100 1.14.10-gke.17 t4-pool n1-standard-8 100 1.14.10-gke.17
Enable the
daemonSet
workload:gcloud container clusters get-credentials tensorrt-cluster kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
This command loads the NVIDIA GPU driver on the nodes in the GPU node pool. It also automatically loads the driver when you add a new node to the GPU node pool.
Deploy inference servers on the cluster:
sed -i.bak "s/YOUR-BUCKET-NAME/${PROJECT_ID}-models/" trtis_deploy.yaml kubectl create -f trtis_service.yaml kubectl create -f trtis_deploy.yaml
Wait a few minutes until services become available.
Get the
clusterIP
address of Triton and store it in an environment variable:export TRITON_IP=$(kubectl get svc inference-server \ -o "jsonpath={.spec['clusterIP']}") echo ${TRITON_IP}
At this point, the inference server is serving four ResNet-50 models that you created in the section Create model files with different optimizations. Clients can specify the model to use when sending inference requests.
Deploy monitoring servers with Prometheus and Grafana
In the SSH terminal, deploy Prometheus servers on the cluster:
sed -i.bak "s/CLUSTER-IP/${TRITON_IP}/" prometheus-configmap.yml kubectl create namespace monitoring kubectl apply -f prometheus-service.yml -n monitoring kubectl create -f clusterRole.yml kubectl create -f prometheus-configmap.yml -n monitoring kubectl create -f prometheus-deployment.yml -n monitoring
Get the endpoint URL of the Prometheus service.
ip_port=$(kubectl get svc prometheus-service \ -o "jsonpath={.spec['clusterIP']}:{.spec['ports'][0]['port']}" -n monitoring) echo "http://${ip_port}"
Make a note of the Prometheus endpoint URL, because you use it to configure Grafana later.
Deploy Grafana servers on the cluster:
kubectl create -f grafana-service.yml -n monitoring kubectl create -f grafana-deployment.yml -n monitoring
Wait a few minutes until all services become available.
Get the endpoint URL of the Grafana service.
ip_port=$(kubectl get svc grafana-service \ -o "jsonpath={.status['loadBalancer']['ingress'][0]['ip']}:{.spec['ports'][0]['port']}" -n monitoring) echo "http://${ip_port}"
Make a note of the Grafana endpoint URL to use in the next step.
In a web browser, go to the Grafana URL that you noted in the preceding step.
Sign in with the default user ID and password (
admin
andadmin
). When prompted, change the default password.Click Add your first data source, and in the Time series databases list, select Prometheus.
In the Settings tab, in the URL field, enter the Prometheus endpoint URL that you noted earlier.
Click Save and Test, and then return to the home screen.
Add a monitoring metric for
nv_gpu_utilization
:- Click Create your first dashboard, and then click Add visualization.
- In the Data source list, select Prometheus.
In the Query tab, in the Metric field, enter
nv_gpu_utilization
.In the Panel options section, in the Title field, enter
GPU Utilization
, and then click Apply.The page displays a panel for GPU utilization.
Add a monitoring metric for
nv_gpu_memory_used_bytes
:- Click Add, and select Visualization.
In the Query tab, in the Metric field, enter
nv_gpu_memory_used_bytes
.In the Panel options section, in the Title field, enter
GPU Memory Used
, and then click Save.
To add the dashboard, in the Save dashboard panel, click Save.
You see the graphs for GPU Utilization and GPU Memory Used.
Deploy a load testing tool
In this section, you deploy the Locust load testing tool on GKE, and generate workload to measure performance of the inference servers.
In the SSH terminal, build a Docker image that contains Triton client libraries, and upload it to Container Registry:
cd ../client git clone https://github.com/triton-inference-server/server cd server git checkout r19.05 sed -i.bak "s/bootstrap.pypa.io\/get-pip.py/bootstrap.pypa.io\/pip\/2.7\/get-pip.py/" Dockerfile.client docker build -t tritonserver_client -f Dockerfile.client . gcloud auth configure-docker docker tag tritonserver_client \ gcr.io/${PROJECT_ID}/tritonserver_client docker push gcr.io/${PROJECT_ID}/tritonserver_client
The build process can take about 5 minutes. When the process is complete, a command prompt appears in the SSH terminal.
When the build process is finished, build a Docker image to generate testing workload, and upload it to Container Registry:
cd .. sed -i.bak "s/YOUR-PROJECT-ID/${PROJECT_ID}/" Dockerfile docker build -t locust_tester -f Dockerfile . docker tag locust_tester gcr.io/${PROJECT_ID}/locust_tester docker push gcr.io/${PROJECT_ID}/locust_tester
Don't change or replace
YOUR-PROJECT-ID
in the commands.This image is built from the image that you created in the previous step.
Deploy the Locust files
service_master.yaml
anddeployment_master.yaml
:sed -i.bak "s/YOUR-PROJECT-ID/${PROJECT_ID}/" deployment_master.yaml sed -i.bak "s/CLUSTER-IP-TRTIS/${TRITON_IP}/" deployment_master.yaml kubectl create namespace locust kubectl create configmap locust-config --from-literal model=original --from-literal saddr=${TRITON_IP} --from-literal rps=10 -n locust kubectl apply -f service_master.yaml -n locust kubectl apply -f deployment_master.yaml -n locust
The
configmap
resource is used to specify the machine learning model to which clients send requests for inference.Wait a few minutes until services become available.
Get the
clusterIP
address of thelocust-master
client, and store that address in an environment variable:export LOCUST_MASTER_IP=$(kubectl get svc locust-master -n locust \ -o "jsonpath={.spec['clusterIP']}") echo ${LOCUST_MASTER_IP}
Deploy the Locust client:
sed -i.bak "s/YOUR-PROJECT-ID/${PROJECT_ID}/" deployment_slave.yaml sed -i.bak "s/CLUSTER-IP-LOCUST-MASTER/${LOCUST_MASTER_IP}/" deployment_slave.yaml kubectl apply -f deployment_slave.yaml -n locust
These commands deploy 10 Locust client Pods that you can use to generate testing workloads. If you can't generate enough requests with the current number of clients, you can change the number of Pods by using the following command:
kubectl scale deployment/locust-slave --replicas=20 -n locust
When there is not enough capacity for a default cluster to increase the number of replicas, we recommend that you increase the number of nodes in the GKE cluster.
Copy the URL of the Locust console, and then open this URL in a web browser:
export LOCUST_IP=$(kubectl get svc locust-master -n locust \ -o "jsonpath={.status.loadBalancer.ingress[0].ip}") echo "http://${LOCUST_IP}:8089"
The Locust console opens, and you can generate testing workloads from it.
Check the running Pods
To ensure that the components are deployed successfully, check that the Pods are running.
In the SSH terminal, check the inference server Pod:
kubectl get pods
The output is similar to the following:
NAME READY STATUS RESTARTS AGE inference-server-67786cddb4-qrw6r 1/1 Running 0 83m
If you don't get the expected output, ensure that you've completed the steps in Deploy inference servers by using Triton.
Check the Locust Pods:
kubectl get pods -n locust
The output is similar to the following:
NAME READY STATUS RESTARTS AGE locust-master-75f6f6d4bc-ttllr 1/1 Running 0 10m locust-slave-76ddb664d9-8275p 1/1 Running 0 2m36s locust-slave-76ddb664d9-f45ww 1/1 Running 0 2m36s locust-slave-76ddb664d9-q95z9 1/1 Running 0 2m36s
If you don't get the expected output, ensure that you've completed the steps in Deploy a load testing tool.
Check the monitoring Pods:
kubectl get pods -n monitoring
The output is similar to the following:
NAME READY STATUS RESTARTS AGE grafana-deployment-644bbcb84-k6t7v 1/1 Running 0 79m prometheus-deployment-544b9b9f98-hl7q8 1/1 Running 0 81m
If you don't get the expected output, ensure that you've completed the steps in Deploy monitoring servers with Prometheus and Grafana.
In the next part of this series, you use this inference server system to learn how various optimizations improve performance and how to interpret those optimizations. For next steps, see Measure and tune performance of a TensorFlow inference system.
What's next
- Learn more about Google Kubernetes Engine (GKE).
- Learn more about Cloud Load Balancing.
- For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.