Deploy a scalable TensorFlow inference system

Last reviewed 2023-11-02 UTC

This document describes how you deploy the reference architecture described in Scalable TensorFlow inference system.

This series is intended for developers who are familiar with Google Kubernetes Engine and machine learning (ML) frameworks, including TensorFlow and NVIDIA TensorRT.

After you complete this deployment, see Measure and tune performance of a TensorFlow inference system.

Architecture

The following diagram shows the architecture of the inference system.

Architecture of the inference system.

The Cloud Load Balancing sends the request traffic to the closest GKE cluster. The cluster contains a Pod for each node. In each Pod, a Triton Inference Server provides an inference service (to serve ResNet-50 models), and an NVIDIA T4 GPU improves performance. Monitoring servers on the cluster collect metrics data on GPU utilization and memory usage.

For details, see Scalable TensorFlow inference system.

Objectives

  • Download a pretrained ResNet-50 model, and use TensorFlow integration with TensorRT (TF-TRT) to apply optimizations
  • Serve a ResNet-50 model from an NVIDIA Triton Inference Server
  • Build a monitoring system for Triton by using Prometheus and Grafana
  • Build a load testing tool by using Locust

Costs

In addition to NVIDIA T4 GPU, in this deployment, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator.

When you finish this deployment, don't delete the resources you created. You need these resources when you measure and tune the deployment.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the GKE API.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the GKE API.

    Enable the API

Build optimized models with TF-TRT

In this section, you create a working environment and optimize the pretrained model.

The pretrained model uses the fake dataset at gs://cloud-tpu-test-datasets/fake_imagenet/. There is also a copy of the pretrained model in the Cloud Storage location at gs://solutions-public-assets/tftrt-tutorial/resnet/export/1584366419/.

Create a working environment

For your working environment, you create a Compute Engine instance by using Deep Learning VM Images. You optimize and quantize the ResNet-50 model with TensorRT on this instance.

  1. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

  2. Deploy an instance named working-vm:

    gcloud config set project PROJECT_ID
    gcloud config set compute/zone us-west1-b
    gcloud compute instances create working-vm \
        --scopes cloud-platform \
        --image-family common-cu113 \
        --image-project deeplearning-platform-release \
        --machine-type n1-standard-8 \
        --min-cpu-platform="Intel Skylake" \
        --accelerator=type=nvidia-tesla-t4,count=1 \
        --boot-disk-size=200GB \
        --maintenance-policy=TERMINATE \
        --metadata="install-nvidia-driver=True"
    

    Replace PROJECT_ID with the ID of the Google Cloud project that you created earlier.

    This command launches a Compute Engine instance using NVIDIA T4. Upon the first boot, it automatically installs the NVIDIA GPU driver that is compatible with TensorRT 5.1.5.

Create model files with different optimizations

In this section, you apply the following optimizations to the original ResNet-50 model by using TF-TRT:

  • Graph optimization
  • Conversion to FP16 with the graph optimization
  • Quantization with INT8 with the graph optimization

For details about these optimizations, see Performance optimization.

  1. In the Google Cloud console, go to Compute Engine > VM instances.

    Go to VM Instances

    You see the working-vm instance that you created earlier.

  2. To open the terminal console of the instance, click SSH.

    You use this terminal to run the rest of the commands in this document.

  3. In the terminal, clone the required repository and change the current directory:

    cd $HOME
    git clone https://github.com/GoogleCloudPlatform/gke-tensorflow-inference-system-tutorial
    cd gke-tensorflow-inference-system-tutorial/server
    
  4. Download the pretrained ResNet-50 model to a local directory:

    mkdir -p models/resnet/original/00001
    gcloud storage cp gs://solutions-public-assets/tftrt-tutorial/resnet/export/1584366419/* models/resnet/original/00001 --recursive
    
  5. Build a container image that contains optimization tools for TF-TRT:

    docker build ./ -t trt-optimizer
    docker image list
    

    The last command shows a table of repositories.

  6. In the table, in the row for the tft-optimizer repository, copy the image ID.

  7. Apply the optimizations (graph optimization, conversion to FP16, and quantization with INT8) to the original model:

    export IMAGE_ID=IMAGE_ID
    
    nvidia-docker run --rm \
        -v `pwd`/models/:/workspace/models ${IMAGE_ID} \
        --input-model-dir='models/resnet/original/00001' \
        --output-dir='models/resnet' \
        --precision-mode='FP32' \
        --batch-size=64
    
    nvidia-docker run --rm \
        -v `pwd`/models/:/workspace/models ${IMAGE_ID} \
        --input-model-dir='models/resnet/original/00001' \
        --output-dir='models/resnet' \
        --precision-mode='FP16' \
        --batch-size=64
    
    nvidia-docker run --rm \
        -v `pwd`/models/:/workspace/models ${IMAGE_ID} \
        --input-model-dir='models/resnet/original/00001' \
        --output-dir='models/resnet' \
        --precision-mode='INT8' \
        --batch-size=64 \
        --calib-image-dir='gs://cloud-tpu-test-datasets/fake_imagenet/' \
        --calibration-epochs=10
    

    Replace IMAGE_ID with the image ID for tft-optimizer that you copied in the previous step.

    The --calib-image-dir option specifies the location of the training data that is used for the pretrained model. The same training data is used for a calibration for INT8 quantization. The calibration process can take about 5 minutes.

    When the commands finish running, the last output line is similar to the following, in which the optimized models are saved in ./models/resnet:

    INFO:tensorflow:SavedModel written to: models/resnet/INT8/00001/saved_model.pb
    

    The directory structure is similar to the following:

    models
    └── resnet
        ├── FP16
        │   └── 00001
        │       ├── saved_model.pb
        │       └── variables
        ├── FP32
        │   └── 00001
        │       ├── saved_model.pb
        │       └── variables
        ├── INT8
        │   └── 00001
        │       ├── saved_model.pb
        │       └── variables
        └── original
            └── 00001
                ├── saved_model.pb
                └── variables
                    ├── variables.data-00000-of-00001
                    └── variables.index
    

The following table summarizes the relationship between directories and optimizations.

Directory Optimization
FP16 Conversion to FP16 in addition to the graph optimization
FP32 Graph optimization
INT8 Quantization with INT8 in addition to the graph optimization
original Original model (no optimization with TF-TRT)

Deploy an inference server

In this section, you deploy Triton servers with five models. First, you upload the model binary that you created in the previous section to Cloud Storage. Then, you create a GKE cluster and deploy Triton servers on the cluster.

Upload the model binary

  • In the SSH terminal, upload the model binaries and config.pbtxt configuration files to a storage bucket:

    export PROJECT_ID=PROJECT_ID
    export BUCKET_NAME=${PROJECT_ID}-models
    
    mkdir -p original/1/model/
    cp -r models/resnet/original/00001/* original/1/model/
    cp original/config.pbtxt original/1/model/
    cp original/imagenet1k_labels.txt original/1/model/
    
    mkdir -p tftrt_fp32/1/model/
    cp -r models/resnet/FP32/00001/* tftrt_fp32/1/model/
    cp tftrt_fp32/config.pbtxt tftrt_fp32/1/model/
    cp tftrt_fp32/imagenet1k_labels.txt tftrt_fp32/1/model/
    
    mkdir -p tftrt_fp16/1/model/
    cp -r models/resnet/FP16/00001/* tftrt_fp16/1/model/
    cp tftrt_fp16/config.pbtxt tftrt_fp16/1/model/
    cp tftrt_fp16/imagenet1k_labels.txt tftrt_fp16/1/model/
    
    mkdir -p tftrt_int8/1/model/
    cp -r models/resnet/INT8/00001/* tftrt_int8/1/model/
    cp tftrt_int8/config.pbtxt tftrt_int8/1/model/
    cp tftrt_int8/imagenet1k_labels.txt tftrt_int8/1/model/
    
    mkdir -p tftrt_int8_bs16_count4/1/model/
    cp -r models/resnet/INT8/00001/* tftrt_int8_bs16_count4/1/model/
    cp tftrt_int8_bs16_count4/config.pbtxt tftrt_int8_bs16_count4/1/model/
    cp tftrt_int8_bs16_count4/imagenet1k_labels.txt tftrt_int8_bs16_count4/1/model/
    
    gcloud storage buckets create gs://${BUCKET_NAME}
    gcloud storage cp original tftrt_fp32 tftrt_fp16 tftrt_int8 tftrt_int8_bs16_count4 \
        gs://${BUCKET_NAME}/resnet/ --recursive
    

    Replace PROJECT_ID with the ID of the Google Cloud project that you created earlier.

    The following tuning parameters are specified in the config.pbtxt files:

    • Model name
    • Input tensor name and output tensor name
    • GPU allocation to each model
    • Batch size and number of instance groups

    As an example, the original/1/model/config.pbtxt file contains the following content:

    name: "original"
    platform: "tensorflow_savedmodel"
    max_batch_size: 64
    input {
        name: "input"
        data_type: TYPE_FP32
        format: FORMAT_NHWC
        dims: [ 224, 224, 3 ]
    }
    output {
        name: "probabilities"
        data_type: TYPE_FP32
        dims: 1000
        label_filename: "imagenet1k_labels.txt"
    }
    default_model_filename: "model"
    instance_group [
      {
        count: 1
        kind: KIND_GPU
      }
    ]
    dynamic_batching {
      preferred_batch_size: [ 64 ]
      max_queue_delay_microseconds: 20000
    }
    

For details on batch size and number of instance groups, see Performance optimization.

The following table summarizes the five models that you deployed in this section.

Model name Optimization
original Original model (no optimization with TF-TRT)
tftrt_fp32 Graph optimization
(batch size=64, instance groups=1)
tftrt_fp16 Conversion to FP16 in addition to the graph optimization
(batch size=64, instance groups=1)
tftrt_int8 Quantization with INT8 in addition to the graph optimization
(batch size=64, instance groups=1)
tftrt_int8_bs16_count4 Quantization with INT8 in addition to the graph optimization
(batch size=16, instance groups=4)

Deploy inference servers by using Triton

  1. In the SSH terminal, install and configure the authentication package, which manages GKE clusters:

    export USE_GKE_GCLOUD_AUTH_PLUGIN=True
    sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin
    
  2. Create a GKE cluster and a GPU node pool with compute nodes that use an NVIDIA T4 GPU:

    gcloud auth login
    gcloud config set compute/zone us-west1-b
    gcloud container clusters create tensorrt-cluster \
        --num-nodes=20
    gcloud container node-pools create t4-gpu-pool \
        --num-nodes=1 \
        --machine-type=n1-standard-8 \
        --cluster=tensorrt-cluster \
        --accelerator type=nvidia-tesla-t4,count=1
    

    The --num-nodes flag specifies 20 instances for the GKE cluster and one instance for the GPU node pool t4-gpu-pool.

    The GPU node pool consists of a single n1-standard-8 instance with an NVIDIA T4 GPU. The number of GPU instances should be equal to or larger than the number of inference server pods, because the NVIDIA T4 GPU cannot be shared by multiple pods on the same instance.

  3. Show the cluster information:

    gcloud container clusters list
    

    The output is similar to the following:

    NAME              LOCATION    MASTER_VERSION  MASTER_IP      MACHINE_TYPE   NODE_VERSION    NUM_NODES  STATUS
    tensorrt-cluster  us-west1-b  1.14.10-gke.17  XX.XX.XX.XX    n1-standard-1  1.14.10-gke.17  21         RUNNING
    
  4. Show the node pool information:

    gcloud container node-pools list --cluster tensorrt-cluster
    

    The output is similar to the following:

    NAME          MACHINE_TYPE   DISK_SIZE_GB  NODE_VERSION
    default-pool  n1-standard-1  100           1.14.10-gke.17
    t4-pool       n1-standard-8  100           1.14.10-gke.17
    
  5. Enable the daemonSet workload:

    gcloud container clusters get-credentials tensorrt-cluster
    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
    

    This command loads the NVIDIA GPU driver on the nodes in the GPU node pool. It also automatically loads the driver when you add a new node to the GPU node pool.

  6. Deploy inference servers on the cluster:

    sed -i.bak "s/YOUR-BUCKET-NAME/${PROJECT_ID}-models/" trtis_deploy.yaml
    kubectl create -f trtis_service.yaml
    kubectl create -f trtis_deploy.yaml
    
  7. Wait a few minutes until services become available.

  8. Get the clusterIP address of Triton and store it in an environment variable:

    export TRITON_IP=$(kubectl get svc inference-server \
      -o "jsonpath={.spec['clusterIP']}")
    echo ${TRITON_IP}
    

At this point, the inference server is serving four ResNet-50 models that you created in the section Create model files with different optimizations. Clients can specify the model to use when sending inference requests.

Deploy monitoring servers with Prometheus and Grafana

  1. In the SSH terminal, deploy Prometheus servers on the cluster:

    sed -i.bak "s/CLUSTER-IP/${TRITON_IP}/" prometheus-configmap.yml
    kubectl create namespace monitoring
    kubectl apply -f prometheus-service.yml -n monitoring
    kubectl create -f clusterRole.yml
    kubectl create -f prometheus-configmap.yml -n monitoring
    kubectl create -f prometheus-deployment.yml -n monitoring
    
  2. Get the endpoint URL of the Prometheus service.

    ip_port=$(kubectl get svc prometheus-service \
      -o "jsonpath={.spec['clusterIP']}:{.spec['ports'][0]['port']}" -n monitoring)
    echo "http://${ip_port}"
    

    Make a note of the Prometheus endpoint URL, because you use it to configure Grafana later.

  3. Deploy Grafana servers on the cluster:

    kubectl create -f grafana-service.yml -n monitoring
    kubectl create -f grafana-deployment.yml -n monitoring
    
  4. Wait a few minutes until all services become available.

  5. Get the endpoint URL of the Grafana service.

    ip_port=$(kubectl get svc grafana-service \
      -o "jsonpath={.status['loadBalancer']['ingress'][0]['ip']}:{.spec['ports'][0]['port']}" -n monitoring)
    echo "http://${ip_port}"
    

    Make a note of the Grafana endpoint URL to use in the next step.

  6. In a web browser, go to the Grafana URL that you noted in the preceding step.

  7. Sign in with the default user ID and password (admin and admin). When prompted, change the default password.

  8. Click Add your first data source, and in the Time series databases list, select Prometheus.

  9. In the Settings tab, in the URL field, enter the Prometheus endpoint URL that you noted earlier.

  10. Click Save and Test, and then return to the home screen.

  11. Add a monitoring metric for nv_gpu_utilization:

    1. Click Create your first dashboard, and then click Add visualization.
    2. In the Data source list, select Prometheus.
    3. In the Query tab, in the Metric field, enter nv_gpu_utilization.

    4. In the Panel options section, in the Title field, enter GPU Utilization, and then click Apply.

      The page displays a panel for GPU utilization.

  12. Add a monitoring metric for nv_gpu_memory_used_bytes:

    1. Click Add, and select Visualization.
    2. In the Query tab, in the Metric field, enter nv_gpu_memory_used_bytes.

    3. In the Panel options section, in the Title field, enter GPU Memory Used, and then click Save.

  13. To add the dashboard, in the Save dashboard panel, click Save.

    You see the graphs for GPU Utilization and GPU Memory Used.

Deploy a load testing tool

In this section, you deploy the Locust load testing tool on GKE, and generate workload to measure performance of the inference servers.

  1. In the SSH terminal, build a Docker image that contains Triton client libraries, and upload it to Container Registry:

    cd ../client
    git clone https://github.com/triton-inference-server/server
    cd server
    git checkout r19.05
    sed -i.bak "s/bootstrap.pypa.io\/get-pip.py/bootstrap.pypa.io\/pip\/2.7\/get-pip.py/" Dockerfile.client
    docker build -t tritonserver_client -f Dockerfile.client .
    gcloud auth configure-docker
    docker tag tritonserver_client \
        gcr.io/${PROJECT_ID}/tritonserver_client
    docker push gcr.io/${PROJECT_ID}/tritonserver_client
    

    The build process can take about 5 minutes. When the process is complete, a command prompt appears in the SSH terminal.

  2. When the build process is finished, build a Docker image to generate testing workload, and upload it to Container Registry:

    cd ..
    sed -i.bak "s/YOUR-PROJECT-ID/${PROJECT_ID}/" Dockerfile
    docker build -t locust_tester -f Dockerfile .
    docker tag locust_tester gcr.io/${PROJECT_ID}/locust_tester
    docker push gcr.io/${PROJECT_ID}/locust_tester
    

    Don't change or replace YOUR-PROJECT-ID in the commands.

    This image is built from the image that you created in the previous step.

  3. Deploy the Locust files service_master.yaml and deployment_master.yaml:

    sed -i.bak "s/YOUR-PROJECT-ID/${PROJECT_ID}/" deployment_master.yaml
    sed -i.bak "s/CLUSTER-IP-TRTIS/${TRITON_IP}/" deployment_master.yaml
    
    kubectl create namespace locust
    kubectl create configmap locust-config --from-literal model=original --from-literal saddr=${TRITON_IP} --from-literal rps=10 -n locust
    
    kubectl apply -f service_master.yaml -n locust
    kubectl apply -f deployment_master.yaml -n locust
    

    The configmap resource is used to specify the machine learning model to which clients send requests for inference.

  4. Wait a few minutes until services become available.

  5. Get the clusterIP address of the locust-master client, and store that address in an environment variable:

    export LOCUST_MASTER_IP=$(kubectl get svc locust-master -n locust \
        -o "jsonpath={.spec['clusterIP']}")
    echo ${LOCUST_MASTER_IP}
    
  6. Deploy the Locust client:

    sed -i.bak "s/YOUR-PROJECT-ID/${PROJECT_ID}/" deployment_slave.yaml
    sed -i.bak "s/CLUSTER-IP-LOCUST-MASTER/${LOCUST_MASTER_IP}/" deployment_slave.yaml
    kubectl apply -f deployment_slave.yaml -n locust
    

    These commands deploy 10 Locust client Pods that you can use to generate testing workloads. If you can't generate enough requests with the current number of clients, you can change the number of Pods by using the following command:

    kubectl scale deployment/locust-slave --replicas=20 -n locust
    

    When there is not enough capacity for a default cluster to increase the number of replicas, we recommend that you increase the number of nodes in the GKE cluster.

  7. Copy the URL of the Locust console, and then open this URL in a web browser:

    export LOCUST_IP=$(kubectl get svc locust-master -n locust \
         -o "jsonpath={.status.loadBalancer.ingress[0].ip}")
    echo "http://${LOCUST_IP}:8089"
    

    The Locust console opens, and you can generate testing workloads from it.

Check the running Pods

To ensure that the components are deployed successfully, check that the Pods are running.

  1. In the SSH terminal, check the inference server Pod:

    kubectl get pods
    

    The output is similar to the following:

    NAME                                READY   STATUS    RESTARTS   AGE
    inference-server-67786cddb4-qrw6r   1/1     Running   0          83m
    

    If you don't get the expected output, ensure that you've completed the steps in Deploy inference servers by using Triton.

  2. Check the Locust Pods:

    kubectl get pods -n locust
    

    The output is similar to the following:

    NAME                                READY   STATUS    RESTARTS   AGE
    locust-master-75f6f6d4bc-ttllr      1/1     Running   0          10m
    locust-slave-76ddb664d9-8275p       1/1     Running   0          2m36s
    locust-slave-76ddb664d9-f45ww       1/1     Running   0          2m36s
    locust-slave-76ddb664d9-q95z9       1/1     Running   0          2m36s
    

    If you don't get the expected output, ensure that you've completed the steps in Deploy a load testing tool.

  3. Check the monitoring Pods:

    kubectl get pods -n monitoring
    

    The output is similar to the following:

    NAME                                     READY   STATUS    RESTARTS   AGE
    grafana-deployment-644bbcb84-k6t7v       1/1     Running   0          79m
    prometheus-deployment-544b9b9f98-hl7q8   1/1     Running   0          81m
    

    If you don't get the expected output, ensure that you've completed the steps in Deploy monitoring servers with Prometheus and Grafana.

In the next part of this series, you use this inference server system to learn how various optimizations improve performance and how to interpret those optimizations. For next steps, see Measure and tune performance of a TensorFlow inference system.

What's next