Keep up with the latest announcements from Google Cloud Next '21. Click here.

Developers & Practitioners

PyTorch on Google Cloud: How to deploy PyTorch models on Vertex AI

This article is the next step in the series of PyTorch on Google Cloud using Vertex AI. In the preceding article, we fine-tuned a Hugging Face Transformers model for a sentiment classification task using PyTorch on Vertex Training service. In this post, we show how to deploy a PyTorch model on the Vertex Prediction service for serving predictions from trained model artifacts. 

Now let’s walk through the deployment of a Pytorch model using TorchServe as a custom container by deploying the model artifacts to a Vertex Endpoint. You can find the accompanying code for this blog post on the GitHub repository and the Jupyter Notebook.

Deploying a PyTorch Model on Vertex Prediction Service

Vertex Prediction service is Google Cloud's managed model serving platform. As a managed service, the platform handles infrastructure setup, maintenance, and management. Vertex Prediction supports both CPU and GPU inferencing and offers a selection of n1-standard machine shapes in Compute Engine, letting you customize the scale unit to fit your requirements. Vertex Prediction service is the most effective way to deploy your models to serve predictions for the following reasons:

  • Simple: Vertex Prediction service simplifies model service with pre-built containers for prediction that requires you to only specify where you store your model artifacts. 
  • Flexible: With custom containers, Vertex Prediction offers flexibility by lowering the abstraction level so that you can choose whichever ML framework, model server, preprocessing, and post-processing that you need.
  • Assistive: Built-in tooling to track performance of models and explain or understand predictions.

TorchServe is the recommended framework to deploy PyTorch models in production. TorchServe’s CLI makes it easy to deploy a PyTorch model locally or can be packaged as a container that can be scaled out by the Vertex Prediction service. The custom container capability of Vertex Prediction provides a flexible way to define the environment where the TorchServe model server is run. 

In this blog post, we deploy a container running a TorchServe model server on the Vertex Prediction service to serve predictions from a fine-tuned transformer model from Hugging Face for the sentiment classification task. You can then send input requests with text to a Vertex Endpoint to classify sentiment as positive or negative.

or negative

Figure 1. Serving with custom containers on Vertex Prediction service

Following are the steps to deploy a PyTorch model on Vertex Prediction:

  1. Download the trained model artifacts.
  2. Package the trained model artifacts including default or custom handlers by creating an archive file using the Torch Model Archiver tool.
  3. Build a custom container (Docker) compatible with the Vertex Prediction service to serve the model using TorchServe.
  4. Upload the model with the custom container image as a Vertex Model resource.
  5. Create a Vertex Endpoint and deploy the model resource to the endpoint to serve predictions.

1. Download the trained model artifacts

Model artifacts are created by the training application code that are required to serve predictions. TorchServe expects model artifacts to be in either a saved model binary (.bin) format or a traced model (.pth or .pt) format. In the previous post, we trained a Hugging Face Transformer model on the Vertex Training service and saved the model as a model binary (.bin) by calling the .save_model() method and then saved the model artifacts to a Cloud Storage bucket.

  trainer.save_model('./cls')

Based on the training job name, you can get the location of model artifacts from Vertex Training using the Cloud Console or gcloud ai custom-jobs describe command and then download the artifacts from the Cloud Storage bucket.

  # set job name
JOB_NAME="[my-training-job-name]" # <-- change job name


# get job id
JOB_ID=$(gcloud beta ai custom-jobs list --region=$REGION --filter="displayName:"$JOB_NAME --format="get(name)")


# get model artifacts directory location set when running the training job
GCS_MODEL_ARTIFACTS_URI=$(gcloud beta ai custom-jobs describe $JOB_ID --region=$REGION --format="get(jobSpec.baseOutputDirectory.outputUriPrefix)")


# download model artifacts from GCS to a local directory
gsutil -m cp -r $GCS_MODEL_ARTIFACTS_URI/ ./

2. Create a custom model handler to handle prediction requests

TorchServe uses a base handler module to pre-process the input before being fed to the model or post-process the model output before sending the prediction response. TorchServe provides default handlers for common use cases such as image classification, object detection, segmentation and text classification. For the sentiment analysis task, we will create a custom handler because the input text needs to be tokenized using the same tokenizer used at the training time to avoid the training-serving skew

The custom handler presented here does the following:

  • Pre-process the input text  before sending it to the model for inference using the same Hugging Face Transformers Tokenizer class used during training
  • Invoke the model for inference
  • Post-process output from the model before sending back a response
  class TransformersClassifierHandler(BaseHandler):
    """
    The handler takes an input string and returns the classification text 
    based on the serialized transformers checkpoint.
    """
    def __init__(self):
        super(TransformersClassifierHandler, self).__init__()
        self.initialized = False

    def initialize(self, ctx):
        """ Loads the model.pt file and initializes the model object.
        Instantiates Tokenizer for preprocessor to use
        Loads labels to name mapping file for post-processing inference response
        """
        self.manifest = ctx.manifest

        properties = ctx.system_properties
        model_dir = properties.get("model_dir")
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")

        # Read model serialize/pt file
        serialized_file = self.manifest["model"]["serializedFile"]
        model_pt_path = os.path.join(model_dir, serialized_file)
        if not os.path.isfile(model_pt_path):
            raise RuntimeError("Missing the model.pt or pytorch_model.bin file")
        
        # Load model
        self.model = AutoModelForSequenceClassification.from_pretrained(model_dir)
        self.model.to(self.device)
        self.model.eval()
        logger.debug('Transformer model from path {0} loaded successfully'.format(model_dir))
        
        # Ensure to use the same tokenizer used during training
        self.tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

        # Read the mapping file, index to object name
        mapping_file_path = os.path.join(model_dir, "index_to_name.json")

        if os.path.isfile(mapping_file_path):
            with open(mapping_file_path) as f:
                self.mapping = json.load(f)
        else:
            logger.warning('Missing the index_to_name.json file. Inference output will not include class name.')

        self.initialized = True
 def preprocess(self, data):
        """ Preprocessing input request by tokenizing
            Extend with your own preprocessing steps as needed
        """
        text = data[0].get("data")
        if text is None:
            text = data[0].get("body")
        sentences = text.decode('utf-8')
        logger.info("Received text: '%s'", sentences)

        # Tokenize the texts
        tokenizer_args = ((sentences,))
        inputs = self.tokenizer(*tokenizer_args,
                                padding='max_length',
                                max_length=128,
                                truncation=True,
                                return_tensors = "pt")
        return inputs

    def inference(self, inputs):
        """ Predict the class of a text using a trained transformer model.
        """
        prediction = self.model(inputs['input_ids'].to(self.device))[0].argmax().item()

        if self.mapping:
            prediction = self.mapping[str(prediction)]

        logger.info("Model predicted: '%s'", prediction)
        return [prediction]

    def postprocess(self, inference_output):
        return inference_output

3. Create custom container image with TorchServe to serve predictions

When deploying a PyTorch model on the Vertex Prediction service, you must use a custom container image that runs a HTTP server, such as TorchServe in this case. The custom container image must meet the requirements to be compatible with the Vertex Prediction service. We create a Dockerfile with TorchServe as the base image that meets custom container image requirements and performs the following steps:

  • Install dependencies required for the custom handler to process the model inference requests. For e.g. transformers package in the use case.
  • Copy trained model artifacts to /home/model-server/ directory of the container image. We assume model artifacts are available when the image is built. In the notebook, we download the trained model artifacts from the Cloud Storage bucket saved as part of hyperparameter tuning trials.
  • Add the custom handler script to /home/model-server/ directory of the container image.
  • Create /home/model-server/config.properties to define the serving configuration such as health check and prediction listener ports
  • Run the Torch Model Archiver tool to create a model archive file from the files copied into the image /home/model-server/. The model archive is saved in the /home/model-server/model-store/ with name same as <model-name>.mar
  • Launch Torchserve HTTP server to enable serving of the model referencing the configuration properties and the model archive file
  FROM pytorch/torchserve:latest-cpu

# install dependencies
RUN pip3 install transformers

# copy model artifacts, custom handler and other dependencies
COPY ./custom_text_handler.py /home/model-server/
COPY ./index_to_name.json /home/model-server/
COPY ./model/$APP_NAME/ /home/model-server/

# create torchserve configuration file
USER root
RUN printf "\nservice_envelope=json" >> /home/model-server/config.properties
RUN printf "\ninference_address=http://0.0.0.0:7080" >> /home/model-server/config.properties
RUN printf "\nmanagement_address=http://0.0.0.0:7081" >> /home/model-server/config.properties
USER model-server

# expose health and prediction listener ports from the image
EXPOSE 7080
EXPOSE 7081

# create model archive file packaging model artifacts and dependencies
RUN torch-model-archiver -f \
  --model-name=$APP_NAME \
  --version=1.0 \
  --serialized-file=/home/model-server/pytorch_model.bin \
  --handler=/home/model-server/custom_text_handler.py \
  --extra-files "/home/model-server/config.json,/home/model-server/tokenizer.json,/home/model-server/training_args.bin,/home/model-server/tokenizer_config.json,/home/model-server/special_tokens_map.json,/home/model-server/vocab.txt,/home/model-server/index_to_name.json" \
  --export-path=/home/model-server/model-store

# run Torchserve HTTP serve to respond to prediction requests
CMD ["torchserve", \
     "--start", \
     "--ts-config=/home/model-server/config.properties", \
     "--models", \
     "$APP_NAME=$APP_NAME.mar", \
     "--model-store", \
     "/home/model-server/model-store"]

Let’s understand the functionality of TorchServe and Torch Model Archiver tools in these steps.

Torch Model Archiver

Torchserve provides a model archive utility to package a PyTorch model for deployment and the resulting model archive file is used by torchserve at serving time. Following is the torch-model-archiver command added in Dockerfile to generate a model archive file for the text classification model:

  torch-model-archiver -f \
  --model-name=$APP_NAME \
  --version=1.0 \
  --serialized-file=/home/model-server/pytorch_model.bin \
  --handler=/home/model-server/custom_text_handler.py \
  --extra-files "/home/model-server/config.json,/home/model-server/tokenizer.json,/home/model-server/training_args.bin,/home/model-server/tokenizer_config.json,/home/model-server/special_tokens_map.json,/home/model-server/vocab.txt,/home/model-server/index_to_name.json" \
  --export-path=/home/model-server/model-store
  • Model Binary (--serialized-file parameter): Model binary is the serialized Pytorch model that can either be the saved model binary (.bin) file or a traced model (.pth) file generated using TorchScript - Torch Just In Time (JIT) compiler. In this example we will use the saved model binary generated in the previous post by fine-tuning a pre-trained Hugging Face Transformer model.

    • NOTE: JIT compiler trace may have some device-dependent operations in the output. So it is often a good practice to generate the trace in the same environment where the model will be deployed.

  • Model Handler (--handler parameter): Model handler can be TorchServe's default handlers or path to a python file to handle custom TorchServe inference logic that can pre-process model inputs or post-process model outputs. We defined a custom handler script in the previous section of this post.

  • Extra files (--extra-files parameter): Extra files allow you to package additional files referenced by the model handler. For example, a few of the files referred in the command are:

    • index_to_name.json: In the custom handler defined earlier, the post-processing step uses an index-to-name JSON file to map prediction target indexes to human-readable labels

    • config.json: Required for AutoModelForSequenceClassification.from_pretrained method to load the model

    • vocab.txt: vocab files used by the tokenizer

TorchServe

TorchServe wraps PyTorch models into a set of REST APIs served by a HTTP web server. Adding the torchserve command to the CMD or ENTRYPOINT of the custom container launches this server. In this article we will only explore prediction and health check APIs. The Explainable AI API for PyTorch models on Vertex endpoints is currently supported only for tabular data.

TorchServe Config (--ts-config parameter):  TorchServe config allows you to customize the inference address and management ports. We also configure service_envelop field to json to indicate the expected input format for TorchServe. Refer to TorchServe documentation to configure other parameters. We create a config.properties file and pass it as TorchServe config.

  inference_address=http://0.0.0.0:7080
management_address=http://0.0.0.0:7081
service_envelope=json
  • Model Store (--model-store parameter): Model store location from where local or default models can be loaded

  • Model Archive (--models parameter):  Models to be loaded by TorchServe using [model_name=]model_location format. Model location is the model archive file in the model store.

4. Build and push the custom container image

  CUSTOM_PREDICTOR_IMAGE_URI = f"gcr.io/{PROJECT_ID}/pytorch_predict_{APP_NAME}"
docker build --tag=$CUSTOM_PREDICTOR_IMAGE_URI ./predictor

Before pushing the image to the Container Registry, you can test the docker image locally by sending input requests to a local TorchServe deployment running inside docker.

  • To run the container image as a container locally, run the following command:

  # run docker container to start local TorchServe deployment
docker run -t -d --rm -p 7080:7080 --name=local_bert_classifier $CUSTOM_PREDICTOR_IMAGE_URI
# delay to allow the model to be loaded in torchserve (takes a few seconds)
sleep 20
  • To send the container's server a health check, run the following command:

  cat > ./predictor/instances.json <<END
{ 
   "instances": [
     { 
       "data": {
         "b64": "$(echo 'Take away the CGI and the A-list cast and you end up with a film with less punch.' | base64 --wrap=0)"
       }
     }
   ]
}
END

curl -s -X POST \
  -H "Content-Type: application/json; charset=utf-8" \
  -d @./predictor/instances.json \
  http://localhost:7080/predictions/$APP_NAME/

This request uses a test sentence. If successful, the server returns the prediction in the following format:

  {"predictions": ["Negative"]}
  • After the response is verified, it confirms that the custom handler, model packaging and torchserve config are working as expected. You can stop the TorchServe local server by stopping the  container.

  docker stop local_bert_classifier

Now push the custom container image to the Container Registry, which will be deployed to the Vertex Endpoint in the next step.

  docker push $CUSTOM_PREDICTOR_IMAGE_URI

NOTE: You can also build and push the custom container image to the Artifact Registry repository instead of the Container Registry repository.

5. Deploying the serving container to Vertex Endpoint 

We have packaged the model and built the serving container image. The next step is to deploy it to a Vertex Endpoint. A model must be deployed to an endpoint before it can be used to serve online predictions. Deploying a model associates physical resources with the model so it can serve online predictions with low latency. We use Vertex SDK for Python to upload the model and deploy it to an endpoint. Following steps are applicable to any model trained either on Vertex Training service or elsewhere such as on-prem.

Upload model

We upload the model artifacts to Vertex AI and create a Model resource for the deployment. In this example the artifact is the serving container image URI. Notice that the predict and health routes (mandatory routes) and container port(s) are also specified at this step.

  from google.cloud import aiplatform

VERSION = 1
model_display_name = f"{APP_NAME}-v{VERSION}"
model_description = "PyTorch based text classifier with custom container"

MODEL_NAME = APP_NAME
health_route = "/ping"
predict_route = f"/predictions/{MODEL_NAME}"
serving_container_ports = [7080]

model = aiplatform.Model.upload(
    display_name=model_display_name,
    description=model_description,
    serving_container_image_uri=CUSTOM_PREDICTOR_IMAGE_URI,
    serving_container_predict_route=predict_route,
    serving_container_health_route=health_route,
    serving_container_ports=serving_container_ports,
)

model.wait()

print(model.display_name)
print(model.resource_name)

After the model is uploaded, you can view the model in the Models page on the Google Cloud Console under the Vertex AI section.

Figure 2

Figure 2. Models page on Google Cloud console under the Vertex AI section

Create endpoint

Create a service endpoint to deploy one or more models. An endpoint provides a service URL where the prediction requests are sent. You can skip this step if you are deploying the model to an existing endpoint.

  endpoint_display_name = f"{APP_NAME}-endpoint"
endpoint = aiplatform.Endpoint.create(display_name=endpoint_display_name)

After the endpoint is created, you can view the endpoint in the Endpoints page on the Google Cloud Console under the Vertex AI section.

Figure 3

Figure 3. Endpoints page on Google Cloud console under the Vertex AI section

Deploy the model to endpoint

The final step is deploying the model to an endpoint. The deploy method provides the interface to specify the endpoint where the model is deployed and compute parameters including machine type, scaling minimum and maximum replica counts, and traffic split.

  traffic_percentage = 100
machine_type = "n1-standard-4"
deployed_model_display_name = model_display_name
min_replica_count = 1
max_replica_count = 3
sync = True

model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=deployed_model_display_name,
    machine_type=machine_type,
    traffic_percentage=traffic_percentage,
    sync=sync,
)

model.wait()

After deploying the model to the endpoint, you can manage and monitor the deployed models from the Endpoints page on the Google Cloud Console under the Vertex AI section.

Figure 4

Figure 4. Manage and monitor models deployed on Endpoint from Google Cloud console under the Vertex AI section

Test the deployment

Now that the model is deployed, we can use the endpoint.predict() method to send base64 encoded text to the prediction request and get the predicted sentiment in response.

  import base64
input_text = b"Jaw dropping visual effects and action! One of the best I have seen to date."

print(f"Input text: \n\t{input_text.decode('utf-8')}\n")
b64_encoded = base64.b64encode(input_text)
instance = [{
    "data": {
        "b64": str(b64_encoded.decode('utf-8'))
    }
}]


prediction = endpoint.predict(instances=instance)
print(f"Prediction response: \n\t{prediction}")

Alternatively, you can also call the Vertex Endpoint to make predictions using the gcloud beta ai endpoints predict command. Refer to the Jupyter Notebook for complete code.

Cleaning up the environment

After you are done experimenting, you can either stop or delete the Notebooks instance. Delete the Notebook instance to prevent any further charges. If you want to save your work, you can choose to stop the instance instead

To clean up all Google Cloud resources created in this post and the previous post, you can delete the individual resources created:

  • Training Jobs

  • Model

  • Endpoint

  • Cloud Storage Bucket

  • Container Images

Follow the Cleaning Up section in the Jupyter Notebook to delete the individual resources.

What’s next?

Continuing from the training and hyperparameter tuning of the PyTorch based text classification model on Vertex AI, we showed deployment of the PyTorch model on Vertex Prediction service. We deployed a custom container running a TorchServe model server on the Vertex Prediction service to serve predictions from the trained model artifacts. As the next steps, you can work through this example on Vertex AI or perhaps deploy one of your own PyTorch models.

References

In the next article of this series, we will show how you can orchestrate a machine learning workflow using Vertex Pipelines to tie together the individual steps which we have seen so far, i.e. training, hyperparameter tuning and deployment of a PyTorch model. This will lay the foundation for CI/CD (Continuous Integration / Continuous Delivery) for machine learning models on the Google Cloud platform.

Stay tuned. Thank you for reading! Have a question or want to chat? Find authors here - Rajesh [Twitter | LinkedIn] and Vaibhav [LinkedIn].


Thanks to Karl Weinmeister and Jordan Totten for helping and reviewing the post.