Simplify model serving with custom prediction routines on Vertex AI
Simplify model serving with custom prediction routines on Vertex AI
The data received at serving time is rarely in the format your model expects. Numerical columns need to be normalized, features created, image bytes decoded, input values validated. Transforming the data can be as important as the prediction itself. That’s why we’re excited to announce custom prediction routines on Vertex AI, which simplify the process of writing pre and post processing code.
With custom prediction routines, you can provide your data transformations as Python code, and behind the scenes Vertex AI SDK will build a custom container that you can test locally and deploy to the cloud.
Understanding custom prediction routines
The Vertex AI pre-built containers handle prediction requests by performing the prediction operation of the machine learning framework. Prior to custom prediction routines, if you wanted to preprocess the input before the prediction is performed, or postprocess the model’s prediction before returning the result, you would need to build a custom container from scratch.
Building a custom serving container requires writing an HTTP server that wraps the trained model, translates HTTP requests into model inputs, and translates model outputs into responses. You can see an example here showing how to build a model server with FastAPI.
With custom prediction routines, Vertex AI provides the serving-related components for you, so that you can focus on your model and data transformations.
The predictor class is responsible for the ML-related logic in a prediction request: loading the model, getting predictions, and applying custom preprocessing and postprocessing. To write custom prediction logic, you’ll subclass the Vertex AI Predictor interface. In most cases, customizing the predictor is all you’ll require, but check out this notebook if you’d like to see an example of customizing the request handler.
You can see an example predictor implementation below, specifically the reusable Sklearn predictor. This is all the code you would need to write in order to build this custom model server.
A predictor implements four methods:
Load: Loads in the model artifacts, and any optional preprocessing artifacts such as an encoder you saved to a pickle file.
Preprocess: Performs the logic to preprocess the input data before the prediction request. By default, the preprocess method receives a dictionary which contains all the data in the request body after it has been deserialized from JSON.
Predict: Performs the prediction, which will look something like
model.predict(instances)depending on what framework you’re using.
Postprocess: Postprocesses the prediction results before returning them to the end user. By default, the output of the postprocess method will be serialized into a JSON object and returned as the response body.
You can customize as many of the above methods as your use case requires. To customize, all you need to do is subclass the predictor and save your new custom predictor to a Python file.
Let’s take a deeper look at how you might customize each one of these methods.
The load method is where you load in any artifacts from Cloud Storage. This includes the model, but can also include custom preprocessors.
For example, let’s say you wrote the following preprocessor to scale numerical features, and stored it as a pickle file called
preprocessor.pkl in Cloud Storage.
When customizing the predictor, you would write a load method to read the pickle file, similar to the following, where
artifacts_uri is the Cloud Storage path to your model and preprocessing artifacts.
The preprocess method is where you write the logic to perform any preprocessing needed for your serving data. It can be as simple as just applying the preprocessor you loaded in the load method as shown below:
Instead of loading in a preprocessor, you might write the preprocessing directly in the preprocess method. For example, you might need to check your inputs are in the format you expect. Let’s say your model expects the feature at index 3 to be a string in its abbreviated form. You want to check that at serving time the value for that feature is abbreviated.
There are numerous other ways you could customize the preprocessing logic. You might need to tokenize text for a language model, generate new features, or load data from an external source.
This method usually just calls
model.predict, and generally doesn't need to be customized unless you're building your predictor from scratch instead of with a reusable predictor.
Sometimes the model prediction is only the first step. After you get a prediction from the model you might need to transform it to make it valuable to the end user. This might be something as simple as converting the numerical class label returned by the model to the string label as shown below.
Or you could implement additional business logic. For example, you might want to only return a prediction if the model’s confidence is above a certain threshold. If it’s below, you want the input to be sent to a human instead to double check.
Just like with preprocessing, there are numerous ways you can postprocess your data with custom prediction routines. You might need to detokenize text for a language model, convert the model output into a more readable format for the end user, or even call a Vertex AI Matching Engine index endpoint to search for data with a similar embedding.
When you’ve written your predictor, you’ll want to save the class out to a Python file. Then you can build your image with the command below, where LOCAL_SOURCE_DIR is a local directory that contains the Python file where you saved your custom predictor.
Once the image is built, you can test it out by deploying it to a local endpoint and then calling the predict method and passing in the request data. You’ll set
artifact_uri to the path in Cloud Storage where you’ve saved your model and any artifacts needed for preprocessing or postprocessing. You can also use a local path for testing.
Deploy to Vertex AI
After testing the model locally to confirm that the predictions work as expected, the next steps are to push the image to Artifact Registry, import the model to the Vertex AI Model Registry, and optionally deploy it to an endpoint if you want online predictions.
When the model has been uploaded to Vertex AI and deployed, you’ll be able to see it in the model registry. And then you can make prediction requests like you would with any other model you have deployed on Vertex AI.
You now know the basics of how to use custom prediction routines to help add powerful customization to your serving workflows without having to worry about model servers or building Docker containers. To get hands on experience with an end to end example, check out this codelab. It’s time to start writing some custom prediction code of your own!