How to automatically scale your machine learning predictions
Karl Weinmeister
Head of Cloud Product DevRel
Quanjie Lin
Software Engineer
Historically, one of the biggest challenges in the data science field is that many models don't make it past the experimental stage. As the field has matured, we've seen MLOps processes and tooling emerge that have increased project velocity and reproducibility. While we've got a ways to go, more models than ever before are crossing the finish line into production.
That leads to the next question for data scientists: how will my model scale in production? In this blog post, we will discuss how to use a managed prediction service, Google Cloud’s AI Platform Prediction, to address the challenges of scaling inference workloads.
Inference Workloads
In a machine learning project, there are two primary workloads: training and inference. Training is the process of building a model by learning from data samples, and inference is the process of using that model to make a prediction with new data.
Typically, training workloads are not only long-running, but also sporadic. If you're using a feed-forward neural network, a training workload will include multiple forward and backward passes through the data, updating weights and biases to minimize errors. In some cases, the model created from this process will be used in production for quite some time, and in others, new training workloads might be triggered frequently to retrain the model with new data.
On the other hand, an inference workload consists of a high volume of smaller transactions. An inference operation essentially is a forward pass through a neural network: starting with the inputs, perform matrix multiplication through each layer and produce an output. The workload characteristics will be highly correlated with how the inference is used in a production application. For example, in an e-commerce site, each request to the product catalog could trigger an inference operation to provide product recommendations, and the traffic served will peak and lull with the e-commerce traffic.
Balancing Cost and Latency
The primary challenge for inference workloads is balancing cost with latency. It's a common requirement for production workloads to have latency < 100 milliseconds for a smooth user experience. On top of that, application usage can be spiky and unpredictable, but the latency requirements don't go away during times of intense use.
To ensure that latency requirements are always met, it might be tempting to provision an abundance of nodes. The downside of overprovisioning is that many nodes will not be fully utilized, leading to unnecessarily high costs.
On the other hand, underprovisioning will reduce cost but lead to missing latency targets due to servers being overloaded. Even worse, users may experience errors if timeouts or dropped packets occur.
It gets even trickier when we consider that many organizations are using machine learning in multiple applications. Each application has a different usage profile, and each application might be using a different model with unique performance characteristics. For example, in this paper, Facebook describes the diverse resource requirements of models they are serving for natural language, recommendation, and computer vision.
AI Platform Prediction Service
The AI Platform Prediction service allows you to easily host your trained machine learning models in the cloud and automatically scale them. Your users can make predictions using the hosted models with input data. The service supports both online prediction, when timely inference is required, and batch prediction, for processing large jobs in bulk.
To deploy your trained model, you start by creating a "model", which is essentially a package for related model artifacts. Within that model, you then create a "version", which consists of the model file and configuration options such as the machine type, framework, region, scaling, and more. You can even use a custom container with the service for more control over the framework, data processing, and dependencies.
To make predictions with the service, you can use the REST API, command line, or a client library. For online prediction, you specify the project, model, and version, and then pass in a formatted set of instances as described in the documentation.
Introduction to scaling options
When defining a version, you can specify the number of prediction nodes to use with the manualScaling.nodes option. By manually setting the number of nodes, the nodes will always be running, whether or not they are serving predictions. You can adjust this number by creating a new model version with a different configuration.
You can also configure the service to automatically scale. The service will increase nodes as traffic increases, and remove them as it decreases. Auto-scaling can be turned on with the autoScaling.minNodes option. You can also set a maximum number of nodes with autoScaling.maxNodes. These settings are key to improving utilization and reducing costs, enabling the number of nodes to adjust within the constraints that you specify.
Continuous availability across zones can be achieved with multi-zone scaling, to address potential outages in one of the zones. Nodes will be distributed across zones in the specified region automatically when using auto-scaling with at least 1 node or manual scaling with at least 2 nodes.
GPU Support
When defining a model version, you need to specify a machine type and a GPU accelerator, which is optional. Each virtual machine instance can offload operations to the attached GPU, which can significantly improve performance. For more information on supported GPUs in Google Cloud, see this blog post: Reduce costs and increase throughput with NVIDIA T4s, P100s, V100s.
The AI Platform Prediction service has recently introduced GPU support for the auto-scaling feature. The service will look at both CPU and GPU utilization to determine if scaling up or down is required.
How does auto-scaling work?
The online prediction service scales the number of nodes it uses, to maximize the number of requests it can handle without introducing too much latency. To do that, the service:
Allocates some nodes (the number can be configured by setting the minNodes option on your model version) the first time you request predictions.
Automatically scales up the model version’s deployment as soon as you need it (traffic goes up).
Automatically scales it back down to save cost when you don’t (traffic goes down).
Keeps at least a minimum number of nodes (by setting the minNodes option on your model version) ready to handle requests even when there are none to handle.
Today, the prediction service supports auto-scaling based on two metrics: CPU utilization and GPU duty cycle. Both metrics are measured by taking the average utilization of each model. The user can specify the target value of these two metrics in the CreateVersion API (see examples below); the target fields specify the target value for the given metric; once the real metric deviates from the target by a certain amount of time, the node count adjusts up or down to match.
How to enable CPU auto-scaling in a new model
Below is an example of creating a version with auto-scaling based on a CPU metric. In this example, the CPU usage target is set to 60% with the minimum nodes set to 1 and maximum nodes set to 3. Once the real CPU usage exceeds 60%, the node count will increase (to a maximum of 3). Once the real CPU usage goes below 60% for a certain amount of time, the node count will decrease (to a minimum of 1). If no target value is set for a metric, it will be set to the default value of 60%.REGION=us-central1
using gcloud:
gcloud beta ai-platform versions create v1 --model ${MODEL} --region ${REGION} \
--accelerator=count=1,type=nvidia-tesla-t4 \
--metric-targets cpu-usage=60 \
--min-nodes 1 --max-nodes 3 \
--runtime-version 2.3 --origin gs://<your model path> --machine-type n1-standard-4 --framework tensorflow
curl example:
curl -k -H Content-Type:application/json -H "Authorization: Bearer $(gcloud auth print-access-token)" https://$REGION-ml.googleapis.com/v1/projects/$PROJECT/models/${MODEL}/versions -d@./version.json
version.json
Using GPUs
Today, the online prediction service supports GPU-based prediction, which can significantly accelerate the speed of prediction. Previously, the user needed to manually specify the number of GPUs for each model. This configuration had several limitations:
To give an accurate estimate of the GPU number, users would need to know the maximum throughput one GPU could process for certain machine types.
The traffic pattern for models may change over time, so the original GPU number may not be optimal. For example, high traffic volume may cause resources to be exhausted, leading to timeouts and dropped requests, while low traffic volume may lead to idle resources and increased costs.
To address these limitations, the AI Platform Prediction Service has introduced GPU based auto-scaling.
Below is an example of creating a version with auto-scaling based on both GPU and CPU metrics. In this example, the CPU usage target is set to 50%, GPU duty cycle is 60%, minimum nodes are 1, and maximum nodes are 3. When the real CPU usage exceeds 60% or the GPU duty cycle exceeds 60% for a certain amount of time, the node count will increase (to a maximum of 3). When the real CPU usage stays below 50% or GPU duty cycle stays below 60% for a certain amount of time, the node count will decrease (to a minimum of 1). If no target value is set for a metric, it will be set to the default value of 60%. acceleratorConfig.count is the number of GPUs per node.
REGION=us-central1
gcloud Example:
gcloud beta ai-platform versions create v1 --model ${MODEL} --region ${REGION} \
--accelerator=count=1,type=nvidia-tesla-t4 \
--metric-targets cpu-usage=50 --metric-targets gpu-duty-cycle=60 \
--min-nodes 1 --max-nodes 3 \
--runtime-version 2.3 --origin gs://<your model path> --machine-type n1-standard-4 --framework tensorflow
Curl Example:
curl -k -H Content-Type:application/json -H "Authorization: Bearer $(gcloud auth print-access-token)" https://$REGION-ml.googleapis.com/v1/projects/$PROJECT/models/${MODEL}/versions -d@./version.json
version.json
Considerations when using automatic scaling
Automatic scaling for online prediction can help you serve varying rates of prediction requests while minimizing costs. However, it is not ideal for all situations. The service may not be able to bring nodes online fast enough to keep up with large spikes of request traffic. If you've configured the service to use GPUs, also keep in mind that provisioning new GPU nodes takes much longer than CPU nodes. If your traffic regularly has steep spikes, and if reliably low latency is important to your application, you may want to consider setting a low threshold to spin up new machines early, setting minNodes to a sufficiently high value, or using manual scaling.
It is recommended to load test your model before putting it in production. Using the load test can help tune the minimum number of nodes and threshold values to ensure your model can scale to your load. The minimum number of nodes must be at least 2 for the model version to be covered by the AI Platform Training and Prediction SLA.
The AI Platform Prediction Service has default quotas enabled for service requests, such as the number of predictions within a given period, as well as CPU and GPU resource utilization. You can find more details on the specific limits in the documentation. If you need to update these limits, you can apply for a quota increase online or through your support channel.
Wrapping up
In this blog post, we've shown how the AI Platform Prediction service can simply and cost-effectively scale to match your workloads. You can now configure auto-scaling for GPUs to accelerate inference without overprovisioning.
If you'd like to try out the service yourself, we have a sample notebook which demonstrates how to deploy a model and configure auto-scaling settings. The AI Platform Prediction documentation also provides a thorough walkthrough of how to use the service and its configuration options.