This article explores common architectures on Google Cloud for providing predictions from machine learning models, as well as techniques for minimizing the prediction serving latency of ML systems. An ML model is useful only if it's deployed and ready to make predictions, but building an adapted ML serving system requires the following:
- Knowing whether you need to provide predictions in real time or offline.
- Balancing between model predictive performance and prediction latency.
- Managing the input features required by the model in a low read-latency lookup store.
- Knowing whether at least some of the predictions can be precomputed offline to be served online.
Prediction with machine learning
After you train, evaluate, and tune a machine learning (ML) model, the model is deployed to production to serve predictions. An ML model can provide predictions in two ways:
- Offline prediction. This is when your ML model is used in a batch scoring job for a large number of data points, where predictions are not required in real-time serving. In offline recommendations, for example, you only use historical information about customer-item interactions to make the prediction, without any need for online information. Offline recommendations are usually performed in retention campaigns for (inactive) customers with high propensity to churn, in promotion campaigns, and so on.
- Online prediction. This is when your ML system is used to serve real-time predictions, based on online requests from the operational systems and apps. In contrast to offline prediction, in online recommendations you need the current context of the customer who's using your application, along with historical information, to make the prediction. This context includes information such as datetime, page views, funnels, items viewed, items in the basket, items removed from the basket, customer device, and device location.
Offline (batch) prediction
In offline prediction, you don't invoke your model after you receive a data point. Instead, you collect the data points in your data lake or data warehouse, and the predictions are produced for all the data points at once, in a batch prediction job. The batch prediction job is part of a scheduled batch Extract, Transform, Load (ETL) process, which might be executed daily, weekly, or even monthly. Typical use cases for batch prediction include:
- Demand forecasting: estimating the demand for products by store on a daily basis for stock and intake optimization.
- Segmentation analysis: identifying your customer segments, emerging segments, and customers who are migrating across segments each month.
- Sentiment analysis and topic mining: identifying the overall sentiment regarding your products or services every week by analyzing customer feedback, and extracting trending topics in your market from social media.
Figure 1 shows a typical high-level architecture on Google Cloud for performing offline batch prediction.
Assume that your data source for scoring is in BigQuery as an enterprise data warehouse. In batch prediction:
- You export your data to Cloud Storage, for example by using the bq command-line tool. Alternatively, you can use Dataflow, which is suitable if you need to preprocess your data in a batch pipeline.
- You submit an AI Platform batch prediction job that uses your trained TensorFlow model in Cloud Storage to perform scoring on the preprocessed data.
- The outputs of the batch prediction job are imported from Cloud Storage to a data warehouse in BigQuery, or to a departmental data mart (like campaigning, inventory, and so on) such as Cloud SQL. This lets you derive functional-based business decisions.
Online (real-time) prediction
In online prediction, the model usually receives a single data point from the caller, and is expected to provide a prediction for this data point in (near) real time. Typical use cases for online prediction include:
- Predictive maintenance: synchronously predicting whether a particular machine part will fail in the next N minutes, given the sensor's real-time data.
- Real-time bidding (RTB): synchronously recommending an ad and optimizing a bid when receiving a bid request. This information is then used to return an ad reference in real time.
- Predictive maintenance: asynchronously predicting whether a particular machine part will fail in the next N minutes, given the averages of the sensor's data in the past 30 minutes.
- Estimating asynchronously how long a food delivery will take based on the average delivery time in an area in the past 30 minutes, the ingredients, and real-time traffic information.
Real-time predictions can be delivered to the consumers (users, apps, systems, dashboards, and so on) in several ways:
- Synchronously. The request for prediction and the response (the prediction) are performed in sequence between the caller and the ML model service. That is, the caller waits until it receives the prediction from the ML service before performing the subsequent steps.
Asynchronously. Predictions or their subsequent actions, based on events streaming data, are delivered to the consumer independently of the request for prediction. This includes:
- Push. The model generates predictions and pushes them to the caller or consumer as a notification. An example is fraud detection, when you want to notify other systems to take action when a potentially fraudulent transaction is identified.
- Poll. The model generates predictions and stores them in a low read-latency database. The caller or consumer periodically polls the database for available predictions. An example is targeted marketing, where the system checks the propensity scores predicted in real time for active customers in order to decide whether to send an email with a promotion or a retention offer.
Figure 2 shows a simple synchronous architecture on Google Cloud for online (real-time) prediction.
In real-time prediction:
- Your online application sends HTTP requests to your ML model, which is deployed as a microservice and which exposes a REST API for prediction. You can deploy your model as an HTTP endpoint using AI Platform.
- The caller application receives the response as soon as your model produces the prediction.
You might need to implement logic in your app backend as part of the system. For example, the app backend might need to perform preprocessing of the input data point before you invoke the ML model. Or the app backend might need to perform post-processing on the output prediction before sending the response back to the caller. This type of backend processing is an ML gateway, which acts as a wrapper for your ML model or models and your preprocessing and post-processing logic.
The architecture of an ML gateway uses managed services such as App Engine and AI Platform to take advantage of features such as autoscaling and load balancing through a secure endpoint. Based on your requirements, you can use other technologies such as Google Kubernetes Engine (GKE) or Kubeflow for additional customization.
Figure 3 shows another common architecture for online predictions that are made asynchronously using a messaging and stream processing pipeline. This type of architecture is useful when you need to compute some features dynamically before making a prediction.
Instead of calling the deployed model's REST API directly, the system's flow works like this:
- Your online application sends events to Pub/Sub, a fully managed real-time messaging service. These events are consumed in real time and processed by a Dataflow stream processing pipeline.
- The processing pipeline invokes the model for prediction and sends the predictions to another Pub/Sub topic.
- The Pub/Sub messages containing predictions, recommendations, and so on are then pushed back to your online application or consumed by a downstream system for monitoring or real-time decision making.
Making and serving predictions
For real-time use cases, minimizing latency to serve prediction is important, because the expected action should happen immediately. You can usually improve serving latency at two levels:
- The model level, where you minimize the time your model takes to make a prediction when it's invoked with a data point. This includes building smaller models, as well as using accelerators for serving your models.
- The serving level, where you minimize the time your system takes to serve the prediction when it receives the request. This includes storing your input features in a low read-latency lookup data store, precomputing predictions in an offline batch-scoring job, and caching the predictions.
Optimizing the model for low read-latency prediction might not by itself guarantee a reduction in the overall latency of your serving system when you deploy your model in production. At the serving-prediction level, you're usually faced with other challenges, which include being able to fetch input features quickly enough from your backend data stores, as well as computing and maintaining real-time values to be used as input features by the model. In addition, optimizing the model for faster prediction might not be relevant if you precompute these predictions and cache them for online serving.
The following sections provide more detail on these approaches.
Optimizing models for serving
To optimize the ML model for low read-latency prediction, you can try the following:
- Using smaller model sizes by reducing the number of input features and/or reducing the model complexity. Examples include reducing hidden units in neural networks, levels in decision trees, and the number of trees in boosted trees.
- Removing unused, redundant, or irrelevant parts of the model for serving. This step is usually needed as you promote your model from training mode to prediction mode.
For example, consider the following characteristics of neural networks (NN):
- The more layers there are, and the more units per layer you have, the more capable the model is of capturing complex relationships in the data, and hence, the better its predictive effectiveness is. (Assuming that you address overfitting carefully.) However, the bigger the model, the more time it takes to produce a prediction.
- On the other hand, using a smaller NN model (fewer layers, units, and input features) reduces prediction latency. But the model might not reach the predictive effectiveness of a bigger one.
So there's a tradeoff between the model's predictive effectiveness and its prediction latency. And depending on the use case, you need to decide on two metrics:
- The model's optimizing metric, which reflects the model's predictive effectiveness, like accuracy, precision, mean square error, and so on. The better the value of this metric, the better the model.
- The model's satisficing metric, which reflects an operational constraint that the model needs to satisfy, such as prediction latency. You set a latency threshold to a particular value, such as 200 milliseconds, and any model that doesn't meet the threshold is not accepted. Another example of a satisficing metric is the size of the model, if you plan on deploying your model to low-spec hardware (like mobile and embedded devices).
After you've determined the values of your optimizing and satisficing metrics thresholds, you can start increasing your model complexity to improve your model predictive power until you hit the model prediction latency threshold.
Beside optimizing your model by balancing optimizing and satisficing metrics, you can use accelerators in the serving infrastructure to help improve the response time performance of the model:
- GPUs are optimized for parallel throughput. Several GPUs types are available across Google Cloud, including in Compute Engine, GKE, and AI Platform. For an in-depth look, see Running TensorFlow inference workloads at scale with TensorRT 5 and NVIDIA T4 GPUs.
- Cloud TPUs, a Google-built technology, are optimized for machine learning workloads. TPUs come as pods that can be used by Google Cloud products such as Compute Engine and AI Platform. Typically, you use TPUs only when you have large deep learning models and large batch sizes.
In addition, optimizing the saved model before deploying it (for example, by stripping unused parts) can reduce prediction latency. If you're training a TensorFlow model, we recommend that you optimize the SavedModel using the Graph Transformation Tools. For more information, see the Optimizing TensorFlow Models for Serving blog post.
Managing input-features lookup
For an ML model to provide a prediction when given a data point, the data point must include all of the input features that the model expects. The expected features are the ones that are used to train the model. For example, if you train a model to estimate the price of a house given its size, location, age, number of rooms, and orientation, the trained model requires values for those features as inputs in order to provide an estimated price. However, in many use cases, the caller of your model does not provide these input feature values; instead, they're read in real time from a data store.
There are two types of input features that are fetched in real time to invoke the model for prediction:
- Static reference features. These feature values are static or slowly changing attributes of the entity for which a prediction is needed.
- Dynamic real-time features. These feature values are dynamically captured and computed based on real-time events.
In practice, online prediction use cases include a mix of user-supplied features, static reference features, and real-time computed features.
Handling static reference features
Static features include descriptive attributes, such as customer demographic information. They also include summary attributes, such as customer purchase behavior indicators, like recency, frequency, and purchase total. Static reference data like this is relevant for the following use cases:
- Predicting the propensity of the customer to respond to a given service promotion, based on the customer demographic and historical purchase information, in order to display an ad.
- Estimating the price of a house based on the location of the house, including schools, shopping, and transportation, plus mean prices in the area.
- Recommending similar products given the attributes of the products that a customer is currently viewing.
These types of features are usually available in a data warehouse or in a master data management (MDM) system. When the ML gateway receives a prediction request for a specific entity, it needs to fetch the features related to that entity and pass them as inputs to the model for online prediction.
Analytical data stores such as BigQuery are not engineered for low-latency singleton read operations, where the result is a single row with many columns. An example of a query like this is "Select 100 columns from several tables for a specific customer ID." Thus, these types of static reference features are collected, prepared, and stored in a NoSQL database that is optimized for singleton lookup operations, such as Datastore. (Choosing the right NoSQL database for feature lookup is discussed under Choosing a NoSQL database later in this document.)
Figure 4 shows a high-level architecture for storing and serving reference data as input features for model prediction.
In this architecture:
- Your application receives an entity identifier (like
movie_id) for which a prediction is required.
- The application looks up the reference data for the entity from Datastore, which contains the attributes that are needed by the model.
- When the application receives the entity attributes, it invokes the deployed AI Platform model with values of the fetched attributes as input features.
- The received prediction from the model is post-processed and returned to the client application.
We refer to these types of features as static because their values do not change in real time. Instead, the values are usually updated in a batch. As shown in Figure 4, you can implement a reference data preparation process using Dataflow. This process runs on a schedule to do the following:
- Read entity-related data from the data warehouse (BigQuery) and possibly from the data lake (Cloud Storage).
- Process this data by joining different feeds, filtering out the required records, computing some metrics (such as recency, frequency, intensity for each entity), and performing any feature engineering required by the model.
- Store the created features in Datastore to be used as a lookup in real time.
If the features you create for your business entities are relevant to many use cases and are useful to several ML models, you can treat such an entity-ID/features dictionary as an enterprise-wide, centralized, discoverable feature store. Your organization might approach this in the following way:
- Authoring the features. One team is responsible for creating, publishing, and maintaining a feature set for an entity in the feature store—for example, demographic information for customers.
- Discovering the features. Other teams consume these features from the feature store in their various ML models. To support discoverability, a central repository is used to describe the metadata that's related to the entities and their features.
For more details on building feature stores, see Feast: an open source feature store for machine learning.
Handling dynamic real-time features
Real-time features are computed on the fly, typically in an event-stream processing pipeline. The difference between real-time features and the batch approach illustrated in Figure 3 is that for real-time features, you need a list of aggregated values for a particular window (fixed, sliding, or session) in a certain period of time, and not an overall aggregation of values within that period of time. Dynamic real-time data is relevant in use cases like the following:
- Predicting whether an engine will fail in the next hour, given real-time sensor data such as maximum, minimum, and average temperature; pressure; and vibration for each minute in the last half-hour.
- Recommending the next news article to read based on the list of last N viewed articles by the user during the current session.
- Estimating how long food delivery will take based on the list of incoming orders, as well as related data such as how many outstanding orders per minute have been made in the past hour.
For use cases like these, you can use a Dataflow streaming pipeline. For dynamic features creation, the pipeline captures and aggregates (sum, count, mean, max, last, and so on) the events in real time, and stores them in a low-latency read/write database. For producing predictions, the pipeline fetches the dynamically created features (a series of aggregate values) from the database, and uses them as input features to the model to make predictions. Bigtable is a good option for a low-latency read/write database for feature values.
Figure 5 shows a high-level architecture of a stream processing pipeline.
In this architecture:
- Real-time events are consumed from Pub/Sub by the Dataflow streaming pipeline, using time-window aggregations.
- The real-time computed aggregates are accumulated and maintained in Bigtable.
- The values in Bigtable are used as input features to invoke AI Platform models for prediction.
- The produced predictions are stored in Datastore to be consumed by downstream systems, or as Pub/Sub topics to be post-processed and pushed back to the caller as real-time notifications, ads, and so on.
Precomputing and caching predictions
Another approach to improve online prediction latency is to precompute predictions in an offline batch scoring job, and store them in a low read-latency data store like Memorystore or Datastore for online serving. In these cases, the client (mobile, web, data pipeline worker, backend, or frontend API) doesn't call the model for online prediction. Instead, the client fetches the precomputed predictions from the data store, assuming that the prediction exists. The process works like this:
- Data is ingested, processed, and stored in a key-value store.
- A trained and deployed model runs a batch prediction job on the prepared data to produce predictions for the input data. Each prediction is identified by a key.
- A data pipeline exports the predictions referenced by a key to a low-latency data store that's optimized for singleton reads.
- A client sends a prediction request referenced by a unique key.
- The ML gateway reads from the data store using the entry key and returns the corresponding prediction.
- The client receives the prediction response.
Figure 6 shows this flow.
Prediction requests can be used for two categories of lookup keys:
Specific entity. The predictions relate to a single entity based on an ID, like a customer, a product, a location, or a device. Use cases for specific-entity predictions include:
- Preparing promotions, recommendations, or offers for a unique user ID at the beginning of the user's session.
- Finding items (movies, articles, songs, and so on) that are similar to popular ones in order to produce related-item recommendations.
Specific combination of input features. The predictions are for entities that are unique in time, and that cannot be based on a single entity ID. Instead, a combination of feature values defines the context for an entity. The key that represents that context is the unique combination of those feature values. Predictions are usually computed for frequent feature-value combinations. Use cases for specific feature combinations include:
- Maximizing a bid price in real-time bidding for an incoming bid request. The bid request has a unique ID in time, but the request context (which might include a combination of website category, user taxonomy, and ad performance) can reoccur.
- Predicting whether an anonymous or a new customer might buy something on your website based on a combination of location, number of products in the cart, and UTM data.
Handling predictions by entities
When you precompute predictions for specific entities, you might face the following situations:
You have a manageable number of entities (low cardinality), which makes precomputing predictions for all entities possible. In that case, you probably want to preprocess the entities and save them all in a key-value store. An example is forecasting daily sales by store when you have hundreds or just a few thousand stores.
You have too many entities (high cardinality), which makes it challenging to precompute prediction in a limited amount of time. An example is forecasting daily sales by item when you have hundreds of thousands or millions items. In that case, you can use a hybrid approach, where you precompute predictions for the top N entities, such as for the most active customers or the most viewed products. You can then use the model directly for online prediction for the rest of the long-tail entities.
Figure 7 illustrates an architecture for this flow.
In cases where you have many entities (such as millions of songs) and the ML task can be tackled as similarity matching (recommending similar songs), you can do the following:
- Precompute the representation embeddings of the items in an offline ML process.
- Group similar items, using a hierarchical clustering method, to create an index that's stored in a low read-latency data store.
- Retrieve the items most similar to a given one (for example, the item currently being viewed) using the index.
- Update the index in a regular offline process with new items and item-user interaction data.
Handling predictions for combinations of feature values
If you decide to precompute predictions not for a specific entity ID but for a combination of input feature values, you need the following:
- Key. A hashed combination of all possible input features. Combinations are not permutations, which means that you need to consistently build the key by using the features in the same order.
- Value. The prediction.
A simple example is to predict whether an anonymous or a new customer will buy something on your website. At the high level, you can start with features like those listed in the following table.
|Feature name||Description||Example cardinality|
||Can vary between 5 and 9||6|
||yes or no||2|
||Custom to your store inventory||10|
You then do the following:
- Decide the order of features when building your key string. For
europe_yes_female_sportis the same prediction as
europe_female_sport_yes, but the two keys result in a different hash value.
- Compute predictions offline for all 120 possible combinations (6 x 2 x 10 in the example).
Store the predictions in a low read-latency database using the following:
- The key, which could be something like
- The value of the prediction, such as
- The key, which could be something like
Figure 8 shows this flow.
This data store design would be tall (many records) and narrow (few columns). Although precomputing can bring latency advantages, you must also take the following into consideration:
- Creating all possible combinations of features can lead to millions or billions of records, depending on your data. Assuming that all features make sense as is and you need all combinations, storing them requires a scalable data store that can still serve single records with minimum latency.
- Continuous values such as
average_pricecannot be used as part of the possible combinations, because the possibilities could be infinite. Thus, you need to bucketize them appropriately, keeping in mind that it would impact your model effectiveness. The bucket size can be a hyperparameter.
- Some categorical features such as
postal_codescan have a high cardinality, and if they're combined can create billions of possible combinations. You must decide whether each combination is relevant, or whether you could decrease the cardinality by using lower postcode resolution or using hash bucketing, without losing too much of your model effectiveness.
Choosing a NoSQL database
Google Cloud provides several data stores to handle your combination of latency, load, throughput, and size requirements.
Figure 9 shows options for managed data stores in Google Cloud. Feature lookups as well as precomputed predictions should use a store that's optimized for singleton reads or to read a limited number of records in milliseconds.
Due to their low read-latency capabilities, the most relevant managed data stores for strategies described in this article are the following:
Memorystore. Memorystore is a managed in-memory database. When you use its Redis offering, you can store intermediate data for submillisecond read access. Keys are binary-safe strings, and values can be of different data structures. Typical use cases for Memorystore are:
- User-feature lookup in real-time bidding that requires submillisecond retrieval time.
- Media and gaming applications that use precomputed predictions.
- Storing intermediate data for a real-time data pipeline for creating input features.
Datastore. Datastore is a fully-managed, scalable NoSQL document database built for automatic scaling, high performance, and ease of application development. Data objects in Datastore are known as entities. An entity has one or more named properties, in which you store the feature values required by your model or models. A typical use case for Datastore is a product recommendation system in an e-commerce site that's based on information about logged-in users.
Bigtable. Bigtable is a massively scalable NoSQL database service engineered for high throughput and for low-latency workloads. It can handle petabytes of data, with millions of reads and writes per second at a latency that's on the order of milliseconds. The data is structured as a sorted key-value map. Bigtable scales linearly with the number of nodes. For more details, see Understanding Bigtable performance. Typical use cases for Bigtable are:
- Fraud detection that leverages dynamically aggregated values. Applications in Fintech and Adtech are usually subject to heavy reads and writes.
- Ad prediction that leverages dynamically aggregated values over all ad requests and historical data.
- Booking recommendation based on the overall customer base's recent bookings.
In most of those cases, data in those stores has:
- A unique record key that refers to a user ID, machine ID, or other unique string that represents the item that the system needs to provide a prediction for.
- A feature value that can represent several data points, such as maximum temperature and average pressure in the last minute in an IoT context, or the count of bookings and page views in the last 10 minutes for a travel booking context. How the value is stored depends on the data store you choose.
To summarize, if you need:
- Submillisecond retrieval latency on a limited amount of quickly changing data, retrieved by a few thousand clients, use Memorystore.
- Millisecond retrieval latency on slowly changing data where storage scales automatically, use Datastore.
- Millisecond retrieval latency on dynamically changing data, using a store that can scale linearly with heavy reads and writes, use Bigtable.
- Take the 5-course Coursera specialization on Machine Learning with TensorFlow on Google Cloud.
- Learn about best practices for ML engineering in Rules of ML.
- Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.