Online versus batch prediction

AI Platform Prediction provides two ways to get predictions from trained models: online prediction (sometimes called HTTP prediction), and batch prediction. In both cases, you pass input data to a cloud-hosted machine-learning model and get inferences for each data instance. The differences are shown in the following table:

Online prediction Batch prediction
Optimized to minimize the latency of serving predictions. Optimized to handle a high volume of instances in a job and to run more complex models.
Can process one or more instances per request. Can process one or more instances per request.
Predictions returned in the response message. Predictions written to output files in a Cloud Storage location that you specify.
Input data passed directly as a JSON string. Input data passed indirectly as one or more URIs of files in Cloud Storage locations.
Returns as soon as possible. Asynchronous request.

Accounts with the following IAM roles can request online predictions:

Accounts with the following IAM roles can request batch predictions:

Runs on the runtime version and in the region selected when you deploy the model. Can run in any available region, using runtime version 2.1 or earlier. Though you should run with the defaults for deployed model versions.
Runs models deployed to AI Platform Prediction. Runs models deployed to AI Platform Prediction or models stored in accessible Google Cloud Storage locations.
Configurable to use various types of virtual machines for prediction nodes. If running a model deployed to AI Platform Prediction, must use the mls1-c1-m2 machine type.
Can serve predictions from a TensorFlow SavedModel or a custom prediction routine (beta), as well as scikit-learn and XGBoost models. Can serve predictions from a TensorFlow SavedModel.
$0.045147 to $0.151962 per node hour (Americas). Price depends on machine type selection. $0.0791205 per node hour (Americas).

The needs of your application dictate the type of prediction you should use.

  • You should generally use online prediction when you are making requests in response to application input or in other situations where timely inference is needed.

  • Batch prediction is ideal for processing accumulated data when you don't need immediate results. For example a periodic job that gets predictions for all data collected since the last job.

You should also inform your decision with the potential differences in prediction costs.

Batch prediction latency

If you use a simple model and a small set of input instances, you'll find that there is a considerable difference between how long it takes to finish identical prediction requests using online versus batch prediction. It might take a batch job several minutes to complete predictions that are returned almost instantly by an online request. This is a side-effect of the different infrastructure used by the two methods of prediction. AI Platform Prediction allocates and initializes resources for a batch prediction job when you send the request. Online prediction is typically ready to process at the time of request.

What's next

Read the prediction overview for more information about predictions.

Or, skip to making online predictions or making batch predictions.