Online predictions are synchronous requests made to a model endpoint. Use online predictions when you are making requests in response to application input or in situations that require timely inference.
You must deploy a model to an endpoint before that model can be used to serve online predictions. Deploying a model associates physical resources with the model so it can serve online predictions with low latency.
You can deploy more than one model to an endpoint, and you can deploy a model to more than one endpoint. For more information about options and use cases for deploying models, see Considerations for deploying models.
To learn how to deploy an AutoML model, see the Get predictions from AutoML models section of this page and select the page that's relevant to your model.
To learn how to deploy a custom trained model, see Get predictions from a custom trained model.
Batch predictions are asynchronous requests. You request batch predictions directly from the model resource without needing to deploy the model to an endpoint. Use batch predictions when you don't require an immediate response and want to process accumulated data by using a single request.
Get predictions from AutoML models
You can get online or batch predictions from AutoML models by using the Google Cloud console or the Vertex AI API. The instructions for how to do this slightly vary based on your data type and model objective:
Learn how to get predictions from the following types of image AutoML models:
Learn how to get predictions from the following types of tabular AutoML models:
Learn how to get predictions from the following types of text AutoML models:
Learn how to get predictions from the following types of video AutoML models:
Get predictions from custom trained models
The instructions on how to get online and batch predictions from your custom trained model are the same, regardless of your data type or model objective.
For details, see Get predictions from a custom trained model.