Get predictions from a custom trained model

A prediction is the output of a trained machine learning model. This page provides an overview of the workflow for getting predictions from your models on Vertex AI.

Vertex AI offers two methods for getting prediction:

  • Online predictions are synchronous requests made to a model endpoint. Before sending a request, you must first deploy the model resource to an endpoint. This associates compute resources with the model so that it can serve online predictions with low latency. Use online predictions when you are making requests in response to application input or in situations that require timely inference.

  • Batch predictions are asynchronous requests. You request a batchPredictionsJob directly from the model resource without needing to deploy the model to an endpoint. Use batch predictions when you don't require an immediate response and want to process accumulated data by using a single request.

Test your model locally

Before getting predictions, it's useful to deploy your model to a local endpoint during the development and testing phase. This lets you both iterate more quickly and test your model without deploying it to an online endpoint or incurring prediction costs. Local deployment is intended for local development and testing, not for production deployments.

To deploy a model locally, use the Vertex AI SDK for Python and deploy a LocalModel to a LocalEndpoint. For a demonstration, see this notebook.

Even if your client is not written in Python, you can still use the Vertex AI SDK for Python to launch the container and server so that you can test requests from your client.

Get predictions from custom trained models

To get predictions, you must first import your model. After it's imported, it becomes a model resource that is visible in Vertex AI Model Registry.

Then, read the following documentation to learn how to get predictions:

What's next