Considerations for deploying models

This page describes the deployment process, and some common deployment scenarios and their associated use cases.

What happens when you deploy a model

When you deploy a model to an endpoint, you associate physical (machine) resources with that model to enable it to serve online predictions. Online predictions have low latency requirements; providing resources to the model in advance reduces latency.

The model's training type (AutoML or custom) and (AutoML) data type determine the kinds of physical resources available to the model. After you choose the resources for a model deployment, you cannot update them; you must create a new deployment.

The endpoint resource provides the service endpoint (URL) you use to request the prediction. For example:

https://us-central1-aiplatform.googleapis.com/v1/projects/{project}/locations/{location}/endpoints/{endpoint}:predict

Reasons to deploy more than one model to the same endpoint

Deploying two models to the same endpoint enables you to gradually replace one model with the other. For example, suppose you were using a model, and found a way to increase the accuracy of that model with new training data. However, you don't want to update your application to point to a new endpoint URL, and you don't want to create sudden change in your application. You can add the new model to the same endpoint, serving a small percentage of traffic, and gradually increase the traffic split for the new model until it is serving 100% of the traffic.

Because the resources are associated with the model rather than the endpoint, you could deploy models of different types to the same endpoint. However, the best practice is to deploy models of a specific type (AutoML text, AutoML tabular, custom-trained, etc.) to an endpoint. This configuration is easier to manage.

Reasons to deploy a model to more than one endpoint

You might want to deploy your models with different resources for different application environments, such as testing and production. You might also want to support different SLOs for your prediction requests. Perhaps one of your applications has much higher performance needs than the others. In this case, you could deploy that model to a higher-performance endpoint with more machine resources. To optimize costs, you can also deploy the model to a lower-performance endpoint with fewer machine resources.