You must deploy a model to an endpoint before that model can be used to serve online predictions. Deploying a model associates physical resources with the model so it can serve online predictions with low latency.
To be deployable, the model must be visible in Vertex AI Model Registry. For information about Model Registry, including how to import model artifacts or create them directly in Model Registry, see Introduction to Vertex AI Model Registry.
You can deploy multiple models to an endpoint, or you can deploy the same model to multiple endpoints. For more information about options and use cases for deploying models, see Reasons to deploy more than one model to the same endpoint.
Deploy a model to an endpoint
Use one of the following methods to deploy a model:
Google Cloud console
In the Google Cloud console, in the Vertex AI section, go to the Models page.
Click the name and version ID of the model you want to deploy to open its details page.
Select the Deploy & Test tab.
If your model is already deployed to any endpoints, they are listed in the Deploy your model section.
Click Deploy to endpoint.
To deploy your model to a new endpoint, select
Create new endpoint, and provide a name for the new endpoint. To deploy your model to an existing endpoint, select Add to existing endpoint, and select the endpoint from the drop-down list.You can deploy multiple models to an endpoint, or you can deploy the same model to multiple endpoints.
If you deploy your model to an existing endpoint that has one or more models deployed to it, you must update the Traffic split percentage for the model you are deploying and the already deployed models so that all of the percentages add up to 100%.
If you're deploying your model to a new endpoint, accept 100 for the Traffic split. Otherwise, adjust the traffic split values for all models on the endpoint so they add up to 100.
Enter the Minimum number of compute nodes you want to provide for your model.
This is the number of nodes that need to be available to the model at all times.
You are charged for the nodes used, whether to handle prediction load or for standby (minimum) nodes, even without prediction traffic. See the pricing page.
The number of compute nodes can increase if needed to handle prediction traffic, but it will never go higher than the maximum number of nodes.
To use autoscaling, enter the Maximum number of compute nodes you want Vertex AI to scale up to.
Select your Machine type.
Larger machine resources increase your prediction performance and increase costs. Compare the available machine types.
Select an Accelerator type and an Accelerator count.
If you enabled accelerator use when you imported or created the model, this option displays.
For the accelerator count, refer to the GPU table to check for valid numbers of GPUs that you can use with each CPU machine type. The accelerator count refers to the number of accelerators per node, not the total number of accelerators in your deployment.
If you want to use a custom service account for the deployment, select a service account in the Service account drop-down box.
Learn how to change the default settings for prediction logging.
Click Done for your model, and when all the Traffic split percentages are correct, click Continue.
The region where your model deploys is displayed. This must be the region where you created your model.
Click Deploy to deploy your model to the endpoint.
API
When you deploy a model using the Vertex AI API, you complete the following steps:
Create an endpoint
If you are deploying a model to an existing endpoint, you can skip this step and go to Get the endpoint ID. To try the dedicated endpoint Preview, skip to Create a dedicated endpoint.
gcloud
The following example uses the gcloud ai endpoints create
command:
gcloud ai endpoints create \
--region=LOCATION_ID \
--display-name=ENDPOINT_NAME
Replace the following:
- LOCATION_ID: The region where you are using Vertex AI.
- ENDPOINT_NAME: The display name for the endpoint.
The Google Cloud CLI tool might take a few seconds to create the endpoint.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: Your region.
- PROJECT_ID: Your project ID.
- ENDPOINT_NAME: The display name for the endpoint.
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints
Request JSON body:
{ "display_name": "ENDPOINT_NAME" }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1.CreateEndpointOperationMetadata", "genericMetadata": { "createTime": "2020-11-05T17:45:42.812656Z", "updateTime": "2020-11-05T17:45:42.812656Z" } } }
"done":
true
.
Terraform
The following sample uses the google_vertex_ai_endpoint
Terraform resource to create an endpoint.
To learn how to apply or remove a Terraform configuration, see Basic Terraform commands.
Java
Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Create a dedicated endpoint
If you are deploying a model to an existing endpoint, you can skip this step.
A dedicated endpoint is a faster, more stable endpoint with support for larger payload sizes and longer request timeouts.
To use a dedicated endpoint during Preview, you need to enable it explicitly.
REST
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-d '{"display_name": "ENDPOINT_NAME", "dedicatedEndpointEnabled": true}' \
https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints
Replace the following:
- ENDPOINT_NAME: The display name for the endpoint.
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: The project ID for your Google Cloud project.
Python
endpoint = aiplatform.Endpoint.create(
display_name="ENDPOINT_NAME",
dedicated_endpoint_enabled=True,
)
Replace the following:
- ENDPOINT_NAME: The display name for the endpoint.
Get the endpoint ID
You need the endpoint ID to deploy the model.
gcloud
The following example uses the gcloud ai endpoints list
command:
gcloud ai endpoints list \
--region=LOCATION_ID \
--filter=display_name=ENDPOINT_NAME
Replace the following:
- LOCATION_ID: The region where you are using Vertex AI.
- ENDPOINT_NAME: The display name for the endpoint.
Note the number that appears in the ENDPOINT_ID
column. Use this ID in the
following step.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID.
- ENDPOINT_NAME: The display name for the endpoint.
HTTP method and URL:
GET https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints?filter=display_name=ENDPOINT_NAME
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "endpoints": [ { "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID", "displayName": "ENDPOINT_NAME", "etag": "AMEw9yPz5pf4PwBHbRWOGh0PcAxUdjbdX2Jm3QO_amguy3DbZGP5Oi_YUKRywIE-BtLx", "createTime": "2020-04-17T18:31:11.585169Z", "updateTime": "2020-04-17T18:35:08.568959Z" } ] }
Deploy the model
Select the tab below for your language or environment:
gcloud
The following examples use the gcloud ai endpoints deploy-model
command.
The following example deploys a Model
to an Endpoint
without using GPUs
to accelerate prediction serving and without splitting traffic between multiple
DeployedModel
resources:
Before using any of the command data below, make the following replacements:
- ENDPOINT_ID: The ID for the endpoint.
- LOCATION_ID: The region where you are using Vertex AI.
- MODEL_ID: The ID for the model to be deployed.
-
DEPLOYED_MODEL_NAME: A name for the
DeployedModel
. You can use the display name of theModel
for theDeployedModel
as well. - MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes.
-
MAX_REPLICA_COUNT: The maximum number of nodes for this deployment.
The node count can be increased or decreased as required by the prediction load,
up to this number of nodes and never fewer than the minimum number of nodes.
If you omit the
--max-replica-count
flag, then maximum number of nodes is set to the value of--min-replica-count
.
Execute the gcloud ai endpoints deploy-model command:
Linux, macOS, or Cloud Shell
gcloud ai endpoints deploy-model ENDPOINT_ID\ --region=LOCATION_ID \ --model=MODEL_ID \ --display-name=DEPLOYED_MODEL_NAME \ --min-replica-count=MIN_REPLICA_COUNT \ --max-replica-count=MAX_REPLICA_COUNT \ --traffic-split=0=100
Windows (PowerShell)
gcloud ai endpoints deploy-model ENDPOINT_ID` --region=LOCATION_ID ` --model=MODEL_ID ` --display-name=DEPLOYED_MODEL_NAME ` --min-replica-count=MIN_REPLICA_COUNT ` --max-replica-count=MAX_REPLICA_COUNT ` --traffic-split=0=100
Windows (cmd.exe)
gcloud ai endpoints deploy-model ENDPOINT_ID^ --region=LOCATION_ID ^ --model=MODEL_ID ^ --display-name=DEPLOYED_MODEL_NAME ^ --min-replica-count=MIN_REPLICA_COUNT ^ --max-replica-count=MAX_REPLICA_COUNT ^ --traffic-split=0=100
Splitting traffic
The --traffic-split=0=100
flag in the preceding examples sends 100% of prediction
traffic that the Endpoint
receives to the new DeployedModel
, which is
represented by the temporary ID 0
. If your Endpoint
already has other
DeployedModel
resources, then you can split traffic between the new
DeployedModel
and the old ones.
For example, to send 20% of traffic to the new DeployedModel
and 80% to an older one,
run the following command.
Before using any of the command data below, make the following replacements:
- OLD_DEPLOYED_MODEL_ID: the ID of the existing
DeployedModel
.
Execute the gcloud ai endpoints deploy-model command:
Linux, macOS, or Cloud Shell
gcloud ai endpoints deploy-model ENDPOINT_ID\ --region=LOCATION_ID \ --model=MODEL_ID \ --display-name=DEPLOYED_MODEL_NAME \ --min-replica-count=MIN_REPLICA_COUNT \ --max-replica-count=MAX_REPLICA_COUNT \ --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80
Windows (PowerShell)
gcloud ai endpoints deploy-model ENDPOINT_ID` --region=LOCATION_ID ` --model=MODEL_ID ` --display-name=DEPLOYED_MODEL_NAME \ --min-replica-count=MIN_REPLICA_COUNT ` --max-replica-count=MAX_REPLICA_COUNT ` --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80
Windows (cmd.exe)
gcloud ai endpoints deploy-model ENDPOINT_ID^ --region=LOCATION_ID ^ --model=MODEL_ID ^ --display-name=DEPLOYED_MODEL_NAME \ --min-replica-count=MIN_REPLICA_COUNT ^ --max-replica-count=MAX_REPLICA_COUNT ^ --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80
REST
Deploy the model.
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID.
- ENDPOINT_ID: The ID for the endpoint.
- MODEL_ID: The ID for the model to be deployed.
-
DEPLOYED_MODEL_NAME: A name for the
DeployedModel
. You can use the display name of theModel
for theDeployedModel
as well. -
MACHINE_TYPE: Optional. The machine resources used for each node of this
deployment. Its default setting is
n1-standard-2
. Learn more about machine types. - ACCELERATOR_TYPE: The type of accelerator to be attached to the machine. Optional if ACCELERATOR_COUNT is not specified or is zero. Not recommended for AutoML models or custom-trained models that are using non-GPU images. Learn more.
- ACCELERATOR_COUNT: The number of accelerators for each replica to use. Optional. Should be zero or unspecified for AutoML models or custom-trained models that are using non-GPU images.
- MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1.
- MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to this number of nodes and never fewer than the minimum number of nodes.
- TRAFFIC_SPLIT_THIS_MODEL: The percentage of the prediction traffic to this endpoint to be routed to the model being deployed with this operation. Defaults to 100. All traffic percentages must add up to 100. Learn more about traffic splits.
- DEPLOYED_MODEL_ID_N: Optional. If other models are deployed to this endpoint, you must update their traffic split percentages so that all percentages add up to 100.
- TRAFFIC_SPLIT_MODEL_N: The traffic split percentage value for the deployed model id key.
- PROJECT_NUMBER: Your project's automatically generated project number
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel
Request JSON body:
{ "deployedModel": { "model": "projects/PROJECT/locations/us-central1/models/MODEL_ID", "displayName": "DEPLOYED_MODEL_NAME", "dedicatedResources": { "machineSpec": { "machineType": "MACHINE_TYPE", "acceleratorType": "ACCELERATOR_TYPE", "acceleratorCount": "ACCELERATOR_COUNT" }, "minReplicaCount": MIN_REPLICA_COUNT, "maxReplicaCount": MAX_REPLICA_COUNT }, }, "trafficSplit": { "0": TRAFFIC_SPLIT_THIS_MODEL, "DEPLOYED_MODEL_ID_1": TRAFFIC_SPLIT_MODEL_1, "DEPLOYED_MODEL_ID_2": TRAFFIC_SPLIT_MODEL_2 }, }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1.DeployModelOperationMetadata", "genericMetadata": { "createTime": "2020-10-19T17:53:16.502088Z", "updateTime": "2020-10-19T17:53:16.502088Z" } } }
Java
Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Node.js
Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Learn how to change the default settings for prediction logging.
Get operation status
Some requests start long-running operations that require time to complete. These requests return an operation name, which you can use to view the operation's status or cancel the operation. Vertex AI provides helper methods to make calls against long-running operations. For more information, see Working with long-running operations.
Limitations
- If you have VPC Service Controls enabled, your deployed model's container won't have access to the internet.
Configure model deployment
During model deployment, you make the following important decisions about how to run online prediction:
Resource created | Setting specified at resource creation |
---|---|
Endpoint | Location in which to run predictions |
Model | Container to use (ModelContainerSpec ) |
DeployedModel | Machines to use for online prediction |
You can't update these settings listed after the initial creation of the model or endpoint, and you can't override them in the online prediction request. If you need to change these settings, you must redeploy your model.
What happens when you deploy a model
When you deploy a model to an endpoint, you associate physical (machine) resources with that model so it can serve online predictions. Online predictions have low latency requirements. Providing resources to the model in advance reduces latency.
The model's training type (AutoML or custom) and (AutoML) data
type determine the kinds of physical resources available to the model. After
model deployment, you can
mutate
some
of those resources without creating a new deployment.
The endpoint resource provides the service endpoint (URL) you use to request the prediction. For example:
https://us-central1-aiplatform.googleapis.com/v1/projects/{project}/locations/{location}/endpoints/{endpoint}:predict
Reasons to deploy more than one model to the same endpoint
Deploying two models to the same endpoint lets you gradually replace one model with the other. For example, suppose you are using a model, and find a way to increase the accuracy of that model with new training data. However, you don't want to update your application to point to a new endpoint URL, and you don't want to create sudden changes in your application. You can add the new model to the same endpoint, serving a small percentage of traffic, and gradually increase the traffic split for the new model until it is serving 100% of the traffic.
Because the resources are associated with the model rather than the endpoint, you could deploy models of different types to the same endpoint. However, the best practice is to deploy models of a specific type (for example, AutoML tabular, custom-trained) to an endpoint. This configuration is easier to manage.
Reasons to deploy a model to more than one endpoint
You might want to deploy your models with different resources for different application environments, such as testing and production. You might also want to support different SLOs for your prediction requests. Perhaps one of your applications has much higher performance needs than the others. In this case, you can deploy that model to a higher-performance endpoint with more machine resources. To optimize costs, you can also deploy the model to a lower-performance endpoint with fewer machine resources.
Scaling behavior
When you deploy a model for online prediction as a DeployedModel
, you can configure
prediction nodes to automatically scale. To do this, set
dedicatedResources.maxReplicaCount
to a
greater value than dedicatedResources.minReplicaCount
.
When you configure a DeployedModel
, you must set
dedicatedResources.minReplicaCount
to at least 1. In other words, you cannot
configure the DeployedModel
to scale to 0 prediction nodes when it is
unused.
Target utilization and configuration
By default, if you deploy a model without dedicated GPU resources, Vertex AI automatically scales the number of replicas up or down so that CPU usage matches the default 60% target value.
By default, if you deploy a model with dedicated GPU resources (if
machineSpec.accelerator_count
is greater than 0), Vertex AI will automatically scale the number of replicas up
or down so that the CPU or GPU usage, whichever is higher, matches the default
60% target value. Therefore, if your prediction throughput is causing high GPU
usage, but not high CPU usage, Vertex AI will scale up, and the CPU
utilization will be very low, which will be visible in monitoring. Conversely,
if your custom container is underutilizing the GPU, but has an unrelated process
that raise CPU utilization higher than 60%, Vertex AI will scale up, even
if this may not have been needed to achieve QPS and latency targets.
You can override the default threshold metric and target by specifying
autoscalingMetricSpecs
.
Note that if your deployment is configured to scale based only on CPU usage, it
won't scale up even if GPU usage is high.
Manage resource usage
You can monitor your endpoint to track metrics like CPU and Accelerator usage, number of requests, latency, and the current and target number of replicas. This information can help you understand your endpoint's resource usage and scaling behavior.
Keep in mind that each replica runs only a single container. This means that if a prediction container can't fully use the selected compute resource, such as single threaded code for a multi-core machine, or a custom model that calls another service as part of making the prediction, your nodes may not scale up.
For example, if you are using FastAPI, or any model server that has a configurable number of workers or threads, there are many cases where having more than one worker can increase resource utilization, which improves the ability for the service to automatically scale the number of replicas.
We generally recommend starting with one worker or thread per core. If you notice that CPU utilization is low, especially under high load, or your model isn't scaling up because CPU utilization is low, then increase the number of workers. On the other hand, if you notice that utilization is too high and your latencies increase more than expected under load, try using fewer workers. If you are already using only a single worker, try using a smaller machine type.
Scaling behavior and lag
Vertex AI adjusts the number of replicas every 15 seconds using data from the previous 5 minutes window. For each 15 second cycle, the system measures the server utilization and generates a target number of replicas based on the following formula:
target # of replicas = Ceil(current # of replicas * (current utilization / target utilization))
For example, if you have two replicas that are being utilized at 100%, the target is 4:
4 = Ceil(3.33) = Ceil(2 * (100% / 60%))
Another example, if you have 10 replicas and utilization drops to 1%, the target is 1:
1 = Ceil(.167) = Ceil(10 * (1% / 60%))
At the end of each 15 second cycle, the system adjusts the number of replicas to match the highest target value from the previous 5 minutes window. Notice that because the highest target value is chosen, your endpoint won't scale down if there is a spike in utilization during that 5 minute window, even if overall utilization is very low. On the other hand, if the system needs to be scaled up, it will do that within 15 seconds since the highest target value is chosen instead of the average.
Keep in mind that even after Vertex AI adjusts the number of replicas, it takes time to start up or turn down the replicas. Thus there is an additional delay before the endpoint can adjust to the traffic. The main factors that contribute to this time include the following:
- the time to provision and start the Compute Engine VMs
- the time to download the container from the registry
- the time to load the model from storage
The best way to understand the real world scaling behavior of your model is to
run a load test and optimize the characteristics that matter for your model and
your use case. If the autoscaler isn't scaling up fast enough for your
application, provision enough min_replicas
to handle your expected baseline
traffic.
Update the scaling configuration
If you specified either DedicatedResources
or AutomaticResources
when you deployed
the model, you can update the scaling configuration without redeploying the
model by calling
mutateDeployedModel
.
For example, the following request updates max_replica
,
autoscaling_metric_specs
, and disables container logging.
{
"deployedModel": {
"id": "2464520679043629056",
"dedicatedResources": {
"maxReplicaCount": 9,
"autoscalingMetricSpecs": [
{
"metricName": "aiplatform.googleapis.com/prediction/online/cpu/utilization",
"target": 50
}
]
},
"disableContainerLogging": true
},
"update_mask": {
"paths": [
"dedicated_resources.max_replica_count",
"dedicated_resources.autoscaling_metric_specs",
"disable_container_logging"
]
}
}
Usage notes:
- You can't change the machine type or switch from
DedicatedResources
toAutomaticResources
or the other way around. The only scaling configuration fields you can change are:min_replica
,max_replica
, andAutoscalingMetricSpec
(DedicatedResources
only). - You must list every field you need to update in
updateMask
. Unlisted fields are ignored. - The DeployedModel
must be in a
DEPLOYED
state. There can be at most one active mutate operation per deployed model. mutateDeployedModel
also lets you enable or disable container logging. For more information, see Online prediction logging.
Undeploy a model and delete the endpoint
Use one of the following methods to undeploy a model and delete the endpoint.
Google Cloud console
Undeploy the model as follows:
In the Google Cloud console, in the Vertex AI section, go to the Endpoints page.
Click the name and version ID of the model you want to undeploy to open its details page.
On the row for your model, click
Actions, and then click Undeploy model from endpoint.In the Undeploy model from endpoint dialog, click Undeploy.
To delete additional models, repeat the preceding steps.
Optional: Delete the online prediction endpoint as follows:
In the Google Cloud console, in the Vertex AI section, go to the Online prediction page.
Select the endpoint.
To delete the endpoint, click
Actions, and then click Delete endpoint.
gcloud
List the endpoint IDs for all endpoints in your project:
gcloud ai endpoints list \ --project=PROJECT_ID \ --region=LOCATION_ID
Replace PROJECT_ID with your project name and LOCATION_ID with the region where you are using Vertex AI.
List the model IDs for the models that are deployed to an endpoint:
gcloud ai endpoints describe ENDPOINT_ID \ --project=PROJECT_ID \ --region=LOCATION_ID
Replace ENDPOINT_ID with the endpoint ID.
Undeploy a model from the endpoint:
gcloud ai endpoints undeploy-model ENDPOINT_ID \ --project=PROJECT_ID \ --region=LOCATION_ID \ --deployed-model-id=DEPLOYED_MODEL_ID
Replace DEPLOYED_MODEL_ID with the model ID.
Optional: Delete the online prediction endpoint:
gcloud ai endpoints delete ENDPOINT_ID \ --project=PROJECT_ID \ --region=LOCATION_ID
What's next
- Learn how to get an online prediction.
- Learn about private endpoints.