You must deploy a model to an endpoint before that model can be used to serve online predictions. Deploying a model associates physical resources with the model so it can serve online predictions with low latency.
You can deploy more than one model to an endpoint, and you can deploy a model to more than one endpoint. For more information about options and use cases for deploying models, see Reasons to deploy more than one model to the same endpoint below.
Deploy a model to an endpoint
Use one of the following methods to deploy a model:
Google Cloud console
- In the Google Cloud console, in the Vertex AI section, go to the Models page.
Click the name and version ID of the model you want to deploy to open its details page.
Select the Deploy & Test tab.
If your model is already deployed to any endpoints, they are listed in the Deploy your model section.
Click Deploy to endpoint.
To deploy your model to a new endpoint, select
Create new endpoint and provide a name for the new endpoint. To deploy your model to an existing endpoint, select Add to existing endpoint and select the endpoint from the drop-down list.
You can add more than one model to an endpoint, and you can add a model to more than one endpoint. Learn more.
If you deploy your model to an existing endpoint that has one or more models deployed to it, you must update the Traffic split percentage for the model you are deploying and the already deployed models so that all of the percentages add up to 100%.
If you're deploying your model to a new endpoint, accept 100 for the Traffic split. Otherwise, adjust the traffic split values for all models on the endpoint so they add up to 100.
Enter the Minimum number of compute nodes you want to provide for your model.
This is the number of nodes available to this model at all times.
You are charged for the nodes used, whether to handle prediction load or for standby (minimum) nodes, even without prediction traffic. See the pricing page.
- To use autoscaling, enter the Maximum number of compute nodes you want to Vertex AI to scale up to.
The number of compute nodes can increase if needed to handle prediction traffic, but it will never go higher than the maximum number of nodes.
- Select your Machine type.
Larger machine resources will increase your prediction performance and increase costs. Compare the available machine types.
- Select an Accelerator type and an Accelerator count.
This option only displays if you enabled accelerator use when you imported or created the model.
For the accelerator count, refer to the GPU table to check for valid numbers of GPUs that you can use with each CPU machine type. The accelerator count refers to the number of accelerators per node, not the total number of accelerators in your deployment.
If you want to use a custom service account for the deployment, select a service account in the Service account drop-down box.
Learn how to change the default settings for prediction logging.
Click Done for your model, and when all the Traffic split percentages are correct, click Continue.
The region where your model deploys is displayed. This must be the region where you created your model.
Click Deploy to deploy your model to the endpoint.
API
When you deploy a model using the Vertex AI API, you complete the following steps:
Create an endpoint
If you are deploying a model to an existing endpoint, you can skip this step.
gcloud
The following example uses the gcloud ai endpoints create
command:
gcloud ai endpoints create \
--region=LOCATION_ID \
--display-name=ENDPOINT_NAME
Replace the following:
- LOCATION_ID: The region where you are using Vertex AI.
- ENDPOINT_NAME: The display name for the endpoint.
The Google Cloud CLI tool might take a few seconds to create the endpoint.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: Your region.
- PROJECT_ID: Your project ID.
- ENDPOINT_NAME: The display name for the endpoint.
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints
Request JSON body:
{ "display_name": "ENDPOINT_NAME" }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1.CreateEndpointOperationMetadata", "genericMetadata": { "createTime": "2020-11-05T17:45:42.812656Z", "updateTime": "2020-11-05T17:45:42.812656Z" } } }
"done":
true
.
Java
To learn how to install and use the client library for Vertex AI, see Vertex AI client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Vertex AI, see Vertex AI client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Vertex AI, see Vertex AI client libraries. For more information, see the Vertex AI Python API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Retrieve the endpoint ID
You need the endpoint ID to deploy the model.
gcloud
The following example uses the gcloud ai endpoints list
command:
gcloud ai endpoints list \
--region=LOCATION_ID \
--filter=display_name=ENDPOINT_NAME
Replace the following:
- LOCATION_ID: The region where you are using Vertex AI.
- ENDPOINT_NAME: The display name for the endpoint.
Note the number that appears in the ENDPOINT_ID
column. Use this ID in the
following step.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID.
- ENDPOINT_NAME: The display name for the endpoint.
HTTP method and URL:
GET https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints?filter=display_name=ENDPOINT_NAME
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "endpoints": [ { "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID", "displayName": "ENDPOINT_NAME", "etag": "AMEw9yPz5pf4PwBHbRWOGh0PcAxUdjbdX2Jm3QO_amguy3DbZGP5Oi_YUKRywIE-BtLx", "createTime": "2020-04-17T18:31:11.585169Z", "updateTime": "2020-04-17T18:35:08.568959Z" } ] }
Deploy the model
Select the tab below for your language or environment:
gcloud
The following examples use the gcloud ai endpoints deploy-model
command.
The following example deploys a Model
to an Endpoint
without using GPUs
to accelerate prediction serving and without splitting traffic between multiple
DeployedModel
resources:
Before using any of the command data below, make the following replacements:
- ENDPOINT_ID: The ID for the endpoint.
- LOCATION_ID: The region where you are using Vertex AI.
- MODEL_ID: The ID for the model to be deployed.
-
DEPLOYED_MODEL_NAME: A name for the
DeployedModel
. You can use the display name of theModel
for theDeployedModel
as well. - MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes.
-
MAX_REPLICA_COUNT: The maximum number of nodes for this deployment.
The node count can be increased or decreased as required by the prediction load,
up to this number of nodes and never fewer than the minimum number of nodes.
If you omit the
--max-replica-count
flag, then maximum number of nodes is set to the value of--min-replica-count
.
Execute the gcloud ai endpoints deploy-model command:
Linux, macOS, or Cloud Shell
gcloud ai endpoints deploy-model ENDPOINT_ID\ --region=LOCATION_ID \ --model=MODEL_ID \ --display-name=DEPLOYED_MODEL_NAME \ --min-replica-count=MIN_REPLICA_COUNT \ --max-replica-count=MAX_REPLICA_COUNT \ --traffic-split=0=100
Windows (PowerShell)
gcloud ai endpoints deploy-model ENDPOINT_ID` --region=LOCATION_ID ` --model=MODEL_ID ` --display-name=DEPLOYED_MODEL_NAME ` --min-replica-count=MIN_REPLICA_COUNT ` --max-replica-count=MAX_REPLICA_COUNT ` --traffic-split=0=100
Windows (cmd.exe)
gcloud ai endpoints deploy-model ENDPOINT_ID^ --region=LOCATION_ID ^ --model=MODEL_ID ^ --display-name=DEPLOYED_MODEL_NAME ^ --min-replica-count=MIN_REPLICA_COUNT ^ --max-replica-count=MAX_REPLICA_COUNT ^ --traffic-split=0=100
Splitting traffic
The --traffic-split=0=100
flag in the preceding examples sends 100% of prediction
traffic that the Endpoint
receives to the new DeployedModel
, which is
represented by the temporary ID 0
. If your Endpoint
already has other
DeployedModel
resources, then you can split traffic between the new
DeployedModel
and the old ones.
For example, to send 20% of traffic to the new DeployedModel
and 80% to an older one,
run the following command.
Before using any of the command data below, make the following replacements:
- OLD_DEPLOYED_MODEL_ID: the ID of the existing
DeployedModel
.
Execute the gcloud ai endpoints deploy-model command:
Linux, macOS, or Cloud Shell
gcloud ai endpoints deploy-model ENDPOINT_ID\ --region=LOCATION_ID \ --model=MODEL_ID \ --display-name=DEPLOYED_MODEL_NAME \ --min-replica-count=MIN_REPLICA_COUNT \ --max-replica-count=MAX_REPLICA_COUNT \ --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80
Windows (PowerShell)
gcloud ai endpoints deploy-model ENDPOINT_ID` --region=LOCATION_ID ` --model=MODEL_ID ` --display-name=DEPLOYED_MODEL_NAME \ --min-replica-count=MIN_REPLICA_COUNT ` --max-replica-count=MAX_REPLICA_COUNT ` --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80
Windows (cmd.exe)
gcloud ai endpoints deploy-model ENDPOINT_ID^ --region=LOCATION_ID ^ --model=MODEL_ID ^ --display-name=DEPLOYED_MODEL_NAME \ --min-replica-count=MIN_REPLICA_COUNT ^ --max-replica-count=MAX_REPLICA_COUNT ^ --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80
REST
Deploy the model.
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID.
- ENDPOINT_ID: The ID for the endpoint.
- MODEL_ID: The ID for the model to be deployed.
-
DEPLOYED_MODEL_NAME: A name for the
DeployedModel
. You can use the display name of theModel
for theDeployedModel
as well. -
MACHINE_TYPE: Optional. The machine resources used for each node of this
deployment. Its default setting is
n1-standard-2
. Learn more about machine types. - ACCELERATOR_TYPE: The type of accelerator to be attached to the machine. Optional if ACCELERATOR_COUNT is not specified or is zero. Not recommended for AutoML models or custom-trained models that are using non-GPU images. Learn more.
- ACCELERATOR_COUNT: The number of accelerators for each replica to use. Optional. Should be zero or unspecified for AutoML models or custom-trained models that are using non-GPU images.
- MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1.
- MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to this number of nodes and never fewer than the minimum number of nodes.
- TRAFFIC_SPLIT_THIS_MODEL: The percentage of the prediction traffic to this endpoint to be routed to the model being deployed with this operation. Defaults to 100. All traffic percentages must add up to 100. Learn more about traffic splits.
- DEPLOYED_MODEL_ID_N: Optional. If other models are deployed to this endpoint, you must update their traffic split percentages so that all percentages add up to 100.
- TRAFFIC_SPLIT_MODEL_N: The traffic split percentage value for the deployed model id key.
- PROJECT_NUMBER: Project number for your project
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel
Request JSON body:
{ "deployedModel": { "model": "projects/PROJECT/locations/us-central1/models/MODEL_ID", "displayName": "DEPLOYED_MODEL_NAME", "dedicatedResources": { "machineSpec": { "machineType": "MACHINE_TYPE", "acceleratorType": "ACCELERATOR_TYPE", "acceleratorCount": "ACCELERATOR_COUNT" }, "minReplicaCount": MIN_REPLICA_COUNT, "maxReplicaCount": MAX_REPLICA_COUNT }, }, "trafficSplit": { "0": TRAFFIC_SPLIT_THIS_MODEL, "DEPLOYED_MODEL_ID_1": TRAFFIC_SPLIT_MODEL_1, "DEPLOYED_MODEL_ID_2": TRAFFIC_SPLIT_MODEL_2 }, }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1.DeployModelOperationMetadata", "genericMetadata": { "createTime": "2020-10-19T17:53:16.502088Z", "updateTime": "2020-10-19T17:53:16.502088Z" } } }
Java
To learn how to install and use the client library for Vertex AI, see Vertex AI client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Vertex AI, see Vertex AI client libraries. For more information, see the Vertex AI Python API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Vertex AI, see Vertex AI client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Learn how to change the default settings for prediction logging.
Get operation status
Some requests start long-running operations that require time to complete. These requests return an operation name, which you can use to view the operation's status or cancel the operation. Vertex AI provides helper methods to make calls against long-running operations. For more information, see Working with long-running operations.
Configure model deployment
During model deployment, you make the following important decisions about how to run online prediction:
Resource created | Setting specified at resource creation |
---|---|
Endpoint | Location in which to run predictions |
Model | Container to use (ModelContainerSpec ) |
DeployedModel | Machines to use for online prediction |
You can't update the settings listed above after the initial creation of the model or endpoint, and you can't override them in the online prediction request. If you need to change these settings, you must redeploy your model.
What happens when you deploy a model
When you deploy a model to an endpoint, you associate physical (machine) resources with that model so it can serve online predictions. Online predictions have low latency requirements. Providing resources to the model in advance reduces latency.
The model's training type (AutoML or custom) and (AutoML) data
type determine the kinds of physical resources available to the model. After
model deployment, you can
mutate
some
of those resources without creating a new deployment.
The endpoint resource provides the service endpoint (URL) you use to request the prediction. For example:
https://us-central1-aiplatform.googleapis.com/v1/projects/{project}/locations/{location}/endpoints/{endpoint}:predict
Reasons to deploy more than one model to the same endpoint
Deploying two models to the same endpoint lets you gradually replace one model with the other. For example, suppose you are using a model, and find a way to increase the accuracy of that model with new training data. However, you don't want to update your application to point to a new endpoint URL, and you don't want to create sudden changes in your application. You can add the new model to the same endpoint, serving a small percentage of traffic, and gradually increase the traffic split for the new model until it is serving 100% of the traffic.
Because the resources are associated with the model rather than the endpoint, you could deploy models of different types to the same endpoint. However, the best practice is to deploy models of a specific type (for example, AutoML text, AutoML tabular, custom-trained) to an endpoint. This configuration is easier to manage.
Reasons to deploy a model to more than one endpoint
You might want to deploy your models with different resources for different application environments, such as testing and production. You might also want to support different SLOs for your prediction requests. Perhaps one of your applications has much higher performance needs than the others. In this case, you can deploy that model to a higher-performance endpoint with more machine resources. To optimize costs, you can also deploy the model to a lower-performance endpoint with fewer machine resources.
Scaling behavior
When you deploy a Model
for online prediction as a DeployedModel
, you can configure
prediction nodes to automatically scale. To do this, set
dedicatedResources.maxReplicaCount
to a
greater value than dedicatedResources.minReplicaCount
.
When you configure a DeployedModel
, you must set
dedicatedResources.minReplicaCount
to at least 1
. In other words, you cannot
configure the DeployedModel
to scale to 0
prediction nodes when it is
unused.
The prediction nodes for batch prediction do not automatically scale.
Vertex AI uses BatchDedicatedResources.startingReplicaCount
and
ignores BatchDedicatedResources.maxReplicaCount
.
Target utilization and configuration
By default, if you deploy a model without dedicated GPU resources, Vertex AI automatically scales the number of replicas up or down so that CPU usage matches the default 60% target value.
By default, if you deploy a model with dedicated GPU resources (if
machineSpec.accelerator_count
is above 0), Vertex AI will automatically scale the number of replicas up
or down so that the CPU or GPU usage, whichever is higher, matches the default
60% target value. Therefore, if your prediction throughput is causing high GPU
usage, but not high CPU usage, Vertex AI will scale up, and the CPU
utilization will be very low, which will be visible in monitoring. Conversely,
if your custom container is underutilizing the GPU, but has an unrelated process
that bring CPU utilization above 60%, Vertex AI will scale up, even
if this may not have been needed to achieve QPS and latency targets.
You can override the default threshold metric and target by specifying
autoscalingMetricSpecs
.
Note that if your deployment is configured to scale based only on CPU usage, it
will not scale up even if GPU usage is high.
Manage resource usage
You can monitor your endpoint to track metrics like CPU and Accelerator usage, number of requests, latency, and the current and target number of replicas. This information can help you understand your endpoint's resource usage and scaling behavior.
Keep in mind that each replica runs only a single container. This means that if a prediction container cannot fully utilize the selected compute resource, such as single threaded code for a multi-core machine, or a custom model that calls another service as part of making the prediction, your nodes may not scale up.
For example, if you are using FastAPI, or any model server that has a configurable number of workers or threads, there are many cases where having more than one worker can increase resource utilization, which improves the ability for the service to automatically scale the number of replicas.
We generally recommend starting with one worker or thread per core. If you notice that CPU utilization is low, especially under high load, or your model is not scaling up because CPU utilization is low, then increase the number of workers. On the other hand, if you notice that utilization is too high and your latencies increase more than expected under load, then try using fewer workers. If you are already using only a single worker, try using a smaller machine type.
Scaling behavior and lag
Vertex AI adjusts the number of replicas every 15 seconds using data from the previous 5 minutes window. For each 15 second cycle, the system measures the server utilization and generates a target number of replicas based on the following formula:
target # of replicas = Ceil(current # of replicas * (current utilization / target utilization))
For example, if you currently have two replicas that are being utilized at 100%, the target is 4:
4 = Ceil(3.33) = Ceil(2 * (100% / 60%))
Another example, if you currently have 10 replicas and utilization drops to 1%, the target is 1:
1 = Ceil(.167) = Ceil(10 * (1% / 60%))
At the end of each 15 second cycle, the system adjusts the number of replicas to match the highest target value from the previous 5 minutes window. Notice that because the highest target value is chosen, your endpoint will not scale down if there is a spike in utilization during that 5 minute window, even if overall utilization is very low. On the other hand, if the system needs to be scaled up, it will do that within 15 seconds since the highest target value is chosen instead of the average.
Keep in mind that even after Vertex AI adjusts the number of replicas, it takes time to start up or turn down the replicas. Thus there is an additional delay before the endpoint can adjust to the traffic. The main factors that contribute to this time include the following:
- the time to provision and start the Compute Engine VMs
- the time to download the container from the registry
- the time to load the model from storage
The best way to understand the real world scaling behavior of your model is to
run a load test and optimize the characteristics that matter for your model and
your use case. If the autoscaler is not scaling up fast enough for your
application, provision enough min_replicas
to handle your expected baseline
traffic.
Update the scaling configuration
If you specified either DedicatedResources
or AutomaticResources
when you deployed
the model, you can update the scaling configuration without redeploying the
model by calling
mutateDeployedModel
.
For example, the following request updates max_replica
,
autoscaling_metric_specs
, and disables container logging.
{
"deployedModel": {
"id": "2464520679043629056",
"dedicatedResources": {
"maxReplicaCount": 9,
"autoscalingMetricSpecs": [
{
"metricName": "aiplatform.googleapis.com/prediction/online/cpu/utilization",
"target": 50
}
]
},
"disableContainerLogging": true
},
"update_mask": {
"paths": [
"dedicated_resources.max_replica_count",
"dedicated_resources.autoscaling_metric_specs",
"disable_container_logging"
]
}
}
Usage notes:
- You cannot change the machine type or switch from
DedicatedResources
toAutomaticResources
or the other way around. The only scaling configuration fields yo