Introduction
A Vertex AI model is deployed to its own virtual machine (VM) instance by default. Vertex AI offers the capability to co-host models on the same VM, which enables the following benefits:
- Resource sharing across multiple deployments.
- Cost-effective model serving.
- Improved utilization of memory and computational resources.
This guide describes how to share resources across multiple deployments on Vertex AI.
Overview
Model co-hosting support introduces the concept of a DeploymentResourcePool
, which groups model deployments that share resources within a single VM. Multiple endpoints can be deployed on the same VM within a DeploymentResourcePool
. Each endpoint has one or more deployed models. The deployed models for a given endpoint can be grouped under the same or a different DeploymentResourcePool
.
In the following example, you have four models and two endpoints:

Model_A
, Model_B
, and Model_C
are deployed to Endpoint_1
with traffic routed to all of them. Model_D
is deployed to Endpoint_2
, which receives 100% of the traffic for that endpoint.
Instead of having each model assigned to a separate VM, you can group the models in one of the following ways:
- Group
Model_A
andModel_B
to share a VM, which makes them a part ofDeploymentResourcePool_X
. - Group
Model_C
andModel_D
(currently not in the same endpoint) to share a VM, which makes them a part ofDeploymentResourcePool_Y
.
Different Deployment Resource Pools can't share a VM.
Considerations
There is no upper limit on the number of models that can be deployed to a single Deployment Resource Pool. It depends on the chosen VM shape, model sizes, and traffic patterns. Co-hosting works well when you have many deployed models with sparse traffic, such that assigning a dedicated machine to each deployed model doesn't effectively utilize resources.
You can deploy models to the same Deployment Resource Pool concurrently. However, there is a limit of 20 concurrent deployment requests at any given time.
There is an increase in CPU utilization when a model is being deployed. The increased CPU utilization can lead to an increase in latency for existing traffic, or it can trigger autoscaling. For the best experience, it is recommended to avoid high traffic to a Deployment Resource Pool while deploying a new model to it.
Existing traffic to a Deployment Resource Pool is not affected when you undeploy a model from it. No impact is expected to CPU utilization or latency of existing traffic while undeploying a model.
An empty Deployment Resource Pool doesn't consume your resource quota. Resources are provisioned to a Deployment Resource Pool when the first model is deployed and released when the last model is undeployed.
Models in a single Deployment Resource Pool are not isolated from each other in terms of resources such as CPU and memory. If one model takes up most resources, it will trigger autoscaling.
Limitations
The following limitations exist when deploying models with resource sharing enabled:
- This feature is only supported for TensorFlow model deployments that use Vertex AI prediction Tensorflow prebuilt containers. Other model frameworks and custom containers are not yet supported.
- Only custom trained or imported models are supported at this time. AutoML models are not yet supported.
- Only models with the same container image (including framework version) of Vertex AI TensorFlow prebuilt containers for prediction can be deployed in the same Deployment Resource Pool.
- The following features are not yet supported: custom service accounts, container logging, Vertex Explainable AI, VPC Service Controls, and private endpoints.
Supported regions list:
Americas
- us-central1
- us-east1
- us-east4
- us-west1
Europe
- europe-west1
Asia Pacific
- asia-northeast1
- asia-southeast1
Deploy a model
To deploy a model to a DeploymentResourcePool
, complete the following steps:
- Create a Deployment Resource Pool if needed.
- Create an Endpoint if needed.
- Retrieve the Endpoint ID.
- Deploy the model to the Endpoint in the Deployment Resource Pool.
Create a Deployment Resource Pool
If you are deploying a model to an existing DeploymentResourcePool
, skip this step:
Use CreateDeploymentResourcePool
to create a resource pool.
Cloud Console
In the Google Cloud console, go to the Vertex AI Deployment Resource Pools page.
Click Create and fill out the form (shown below).
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID.
-
MACHINE_TYPE: Optional. The machine resources used for each node of this
deployment. Its default setting is
n1-standard-2
. Learn more about machine types. - ACCELERATOR_TYPE: The type of accelerator to be attached to the machine. Optional if ACCELERATOR_COUNT is not specified or is zero. Not recommended for AutoML models or custom-trained models that are using non-GPU images. Learn more.
- ACCELERATOR_COUNT: The number of accelerators for each replica to use. Optional. Should be zero or unspecified for AutoML models or custom-trained models that are using non-GPU images.
- MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1.
- MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to this number of nodes and never fewer than the minimum number of nodes.
-
DEPLOYMENT_RESOURCE_POOL_ID: A name for your
DeploymentResourcePool
. The maximum length is 63 characters, and valid characters are /^[a-z]([a-z0-9-]{0,61}[a-z0-9])?$/.
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/deploymentResourcePools
Request JSON body:
{ "deploymentResourcePool":{ "dedicatedResources":{ "machineSpec":{ "machineType":"MACHINE_TYPE", "acceleratorType":"ACCELERATOR_TYPE", "acceleratorCount":"ACCELERATOR_COUNT" }, "minReplicaCount":MIN_REPLICA_COUNT, "maxReplicaCount":MAX_REPLICA_COUNT } }, "deploymentResourcePoolId":"DEPLOYMENT_RESOURCE_POOL_ID" }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/deploymentResourcePools/DEPLOYMENT_RESOURCE_POOL_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1beta1.CreateDeploymentResourcePoolOperationMetadata", "genericMetadata": { "createTime": "2022-06-15T05:48:06.383592Z", "updateTime": "2022-06-15T05:48:06.383592Z" } } }
You can poll for the status of the operation until
the response includes "done": true
.
Create Endpoint
Follow these instructions to create an Endpoint. This step is the same as a single-model deployment.
Retrieve Endpoint ID
Follow these instructions to retrieve the Endpoint ID. This step is the same as a single-model deployment.
Deploy model in a Deployment Resource Pool
After you create a DeploymentResourcePool
and an Endpoint, you are ready to deploy using the DeployModel
API method. This process is similar to a single-model deployment. If there is a DeploymentResourcePool
, specify shared_resources
of DeployModel
with the resource name of the DeploymentResourcePool
that you are deploying.
Cloud Console
In the Google Cloud console, go to the Vertex AI Model Registry page.
Find your model and click Deploy to endpoint.
Under Model settings (shown below), select Deploy to a shared deployment resource pool.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT: Your project ID.
- ENDPOINT_ID: The ID for the endpoint.
- MODEL_ID: The ID for the model to be deployed.
-
DEPLOYED_MODEL_NAME: A name for the
DeployedModel
. You can use the display name of theModel
for theDeployedModel
as well. -
DEPLOYMENT_RESOURCE_POOL_ID: A name for your
DeploymentResourcePool
. The maximum length is 63 characters, and valid characters are /^[a-z]([a-z0-9-]{0,61}[a-z0-9])?$/. - TRAFFIC_SPLIT_THIS_MODEL: The percentage of the prediction traffic to this endpoint to be routed to the model being deployed with this operation. Defaults to 100. All traffic percentages must add up to 100. Learn more about traffic splits.
- DEPLOYED_MODEL_ID_N: Optional. If other models are deployed to this endpoint, you must update their traffic split percentages so that all percentages add up to 100.
- TRAFFIC_SPLIT_MODEL_N: The traffic split percentage value for the deployed model id key.
- PROJECT_NUMBER: Project number for your project
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT/locations/LOCATION/endpoints/ENDPOINT_ID:deployModel
Request JSON body:
{ "deployedModel": { "model": "projects/PROJECT/locations/us-central1/models/MODEL_ID", "displayName": "DEPLOYED_MODEL_NAME", "sharedResources":"projects/PROJECT/locations/us-central1/deploymentResourcePools/DEPLOYMENT_RESOURCE_POOL_ID" }, "trafficSplit": { "0": TRAFFIC_SPLIT_THIS_MODEL, "DEPLOYED_MODEL_ID_1": TRAFFIC_SPLIT_MODEL_1, "DEPLOYED_MODEL_ID_2": TRAFFIC_SPLIT_MODEL_2 }, }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_NUMBER/locations/LOCATION/endpoints/ENDPOINT_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1beta1.DeployModelOperationMetadata", "genericMetadata": { "createTime": "2022-06-19T17:53:16.502088Z", "updateTime": "2022-06-19T17:53:16.502088Z" } } }
Repeat the above request with different models that have the same shared resources to deploy multiple models to the same Deployment Resource Pool.
Get predictions
You can send prediction requests to a model in a DeploymentResourcePool
as you would to any other model deployed on Vertex AI.