Share resources across deployments

Introduction

A Vertex AI model is deployed to its own virtual machine (VM) instance by default. Vertex AI offers the capability to co-host models on the same VM, which enables the following benefits:

  • Resource sharing across multiple deployments.
  • Cost-effective model serving.
  • Improved utilization of memory and computational resources.

This guide describes how to share resources across multiple deployments on Vertex AI.

Overview

Model co-hosting support introduces the concept of a DeploymentResourcePool, which groups model deployments that share resources within a single VM. Multiple endpoints can be deployed on the same VM within a DeploymentResourcePool. Each endpoint has one or more deployed models. The deployed models for a given endpoint can be grouped under the same or a different DeploymentResourcePool.

In the following example, you have four models and two endpoints:

Co-hosting models from multiple Endpoints

Model_A, Model_B, and Model_C are deployed to Endpoint_1 with traffic routed to all of them. Model_D is deployed to Endpoint_2, which receives 100% of the traffic for that endpoint. Instead of having each model assigned to a separate VM, you can group the models in one of the following ways:

  • Group Model_A and Model_B to share a VM, which makes them a part of DeploymentResourcePool_X.
  • Group Model_C and Model_D (currently not in the same endpoint) to share a VM, which makes them a part of DeploymentResourcePool_Y.

Different Deployment Resource Pools can't share a VM.

Considerations

There is no upper limit on the number of models that can be deployed to a single Deployment Resource Pool. It depends on the chosen VM shape, model sizes, and traffic patterns. Co-hosting works well when you have many deployed models with sparse traffic, such that assigning a dedicated machine to each deployed model doesn't effectively utilize resources.

You can deploy models to the same Deployment Resource Pool concurrently. However, there is a limit of 20 concurrent deployment requests at any given time.

There is an increase in CPU utilization when a model is being deployed. The increased CPU utilization can lead to an increase in latency for existing traffic, or it can trigger autoscaling. For the best experience, it is recommended to avoid high traffic to a Deployment Resource Pool while deploying a new model to it.

Existing traffic to a Deployment Resource Pool is not affected when you undeploy a model from it. No impact is expected to CPU utilization or latency of existing traffic while undeploying a model.

An empty Deployment Resource Pool doesn't consume your resource quota. Resources are provisioned to a Deployment Resource Pool when the first model is deployed and released when the last model is undeployed.

Models in a single Deployment Resource Pool are not isolated from each other in terms of resources such as CPU and memory. If one model takes up most resources, it will trigger autoscaling.

Limitations

The following limitations exist when deploying models with resource sharing enabled:

Deploy a model

To deploy a model to a DeploymentResourcePool, complete the following steps:

  1. Create a Deployment Resource Pool if needed.
  2. Create an Endpoint if needed.
  3. Retrieve the Endpoint ID.
  4. Deploy the model to the Endpoint in the Deployment Resource Pool.

Create a Deployment Resource Pool

If you are deploying a model to an existing DeploymentResourcePool, skip this step:

Use CreateDeploymentResourcePool to create a resource pool.

Cloud Console

  1. In the Google Cloud console, go to the Vertex AI Deployment Resource Pools page.

    Go to Deployment Resource Pools

  2. Click Create and fill out the form (shown below).

REST

Before using any of the request data, make the following replacements:

  • LOCATION_ID: The region where you are using Vertex AI.
  • PROJECT_ID: Your project ID.
  • MACHINE_TYPE: Optional. The machine resources used for each node of this deployment. Its default setting is n1-standard-2. Learn more about machine types.
  • ACCELERATOR_TYPE: The type of accelerator to be attached to the machine. Optional if ACCELERATOR_COUNT is not specified or is zero. Not recommended for AutoML models or custom-trained models that are using non-GPU images. Learn more.
  • ACCELERATOR_COUNT: The number of accelerators for each replica to use. Optional. Should be zero or unspecified for AutoML models or custom-trained models that are using non-GPU images.
  • MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1.
  • MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to this number of nodes and never fewer than the minimum number of nodes.
  • DEPLOYMENT_RESOURCE_POOL_ID: A name for your DeploymentResourcePool. The maximum length is 63 characters, and valid characters are /^[a-z]([a-z0-9-]{0,61}[a-z0-9])?$/.

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/deploymentResourcePools

Request JSON body:

{
  "deploymentResourcePool":{
    "dedicatedResources":{
      "machineSpec":{
        "machineType":"MACHINE_TYPE",
        "acceleratorType":"ACCELERATOR_TYPE",
        "acceleratorCount":"ACCELERATOR_COUNT"
      },
      "minReplicaCount":MIN_REPLICA_COUNT, 
      "maxReplicaCount":MAX_REPLICA_COUNT
    }
  },
  "deploymentResourcePoolId":"DEPLOYMENT_RESOURCE_POOL_ID"
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/deploymentResourcePools/DEPLOYMENT_RESOURCE_POOL_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1beta1.CreateDeploymentResourcePoolOperationMetadata",
    "genericMetadata": {
      "createTime": "2022-06-15T05:48:06.383592Z",
      "updateTime": "2022-06-15T05:48:06.383592Z"
    }
  }
}

You can poll for the status of the operation until the response includes "done": true.

Create Endpoint

Follow these instructions to create an Endpoint. This step is the same as a single-model deployment.

Retrieve Endpoint ID

Follow these instructions to retrieve the Endpoint ID. This step is the same as a single-model deployment.

Deploy model in a Deployment Resource Pool

After you create a DeploymentResourcePool and an Endpoint, you are ready to deploy using the DeployModel API method. This process is similar to a single-model deployment. If there is a DeploymentResourcePool, specify shared_resources of DeployModel with the resource name of the DeploymentResourcePool that you are deploying.

Cloud Console

  1. In the Google Cloud console, go to the Vertex AI Model Registry page.

    Go to Model Registry

  2. Find your model and click Deploy to endpoint.

  3. Under Model settings (shown below), select Deploy to a shared deployment resource pool.

REST

Before using any of the request data, make the following replacements:

  • LOCATION_ID: The region where you are using Vertex AI.
  • PROJECT: Your project ID.
  • ENDPOINT_ID: The ID for the endpoint.
  • MODEL_ID: The ID for the model to be deployed.
  • DEPLOYED_MODEL_NAME: A name for the DeployedModel. You can use the display name of the Model for the DeployedModel as well.
  • DEPLOYMENT_RESOURCE_POOL_ID: A name for your DeploymentResourcePool. The maximum length is 63 characters, and valid characters are /^[a-z]([a-z0-9-]{0,61}[a-z0-9])?$/.
  • TRAFFIC_SPLIT_THIS_MODEL: The percentage of the prediction traffic to this endpoint to be routed to the model being deployed with this operation. Defaults to 100. All traffic percentages must add up to 100. Learn more about traffic splits.
  • DEPLOYED_MODEL_ID_N: Optional. If other models are deployed to this endpoint, you must update their traffic split percentages so that all percentages add up to 100.
  • TRAFFIC_SPLIT_MODEL_N: The traffic split percentage value for the deployed model id key.
  • PROJECT_NUMBER: Your project's automatically generated project number

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT/locations/LOCATION/endpoints/ENDPOINT_ID:deployModel

Request JSON body:

{
  "deployedModel": {
    "model": "projects/PROJECT/locations/us-central1/models/MODEL_ID",
    "displayName": "DEPLOYED_MODEL_NAME",
    "sharedResources":"projects/PROJECT/locations/us-central1/deploymentResourcePools/DEPLOYMENT_RESOURCE_POOL_ID"
  },
  "trafficSplit": {
    "0": TRAFFIC_SPLIT_THIS_MODEL,
    "DEPLOYED_MODEL_ID_1": TRAFFIC_SPLIT_MODEL_1,
    "DEPLOYED_MODEL_ID_2": TRAFFIC_SPLIT_MODEL_2
  },
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_NUMBER/locations/LOCATION/endpoints/ENDPOINT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1beta1.DeployModelOperationMetadata",
    "genericMetadata": {
      "createTime": "2022-06-19T17:53:16.502088Z",
      "updateTime": "2022-06-19T17:53:16.502088Z"
    }
  }
}

Repeat the above request with different models that have the same shared resources to deploy multiple models to the same Deployment Resource Pool.

Get predictions

You can send prediction requests to a model in a DeploymentResourcePool as you would to any other model deployed on Vertex AI.