Deploy a model to an endpoint

You must deploy a model to an endpoint before that model can be used to serve online predictions. Deploying a model associates physical resources with the model so it can serve online predictions with low latency.

To be deployable, the model must be visible in Vertex AI Model Registry. For information about Model Registry, including how to import model artifacts or create them directly in Model Registry, see Introduction to Vertex AI Model Registry.

You can deploy multiple models to an endpoint, or you can deploy the same model to multiple endpoints. For more information about options and use cases for deploying models, see Reasons to deploy more than one model to the same endpoint.

Deploy a model to an endpoint

Use one of the following methods to deploy a model:

Google Cloud console

In the Google Cloud console, in the Vertex AI section, go to the Models page.

Go to the Models page
Click the name and version ID of the model you want to deploy to open its details page.
Select the Deploy & Test tab.

If your model is already deployed to any endpoints, they are listed in the Deploy your model section.
Click Deploy to endpoint.
To deploy your model to a new endpoint, select Create new endpoint, and provide a name for the new endpoint. To deploy your model to an existing endpoint, select Add to existing endpoint, and select the endpoint from the drop-down list.

You can deploy multiple models to an endpoint, or you can deploy the same model to multiple endpoints.
If you deploy your model to an existing endpoint that has one or more models deployed to it, you must update the Traffic split percentage for the model you are deploying and the already deployed models so that all of the percentages add up to 100%.
If you're deploying your model to a new endpoint, accept 100 for the Traffic split. Otherwise, adjust the traffic split values for all models on the endpoint so they add up to 100.
Enter the Minimum number of compute nodes you want to provide for your model.

This is the number of nodes that need to be available to the model at all times.

You are charged for the nodes used, whether to handle prediction load or for standby (minimum) nodes, even without prediction traffic. See the pricing page.

The number of compute nodes can increase if needed to handle prediction traffic, but it will never go higher than the maximum number of nodes.
To use autoscaling, enter the Maximum number of compute nodes you want Vertex AI to scale up to.
Select your Machine type.

Larger machine resources increase your prediction performance and increase costs. Compare the available machine types.
Select an Accelerator type and an Accelerator count.

If you enabled accelerator use when you imported or created the model, this option displays.

For the accelerator count, refer to the GPU table to check for valid numbers of GPUs that you can use with each CPU machine type. The accelerator count refers to the number of accelerators per node, not the total number of accelerators in your deployment.
If you want to use a custom service account for the deployment, select a service account in the Service account drop-down box.
Learn how to change the default settings for prediction logging.
Click Done for your model, and when all the Traffic split percentages are correct, click Continue.
The region where your model deploys is displayed. This must be the region where you created your model.
Click Deploy to deploy your model to the endpoint.

API

When you deploy a model using the Vertex AI API, you complete the following steps:

Create an endpoint if needed.
Get the endpoint ID.
Deploy the model to the endpoint.

Create an endpoint

If you are deploying a model to an existing endpoint, you can skip this step and go to Get the endpoint ID. To try the dedicated endpoint Preview, skip to Create a dedicated endpoint.

gcloud

The following example uses the gcloud ai endpoints create command:

gcloud ai endpoints create \
  --region=LOCATION_ID \
  --display-name=ENDPOINT_NAME

Replace the following:

LOCATION_ID: The region where you are using Vertex AI.
ENDPOINT_NAME: The display name for the endpoint.

The Google Cloud CLI tool might take a few seconds to create the endpoint.

REST

Before using any of the request data, make the following replacements:

LOCATION_ID: Your region.
PROJECT_ID: Your project ID.
ENDPOINT_NAME: The display name for the endpoint.

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints

Request JSON body:

{
  "display_name": "ENDPOINT_NAME"
}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints"

PowerShell (Windows)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.CreateEndpointOperationMetadata",
    "genericMetadata": {
      "createTime": "2020-11-05T17:45:42.812656Z",
      "updateTime": "2020-11-05T17:45:42.812656Z"
    }
  }
}

You can poll for the status of the operation until the response includes

"done":
true

Terraform

The following sample uses the google_vertex_ai_endpoint Terraform resource to create an endpoint.

To learn how to apply or remove a Terraform configuration, see Basic Terraform commands.

# Endpoint name must be unique for the project
resource "random_id" "endpoint_id" {
  byte_length = 4
}

resource "google_vertex_ai_endpoint" "default" {
  name         = substr(random_id.endpoint_id.dec, 0, 10)
  display_name = "sample-endpoint"
  description  = "A sample Vertex AI endpoint"
  location     = "us-central1"
  labels = {
    label-one = "value-one"
  }
}

Java

Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.aiplatform.v1.CreateEndpointOperationMetadata;
import com.google.cloud.aiplatform.v1.Endpoint;
import com.google.cloud.aiplatform.v1.EndpointServiceClient;
import com.google.cloud.aiplatform.v1.EndpointServiceSettings;
import com.google.cloud.aiplatform.v1.LocationName;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

public class CreateEndpointSample {

  public static void main(String[] args)
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    String project = "YOUR_PROJECT_ID";
    String endpointDisplayName = "YOUR_ENDPOINT_DISPLAY_NAME";
    createEndpointSample(project, endpointDisplayName);
  }

  static void createEndpointSample(String project, String endpointDisplayName)
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    EndpointServiceSettings endpointServiceSettings =
        EndpointServiceSettings.newBuilder()
            .setEndpoint("us-central1-aiplatform.googleapis.com:443")
            .build();

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (EndpointServiceClient endpointServiceClient =
        EndpointServiceClient.create(endpointServiceSettings)) {
      String location = "us-central1";
      LocationName locationName = LocationName.of(project, location);
      Endpoint endpoint = Endpoint.newBuilder().setDisplayName(endpointDisplayName).build();

      OperationFuture<Endpoint, CreateEndpointOperationMetadata> endpointFuture =
          endpointServiceClient.createEndpointAsync(locationName, endpoint);
      System.out.format("Operation name: %s\n", endpointFuture.getInitialFuture().get().getName());
      System.out.println("Waiting for operation to finish...");
      Endpoint endpointResponse = endpointFuture.get(300, TimeUnit.SECONDS);

      System.out.println("Create Endpoint Response");
      System.out.format("Name: %s\n", endpointResponse.getName());
      System.out.format("Display Name: %s\n", endpointResponse.getDisplayName());
      System.out.format("Description: %s\n", endpointResponse.getDescription());
      System.out.format("Labels: %s\n", endpointResponse.getLabelsMap());
      System.out.format("Create Time: %s\n", endpointResponse.getCreateTime());
      System.out.format("Update Time: %s\n", endpointResponse.getUpdateTime());
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */

// const endpointDisplayName = 'YOUR_ENDPOINT_DISPLAY_NAME';
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';

// Imports the Google Cloud Endpoint Service Client library
const {EndpointServiceClient} = require('@google-cloud/aiplatform');

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};

// Instantiates a client
const endpointServiceClient = new EndpointServiceClient(clientOptions);

async function createEndpoint() {
  // Configure the parent resource
  const parent = `projects/${project}/locations/${location}`;
  const endpoint = {
    displayName: endpointDisplayName,
  };
  const request = {
    parent,
    endpoint,
  };

  // Get and print out a list of all the endpoints for this resource
  const [response] = await endpointServiceClient.createEndpoint(request);
  console.log(`Long running operation : ${response.name}`);

  // Wait for operation to complete
  await response.promise();
  const result = response.result;

  console.log('Create endpoint response');
  console.log(`\tName : ${result.name}`);
  console.log(`\tDisplay name : ${result.displayName}`);
  console.log(`\tDescription : ${result.description}`);
  console.log(`\tLabels : ${JSON.stringify(result.labels)}`);
  console.log(`\tCreate time : ${JSON.stringify(result.createTime)}`);
  console.log(`\tUpdate time : ${JSON.stringify(result.updateTime)}`);
}
createEndpoint();

Vertex AI SDK for Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.

def create_endpoint_sample(
    project: str,
    display_name: str,
    location: str,
):
    aiplatform.init(project=project, location=location)

    endpoint = aiplatform.Endpoint.create(
        display_name=display_name,
        project=project,
        location=location,
    )

    print(endpoint.display_name)
    print(endpoint.resource_name)
    return endpoint

Create a dedicated endpoint

If you are deploying a model to an existing endpoint, you can skip this step.

A dedicated endpoint is a faster, more stable endpoint with support for larger payload sizes and longer request timeouts.

REST

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-d '{"display_name": "ENDPOINT_NAME", "dedicatedEndpointEnabled": true}' \
https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints

Replace the following:

ENDPOINT_NAME: The display name for the endpoint.
LOCATION_ID: The region where you are using Vertex AI.
PROJECT_ID: The project ID for your Google Cloud project.

Python

endpoint = aiplatform.Endpoint.create(
  display_name="ENDPOINT_NAME",
  dedicated_endpoint_enabled=True,
)

Replace the following:

ENDPOINT_NAME: The display name for the endpoint.

Get the endpoint ID

You need the endpoint ID to deploy the model.

gcloud

The following example uses the gcloud ai endpoints list command:

gcloud ai endpoints list \
  --region=LOCATION_ID \
  --filter=display_name=ENDPOINT_NAME

Replace the following:

LOCATION_ID: The region where you are using Vertex AI.
ENDPOINT_NAME: The display name for the endpoint.

Note the number that appears in the ENDPOINT_ID column. Use this ID in the following step.

REST

Before using any of the request data, make the following replacements:

LOCATION_ID: The region where you are using Vertex AI.
PROJECT_ID: Your project ID.
ENDPOINT_NAME: The display name for the endpoint.

HTTP method and URL:

GET https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints?filter=display_name=ENDPOINT_NAME

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Execute the following command:

curl -X GET \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints?filter=display_name=ENDPOINT_NAME"

PowerShell (Windows)

Execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method GET `
    -Headers $headers `
    -Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints?filter=display_name=ENDPOINT_NAME" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "endpoints": [
    {
      "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID",
      "displayName": "ENDPOINT_NAME",
      "etag": "AMEw9yPz5pf4PwBHbRWOGh0PcAxUdjbdX2Jm3QO_amguy3DbZGP5Oi_YUKRywIE-BtLx",
      "createTime": "2020-04-17T18:31:11.585169Z",
      "updateTime": "2020-04-17T18:35:08.568959Z"
    }
  ]
}

Note the ENDPOINT_ID.

Deploy the model

Select the tab below for your language or environment:

gcloud

The following examples use the gcloud ai endpoints deploy-model command.

The following example deploys a Model to an Endpoint without using GPUs to accelerate prediction serving and without splitting traffic between multiple DeployedModel resources:

Before using any of the command data below, make the following replacements:

ENDPOINT_ID: The ID for the endpoint.
LOCATION_ID: The region where you are using Vertex AI.
MODEL_ID: The ID for the model to be deployed.
DEPLOYED_MODEL_NAME: A name for the DeployedModel. You can use the display name of the Model for the DeployedModel as well.
MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes.
MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to this number of nodes and never fewer than the minimum number of nodes. If you omit the --max-replica-count flag, then maximum number of nodes is set to the value of --min-replica-count.

Execute the gcloud ai endpoints deploy-model command:

Linux, macOS, or Cloud Shell

gcloud ai endpoints deploy-model ENDPOINT_ID\
  --region=LOCATION_ID \
  --model=MODEL_ID \
  --display-name=DEPLOYED_MODEL_NAME \
  --min-replica-count=MIN_REPLICA_COUNT \
  --max-replica-count=MAX_REPLICA_COUNT \
  --traffic-split=0=100

Windows (PowerShell)

gcloud ai endpoints deploy-model ENDPOINT_ID`
  --region=LOCATION_ID `
  --model=MODEL_ID `
  --display-name=DEPLOYED_MODEL_NAME `
  --min-replica-count=MIN_REPLICA_COUNT `
  --max-replica-count=MAX_REPLICA_COUNT `
  --traffic-split=0=100

Windows (cmd.exe)

gcloud ai endpoints deploy-model ENDPOINT_ID^
  --region=LOCATION_ID ^
  --model=MODEL_ID ^
  --display-name=DEPLOYED_MODEL_NAME ^
  --min-replica-count=MIN_REPLICA_COUNT ^
  --max-replica-count=MAX_REPLICA_COUNT ^
  --traffic-split=0=100

Splitting traffic

The --traffic-split=0=100 flag in the preceding examples sends 100% of prediction traffic that the Endpoint receives to the new DeployedModel, which is represented by the temporary ID 0. If your Endpoint already has other DeployedModel resources, then you can split traffic between the new DeployedModel and the old ones. For example, to send 20% of traffic to the new DeployedModel and 80% to an older one, run the following command.

Before using any of the command data below, make the following replacements:

OLD_DEPLOYED_MODEL_ID: the ID of the existing DeployedModel.

Execute the gcloud ai endpoints deploy-model command:

Linux, macOS, or Cloud Shell

gcloud ai endpoints deploy-model ENDPOINT_ID\
  --region=LOCATION_ID \
  --model=MODEL_ID \
  --display-name=DEPLOYED_MODEL_NAME \ 
  --min-replica-count=MIN_REPLICA_COUNT \
  --max-replica-count=MAX_REPLICA_COUNT \
  --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80

Windows (PowerShell)

gcloud ai endpoints deploy-model ENDPOINT_ID`
  --region=LOCATION_ID `
  --model=MODEL_ID `
  --display-name=DEPLOYED_MODEL_NAME \ 
  --min-replica-count=MIN_REPLICA_COUNT `
  --max-replica-count=MAX_REPLICA_COUNT `
  --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80

Windows (cmd.exe)

gcloud ai endpoints deploy-model ENDPOINT_ID^
  --region=LOCATION_ID ^
  --model=MODEL_ID ^
  --display-name=DEPLOYED_MODEL_NAME \ 
  --min-replica-count=MIN_REPLICA_COUNT ^
  --max-replica-count=MAX_REPLICA_COUNT ^
  --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80

REST

Deploy the model.

Before using any of the request data, make the following replacements:

LOCATION_ID: The region where you are using Vertex AI.
PROJECT_ID: Your project ID.
ENDPOINT_ID: The ID for the endpoint.
MODEL_ID: The ID for the model to be deployed.
DEPLOYED_MODEL_NAME: A name for the DeployedModel. You can use the display name of the Model for the DeployedModel as well.
MACHINE_TYPE: Optional. The machine resources used for each node of this deployment. Its default setting is n1-standard-2. Learn more about machine types.
ACCELERATOR_TYPE: The type of accelerator to be attached to the machine. Optional if ACCELERATOR_COUNT is not specified or is zero. Not recommended for AutoML models or custom-trained models that are using non-GPU images. Learn more.
ACCELERATOR_COUNT: The number of accelerators for each replica to use. Optional. Should be zero or unspecified for AutoML models or custom-trained models that are using non-GPU images.
MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1.
MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to this number of nodes and never fewer than the minimum number of nodes.
REQUIRED_REPLICA_COUNT: Optional. The required number of nodes for this deployment to be marked as successful. Must be greater than or equal to 1 and fewer than or equal to the minimum number of nodes. If not specified, the default value is the minimum number of nodes.
TRAFFIC_SPLIT_THIS_MODEL: The percentage of the prediction traffic to this endpoint to be routed to the model being deployed with this operation. Defaults to 100. All traffic percentages must add up to 100. Learn more about traffic splits.
DEPLOYED_MODEL_ID_N: Optional. If other models are deployed to this endpoint, you must update their traffic split percentages so that all percentages add up to 100.
TRAFFIC_SPLIT_MODEL_N: The traffic split percentage value for the deployed model id key.
PROJECT_NUMBER: Your project's automatically generated project number

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel

Request JSON body:

{
  "deployedModel": {
    "model": "projects/PROJECT/locations/us-central1/models/MODEL_ID",
    "displayName": "DEPLOYED_MODEL_NAME",
    "dedicatedResources": {
       "machineSpec": {
         "machineType": "MACHINE_TYPE",
         "acceleratorType": "ACCELERATOR_TYPE",
         "acceleratorCount": "ACCELERATOR_COUNT"
       },
       "minReplicaCount": MIN_REPLICA_COUNT,
       "maxReplicaCount": MAX_REPLICA_COUNT,
       "requiredReplicaCount": REQUIRED_REPLICA_COUNT
     },
  },
  "trafficSplit": {
    "0": TRAFFIC_SPLIT_THIS_MODEL,
    "DEPLOYED_MODEL_ID_1": TRAFFIC_SPLIT_MODEL_1,
    "DEPLOYED_MODEL_ID_2": TRAFFIC_SPLIT_MODEL_2
  },
}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel"

PowerShell (Windows)

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.DeployModelOperationMetadata",
    "genericMetadata": {
      "createTime": "2020-10-19T17:53:16.502088Z",
      "updateTime": "2020-10-19T17:53:16.502088Z"
    }
  }
}

Java

Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.aiplatform.v1.DedicatedResources;
import com.google.cloud.aiplatform.v1.DeployModelOperationMetadata;
import com.google.cloud.aiplatform.v1.DeployModelResponse;
import com.google.cloud.aiplatform.v1.DeployedModel;
import com.google.cloud.aiplatform.v1.EndpointName;
import com.google.cloud.aiplatform.v1.EndpointServiceClient;
import com.google.cloud.aiplatform.v1.EndpointServiceSettings;
import com.google.cloud.aiplatform.v1.MachineSpec;
import com.google.cloud.aiplatform.v1.ModelName;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ExecutionException;

public class DeployModelCustomTrainedModelSample {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String project = "PROJECT";
    String endpointId = "ENDPOINT_ID";
    String modelName = "MODEL_NAME";
    String deployedModelDisplayName = "DEPLOYED_MODEL_DISPLAY_NAME";
    deployModelCustomTrainedModelSample(project, endpointId, modelName, deployedModelDisplayName);
  }

  static void deployModelCustomTrainedModelSample(
      String project, String endpointId, String model, String deployedModelDisplayName)
      throws IOException, ExecutionException, InterruptedException {
    EndpointServiceSettings settings =
        EndpointServiceSettings.newBuilder()
            .setEndpoint("us-central1-aiplatform.googleapis.com:443")
            .build();
    String location = "us-central1";

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (EndpointServiceClient client = EndpointServiceClient.create(settings)) {
      MachineSpec machineSpec = MachineSpec.newBuilder().setMachineType("n1-standard-2").build();
      DedicatedResources dedicatedResources =
          DedicatedResources.newBuilder().setMinReplicaCount(1).setMachineSpec(machineSpec).build();

      String modelName = ModelName.of(project, location, model).toString();
      DeployedModel deployedModel =
          DeployedModel.newBuilder()
              .setModel(modelName)
              .setDisplayName(deployedModelDisplayName)
              // `dedicated_resources` must be used for non-AutoML models
              .setDedicatedResources(dedicatedResources)
              .build();
      // key '0' assigns traffic for the newly deployed model
      // Traffic percentage values must add up to 100
      // Leave dictionary empty if endpoint should not accept any traffic
      Map<String, Integer> trafficSplit = new HashMap<>();
      trafficSplit.put("0", 100);
      EndpointName endpoint = EndpointName.of(project, location, endpointId);
      OperationFuture<DeployModelResponse, DeployModelOperationMetadata> response =
          client.deployModelAsync(endpoint, deployedModel, trafficSplit);

      // You can use OperationFuture.getInitialFuture to get a future representing the initial
      // response to the request, which contains information while the operation is in progress.
      System.out.format("Operation name: %s\n", response.getInitialFuture().get().getName());

      // OperationFuture.get() will block until the operation is finished.
      DeployModelResponse deployModelResponse = response.get();
      System.out.format("deployModelResponse: %s\n", deployModelResponse);
    }
  }
}

Vertex AI SDK for Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.

def deploy_model_with_dedicated_resources_sample(
    project,
    location,
    model_name: str,
    machine_type: str,
    endpoint: Optional[aiplatform.Endpoint] = None,
    deployed_model_display_name: Optional[str] = None,
    traffic_percentage: Optional[int] = 0,
    traffic_split: Optional[Dict[str, int]] = None,
    min_replica_count: int = 1,
    max_replica_count: int = 1,
    accelerator_type: Optional[str] = None,
    accelerator_count: Optional[int] = None,
    explanation_metadata: Optional[explain.ExplanationMetadata] = None,
    explanation_parameters: Optional[explain.ExplanationParameters] = None,
    metadata: Optional[Sequence[Tuple[str, str]]] = (),
    sync: bool = True,
):
    """
    model_name: A fully-qualified model resource name or model ID.
          Example: "projects/123/locations/us-central1/models/456" or
          "456" when project and location are initialized or passed.
    """

    aiplatform.init(project=project, location=location)

    model = aiplatform.Model(model_name=model_name)

    # The explanation_metadata and explanation_parameters should only be
    # provided for a custom trained model and not an AutoML model.
    model.deploy(
        endpoint=endpoint,
        deployed_model_display_name=deployed_model_display_name,
        traffic_percentage=traffic_percentage,
        traffic_split=traffic_split,
        machine_type=machine_type,
        min_replica_count=min_replica_count,
        max_replica_count=max_replica_count,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        explanation_metadata=explanation_metadata,
        explanation_parameters=explanation_parameters,
        metadata=metadata,
        sync=sync,
    )

    model.wait()

    print(model.display_name)
    print(model.resource_name)
    return model

Node.js

Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

const automl = require('@google-cloud/automl');
const client = new automl.v1beta1.AutoMlClient();

/**
 * Demonstrates using the AutoML client to create a model.
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = '[PROJECT_ID]' e.g., "my-gcloud-project";
// const computeRegion = '[REGION_NAME]' e.g., "us-central1";
// const datasetId = '[DATASET_ID]' e.g., "TBL2246891593778855936";
// const tableId = '[TABLE_ID]' e.g., "1991013247762825216";
// const columnId = '[COLUMN_ID]' e.g., "773141392279994368";
// const modelName = '[MODEL_NAME]' e.g., "testModel";
// const trainBudget = '[TRAIN_BUDGET]' e.g., "1000",
// `Train budget in milli node hours`;

// A resource that represents Google Cloud Platform location.
const projectLocation = client.locationPath(projectId, computeRegion);

// Get the full path of the column.
const columnSpecId = client.columnSpecPath(
  projectId,
  computeRegion,
  datasetId,
  tableId,
  columnId
);

// Set target column to train the model.
const targetColumnSpec = {name: columnSpecId};

// Set tables model metadata.
const tablesModelMetadata = {
  targetColumnSpec: targetColumnSpec,
  trainBudgetMilliNodeHours: trainBudget,
};

// Set datasetId, model name and model metadata for the dataset.
const myModel = {
  datasetId: datasetId,
  displayName: modelName,
  tablesModelMetadata: tablesModelMetadata,
};

// Create a model with the model metadata in the region.
client
  .createModel({parent: projectLocation, model: myModel})
  .then(responses => {
    const initialApiResponse = responses[1];
    console.log(`Training operation name: ${initialApiResponse.name}`);
    console.log('Training started...');
  })
  .catch(err => {
    console.error(err);
  });

Learn how to change the default settings for prediction logging.

Get operation status

Some requests start long-running operations that require time to complete. These requests return an operation name, which you can use to view the operation's status or cancel the operation. Vertex AI provides helper methods to make calls against long-running operations. For more information, see Working with long-running operations.

Limitations

If you have VPC Service Controls enabled, your deployed model's container won't have access to the internet.

Configure model deployment

During model deployment, you make the following important decisions about how to run online prediction:

Resource created	Setting specified at resource creation
Endpoint	Location in which to run predictions
Model	Container to use (`ModelContainerSpec`)
DeployedModel	Machines to use for online prediction

You can't update these settings listed after the initial creation of the model or endpoint, and you can't override them in the online prediction request. If you need to change these settings, you must redeploy your model.

What happens when you deploy a model

When you deploy a model to an endpoint, you associate physical (machine) resources with that model so it can serve online predictions. Online predictions have low latency requirements. Providing resources to the model in advance reduces latency.

The model's training type (AutoML or custom) and (AutoML) data type determine the kinds of physical resources available to the model. After model deployment, you can mutate some of those resources without creating a new deployment.

The endpoint resource provides the service endpoint (URL) you use to request the prediction. For example:

https://us-central1-aiplatform.googleapis.com/v1/projects/{project}/locations/{location}/endpoints/{endpoint}:predict

Reasons to deploy more than one model to the same endpoint

Deploying two models to the same endpoint lets you gradually replace one model with the other. For example, suppose you are using a model, and find a way to increase the accuracy of that model with new training data. However, you don't want to update your application to point to a new endpoint URL, and you don't want to create sudden changes in your application. You can add the new model to the same endpoint, serving a small percentage of traffic, and gradually increase the traffic split for the new model until it is serving 100% of the traffic.

Because the resources are associated with the model rather than the endpoint, you could deploy models of different types to the same endpoint. However, the best practice is to deploy models of a specific type (for example, AutoML tabular, custom-trained) to an endpoint. This configuration is easier to manage.

Reasons to deploy a model to more than one endpoint

You might want to deploy your models with different resources for different application environments, such as testing and production. You might also want to support different SLOs for your prediction requests. Perhaps one of your applications has much higher performance needs than the others. In this case, you can deploy that model to a higher-performance endpoint with more machine resources. To optimize costs, you can also deploy the model to a lower-performance endpoint with fewer machine resources.

Scaling behavior

When you deploy a model for online prediction as a DeployedModel, you can configure prediction nodes to automatically scale. To do this, set dedicatedResources.maxReplicaCount to a greater value than dedicatedResources.minReplicaCount.

When you configure a DeployedModel, you must set dedicatedResources.minReplicaCount to at least 1. In other words, you cannot configure the DeployedModel to scale to 0 prediction nodes when it is unused.

By default, the deployment operation is only considered successful if the number of prediction nodes reaches dedicatedResources.minReplicaCount before the deployment request timeout value. Otherwise, the deployment is marked as failed, and the underlying resources are released.

Partially successful deployment and mutation

You can modify the default deployment behavior by setting dedicatedResources.requiredReplicaCount to a value that is less than dedicatedResources.minReplicaCount. In this case, when the number of prediction nodes reaches dedicatedResources.requiredReplicaCount, the deployment operation is marked as successful, even though it is not yet complete. Deployment continues until dedicatedResources.minReplicaCount is reached. If dedicatedResources.minReplicaCount isn't reached before the deployment request time, the operation is still successful, but an error message for the failed replicas is returned in DeployedModel.status.message.

Quota for Custom model serving is calculated based on your deployed model's real-time usage of compute resources. If the sum of maxReplicaCount for all the deployments in your project is more than your project's quota, some deployments may fail to autoscale due to quota being exhausted.

Endpoints are scaled up and down per machine, but quota is calculated per CPU or GPU. For example, if your model is deployed to a2-highgpu-2g machine type, each active replica will count as 24 CPUs and 2 GPUs against your project's quota. For more information, see Quota and limits.

The prediction nodes for batch prediction don't automatically scale. Vertex AI uses BatchDedicatedResources.startingReplicaCount and ignores BatchDedicatedResources.maxReplicaCount.

Target utilization and configuration

By default, if you deploy a model without dedicated GPU resources, Vertex AI automatically scales the number of replicas up or down so that CPU usage matches the default 60% target value.

By default, if you deploy a model with dedicated GPU resources (if machineSpec.accelerator_count is greater than 0), Vertex AI will automatically scale the number of replicas up or down so that the CPU or GPU usage, whichever is higher, matches the default 60% target value. Therefore, if your prediction throughput is causing high GPU usage, but not high CPU usage, Vertex AI will scale up, and the CPU utilization will be very low, which will be visible in monitoring. Conversely, if your custom container is underutilizing the GPU, but has an unrelated process that raise CPU utilization higher than 60%, Vertex AI will scale up, even if this may not have been needed to achieve QPS and latency targets.

You can override the default threshold metric and target by specifying autoscalingMetricSpecs. Note that if your deployment is configured to scale based only on CPU usage, it won't scale up even if GPU usage is high.

Manage resource usage

You can monitor your endpoint to track metrics like CPU and Accelerator usage, number of requests, latency, and the current and target number of replicas. This information can help you understand your endpoint's resource usage and scaling behavior.

Keep in mind that each replica runs only a single container. This means that if a prediction container can't fully use the selected compute resource, such as single threaded code for a multi-core machine, or a custom model that calls another service as part of making the prediction, your nodes may not scale up.

For example, if you are using FastAPI, or any model server that has a configurable number of workers or threads, there are many cases where having more than one worker can increase resource utilization, which improves the ability for the service to automatically scale the number of replicas.

We generally recommend starting with one worker or thread per core. If you notice that CPU utilization is low, especially under high load, or your model isn't scaling up because CPU utilization is low, then increase the number of workers. On the other hand, if you notice that utilization is too high and your latencies increase more than expected under load, try using fewer workers. If you are already using only a single worker, try using a smaller machine type.

Scaling behavior and lag

Vertex AI adjusts the number of replicas every 15 seconds using data from the previous 5 minutes window. For each 15 second cycle, the system measures the server utilization and generates a target number of replicas based on the following formula:

target # of replicas = Ceil(current # of replicas * (current utilization / target utilization))

For example, if you have two replicas that are being utilized at 100%, the target is 4:

4 = Ceil(3.33) = Ceil(2 * (100% / 60%))

Another example, if you have 10 replicas and utilization drops to 1%, the target is 1:

1 = Ceil(.167) = Ceil(10 * (1% / 60%))

At the end of each 15 second cycle, the system adjusts the number of replicas to match the highest target value from the previous 5 minutes window. Notice that because the highest target value is chosen, your endpoint won't scale down if there is a spike in utilization during that 5 minute window, even if overall utilization is very low. On the other hand, if the system needs to be scaled up, it will do that within 15 seconds since the highest target value is chosen instead of the average.

Keep in mind that even after Vertex AI adjusts the number of replicas, it takes time to start up or turn down the replicas. Thus there is an additional delay before the endpoint can adjust to the traffic. The main factors that contribute to this time include the following:

the time to provision and start the Compute Engine VMs
the time to download the container from the registry
the time to load the model from storage

The best way to understand the real world scaling behavior of your model is to run a load test and optimize the characteristics that matter for your model and your use case. If the autoscaler isn't scaling up fast enough for your application, provision enough min_replicas to handle your expected baseline traffic.

Update the scaling configuration

If you specified either DedicatedResources or AutomaticResources when you deployed the model, you can update the scaling configuration without redeploying the model by calling mutateDeployedModel.

For example, the following request updates max_replica, autoscaling_metric_specs, and disables container logging.

{
  "deployedModel": {
    "id": "2464520679043629056",
    "dedicatedResources": {
      "maxReplicaCount": 9,
      "autoscalingMetricSpecs": [
        {
          "metricName": "aiplatform.googleapis.com/prediction/online/cpu/utilization",
          "target": 50
        }
      ]
    },
    "disableContainerLogging": true
  },
  "update_mask": {
    "paths": [
      "dedicated_resources.max_replica_count",
      "dedicated_resources.autoscaling_metric_specs",
      "disable_container_logging"
    ]
  }
}

Usage notes:

You can't change the machine type or switch from DedicatedResources to AutomaticResources or the other way around. The only scaling configuration fields you can change are: min_replica, max_replica, required_replica, and AutoscalingMetricSpec (DedicatedResources only).
You must list every field you need to update in updateMask. Unlisted fields are ignored.
The DeployedModel must be in a DEPLOYED state. There can be at most one active mutate operation per deployed model.
mutateDeployedModel also lets you enable or disable container logging. For more information, see Online prediction logging.

Use a rolling deployment to replace a deployed model

In a rolling deployment, a deployed model is replaced with a new version of the same model. The new model reuses the compute resources from the previous one.

In the rolling deployment request, the traffic split and dedicatedResources values are the same as for the previous deployment. After the rolling deployment completes, the traffic split is updated to show that all of the traffic from the previous DeployedModel has migrated to the new deployment.

Other configurable fields in DeployedModel (such as serviceAccount, disableContainerLogging, and enableAccessLogging) are set to the same values as for the previous DeployedModel by default. However, you can optionally specify new values for these fields.

When a model is deployed using a rolling deployment, a new DeployedModel is created. The new DeployedModel receives a new ID that is different from that of the previous one. It also receives a new revisionNumber value in the rolloutOptions field.

If there are multiple rolling deployments targeting the same backing resources, the DeployedModel with the highest revisionNumber is treated as the intended final state.

As the rolling deployment progresses, all the existing replicas for the previous DeployedModel are replaced with replicas of the new DeployedModel. This happens quickly, and replicas are updated whenever the deployment has enough available replicas or enough surge capacity to bring up additional replicas.

Additionally, as the rolling deployment progresses, the traffic for the old DeployedModel is gradually migrated to the new DeployedModel. The traffic is load-balanced in proportion to the number of ready-to-serve replicas of each DeployedModel.

If the rolling deployment's new replicas never become ready because their health route consistently returns a non-200 response code, traffic isn't sent to those unready replicas. In this case, the rolling deployment eventually fails, and the replicas are reverted to the previous DeployedModel.

Start a rolling deployment

To start a rolling deployment, include the rolloutOptions field in the model deployment request as shown in the following example.

REST

Before using any of the request data, make the following replacements:

LOCATION_ID: The region where you are using Vertex AI.
PROJECT_ID: Your project ID.
ENDPOINT_ID: The ID for the endpoint.
MODEL_ID: The ID for the model to be deployed.
PREVIOUS_DEPLOYED_MODEL: The DeployedModel ID of a model on the same endpoint. This specifies the DeployedModel whose backing resources are to be reused. You can call GetEndpoint to get a list of deployed models on an endpoint along with their numeric IDs.
MAX_UNAVAILABLE_REPLICAS: The number of model replicas that can be taken down during the rolling deployment.
MAX_SURGE_REPLICAS: The number of additional model replicas that can be brought up during the rolling deployment. If this is set to zero, then only the existing capacity is used.

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel

Request JSON body:

{
  "deployedModel": {
    "model": "projects/PROJECT_ID/locations/LOCATION_ID/models/MODEL_ID",
    "rolloutOptions": {
      "previousDeployedModel": "PREVIOUS_DEPLOYED_MODEL",
      "maxUnavailableReplicas": "MAX_UNAVAILABLE_REPLICAS",
      "maxSurgeReplicas": "MAX_SURGE_REPLICAS"
    }
  }
}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel"

PowerShell (Windows)

Save the request body in a file named request.json, and execute the following command:

$headers = @{  }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel" | Select-Object -Expand Content

You should receive a successful status code (2xx) and an empty response.

If desired, you can replace maxSurgeReplicas and maxUnavailableReplicas, or both, with percentage values, as shown in the following example.

REST

Before using any of the request data, make the following replacements:

MAX_UNAVAILABLE_PERCENTAGE: The percentage of model replicas that can be taken down during the rolling deployment.
MAX_SURGE_PERCENTAGE: The percentage of additional model replicas that can be brought up during the rolling deployment. If this is set to zero, then only the existing capacity is used.

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel

Request JSON body:

{
  "deployedModel": {
    "model": "projects/PROJECT/locations/LOCATION_ID/models/MODEL_ID",
    "rolloutOptions": {
      "previousDeployedModel": "PREVIOUS_DEPLOYED_MODEL",
      "maxUnavailablePercentage": "MAX_UNAVAILABLE_PERCENTAGE",
      "maxSurgePercentage": "MAX_SURGE_PERCENTAGE"
    }
  }
}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel"

PowerShell (Windows)

Save the request body in a file named request.json, and execute the following command:

$headers = @{  }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel" | Select-Object -Expand Content

You should receive a successful status code (2xx) and an empty response.

Roll back a rolling deployment

To roll back a rolling deployment, start a new rolling deployment of the previous model, using the ongoing rolling deployment's DeployedModel ID as the previousDeployedModel.

To get the DeployedModel ID for an ongoing deployment, set the parameter allDeploymentStates=true in the call to GetEndpoint, as shown in the following example.

REST

Before using any of the request data, make the following replacements:

LOCATION_ID: The region where you are using Vertex AI.
PROJECT_ID: Your project ID.
ENDPOINT_ID: The ID for the endpoint.

HTTP method and URL:

GET https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID?allDeploymentStates=true

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Execute the following command:

curl -X GET \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID?allDeploymentStates=true"

PowerShell (Windows)

Execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method GET `
    -Headers $headers `
    -Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID?allDeploymentStates=true" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID",
  "displayName": "rolling-deployments-endpoint",
  "deployedModels": [
    {
      "id": "2718281828459045",
      "model": "projects/PROJECT_ID/locations/LOCATION_ID/models/MODEL_ID@1",
      "displayName": "rd-test-model",
      "createTime": "2024-09-11T21:37:48.522692Z",
      "dedicatedResources": {
        "machineSpec": {
          "machineType": "e2-standard-2"
        },
        "minReplicaCount": 5,
        "maxReplicaCount": 5
      },
      "modelVersionId": "1",
      "state": "BEING_DEPLOYED"
    }
  ],
  "etag": "AMEw9yMs3TdZMn8CUg-3DY3wS74bkIaTDQhqJ7-Ld_Zp7wgT8gsEfJlrCOyg67lr9dwn",
  "createTime": "2024-09-11T21:22:36.588538Z",
  "updateTime": "2024-09-11T21:27:28.563579Z",
  "dedicatedEndpointEnabled": true,
  "dedicatedEndpointDns": "ENDPOINT_ID.LOCATION_ID-PROJECT_ID.prediction.vertexai.goog"
}

Constraints and limitations

The previous DeployedModel must be on the same endpoint as the new DeployedModel.
You can't create multiple rolling deployments with the same previousDeployedModel.
You can't create rolling deployments on top of a DeployedModel that isn't fully deployed. Exception: If previousDeployedModel is itself an in-progress rolling deployment, then a new rolling deployment can be created on top of it. This allows for rolling back deployments that start to fail.
Previous models don't automatically undeploy after a rolling deployment completes successfully. You can undeploy the model manually.
For rolling deployments on endpoints that aren't dedicated endpoints, the predictRoute and healthRoute for the new model must be the same as for the previous model.
Rolling deployments aren't compatible with model cohosting.
Rolling deployments can't be used for models that require online explanations.

Undeploy a model and delete the endpoint

Use one of the following methods to undeploy a model and delete the endpoint.

Google Cloud console

Undeploy the model as follows:
1. In the Google Cloud console, in the Vertex AI section, go to the Endpoints page.
  
  Go to the Endpoints page
2. Click the name and version ID of the model you want to undeploy to open its details page.
3. On the row for your model, click Actions, and then click Undeploy model from endpoint.
4. In the Undeploy model from endpoint dialog, click Undeploy.
5. To delete additional models, repeat the preceding steps.
Optional: Delete the online prediction endpoint as follows:
1. In the Google Cloud console, in the Vertex AI section, go to the Online prediction page.
  
  Go to Online prediction
2. Select the endpoint.
3. To delete the endpoint, click Actions, and then click Delete endpoint.

gcloud

List the endpoint IDs for all endpoints in your project:
```
gcloud ai endpoints list \
    --project=PROJECT_ID \
    --region=LOCATION_ID
```
Replace PROJECT_ID with your project name and LOCATION_ID with the region where you are using Vertex AI.

List the model IDs for the models that are deployed to an endpoint:

gcloud ai endpoints describe ENDPOINT_ID \
    --project=PROJECT_ID \
    --region=LOCATION_ID

Replace ENDPOINT_ID with the endpoint ID.

Undeploy a model from the endpoint:

gcloud ai endpoints undeploy-model ENDPOINT_ID \
    --project=PROJECT_ID \
    --region=LOCATION_ID \
    --deployed-model-id=DEPLOYED_MODEL_ID

Replace DEPLOYED_MODEL_ID with the model ID.

Optional: Delete the online prediction endpoint:

gcloud ai endpoints delete ENDPOINT_ID \
    --project=PROJECT_ID \
    --region=LOCATION_ID

What's next

Learn how to get an online prediction.
Learn about private endpoints.