Deploy a model to an endpoint

You must deploy a model to an endpoint before that model can be used to serve online predictions. Deploying a model associates physical resources with the model so it can serve online predictions with low latency.

You can deploy more than one model to an endpoint, and you can deploy a model to more than one endpoint. For more information about options and use cases for deploying models, see Reasons to deploy more than one model to the same endpoint below.

Deploy a model to an endpoint

Use one of the following methods to deploy a model:

Google Cloud console

  1. In the Google Cloud console, in the Vertex AI section, go to the Models page.

    Go to the Models page

  2. Click the name and version ID of the model you want to deploy to open its details page.

  3. Select the Deploy & Test tab.

    If your model is already deployed to any endpoints, they are listed in the Deploy your model section.

  4. Click Deploy to endpoint.

  5. To deploy your model to a new endpoint, select Create new endpoint, and provide a name for the new endpoint. To deploy your model to an existing endpoint, select Add to existing endpoint, and select the endpoint from the drop-down list.

    You can add more than one model to an endpoint, and you can add a model to more than one endpoint.

  6. If you deploy your model to an existing endpoint that has one or more models deployed to it, you must update the Traffic split percentage for the model you are deploying and the already deployed models so that all of the percentages add up to 100%.

  7. If you're deploying your model to a new endpoint, accept 100 for the Traffic split. Otherwise, adjust the traffic split values for all models on the endpoint so they add up to 100.

  8. Enter the Minimum number of compute nodes you want to provide for your model.

    This is the number of nodes available to this model at all times.

    You are charged for the nodes used, whether to handle prediction load or for standby (minimum) nodes, even without prediction traffic. See the pricing page.

    The number of compute nodes can increase if needed to handle prediction traffic, but it will never go higher than the maximum number of nodes.

  9. To use autoscaling, enter the Maximum number of compute nodes you want Vertex AI to scale up to.

  10. Select your Machine type.

    Larger machine resources increase your prediction performance and increase costs. Compare the available machine types.

  11. Select an Accelerator type and an Accelerator count.

    If you enabled accelerator use when you imported or created the model, this option displays.

    For the accelerator count, refer to the GPU table to check for valid numbers of GPUs that you can use with each CPU machine type. The accelerator count refers to the number of accelerators per node, not the total number of accelerators in your deployment.

  12. If you want to use a custom service account for the deployment, select a service account in the Service account drop-down box.

  13. Learn how to change the default settings for prediction logging.

  14. Click Done for your model, and when all the Traffic split percentages are correct, click Continue.

    The region where your model deploys is displayed. This must be the region where you created your model.

  15. Click Deploy to deploy your model to the endpoint.

API

When you deploy a model using the Vertex AI API, you complete the following steps:

  1. Create an endpoint if needed.
  2. Get the endpoint ID.
  3. Deploy the model to the endpoint.

Create an endpoint

If you are deploying a model to an existing endpoint, you can skip this step.

gcloud

The following example uses the gcloud ai endpoints create command:

gcloud ai endpoints create \
  --region=LOCATION_ID \
  --display-name=ENDPOINT_NAME

Replace the following:

  • LOCATION_ID: The region where you are using Vertex AI.
  • ENDPOINT_NAME: The display name for the endpoint.

The Google Cloud CLI tool might take a few seconds to create the endpoint.

REST

Before using any of the request data, make the following replacements:

  • LOCATION_ID: Your region.
  • PROJECT_ID: Your project ID.
  • ENDPOINT_NAME: The display name for the endpoint.

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints

Request JSON body:

{
  "display_name": "ENDPOINT_NAME"
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.CreateEndpointOperationMetadata",
    "genericMetadata": {
      "createTime": "2020-11-05T17:45:42.812656Z",
      "updateTime": "2020-11-05T17:45:42.812656Z"
    }
  }
}
You can poll for the status of the operation until the response includes "done": true.

Terraform

The following sample uses the google_vertex_ai_endpoint Terraform resource to create an endpoint.

To learn how to apply or remove a Terraform configuration, see Basic Terraform commands.

# Endpoint name must be unique for the project
resource "random_id" "endpoint_id" {
  byte_length = 4
}

resource "google_vertex_ai_endpoint" "default" {
  name         = substr(random_id.endpoint_id.dec, 0, 10)
  display_name = "sample-endpoint"
  description  = "A sample Vertex AI endpoint"
  location     = "us-central1"
  labels = {
    label-one = "value-one"
  }
}

Java

Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.aiplatform.v1.CreateEndpointOperationMetadata;
import com.google.cloud.aiplatform.v1.Endpoint;
import com.google.cloud.aiplatform.v1.EndpointServiceClient;
import com.google.cloud.aiplatform.v1.EndpointServiceSettings;
import com.google.cloud.aiplatform.v1.LocationName;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

public class CreateEndpointSample {

  public static void main(String[] args)
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    String project = "YOUR_PROJECT_ID";
    String endpointDisplayName = "YOUR_ENDPOINT_DISPLAY_NAME";
    createEndpointSample(project, endpointDisplayName);
  }

  static void createEndpointSample(String project, String endpointDisplayName)
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    EndpointServiceSettings endpointServiceSettings =
        EndpointServiceSettings.newBuilder()
            .setEndpoint("us-central1-aiplatform.googleapis.com:443")
            .build();

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (EndpointServiceClient endpointServiceClient =
        EndpointServiceClient.create(endpointServiceSettings)) {
      String location = "us-central1";
      LocationName locationName = LocationName.of(project, location);
      Endpoint endpoint = Endpoint.newBuilder().setDisplayName(endpointDisplayName).build();

      OperationFuture<Endpoint, CreateEndpointOperationMetadata> endpointFuture =
          endpointServiceClient.createEndpointAsync(locationName, endpoint);
      System.out.format("Operation name: %s\n", endpointFuture.getInitialFuture().get().getName());
      System.out.println("Waiting for operation to finish...");
      Endpoint endpointResponse = endpointFuture.get(300, TimeUnit.SECONDS);

      System.out.println("Create Endpoint Response");
      System.out.format("Name: %s\n", endpointResponse.getName());
      System.out.format("Display Name: %s\n", endpointResponse.getDisplayName());
      System.out.format("Description: %s\n", endpointResponse.getDescription());
      System.out.format("Labels: %s\n", endpointResponse.getLabelsMap());
      System.out.format("Create Time: %s\n", endpointResponse.getCreateTime());
      System.out.format("Update Time: %s\n", endpointResponse.getUpdateTime());
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */

// const endpointDisplayName = 'YOUR_ENDPOINT_DISPLAY_NAME';
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';

// Imports the Google Cloud Endpoint Service Client library
const {EndpointServiceClient} = require('@google-cloud/aiplatform');

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};

// Instantiates a client
const endpointServiceClient = new EndpointServiceClient(clientOptions);

async function createEndpoint() {
  // Configure the parent resource
  const parent = `projects/${project}/locations/${location}`;
  const endpoint = {
    displayName: endpointDisplayName,
  };
  const request = {
    parent,
    endpoint,
  };

  // Get and print out a list of all the endpoints for this resource
  const [response] = await endpointServiceClient.createEndpoint(request);
  console.log(`Long running operation : ${response.name}`);

  // Wait for operation to complete
  await response.promise();
  const result = response.result;

  console.log('Create endpoint response');
  console.log(`\tName : ${result.name}`);
  console.log(`\tDisplay name : ${result.displayName}`);
  console.log(`\tDescription : ${result.description}`);
  console.log(`\tLabels : ${JSON.stringify(result.labels)}`);
  console.log(`\tCreate time : ${JSON.stringify(result.createTime)}`);
  console.log(`\tUpdate time : ${JSON.stringify(result.updateTime)}`);
}
createEndpoint();

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

def create_endpoint_sample(
    project: str,
    display_name: str,
    location: str,
):
    aiplatform.init(project=project, location=location)

    endpoint = aiplatform.Endpoint.create(
        display_name=display_name,
        project=project,
        location=location,
    )

    print(endpoint.display_name)
    print(endpoint.resource_name)
    return endpoint

Retrieve the endpoint ID

You need the endpoint ID to deploy the model.

gcloud

The following example uses the gcloud ai endpoints list command:

gcloud ai endpoints list \
  --region=LOCATION_ID \
  --filter=display_name=ENDPOINT_NAME

Replace the following:

  • LOCATION_ID: The region where you are using Vertex AI.
  • ENDPOINT_NAME: The display name for the endpoint.

Note the number that appears in the ENDPOINT_ID column. Use this ID in the following step.

REST

Before using any of the request data, make the following replacements:

  • LOCATION_ID: The region where you are using Vertex AI.
  • PROJECT_ID: Your project ID.
  • ENDPOINT_NAME: The display name for the endpoint.

HTTP method and URL:

GET https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints?filter=display_name=ENDPOINT_NAME

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "endpoints": [
    {
      "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID",
      "displayName": "ENDPOINT_NAME",
      "etag": "AMEw9yPz5pf4PwBHbRWOGh0PcAxUdjbdX2Jm3QO_amguy3DbZGP5Oi_YUKRywIE-BtLx",
      "createTime": "2020-04-17T18:31:11.585169Z",
      "updateTime": "2020-04-17T18:35:08.568959Z"
    }
  ]
}
Note the ENDPOINT_ID.

Deploy the model

Select the tab below for your language or environment:

gcloud

The following examples use the gcloud ai endpoints deploy-model command.

The following example deploys a Model to an Endpoint without using GPUs to accelerate prediction serving and without splitting traffic between multiple DeployedModel resources:

Before using any of the command data below, make the following replacements:

  • ENDPOINT_ID: The ID for the endpoint.
  • LOCATION_ID: The region where you are using Vertex AI.
  • MODEL_ID: The ID for the model to be deployed.
  • DEPLOYED_MODEL_NAME: A name for the DeployedModel. You can use the display name of the Model for the DeployedModel as well.
  • MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes.
  • MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to this number of nodes and never fewer than the minimum number of nodes. If you omit the --max-replica-count flag, then maximum number of nodes is set to the value of --min-replica-count.

Execute the gcloud ai endpoints deploy-model command:

Linux, macOS, or Cloud Shell

gcloud ai endpoints deploy-model ENDPOINT_ID\
  --region=LOCATION_ID \
  --model=MODEL_ID \
  --display-name=DEPLOYED_MODEL_NAME \
  --min-replica-count=MIN_REPLICA_COUNT \
  --max-replica-count=MAX_REPLICA_COUNT \
  --traffic-split=0=100

Windows (PowerShell)

gcloud ai endpoints deploy-model ENDPOINT_ID`
  --region=LOCATION_ID `
  --model=MODEL_ID `
  --display-name=DEPLOYED_MODEL_NAME `
  --min-replica-count=MIN_REPLICA_COUNT `
  --max-replica-count=MAX_REPLICA_COUNT `
  --traffic-split=0=100

Windows (cmd.exe)

gcloud ai endpoints deploy-model ENDPOINT_ID^
  --region=LOCATION_ID ^
  --model=MODEL_ID ^
  --display-name=DEPLOYED_MODEL_NAME ^
  --min-replica-count=MIN_REPLICA_COUNT ^
  --max-replica-count=MAX_REPLICA_COUNT ^
  --traffic-split=0=100
 

Splitting traffic

The --traffic-split=0=100 flag in the preceding examples sends 100% of prediction traffic that the Endpoint receives to the new DeployedModel, which is represented by the temporary ID 0. If your Endpoint already has other DeployedModel resources, then you can split traffic between the new DeployedModel and the old ones. For example, to send 20% of traffic to the new DeployedModel and 80% to an older one, run the following command.

Before using any of the command data below, make the following replacements:

  • OLD_DEPLOYED_MODEL_ID: the ID of the existing DeployedModel.

Execute the gcloud ai endpoints deploy-model command:

Linux, macOS, or Cloud Shell

gcloud ai endpoints deploy-model ENDPOINT_ID\
  --region=LOCATION_ID \
  --model=MODEL_ID \
  --display-name=DEPLOYED_MODEL_NAME \ 
  --min-replica-count=MIN_REPLICA_COUNT \
  --max-replica-count=MAX_REPLICA_COUNT \
  --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80

Windows (PowerShell)

gcloud ai endpoints deploy-model ENDPOINT_ID`
  --region=LOCATION_ID `
  --model=MODEL_ID `
  --display-name=DEPLOYED_MODEL_NAME \ 
  --min-replica-count=MIN_REPLICA_COUNT `
  --max-replica-count=MAX_REPLICA_COUNT `
  --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80

Windows (cmd.exe)

gcloud ai endpoints deploy-model ENDPOINT_ID^
  --region=LOCATION_ID ^
  --model=MODEL_ID ^
  --display-name=DEPLOYED_MODEL_NAME \ 
  --min-replica-count=MIN_REPLICA_COUNT ^
  --max-replica-count=MAX_REPLICA_COUNT ^
  --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80
 

REST

Deploy the model.

Before using any of the request data, make the following replacements:

  • LOCATION_ID: The region where you are using Vertex AI.
  • PROJECT_ID: Your project ID.
  • ENDPOINT_ID: The ID for the endpoint.
  • MODEL_ID: The ID for the model to be deployed.
  • DEPLOYED_MODEL_NAME: A name for the DeployedModel. You can use the display name of the Model for the DeployedModel as well.
  • MACHINE_TYPE: Optional. The machine resources used for each node of this deployment. Its default setting is n1-standard-2. Learn more about machine types.
  • ACCELERATOR_TYPE: The type of accelerator to be attached to the machine. Optional if ACCELERATOR_COUNT is not specified or is zero. Not recommended for AutoML models or custom-trained models that are using non-GPU images. Learn more.
  • ACCELERATOR_COUNT: The number of accelerators for each replica to use. Optional. Should be zero or unspecified for AutoML models or custom-trained models that are using non-GPU images.
  • MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1.
  • MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to this number of nodes and never fewer than the minimum number of nodes.
  • TRAFFIC_SPLIT_THIS_MODEL: The percentage of the prediction traffic to this endpoint to be routed to the model being deployed with this operation. Defaults to 100. All traffic percentages must add up to 100. Learn more about traffic splits.
  • DEPLOYED_MODEL_ID_N: Optional. If other models are deployed to this endpoint, you must update their traffic split percentages so that all percentages add up to 100.
  • TRAFFIC_SPLIT_MODEL_N: The traffic split percentage value for the deployed model id key.
  • PROJECT_NUMBER: Your project's automatically generated project number

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel

Request JSON body:

{
  "deployedModel": {
    "model": "projects/PROJECT/locations/us-central1/models/MODEL_ID",
    "displayName": "DEPLOYED_MODEL_NAME",
    "dedicatedResources": {
       "machineSpec": {
         "machineType": "MACHINE_TYPE",
         "acceleratorType": "ACCELERATOR_TYPE",
         "acceleratorCount": "ACCELERATOR_COUNT"
       },
       "minReplicaCount": MIN_REPLICA_COUNT,
       "maxReplicaCount": MAX_REPLICA_COUNT
     },
  },
  "trafficSplit": {
    "0": TRAFFIC_SPLIT_THIS_MODEL,
    "DEPLOYED_MODEL_ID_1": TRAFFIC_SPLIT_MODEL_1,
    "DEPLOYED_MODEL_ID_2": TRAFFIC_SPLIT_MODEL_2
  },
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.DeployModelOperationMetadata",
    "genericMetadata": {
      "createTime": "2020-10-19T17:53:16.502088Z",
      "updateTime": "2020-10-19T17:53:16.502088Z"
    }
  }
}

Java

Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.aiplatform.v1.DedicatedResources;
import com.google.cloud.aiplatform.v1.DeployModelOperationMetadata;
import com.google.cloud.aiplatform.v1.DeployModelResponse;
import com.google.cloud.aiplatform.v1.DeployedModel;
import com.google.cloud.aiplatform.v1.EndpointName;
import com.google.cloud.aiplatform.v1.EndpointServiceClient;
import com.google.cloud.aiplatform.v1.EndpointServiceSettings;
import com.google.cloud.aiplatform.v1.MachineSpec;
import com.google.cloud.aiplatform.v1.ModelName;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ExecutionException;

public class DeployModelCustomTrainedModelSample {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String project = "PROJECT";
    String endpointId = "ENDPOINT_ID";
    String modelName = "MODEL_NAME";
    String deployedModelDisplayName = "DEPLOYED_MODEL_DISPLAY_NAME";
    deployModelCustomTrainedModelSample(project, endpointId, modelName, deployedModelDisplayName);
  }

  static void deployModelCustomTrainedModelSample(
      String project, String endpointId, String model, String deployedModelDisplayName)
      throws IOException, ExecutionException, InterruptedException {
    EndpointServiceSettings settings =
        EndpointServiceSettings.newBuilder()
            .setEndpoint("us-central1-aiplatform.googleapis.com:443")
            .build();
    String location = "us-central1";

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (EndpointServiceClient client = EndpointServiceClient.create(settings)) {
      MachineSpec machineSpec = MachineSpec.newBuilder().setMachineType("n1-standard-2").build();
      DedicatedResources dedicatedResources =
          DedicatedResources.newBuilder().setMinReplicaCount(1).setMachineSpec(machineSpec).build();

      String modelName = ModelName.of(project, location, model).toString();
      DeployedModel deployedModel =
          DeployedModel.newBuilder()
              .setModel(modelName)
              .setDisplayName(deployedModelDisplayName)
              // `dedicated_resources` must be used for non-AutoML models
              .setDedicatedResources(dedicatedResources)
              .build();
      // key '0' assigns traffic for the newly deployed model
      // Traffic percentage values must add up to 100
      // Leave dictionary empty if endpoint should not accept any traffic
      Map<String, Integer> trafficSplit = new HashMap<>();
      trafficSplit.put("0", 100);
      EndpointName endpoint = EndpointName.of(project, location, endpointId);
      OperationFuture<DeployModelResponse, DeployModelOperationMetadata> response =
          client.deployModelAsync(endpoint, deployedModel, trafficSplit);

      // You can use OperationFuture.getInitialFuture to get a future representing the initial
      // response to the request, which contains information while the operation is in progress.
      System.out.format("Operation name: %s\n", response.getInitialFuture().get().getName());

      // OperationFuture.get() will block until the operation is finished.
      DeployModelResponse deployModelResponse = response.get();
      System.out.format("deployModelResponse: %s\n", deployModelResponse);
    }
  }
}

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

def deploy_model_with_dedicated_resources_sample(
    project,
    location,
    model_name: str,
    machine_type: str,
    endpoint: Optional[aiplatform.Endpoint] = None,
    deployed_model_display_name: Optional[str] = None,
    traffic_percentage: Optional[int] = 0,
    traffic_split: Optional[Dict[str, int]] = None,
    min_replica_count: int = 1,
    max_replica_count: int = 1,
    accelerator_type: Optional[str] = None,
    accelerator_count: Optional[int] = None,
    explanation_metadata: Optional[explain.ExplanationMetadata] = None,
    explanation_parameters: Optional[explain.ExplanationParameters] = None,
    metadata: Optional[Sequence[Tuple[str, str]]] = (),
    sync: bool = True,
):
    """
    model_name: A fully-qualified model resource name or model ID.
          Example: "projects/123/locations/us-central1/models/456" or
          "456" when project and location are initialized or passed.
    """

    aiplatform.init(project=project, location=location)

    model = aiplatform.Model(model_name=model_name)

    # The explanation_metadata and explanation_parameters should only be
    # provided for a custom trained model and not an AutoML model.
    model.deploy(
        endpoint=endpoint,
        deployed_model_display_name=deployed_model_display_name,
        traffic_percentage=traffic_percentage,
        traffic_split=traffic_split,
        machine_type=machine_type,
        min_replica_count=min_replica_count,
        max_replica_count=max_replica_count,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        explanation_metadata=explanation_metadata,
        explanation_parameters=explanation_parameters,
        metadata=metadata,
        sync=sync,
    )

    model.wait()

    print(model.display_name)
    print(model.resource_name)
    return model

Node.js

Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

const automl = require('@google-cloud/automl');
const client = new automl.v1beta1.AutoMlClient();

/**
 * Demonstrates using the AutoML client to create a model.
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = '[PROJECT_ID]' e.g., "my-gcloud-project";
// const computeRegion = '[REGION_NAME]' e.g., "us-central1";
// const datasetId = '[DATASET_ID]' e.g., "TBL2246891593778855936";
// const tableId = '[TABLE_ID]' e.g., "1991013247762825216";
// const columnId = '[COLUMN_ID]' e.g., "773141392279994368";
// const modelName = '[MODEL_NAME]' e.g., "testModel";
// const trainBudget = '[TRAIN_BUDGET]' e.g., "1000",
// `Train budget in milli node hours`;

// A resource that represents Google Cloud Platform location.
const projectLocation = client.locationPath(projectId, computeRegion);

// Get the full path of the column.
const columnSpecId = client.columnSpecPath(
  projectId,
  computeRegion,
  datasetId,
  tableId,
  columnId
);

// Set target column to train the model.
const targetColumnSpec = {name: columnSpecId};

// Set tables model metadata.
const tablesModelMetadata = {
  targetColumnSpec: targetColumnSpec,
  trainBudgetMilliNodeHours: trainBudget,
};

// Set datasetId, model name and model metadata for the dataset.
const myModel = {
  datasetId: datasetId,
  displayName: modelName,
  tablesModelMetadata: tablesModelMetadata,
};

// Create a model with the model metadata in the region.
client
  .createModel({parent: projectLocation, model: myModel})
  .then(responses => {
    const initialApiResponse = responses[1];
    console.log(`Training operation name: ${initialApiResponse.name}`);
    console.log('Training started...');
  })
  .catch(err => {
    console.error(err);
  });

Learn how to change the default settings for prediction logging.

Get operation status

Some requests start long-running operations that require time to complete. These requests return an operation name, which you can use to view the operation's status or cancel the operation. Vertex AI provides helper methods to make calls against long-running operations. For more information, see Working with long-running operations.

Limitations

  • If you have VPC Service Controls enabled, your deployed model's container won't have access to the internet.

Configure model deployment

During model deployment, you make the following important decisions about how to run online prediction:

Resource created Setting specified at resource creation
Endpoint Location in which to run predictions
Model Container to use (ModelContainerSpec)
DeployedModel Machines to use for online prediction

You can't update the settings listed above after the initial creation of the model or endpoint, and you can't override them in the online prediction request. If you need to change these settings, you must redeploy your model.

What happens when you deploy a model

When you deploy a model to an endpoint, you associate physical (machine) resources with that model so it can serve online predictions. Online predictions have low latency requirements. Providing resources to the model in advance reduces latency.

The model's training type (AutoML or custom) and (AutoML) data type determine the kinds of physical resources available to the model. After model deployment, you can mutate some of those resources without creating a new deployment.

The endpoint resource provides the service endpoint (URL) you use to request the prediction. For example:

https://us-central1-aiplatform.googleapis.com/v1/projects/{project}/locations/{location}/endpoints/{endpoint}:predict

Reasons to deploy more than one model to the same endpoint

Deploying two models to the same endpoint lets you gradually replace one model with the other. For example, suppose you are using a model, and find a way to increase the accuracy of that model with new training data. However, you don't want to update your application to point to a new endpoint URL, and you don't want to create sudden changes in your application. You can add the new model to the same endpoint, serving a small percentage of traffic, and gradually increase the traffic split for the new model until it is serving 100% of the traffic.

Because the resources are associated with the model rather than the endpoint, you could deploy models of different types to the same endpoint. However, the best practice is to deploy models of a specific type (for example, AutoML text, AutoML tabular, custom-trained) to an endpoint. This configuration is easier to manage.

Reasons to deploy a model to more than one endpoint

You might want to deploy your models with different resources for different application environments, such as testing and production. You might also want to support different SLOs for your prediction requests. Perhaps one of your applications has much higher performance needs than the others. In this case, you can deploy that model to a higher-performance endpoint with more machine resources. To optimize costs, you can also deploy the model to a lower-performance endpoint with fewer machine resources.

Scaling behavior

When you deploy a Model for online prediction as a DeployedModel, you can configure prediction nodes to automatically scale. To do this, set dedicatedResources.maxReplicaCount to a greater value than dedicatedResources.minReplicaCount.

When you configure a DeployedModel, you must set dedicatedResources.minReplicaCount to at least 1. In other words, you cannot configure the DeployedModel to scale to 0 prediction nodes when it is unused.

Target utilization and configuration

By default, if you deploy a model without dedicated GPU resources, Vertex AI automatically scales the number of replicas up or down so that CPU usage matches the default 60% target value.

By default, if you deploy a model with dedicated GPU resources (if machineSpec.accelerator_count is above 0), Vertex AI will automatically scale the number of replicas up or down so that the CPU or GPU usage, whichever is higher, matches the default 60% target value. Therefore, if your prediction throughput is causing high GPU usage, but not high CPU usage, Vertex AI will scale up, and the CPU utilization will be very low, which will be visible in monitoring. Conversely, if your custom container is underutilizing the GPU, but has an unrelated process that bring CPU utilization above 60%, Vertex AI will scale up, even if this may not have been needed to achieve QPS and latency targets.

You can override the default threshold metric and target by specifying autoscalingMetricSpecs. Note that if your deployment is configured to scale based only on CPU usage, it will not scale up even if GPU usage is high.

Manage resource usage

You can monitor your endpoint to track metrics like CPU and Accelerator usage, number of requests, latency, and the current and target number of replicas. This information can help you understand your endpoint's resource usage and scaling behavior.

Keep in mind that each replica runs only a single container. This means that if a prediction container cannot fully utilize the selected compute resource, such as single threaded code for a multi-core machine, or a custom model that calls another service as part of making the prediction, your nodes may not scale up.

For example, if you are using FastAPI, or any model server that has a configurable number of workers or threads, there are many cases where having more than one worker can increase resource utilization, which improves the ability for the service to automatically scale the number of replicas.

We generally recommend starting with one worker or thread per core. If you notice that CPU utilization is low, especially under high load, or your model is not scaling up because CPU utilization is low, then increase the number of workers. On the other hand, if you notice that utilization is too high and your latencies increase more than expected under load, then try using fewer workers. If you are already using only a single worker, try using a smaller machine type.

Scaling behavior and lag

Vertex AI adjusts the number of replicas every 15 seconds using data from the previous 5 minutes window. For each 15 second cycle, the system measures the server utilization and generates a target number of replicas based on the following formula:

target # of replicas = Ceil(current # of replicas * (current utilization / target utilization))

For example, if you currently have two replicas that are being utilized at 100%, the target is 4:

4 = Ceil(3.33) = Ceil(2 * (100% / 60%))

Another example, if you currently have 10 replicas and utilization drops to 1%, the target is 1:

1 = Ceil(.167) = Ceil(10 * (1% / 60%))

At the end of each 15 second cycle, the system adjusts the number of replicas to match the highest target value from the previous 5 minutes window. Notice that because the highest target value is chosen, your endpoint will not scale down if there is a spike in utilization during that 5 minute window, even if overall utilization is very low. On the other hand, if the system needs to be scaled up, it will do that within 15 seconds since the highest target value is chosen instead of the average.

Keep in mind that even after Vertex AI adjusts the number of replicas, it takes time to start up or turn down the replicas. Thus there is an additional delay before the endpoint can adjust to the traffic. The main factors that contribute to this time include the following:

  • the time to provision and start the Compute Engine VMs
  • the time to download the container from the registry
  • the time to load the model from storage

The best way to understand the real world scaling behavior of your model is to run a load test and optimize the characteristics that matter for your model and your use case. If the autoscaler is not scaling up fast enough for your application, provision enough min_replicas to handle your expected baseline traffic.

Update the scaling configuration

If you specified either DedicatedResources or AutomaticResources when you deployed the model, you can update the scaling configuration without redeploying the model by calling mutateDeployedModel.

For example, the following request updates max_replica, autoscaling_metric_specs, and disables container logging.

{
  "deployedModel": {
    "id": "2464520679043629056",
    "dedicatedResources": {
      "maxReplicaCount": 9,
      "autoscalingMetricSpecs": [
        {
          "metricName": "aiplatform.googleapis.com/prediction/online/cpu/utilization",
          "target": 50
        }
      ]
    },
    "disableContainerLogging": true
  },
  "update_mask": {
    "paths": [
      "dedicated_resources.max_replica_count",
      "dedicated_resources.autoscaling_metric_specs",
      "disable_container_logging"
    ]
  }
}

Usage notes:

  • You cannot change the machine type or switch from DedicatedResources to AutomaticResources or the other way around. The only scaling configuration fields you can change are: min_replica, max_replica, and AutoscalingMetricSpec (DedicatedResources only).
  • You must list every field you wish to update in updateMask. Unlisted fields are ignored.
  • The DeployedModel must be in a DEPLOYED state. There can be at most one active mutate operation per deployed model.
  • mutateDeployedModel also allows you to enable or disable container logging. For more information, see Online prediction logging.

What's next