Get online predictions and explanations

This page shows you how to get online (real-time) predictions and explanations from your tabular classification or regression models using the Google Cloud console or the Vertex AI API.

An online prediction is a synchronous request as opposed to a batch prediction, which is an asynchronous request. Use online predictions when you are making requests in response to application input or in other situations where you require timely inference.

You must deploy a model to an endpoint before that model can be used to serve online predictions. Deploying a model associates physical resources with the model so it can serve online predictions with low latency.

The topics covered are:

  1. Deploy a model to an endpoint
  2. Get an online prediction using your deployed model
  3. Get an online explanation using your deployed model

Before you begin

Before you can get online predictions, you must first train a classification or regression model and evaluate it for accuracy.

Deploy a model to an endpoint

You can deploy more than one model to an endpoint, and you can deploy a model to more than one endpoint. For more information about options and use cases for deploying models, see About deploying models.

Use one of the following methods to deploy a model:

Google Cloud console

  1. In the Google Cloud console, in the Vertex AI section, go to the Models page.

    Go to the Models page

  2. Click the name of the model you want to deploy to open its details page.

  3. Select the Deploy & Test tab.

    If your model is already deployed to any endpoints, they are listed in the Deploy your model section.

  4. Click Deploy to endpoint.

  5. In the Define your endpoint page, configure as follows:

    1. You can choose to deploy your model to a new endpoint or an existing endpoint.

      • To deploy your model to a new endpoint, select Create new endpoint and provide a name for the new endpoint.
      • To deploy your model to an existing endpoint, select Add to existing endpoint and select the endpoint from the drop-down list.
      • You can add more than one model to an endpoint, and you can add a model to more than one endpoint. Learn more.
    2. Click Continue.

  6. In the Model settings page, configure as follows:

    1. If you're deploying your model to a new endpoint, accept 100 for the Traffic split. If you're deploying your model to an existing endpoint that has one or more models deployed to it, you must update the Traffic split percentage for the model you are deploying and the already deployed models so that all of the percentages add up to 100%.

    2. Enter the Minimum number of compute nodes you want to provide for your model.

      This is the number of nodes available to this model at all times. You are charged for the nodes used, whether to handle prediction load or for standby (minimum) nodes, even without prediction traffic. See the pricing page.

    3. Select your Machine type.

      Larger machine resources will increase your prediction performance and increase costs.

    4. Learn how to change the default settings for prediction logging.

    5. Click Continue

  7. In the Model monitoring page, click Continue.

  8. In the Monitoring objectives page, configure as follows:

    1. Enter the location of your training data.
    2. Enter the name of the target column.
  9. Click Deploy to deploy your model to the endpoint.

API

When you deploy a model using the Vertex AI API, you complete the following steps:

  1. Create an endpoint if needed.
  2. Get the endpoint ID.
  3. Deploy the model to the endpoint.

Create an endpoint

If you are deploying a model to an existing endpoint, you can skip this step.

gcloud

The following example uses the gcloud ai endpoints create command:

  gcloud ai endpoints create \
    --region=LOCATION \
    --display-name=ENDPOINT_NAME

Replace the following:

  • LOCATION_ID: The region where you are using Vertex AI.
  • ENDPOINT_NAME: The display name for the endpoint.

    The Google Cloud CLI tool might take a few seconds to create the endpoint.

REST

Before using any of the request data, make the following replacements:

  • LOCATION_ID: Your region.
  • PROJECT_ID: Your project ID.
  • ENDPOINT_NAME: The display name for the endpoint.

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints

Request JSON body:

{
  "display_name": "ENDPOINT_NAME"
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.CreateEndpointOperationMetadata",
    "genericMetadata": {
      "createTime": "2020-11-05T17:45:42.812656Z",
      "updateTime": "2020-11-05T17:45:42.812656Z"
    }
  }
}
You can poll for the status of the operation until the response includes "done": true.

Java

Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.aiplatform.v1.CreateEndpointOperationMetadata;
import com.google.cloud.aiplatform.v1.Endpoint;
import com.google.cloud.aiplatform.v1.EndpointServiceClient;
import com.google.cloud.aiplatform.v1.EndpointServiceSettings;
import com.google.cloud.aiplatform.v1.LocationName;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

public class CreateEndpointSample {

  public static void main(String[] args)
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    String project = "YOUR_PROJECT_ID";
    String endpointDisplayName = "YOUR_ENDPOINT_DISPLAY_NAME";
    createEndpointSample(project, endpointDisplayName);
  }

  static void createEndpointSample(String project, String endpointDisplayName)
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    EndpointServiceSettings endpointServiceSettings =
        EndpointServiceSettings.newBuilder()
            .setEndpoint("us-central1-aiplatform.googleapis.com:443")
            .build();

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (EndpointServiceClient endpointServiceClient =
        EndpointServiceClient.create(endpointServiceSettings)) {
      String location = "us-central1";
      LocationName locationName = LocationName.of(project, location);
      Endpoint endpoint = Endpoint.newBuilder().setDisplayName(endpointDisplayName).build();

      OperationFuture<Endpoint, CreateEndpointOperationMetadata> endpointFuture =
          endpointServiceClient.createEndpointAsync(locationName, endpoint);
      System.out.format("Operation name: %s\n", endpointFuture.getInitialFuture().get().getName());
      System.out.println("Waiting for operation to finish...");
      Endpoint endpointResponse = endpointFuture.get(300, TimeUnit.SECONDS);

      System.out.println("Create Endpoint Response");
      System.out.format("Name: %s\n", endpointResponse.getName());
      System.out.format("Display Name: %s\n", endpointResponse.getDisplayName());
      System.out.format("Description: %s\n", endpointResponse.getDescription());
      System.out.format("Labels: %s\n", endpointResponse.getLabelsMap());
      System.out.format("Create Time: %s\n", endpointResponse.getCreateTime());
      System.out.format("Update Time: %s\n", endpointResponse.getUpdateTime());
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */

// const endpointDisplayName = 'YOUR_ENDPOINT_DISPLAY_NAME';
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';

// Imports the Google Cloud Endpoint Service Client library
const {EndpointServiceClient} = require('@google-cloud/aiplatform');

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};

// Instantiates a client
const endpointServiceClient = new EndpointServiceClient(clientOptions);

async function createEndpoint() {
  // Configure the parent resource
  const parent = `projects/${project}/locations/${location}`;
  const endpoint = {
    displayName: endpointDisplayName,
  };
  const request = {
    parent,
    endpoint,
  };

  // Get and print out a list of all the endpoints for this resource
  const [response] = await endpointServiceClient.createEndpoint(request);
  console.log(`Long running operation : ${response.name}`);

  // Wait for operation to complete
  await response.promise();
  const result = response.result;

  console.log('Create endpoint response');
  console.log(`\tName : ${result.name}`);
  console.log(`\tDisplay name : ${result.displayName}`);
  console.log(`\tDescription : ${result.description}`);
  console.log(`\tLabels : ${JSON.stringify(result.labels)}`);
  console.log(`\tCreate time : ${JSON.stringify(result.createTime)}`);
  console.log(`\tUpdate time : ${JSON.stringify(result.updateTime)}`);
}
createEndpoint();

Vertex AI SDK for Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.

def create_endpoint_sample(
    project: str,
    display_name: str,
    location: str,
):
    aiplatform.init(project=project, location=location)

    endpoint = aiplatform.Endpoint.create(
        display_name=display_name,
        project=project,
        location=location,
    )

    print(endpoint.display_name)
    print(endpoint.resource_name)
    return endpoint

Get the endpoint ID

You need the endpoint ID to deploy the model.

gcloud

The following example uses the gcloud ai endpoints list command:

  gcloud ai endpoints list \
    --region=LOCATION \
    --filter=display_name=ENDPOINT_NAME

Replace the following:

  • LOCATION_ID: The region where you are using Vertex AI.
  • ENDPOINT_NAME: The display name for the endpoint.

    Note the number that appears in the ENDPOINT_ID column. Use this ID in the following step.

REST

Before using any of the request data, make the following replacements:

  • LOCATION_ID: The region where you are using Vertex AI.
  • PROJECT_ID: Your project ID.
  • ENDPOINT_NAME: The display name for the endpoint.

HTTP method and URL:

GET https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints?filter=display_name=ENDPOINT_NAME

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "endpoints": [
    {
      "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID",
      "displayName": "ENDPOINT_NAME",
      "etag": "AMEw9yPz5pf4PwBHbRWOGh0PcAxUdjbdX2Jm3QO_amguy3DbZGP5Oi_YUKRywIE-BtLx",
      "createTime": "2020-04-17T18:31:11.585169Z",
      "updateTime": "2020-04-17T18:35:08.568959Z"
    }
  ]
}
Note the ENDPOINT_ID.

Deploy the model

Select the tab below for your language or environment:

gcloud

The following examples use the gcloud ai endpoints deploy-model command.

The following example deploys a Model to an Endpoint without using GPUs to accelerate prediction serving and without splitting traffic between multiple DeployedModel resources:

Before using any of the command data below, make the following replacements:

  • ENDPOINT_ID: The ID for the endpoint.
  • LOCATION_ID: The region where you are using Vertex AI.
  • MODEL_ID: The ID for the model to be deployed.
  • DEPLOYED_MODEL_NAME: A name for the DeployedModel. You can use the display name of the Model for the DeployedModel as well.
  • MACHINE_TYPE: Optional. The machine resources used for each node of this deployment. Its default setting is n1-standard-2. Learn more about machine types.
  • MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1. If the --min-replica-count flag is omitted, the value defaults to 1.
  • MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to this number of nodes and never fewer than the minimum number of nodes. If you omit the --max-replica-count flag, then maximum number of nodes is set to the value of --min-replica-count.

Execute the gcloud ai endpoints deploy-model command:

Linux, macOS, or Cloud Shell

gcloud ai endpoints deploy-model ENDPOINT_ID\
  --region=LOCATION_ID \
  --model=MODEL_ID \
  --display-name=DEPLOYED_MODEL_NAME \
  --machine-type=MACHINE_TYPE \
  --min-replica-count=MIN_REPLICA_COUNT \
  --max-replica-count=MAX_REPLICA_COUNT \
  --traffic-split=0=100

Windows (PowerShell)

gcloud ai endpoints deploy-model ENDPOINT_ID`
  --region=LOCATION_ID `
  --model=MODEL_ID `
  --display-name=DEPLOYED_MODEL_NAME `
  --machine-type=MACHINE_TYPE `
  --min-replica-count=MIN_REPLICA_COUNT `
  --max-replica-count=MAX_REPLICA_COUNT `
  --traffic-split=0=100

Windows (cmd.exe)

gcloud ai endpoints deploy-model ENDPOINT_ID^
  --region=LOCATION_ID ^
  --model=MODEL_ID ^
  --display-name=DEPLOYED_MODEL_NAME ^
  --machine-type=MACHINE_TYPE ^
  --min-replica-count=MIN_REPLICA_COUNT ^
  --max-replica-count=MAX_REPLICA_COUNT ^
  --traffic-split=0=100
 

Splitting traffic

The --traffic-split=0=100 flag in the preceding examples sends 100% of prediction traffic that the Endpoint receives to the new DeployedModel, which is represented by the temporary ID 0. If your Endpoint already has other DeployedModel resources, then you can split traffic between the new DeployedModel and the old ones. For example, to send 20% of traffic to the new DeployedModel and 80% to an older one, run the following command.

Before using any of the command data below, make the following replacements:

  • OLD_DEPLOYED_MODEL_ID: the ID of the existing DeployedModel.

Execute the gcloud ai endpoints deploy-model command:

Linux, macOS, or Cloud Shell

gcloud ai endpoints deploy-model ENDPOINT_ID\
  --region=LOCATION_ID \
  --model=MODEL_ID \
  --display-name=DEPLOYED_MODEL_NAME \ 
  --machine-type=MACHINE_TYPE \
  --min-replica-count=MIN_REPLICA_COUNT \
  --max-replica-count=MAX_REPLICA_COUNT \
  --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80

Windows (PowerShell)

gcloud ai endpoints deploy-model ENDPOINT_ID`
  --region=LOCATION_ID `
  --model=MODEL_ID `
  --display-name=DEPLOYED_MODEL_NAME \ 
  --machine-type=MACHINE_TYPE `
  --min-replica-count=MIN_REPLICA_COUNT `
  --max-replica-count=MAX_REPLICA_COUNT `
  --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80

Windows (cmd.exe)

gcloud ai endpoints deploy-model ENDPOINT_ID^
  --region=LOCATION_ID ^
  --model=MODEL_ID ^
  --display-name=DEPLOYED_MODEL_NAME \ 
  --machine-type=MACHINE_TYPE ^
  --min-replica-count=MIN_REPLICA_COUNT ^
  --max-replica-count=MAX_REPLICA_COUNT ^
  --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80
 

REST

You use the endpoints.predict method to request an online prediction.

Deploy the model.

Before using any of the request data, make the following replacements:

  • LOCATION_ID: The region where you are using Vertex AI.
  • PROJECT_ID: Your project ID.
  • ENDPOINT_ID: The ID for the endpoint.
  • MODEL_ID: The ID for the model to be deployed.
  • DEPLOYED_MODEL_NAME: A name for the DeployedModel. You can use the display name of the Model for the DeployedModel as well.
  • MACHINE_TYPE: Optional. The machine resources used for each node of this deployment. Its default setting is n1-standard-2. Learn more about machine types.
  • ACCELERATOR_TYPE: The type of accelerator to be attached to the machine. Optional if ACCELERATOR_COUNT is not specified or is zero. Not recommended for AutoML models or custom-trained models that are using non-GPU images. Learn more.
  • ACCELERATOR_COUNT: The number of accelerators for each replica to use. Optional. Should be zero or unspecified for AutoML models or custom-trained models that are using non-GPU images.
  • MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1.
  • MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to this number of nodes and never fewer than the minimum number of nodes.
  • TRAFFIC_SPLIT_THIS_MODEL: The percentage of the prediction traffic to this endpoint to be routed to the model being deployed with this operation. Defaults to 100. All traffic percentages must add up to 100. Learn more about traffic splits.
  • DEPLOYED_MODEL_ID_N: Optional. If other models are deployed to this endpoint, you must update their traffic split percentages so that all percentages add up to 100.
  • TRAFFIC_SPLIT_MODEL_N: The traffic split percentage value for the deployed model id key.
  • PROJECT_NUMBER: Your project's automatically generated project number

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel

Request JSON body:

{
  "deployedModel": {
    "model": "projects/PROJECT/locations/us-central1/models/MODEL_ID",
    "displayName": "DEPLOYED_MODEL_NAME",
    "dedicatedResources": {
       "machineSpec": {
         "machineType": "MACHINE_TYPE",
         "acceleratorType": "ACCELERATOR_TYPE",
         "acceleratorCount": "ACCELERATOR_COUNT"
       },
       "minReplicaCount": MIN_REPLICA_COUNT,
       "maxReplicaCount": MAX_REPLICA_COUNT
     },
  },
  "trafficSplit": {
    "0": TRAFFIC_SPLIT_THIS_MODEL,
    "DEPLOYED_MODEL_ID_1": TRAFFIC_SPLIT_MODEL_1,
    "DEPLOYED_MODEL_ID_2": TRAFFIC_SPLIT_MODEL_2
  },
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.DeployModelOperationMetadata",
    "genericMetadata": {
      "createTime": "2020-10-19T17:53:16.502088Z",
      "updateTime": "2020-10-19T17:53:16.502088Z"
    }
  }
}

Java

Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.aiplatform.v1.DedicatedResources;
import com.google.cloud.aiplatform.v1.DeployModelOperationMetadata;
import com.google.cloud.aiplatform.v1.DeployModelResponse;
import com.google.cloud.aiplatform.v1.DeployedModel;
import com.google.cloud.aiplatform.v1.EndpointName;
import com.google.cloud.aiplatform.v1.EndpointServiceClient;
import com.google.cloud.aiplatform.v1.EndpointServiceSettings;
import com.google.cloud.aiplatform.v1.MachineSpec;
import com.google.cloud.aiplatform.v1.ModelName;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ExecutionException;

public class DeployModelCustomTrainedModelSample {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String project = "PROJECT";
    String endpointId = "ENDPOINT_ID";
    String modelName = "MODEL_NAME";
    String deployedModelDisplayName = "DEPLOYED_MODEL_DISPLAY_NAME";
    deployModelCustomTrainedModelSample(project, endpointId, modelName, deployedModelDisplayName);
  }

  static void deployModelCustomTrainedModelSample(
      String project, String endpointId, String model, String deployedModelDisplayName)
      throws IOException, ExecutionException, InterruptedException {
    EndpointServiceSettings settings =
        EndpointServiceSettings.newBuilder()
            .setEndpoint("us-central1-aiplatform.googleapis.com:443")
            .build();
    String location = "us-central1";

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (EndpointServiceClient client = EndpointServiceClient.create(settings)) {
      MachineSpec machineSpec = MachineSpec.newBuilder().setMachineType("n1-standard-2").build();
      DedicatedResources dedicatedResources =
          DedicatedResources.newBuilder().setMinReplicaCount(1).setMachineSpec(machineSpec).build();

      String modelName = ModelName.of(project, location, model).toString();
      DeployedModel deployedModel =
          DeployedModel.newBuilder()
              .setModel(modelName)
              .setDisplayName(deployedModelDisplayName)
              // `dedicated_resources` must be used for non-AutoML models
              .setDedicatedResources(dedicatedResources)
              .build();
      // key '0' assigns traffic for the newly deployed model
      // Traffic percentage values must add up to 100
      // Leave dictionary empty if endpoint should not accept any traffic
      Map<String, Integer> trafficSplit = new HashMap<>();
      trafficSplit.put("0", 100);
      EndpointName endpoint = EndpointName.of(project, location, endpointId);
      OperationFuture<DeployModelResponse, DeployModelOperationMetadata> response =
          client.deployModelAsync(endpoint, deployedModel, trafficSplit);

      // You can use OperationFuture.getInitialFuture to get a future representing the initial
      // response to the request, which contains information while the operation is in progress.
      System.out.format("Operation name: %s\n", response.getInitialFuture().get().getName());

      // OperationFuture.get() will block until the operation is finished.
      DeployModelResponse deployModelResponse = response.get();
      System.out.format("deployModelResponse: %s\n", deployModelResponse);
    }
  }
}

Vertex AI SDK for Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.

def deploy_model_with_dedicated_resources_sample(
    project,
    location,
    model_name: str,
    machine_type: str,
    endpoint: Optional[aiplatform.Endpoint] = None,
    deployed_model_display_name: Optional[str] = None,
    traffic_percentage: Optional[int] = 0,
    traffic_split: Optional[Dict[str, int]] = None,
    min_replica_count: int = 1,
    max_replica_count: int = 1,
    accelerator_type: Optional[str] = None,
    accelerator_count: Optional[int] = None,
    explanation_metadata: Optional[explain.ExplanationMetadata] = None,
    explanation_parameters: Optional[explain.ExplanationParameters] = None,
    metadata: Optional[Sequence[Tuple[str, str]]] = (),
    sync: bool = True,
):
    """
    model_name: A fully-qualified model resource name or model ID.
          Example: "projects/123/locations/us-central1/models/456" or
          "456" when project and location are initialized or passed.
    """

    aiplatform.init(project=project, location=location)

    model = aiplatform.Model(model_name=model_name)

    # The explanation_metadata and explanation_parameters should only be
    # provided for a custom trained model and not an AutoML model.
    model.deploy(
        endpoint=endpoint,
        deployed_model_display_name=deployed_model_display_name,
        traffic_percentage=traffic_percentage,
        traffic_split=traffic_split,
        machine_type=machine_type,
        min_replica_count=min_replica_count,
        max_replica_count=max_replica_count,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        explanation_metadata=explanation_metadata,
        explanation_parameters=explanation_parameters,
        metadata=metadata,
        sync=sync,
    )

    model.wait()

    print(model.display_name)
    print(model.resource_name)
    return model

Node.js

Before trying this sample, follow the Node.js setup instructions in the