Use private endpoints for online prediction

Using private endpoints to serve online predictions with Vertex AI provides a low-latency, secure connection to the Vertex AI online prediction service. This guide shows how to configure private endpoints on Vertex AI by using VPC Network Peering to peer your network with the Vertex AI online prediction service.

Overview

Before you serve online predictions with private endpoints, you must configure private services access to create peering connections between your network and Vertex AI. If you have already set this up, you can use your existing peering connections.

This guide covers the following tasks:

  • Verifying the status of your existing peering connections.
  • Creating a private endpoint.
  • Deploying a model to a private endpoint.
  • Sending a prediction to a private endpoint.
  • Cleaning up resources

Check the status of existing peering connections

If you have existing peering connections you use with Vertex AI, you can list them to check status:

gcloud compute networks peerings list --network NETWORK_NAME

You should see that the state of your peering connections are ACTIVE. Learn more about active peering connections.

Create a private endpoint

To create a private endpoint, add the --network flag when you create an endpoint using the gcloud command-line tool:

gcloud beta ai endpoints create \
  --display-name=ENDPOINT_DISPLAY_NAME \
  --network=NETWORK_NAME \
  --region=REGION

Replace NETWORK_NAME with the fully qualified network name:

projects/PROJECT_NUMBER/global/networks/NETWORK_NAME

If you create the endpoint without specifying a network, then you create a regular public endpoint.

Limitations of private endpoints

Note the following limitations for private endpoints:

  • Private endpoints currently do not support traffic splitting. As a workaround, you can create traffic splitting manually by deploying your model to multiple private endpoints, and splitting traffic among the resulting prediction URLs for each private endpoint.
  • To enable access logging on a private endpoint, contact vertex-ai-feedback@google.com.
  • You can use only one network for all private endpoints in a Google Cloud project. If you want to change to another network, contact vertex-ai-feedback@google.com.

Monitor private endpoints

To monitor your private endpoints, you can query the following two metrics:

  • aiplatform.googleapis.com/prediction/online/private/response_count

    The number of prediction responses. You can filter this metric by deployed_model_id or HTTP response code.

  • aiplatform.googleapis.com/prediction/online/private/prediction_latencies

    The latency of the prediction request in milliseconds. You can filter this metric by deployed_model_id, only for successful requests.

Learn how to select, query, and display these metrics in Metrics Explorer.

Deploy a model

You can import a new model, or deploy an existing model that you have already uploaded. To upload a new model, use gcloud ai models upload. For more information, see Import models to Vertex AI.

  1. To deploy a model to a private endpoint, see the guide to deploy models. Besides traffic splitting and manually enabling access logging, you can use any of the other options available for deploying custom-trained models. Refer to the limitations of private endpoints to learn more about how they are different from public endpoints.

  2. After you deploy the endpoint, you can get the prediction URI from the metadata of your private endpoint.

    1. If you have the display name of your private endpoint, run this command to get the endpoint ID:

      ENDPOINT_ID=$(gcloud ai endpoints list \
                --region=REGION \
                --filter=displayName:ENDPOINT_DISPLAY_NAME \
                --format="value(ENDPOINT_ID.scope())")
      

      Otherwise, to view the endpoint ID and display name for all of your endpoints, run the following command:

      gcloud ai endpoints list --region=REGION
      
    2. Finally, to get the prediction URI, run the following command:

      gcloud beta ai endpoints describe ENDPOINT_ID \
        --region=REGION \
        --format="value(deployedModels.privateEndpoints.predictHttpUri)"
      

Private prediction URI format

The prediction URI looks different for private endpoints compared to public ones:

http://aiplatform-HASH_ID.googleapis.com/v1/models/DEPLOYED_MODEL_HASH_ID:predict

  • HASH_ID is a hash ID in your prediction URL that contains six alphanumeric characters.
  • DEPLOYED_MODEL_HASH_ID is a hash ID for your deployed model that contains 21 characters (two sets of 10 alphanumeric characters connected by a hyphen).

Note that your prediction URI might have a slightly different format if you are using a custom container.

Send a prediction to a private endpoint

  1. Create a Compute Engine instance in your VPC network. Make sure to create the instance in the same VPC network that you have peered with Vertex AI.

  2. SSH into your Compute Engine instance, and install your prediction client, if applicable. Otherwise, you can use cURL.

  3. When predicting, use the prediction URL obtained from model deployment. In this example, you're sending the request from your prediction client in your Compute Engine instance in the same VPC network:

    curl -X POST -d@PATH_TO_JSON_FILE  http://aiplatform-HASH_ID.googleapis.com/v1/models/DEPLOYED_MODEL_HASH_ID:predict
    

    In this sample request:

    • PATH_TO_JSON_FILE is the path to your prediction request, saved as a JSON file. For example, example-request.json.
    • HASH_ID is a hash ID in your prediction URL that contains six alphanumeric characters.
    • DEPLOYED_MODEL_HASH_ID is a hash ID for your deployed model that contains 21 characters (two sets of 10 alphanumeric characters connected by a hyphen).

    Note that your prediction URI might have a slightly different format if you are using a custom container.

Clean up resources

You can undeploy models and delete private endpoints the same way as with public models and endpoints. You can only create private endpoints on one network per Google Cloud project, even if you delete these resources on your own afterward. If you need to switch to a different network, please contact vertex-ai-feedback@google.com.

Example: Test private endpoints in Shared VPC

This example uses two Google Cloud projects with a Shared VPC network:

  • The host project hosts the Shared VPC network.
  • The client project hosts a Compute Engine instance where you run a prediction client, such as cURL, or your own REST client in the Compute Engine instance, to send prediction requests.

When you create the Compute Engine instance in the client project, it must be within the custom subnet in the host project's Shared VPC network, and in the same region where the model gets deployed.

  1. Create the peering connections for private services access in the host project. Run gcloud services vpc-peerings connect:

    gcloud services vpc-peerings connect \
      --service=servicenetworking.googleapis.com \
      --network=HOST_SHARED_VPC_NAME \
      --ranges=PREDICTION_RESERVED_RANGE_NAME \
      --project=HOST_PROJECT_ID
    
  2. Create the endpoint in the client project, using the host project's network name. Run gcloud beta ai endpoints create:

    gcloud beta ai endpoints create \
      --display-name=ENDPOINT_DISPLAY_NAME \
      --network=HOST_SHARED_VPC_NAME \
      --region=REGION \
      --project=CLIENT_PROJECT_ID
    
  3. Send prediction requests, using the prediction client within the client project.