Use reservations with prediction

To ensure that VM resources are available when your prediction jobs need them, you can use Compute Engine reservations. Reservations provide a high level of assurance in obtaining capacity for Compute Engine resources. To learn more, see Reservations of Compute Engine zonal resources.

Overview

Your Vertex AI custom training and prediction jobs can consume Compute Engine reservations. Your reservation must specify an A2 or A3 machine type. If resources from those reservations are eligible for any committed use discounts (CUDs), then, when your VMs consume those reservations, you get those resources at the discounted prices. See CUDs for your reserved resources.

Limitations and requirements

Consider the following limitations and requirements when using Compute Engine reservations with Vertex AI:

Vertex AI can only consume reservations with the following machine series:
- A2
- A3
Using Compute Engine reservations with Vertex AI is only supported for custom training and prediction.
Ensure that sufficient quota is available for your Vertex AI jobs. See Additional quota requirements for shared reservations.
To support regular updates of your Vertex AI deployments, we recommend increasing your VM count by at least 1 additional VM for each concurrent deployment.
Ensure that your organization policy constraints allow shared reservations. See Allow and restrict projects from creating and modifying shared reservations.
Your reservation's VM instance properties must match exactly with your Vertex AI workload to use the reservation. For example, if a Vertex AI workload has an a2-megagpu-16g machine type, the reservation's machine type must match. See Requirements.
The following services and capabilities aren't supported when using Compute Engine reservations with Vertex AI prediction:
- Customer managed encryption keys (CMEK)
- Federal Risk and Authorization Management Program (FedRAMP) compliance

Billing

When you use Compute Engine reservations, you're billed for the following:

Compute Engine pricing for the Compute Engine resources, including any applicable committed use discounts (CUDs). See Compute Engine pricing.
Vertex AI prediction management fees in addition to your infrastructure usage. See Prediction pricing.

Before you begin

Review the requirements and restrictions for reservations.
Review the quota requirements and restrictions for shared reservations.

Allow a reservation to be consumed

Before consuming a reservation of A2 or A3 VMs, you must set its sharing policy to allow Vertex AI to consume the reservation. To do so, use one of the following methods:

Allow consumption while creating a reservation
Allow consumption in an existing reservation

Allow consumption while creating a reservation

While creating a single-project or shared reservation of A2 or A3 VMs, you can specify to allow Vertex AI to consume the reservation as follows:

If you're using the Google Cloud console, then, in the Google Cloud services section, select Share reservation.
If you're using the Google Cloud CLI, then include the --reservation-sharing-policy flag set to ALLOW_ALL.
If you're using the REST API, then include the serviceShareType field set to ALLOW_ALL.

Allow consumption in an existing reservation

To allow Vertex AI to consume an existing reservation of A2 or A3 VMs, see Modify the sharing policy of a reservation.

Get predictions by using a reservation

To consume a Compute Engine reservation when you deploy a model to get predictions, you can use the REST API or the Vertex AI SDK for Python.

REST

Before using any of the request data, make the following replacements:

LOCATION_ID: The region where you are using Vertex AI.
PROJECT_ID: Your project ID.
ENDPOINT_ID: The ID for the endpoint.
MODEL_ID: The ID for the model to be deployed.
DEPLOYED_MODEL_NAME: A name for the DeployedModel. You can use the display name of the Model for the DeployedModel as well.
MACHINE_TYPE: Optional. The machine resources used for each node of this deployment. Its default setting is n1-standard-2. Learn more about machine types.
RESERVATION_AFFINITY_TYPE: Must be ANY, SPECIFIC_RESERVATION, or NONE.
- ANY means that the VMs of your customJob automatically can consume any reservation with matching properties.
- SPECIFIC_RESERVATION means that the VMs of your customJob can consume only a reservation that the VMs specifically targets by name.
- NONE means that the VMs of your customJob can't consume any reservation. Specifying NONE has the same effect as omitting a reservation affinity specification.
RESERVATION_NAME: The name of your reservation.
MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1.
MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to this number of nodes and never fewer than the minimum number of nodes.
TRAFFIC_SPLIT_THIS_MODEL: The percentage of the prediction traffic to this endpoint to be routed to the model being deployed with this operation. Defaults to 100. All traffic percentages must add up to 100. Learn more about traffic splits.
DEPLOYED_MODEL_ID_N: Optional. If other models are deployed to this endpoint, you must update their traffic split percentages so that all percentages add up to 100.
TRAFFIC_SPLIT_MODEL_N: The traffic split percentage value for the deployed model id key.
PROJECT_NUMBER: Your project's automatically generated project number

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel

Request JSON body:

{
  "deployedModel": {
    "model": "projects/PROJECT/locations/LOCATION_ID/models/MODEL_ID",
    "displayName": "DEPLOYED_MODEL_NAME",
    "dedicatedResources": {
      "machineSpec": {
        "machineType": "MACHINE_TYPE",
        "reservationAffinity": {
          "reservationAffinityType": "RESERVATION_AFFINITY_TYPE",
          "key": "compute.googleapis.com/reservation-name",
          "values": [
            "projects/PROJECT_ID/reservations/RESERVATION_NAME"
          ]
        }
      },
      "minReplicaCount": MIN_REPLICA_COUNT,
      "maxReplicaCount": MAX_REPLICA_COUNT
    },
  },
  "trafficSplit": {
    "0": TRAFFIC_SPLIT_THIS_MODEL,
    "DEPLOYED_MODEL_ID_1": TRAFFIC_SPLIT_MODEL_1,
    "DEPLOYED_MODEL_ID_2": TRAFFIC_SPLIT_MODEL_2
  },
}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel"

PowerShell (Windows)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1beta1.DeployModelOperationMetadata",
    "genericMetadata": {
      "createTime": "2020-10-19T17:53:16.502088Z",
      "updateTime": "2020-10-19T17:53:16.502088Z"
    }
  }
}

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.

To use a specific reservation:

endpoint5.deploy(
    model = model,
    deployed_model_display_name=DEPLOYED_NAME,
    traffic_split=TRAFFIC_SPLIT,
    machine_type="MACHINE_TYPE",
    accelerator_type="ACCELERATOR_TYPE",
    accelerator_count=ACCELERATOR_COUNT,
    reservation_affinity_type="SPECIFIC_RESERVATION",
    reservation_affinity_key="compute.googleapis.com/reservation-name",
    reservation_affinity_values=["projects/PROJECT_ID/zones/ZONE/reservations/RESERVATION_NAME"],
    min_replica_count=MIN_REPLICA_COUNT,
    max_replica_count=MAX_REPLICA_COUNT,
    sync=True
)

To use any reservation:

endpoint5.deploy(
    model = model,
    deployed_model_display_name=DEPLOYED_NAME,
    traffic_split=TRAFFIC_SPLIT,
    machine_type="MACHINE_TYPE",
    accelerator_type="ACCELERATOR_TYPE",
    accelerator_count=ACCELERATOR_COUNT,
    reservation_affinity_type="ANY_RESERVATION",
    min_replica_count=MIN_REPLICA_COUNT,
    max_replica_count=MAX_REPLICA_COUNT,
    sync=True
)

What's next

Learn more about reservations of Compute Engine zonal resources.
Use reservations with Vertex AI training.