Configure compute resources for prediction

Vertex AI allocates nodes to handle online and batch predictions. When you deploy a custom-trained model or AutoML tabular model to an Endpoint resource to serve online predictions or when you request batch predictions, you can customize the type of virtual machine that the prediction service uses for these nodes. You can optionally configure prediction nodes to use GPUs.

Machine types differ in a few ways:

  • Number of virtual CPUs (vCPUs) per node
  • Amount of memory per node
  • Pricing

By selecting a machine type with more computing resources, you can serve predictions with lower latency or handle more prediction requests at the same time.

Where to specify compute resources

If you want to use a custom-trained model or an AutoML tabular model to serve online predictions, you must specify a machine type when you deploy the Model resource as a DeployedModel to an Endpoint. For other types of AutoML models, Vertex AI configures the machine types automatically.

Specify the machine type (and, optionally, GPU configuration) in the dedicatedResources.machineSpec field of your DeployedModel.

Learn how to deploy each model type:

If you want to get batch predictions from a custom-trained model or an AutoML tabular model, you must specify a machine type when you create a BatchPredictionJob resource. Specify the machine type (and, optionally, GPU configuration) in the dedicatedResources.machineSpec field of your BatchPredictionJob.

Machine types

The following table compares the available machine types for serving predictions from custom-trained models and AutoML tabular models:

E2 Series

Name vCPUs Memory (GB)
e2-standard-2 2 8
e2-standard-4 4 16
e2-standard-8 8 32
e2-standard-16 16 64
e2-standard-32 32 128
e2-highmem-2 2 16
e2-highmem-4 4 32
e2-highmem-8 8 64
e2-highmem-16 16 128
e2-highcpu-2 2 2
e2-highcpu-4 4 4
e2-highcpu-8 8 8
e2-highcpu-16 16 16
e2-highcpu-32 32 32

N1 Series

Name vCPUs Memory (GB)
n1-standard-2 2 7.5
n1-standard-4 4 15
n1-standard-8 8 30
n1-standard-16 16 60
n1-standard-32 32 120
n1-highmem-2 2 13
n1-highmem-4 4 26
n1-highmem-8 8 52
n1-highmem-16 16 104
n1-highmem-32 32 208
n1-highcpu-4 4 3.6
n1-highcpu-8 8 7.2
n1-highcpu-16 16 14.4
n1-highcpu-32 32 28.8

N2 Series

Name vCPUs Memory (GB)
n2-standard-2 2 8
n2-standard-4 4 16
n2-standard-8 8 32
n2-standard-16 16 64
n2-standard-32 32 128
n2-standard-48 48 192
n2-standard-64 64 256
n2-standard-80 80 320
n2-standard-96 96 384
n2-standard-128 128 512
n2-highmem-2 2 16
n2-highmem-4 4 32
n2-highmem-8 8 64
n2-highmem-16 16 128
n2-highmem-32 32 256
n2-highmem-48 48 384
n2-highmem-64 64 512
n2-highmem-80 80 640
n2-highmem-96 96 768
n2-highmem-128 128 864
n2-highcpu-2 2 2
n2-highcpu-4 4 4
n2-highcpu-8 8 8
n2-highcpu-16 16 16
n2-highcpu-32 32 32
n2-highcpu-48 48 48
n2-highcpu-64 64 64
n2-highcpu-80 80 80
n2-highcpu-96 96 96

N2D Series

Name vCPUs Memory (GB)
n2d-standard-2 2 8
n2d-standard-4 4 16
n2d-standard-8 8 32
n2d-standard-16 16 64
n2d-standard-32 32 128
n2d-standard-48 48 192
n2d-standard-64 64 256
n2d-standard-80 80 320
n2d-standard-96 96 384
n2d-standard-128 128 512
n2d-standard-224 224 896
n2d-highmem-2 2 16
n2d-highmem-4 4 32
n2d-highmem-8 8 64
n2d-highmem-16 16 128
n2d-highmem-32 32 256
n2d-highmem-48 48 384
n2d-highmem-64 64 512
n2d-highmem-80 80 640
n2d-highmem-96 96 768
n2d-highcpu-2 2 2
n2d-highcpu-4 4 4
n2d-highcpu-8 8 8
n2d-highcpu-16 16 16
n2d-highcpu-32 32 32
n2d-highcpu-48 48 48
n2d-highcpu-64 64 64
n2d-highcpu-80 80 80
n2d-highcpu-96 96 96
n2d-highcpu-128 128 128
n2d-highcpu-224 224 224

C2 Series

Name vCPUs Memory (GB)
c2-standard-4 4 16
c2-standard-8 8 32
c2-standard-16 16 64
c2-standard-30 30 120
c2-standard-60 60 240

C2D Series

Name vCPUs Memory (GB)
c2d-standard-2 2 8
c2d-standard-4 4 16
c2d-standard-8 8 32
c2d-standard-16 16 64
c2d-standard-32 32 128
c2d-standard-56 56 224
c2d-standard-112 112 448
c2d-highcpu-2 2 4
c2d-highcpu-4 4 8
c2d-highcpu-8 8 16
c2d-highcpu-16 16 32
c2d-highcpu-32 32 64
c2d-highcpu-56 56 112
c2d-highcpu-112 112 224
c2d-highmem-2 2 16
c2d-highmem-4 4 32
c2d-highmem-8 8 64
c2d-highmem-16 16 128
c2d-highmem-32 32 256
c2d-highmem-56 56 448
c2d-highmem-112 112 896

A2 Series

Name vCPUs Memory (GB) GPUs (A100 40GB)
a2-highgpu-1g 12 85 1
a2-highgpu-2g 24 170 2
a2-highgpu-4g 48 340 4
a2-highgpu-8g 96 680 8
a2-megagpu-16g 96 1360 16

G2 Series

Name vCPUs Memory (GB) GPUs (NVIDIA L4)
g2-standard-4 4 16 1
g2-standard-8 8 32 1
g2-standard-12 12 48 1
g2-standard-16 16 64 1
g2-standard-24 24 96 2
g2-standard-32 32 128 1
g2-standard-48 48 192 4
g2-standard-96 96 384 8

Learn about pricing for each machine type. Read more about the detailed specifications of these machine types in the Compute Engine documentation about machine types.

Find the ideal machine type

To find the idea machine type for your use case, we recommend loading your model on multiple machine types and measuring characteristics such as the latency, cost, concurrency, and throughput.

One way to do this is to run this notebook on multiple machine types and compare the results to find the one that works best for you.

Vertex AI reserves approximately 1 vCPU on each replica for running system processes. This means that running the notebook on a single core machine type would be comparable to using a 2-core machine type for serving predictions.

When considering prediction costs, remember that although larger machines cost more, they can lower overall cost because fewer replicas are required to serve the same workload. This is particularly evident for GPUs, which tend to cost more per hour, but can both provide lower latency and cost less overall.

GPUs

Some configurations, such as the A2 and G2 series, have a fixed number of GPUs built-in.

Other configurations, such as the N1 series, let you optionally add GPUs to accelerate each prediction node. To use GPUs, you must account for several requirements:

  • You can only use GPUs when your Model resource is based on a TensorFlow SavedModel, or when you use a custom container that has been designed to take advantage of GPUs. You can't use GPUs for scikit-learn or XGBoost models.
  • The availability of each type of GPU varies depending on which region you use for your model. Learn which types of GPUs are available in which regions.
  • You can only use one type of GPU for your DeployedModel resource or BatchPredictionJob, and there are limitations on the number of GPUs you can add depending on which machine type you are using. The following table describes these limitations.

The following table shows the GPUs available for online prediction and how many of each type of GPU you can use with each Compute Engine machine type:

Valid numbers of GPUs for each machine type
Machine type NVIDIA Tesla K80 NVIDIA Tesla P100 NVIDIA Tesla V100 NVIDIA Tesla P4 NVIDIA Tesla T4
n1-standard-2 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-standard-4 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-standard-8 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-standard-16 2, 4, 8 1, 2, 4 2, 4, 8 1, 2, 4 1, 2, 4
n1-standard-32 4, 8 2, 4 4, 8 2, 4 2, 4
n1-highmem-2 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-highmem-4 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-highmem-8 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-highmem-16 2, 4, 8 1, 2, 4 2, 4, 8 1, 2, 4 1, 2, 4
n1-highmem-32 4, 8 2, 4 4, 8 2, 4 2, 4
n1-highcpu-2 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-highcpu-4 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-highcpu-8 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-highcpu-16 2, 4, 8 1, 2, 4 2, 4, 8 1, 2, 4 1, 2, 4
n1-highcpu-32 4, 8 2, 4 4, 8 2, 4 2, 4

GPUs are optional and incur additional costs.

Scaling

When you deploy a Model for online prediction as a DeployedModel, you can configure prediction nodes to automatically scale. To do this, set dedicatedResources.maxReplicaCount to a greater value than dedicatedResources.minReplicaCount.

When you configure a DeployedModel, you must set dedicatedResources.minReplicaCount to at least 1. In other words, you cannot configure the DeployedModel to scale to 0 prediction nodes when it is unused.

The prediction nodes for batch prediction do not automatically scale. Vertex AI uses BatchDedicatedResources.startingReplicaCount and ignores BatchDedicatedResources.maxReplicaCount.

Scaling behavior

Target utilization and configuration

By default, if you deploy a model without dedicated GPU resources, Vertex AI will automatically scale the number of replicas up or down so that CPU usage matches the default 60% target value.

By default, if you deploy a model with dedicated GPU resources (if machineSpec.accelerator_count is above 0), Vertex AI will automatically scale the number of replicas up or down so that the CPU or GPU usage, whichever is higher, matches the default 60% target value. Therefore, if your prediction throughput is causing high GPU usage, but not high CPU usage, Vertex AI will scale up, and the CPU utilization will be very low, which will be visible in monitoring. Conversely, if your custom container is underutilizing the GPU, but has an unrelated process that bring CPU utilization above 60%, Vertex AI will scale up, even if this may not have been needed to achieve QPS and latency targets.

You can override the default threshold metric and target by specifying autoscalingMetricSpecs. Note that if your deployment is configured to scale based only on CPU usage, it will not scale up even if GPU usage is high.

Manage resource usage

You can monitor your endpoint to track metrics like CPU and Accelerator usage, number of requests, latency, as well as the current and target number of replicas. This information can help you understand your endpoint's resource usage and scaling behavior.

Keep in mind that each replica runs only a single container. This means that if a prediction container cannot fully utilize the selected compute resource, such as single threaded code for a multi-core machine, or a custom model that calls another service as part of making the prediction, your nodes may not scale up.

For example, if you are using FastAPI, or any model server that has a configurable number of workers or threads, there are many cases where having more than one worker can increase resource utilization, which improves the ability for the service to automatically scale the number of replicas.

We generally recommend starting with one worker or thread per core. If you notice that CPU utilization is low, especially under high load, or your model is not scaling up because CPU utilization is low, then increase the number of workers. On the other hand, if you notice that utilization is too high and your latencies increase more than expected under load, then try using fewer workers. If you are already using only a single worker, try using a smaller machine type.

Scaling behavior and lag

Vertex AI adjusts the number of replicas every 5 minutes. For each 5 minute window, the system measures the server utilization and generates a target number of replicas every 15 seconds based on the following formula:

target # of replicas = Ceil(current # of replicas * (current utilization / target utilization))

For example, if you currently have two replicas that are being utilized at 100%, the target is 4:

4 = Ceil(3.33) = Ceil(2 * (100% / 60%))

Another example, if you currently have 10 replicas and utilization drops to 1%, the target is 1:

1 = Ceil(.167) = Ceil(10 * (1% / 60%))

At the end of the 5 minute window, the system adjusts the number of replicas to match the highest target value from that 5 minute window. Notice that because the highest target value is chosen, your endpoint will not scale down if there is a spike in utilization during that 5 minute window, even if overall utilization is very low.

Keep in mind that even after Vertex AI adjusts the number of replicas, it takes time to start up or turn down the replicas. Thus there is an additional delay before the endpoint can adjust to the traffic. The main factors that contribute to this time are:

  • the time to provision and start the Compute Engine VMs
  • the time to download the container from the registry
  • the time to load the model from storage

The best way to understand the real world scaling behavior of your model is to run a load test and optimize the characteristics that matter for your model and your use case. If the autoscaler is not scaling up fast enough for your application, provision enough min_replicas to handle your expected baseline traffic.

Update the scaling configuration

If you specified either DedicatedResources or AutomaticResources when you deployed the model, you can update the scaling configuration without redeploying the model by calling mutateDeployedModel.

For example, the following request updates max_replica, autoscaling_metric_specs, and disables container logging.

{
  "deployedModel": {
    "id": "2464520679043629056",
    "dedicatedResources": {
      "maxReplicaCount": 9,
      "autoscalingMetricSpecs": [
        {
          "metricName": "aiplatform.googleapis.com/prediction/online/cpu/utilization",
          "target": 50
        }
      ]
    },
    "disableContainerLogging": true
  },
  "update_mask": {
    "paths": [
      "dedicated_resources.max_replica_count",
      "dedicated_resources.autoscaling_metric_specs",
      "disable_container_logging"
    ]
  }
}

Usage notes:

  • You cannot change the machine type or, switch from DedicatedResources to AutomaticResources or vice versa. The only scaling configuration values you can change are: min_replica, max_replica, and AutoscalingMetricSpec (DedicatedResources only) fields.
  • You must list every field you wish to update in updateMask. Any field that is not listed is ignored.
  • The DeployedModel must be in a DEPLOYED state. There can be, at most, one active mutate operation per deployed model.
  • mutateDeployedModel also allows you to enable or disable container logging. For more information, see Online prediction logging.

What's next