Configuring compute resources for prediction

Vertex AI allocates nodes to handle online and batch predictions. When you deploy a custom-trained model or AutoML tabular model to an Endpoint resource to serve online predictions or when you request batch predictions, you can customize the type of virtual machine that the prediction service uses for these nodes. You can optionally configure prediction nodes to use GPUs.

Machine types differ in a few ways:

  • Number of virtual CPUs (vCPUs) per node
  • Amount of memory per node
  • Pricing

By selecting a machine type with more computing resources, you can serve predictions with lower latency or handle more prediction requests at the same time.

Where to specify compute resources

If you want to use a custom-trained model or an AutoML tabular model to serve online predictions, you must specify a machine type when you deploy the Model resource as a DeployedModel to an Endpoint. For other types of AutoML models, Vertex AI configures the machine types automatically.

Specify the machine type (and, optionally, GPU configuration) in the dedicatedResources.machineSpec field of your DeployedModel.

Learn how to deploy each model type:

If you want to get batch predictions from a custom-trained model or an AutoML tabular model, you must specify a machine type when you create a BatchPredictionJob resource. Specify the machine type (and, optionally, GPU configuration) in the dedicatedResources.machineSpec field of your BatchPredictionJob.

Machine types

The following table compares the available machine types for serving predictions from custom-trained models and AutoML tabular models:

Name vCPUs Memory (GB)
n1-standard-2 2 7.5
n1-standard-4 4 15
n1-standard-8 8 30
n1-standard-16 16 60
n1-standard-32 32 120
n1-highmem-2 2 13
n1-highmem-4 4 26
n1-highmem-8 8 52
n1-highmem-16 16 104
n1-highmem-32 32 208
n1-highcpu-4 4 3.6
n1-highcpu-8 8 7.2
n1-highcpu-16 16 14.4
n1-highcpu-32 32 28.8

Learn about pricing for each machine type. Read more about the detailed specifications of these machine types in the Compute Engine documentation about machine types.

GPUs

For some configurations, you can optionally add GPUs to accelerate each prediction node. To use GPUs, you must account for several requirements:

  • You can only use GPUs when your Model resource is based on a TensorFlow SavedModel, or when you use a custom container that has been designed to take advantage of GPUs. You cannot use GPUs for scikit-learn or XGBoost models.
  • The availability of each type of GPU varies depending on which region you use for your model. Learn which types of GPUs are available in which regions.
  • You can only use one type of GPU for your DeployedModel resource or BatchPredictionJob, and there are limitations on the number of GPUs you can add depending on which machine type you are using. The following table describes these limitations.

The following table shows the GPUs available for online prediction and how many of each type of GPU you can use with each Compute Engine machine type:

Valid numbers of GPUs for each machine type
Machine type NVIDIA Tesla K80 NVIDIA Tesla P100 NVIDIA Tesla V100 NVIDIA Tesla P4 NVIDIA Tesla T4
n1-standard-2 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-standard-4 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-standard-8 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-standard-16 2, 4, 8 1, 2, 4 2, 4, 8 1, 2, 4 1, 2, 4
n1-standard-32 4, 8 2, 4 4, 8 2, 4 2, 4
n1-highmem-2 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-highmem-4 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-highmem-8 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-highmem-16 2, 4, 8 1, 2, 4 2, 4, 8 1, 2, 4 1, 2, 4
n1-highmem-32 4, 8 2, 4 4, 8 2, 4 2, 4
n1-highcpu-2 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-highcpu-4 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-highcpu-8 1, 2, 4, 8 1, 2, 4 1, 2, 4, 8 1, 2, 4 1, 2, 4
n1-highcpu-16 2, 4, 8 1, 2, 4 2, 4, 8 1, 2, 4 1, 2, 4
n1-highcpu-32 4, 8 2, 4 4, 8 2, 4 2, 4

GPUs are optional and incur additional costs.

Scaling

When you deploy a Model as a DeployedModel or create a BatchPredictionJob, you can configure prediction nodes to automatically scale. To do this, set dedicatedResources.maxReplicaCount to a greater value than dedicatedResources.minReplicaCount (for a DeployedModel) or dedicatedResources.startingReplicaCount (for a BatchPredictionJob).

If you use this type of configuration, Vertex AI automatically scales your DeployedModel or BatchPredictionJob to use more prediction nodes when the CPU usage of your existing nodes gets high. Vertex AI scales your nodes based on CPU usage even if you have configured your prediction nodes to use GPUs; therefore if your prediction throughput is causing high GPU usage, but not high CPU usage, your nodes might not scale as you expect.

What's next