Vertex AI allocates nodes to handle online and batch predictions.
When you deploy a custom-trained model or AutoML tabular model to an Endpoint
resource to serve online predictions or when
you request batch predictions, you can
customize the type of virtual machine that the prediction service uses for
these nodes. You can optionally configure prediction nodes to use GPUs.
Machine types differ in a few ways:
- Number of virtual CPUs (vCPUs) per node
- Amount of memory per node
- Pricing
By selecting a machine type with more computing resources, you can serve predictions with lower latency or handle more prediction requests at the same time.
Where to specify compute resources
If you want to use a custom-trained model or an AutoML tabular model to serve
online predictions, you must specify a machine type when you deploy the Model
resource as a DeployedModel
to an Endpoint
. For other types of AutoML
models, Vertex AI configures the machine types automatically.
Specify the machine type (and, optionally, GPU configuration) in the
dedicatedResources.machineSpec
field of your
DeployedModel
.
Learn how to deploy each model type:
- Deploy an AutoML tabular model in Google Cloud console
- Deploy a custom-trained model in Google Cloud console
- Deploy a custom-trained model using client libraries
If you want to get batch predictions from a custom-trained model or an AutoML
tabular model, you must specify a machine type when you create a
BatchPredictionJob
resource. Specify the
machine type (and, optionally, GPU configuration) in the
dedicatedResources.machineSpec
field of your
BatchPredictionJob
.
Machine types
The following table compares the available machine types for serving predictions from custom-trained models and AutoML tabular models:
E2 Series
Name | vCPUs | Memory (GB) |
---|---|---|
e2-standard-2 |
2 | 8 |
e2-standard-4 |
4 | 16 |
e2-standard-8 |
8 | 32 |
e2-standard-16 |
16 | 64 |
e2-standard-32 |
32 | 128 |
e2-highmem-2 |
2 | 16 |
e2-highmem-4 |
4 | 32 |
e2-highmem-8 |
8 | 64 |
e2-highmem-16 |
16 | 128 |
e2-highcpu-2 |
2 | 2 |
e2-highcpu-4 |
4 | 4 |
e2-highcpu-8 |
8 | 8 |
e2-highcpu-16 |
16 | 16 |
e2-highcpu-32 |
32 | 32 |
N1 Series
Name | vCPUs | Memory (GB) |
---|---|---|
n1-standard-2 |
2 | 7.5 |
n1-standard-4 |
4 | 15 |
n1-standard-8 |
8 | 30 |
n1-standard-16 |
16 | 60 |
n1-standard-32 |
32 | 120 |
n1-highmem-2 |
2 | 13 |
n1-highmem-4 |
4 | 26 |
n1-highmem-8 |
8 | 52 |
n1-highmem-16 |
16 | 104 |
n1-highmem-32 |
32 | 208 |
n1-highcpu-4 |
4 | 3.6 |
n1-highcpu-8 |
8 | 7.2 |
n1-highcpu-16 |
16 | 14.4 |
n1-highcpu-32 |
32 | 28.8 |
A2 Series
Name | vCPUs | Memory (GB) | GPUs (A100 40GB) |
---|---|---|---|
a2-highgpu-1g |
12 | 85 | 1 |
a2-highgpu-2g |
24 | 170 | 2 |
a2-highgpu-4g |
48 | 340 | 4 |
a2-highgpu-8g |
96 | 680 | 8 |
a2-megagpu-16g |
96 | 1360 | 16 |
Learn about pricing for each machine type. Read more about the detailed specifications of these machine types in the Compute Engine documentation about machine types.
GPUs
Some configurations, such as the A2 series, have a fixed number of GPUs built-in.
Other configurations, such as the N1 series, let you optionally add GPUs to accelerate each prediction node. To use GPUs, you must account for several requirements:
- You can only use GPUs when your
Model
resource is based on a TensorFlow SavedModel, or when you use a custom container that has been designed to take advantage of GPUs. You cannot use GPUs for scikit-learn or XGBoost models. - The availability of each type of GPU varies depending on which region you use for your model. Learn which types of GPUs are available in which regions.
- You can only use one type of GPU for your
DeployedModel
resource orBatchPredictionJob
, and there are limitations on the number of GPUs you can add depending on which machine type you are using. The following table describes these limitations.
The following table shows the GPUs available for online prediction and how many of each type of GPU you can use with each Compute Engine machine type:
Valid numbers of GPUs for each machine type | |||||
---|---|---|---|---|---|
Machine type | NVIDIA Tesla K80 | NVIDIA Tesla P100 | NVIDIA Tesla V100 | NVIDIA Tesla P4 | NVIDIA Tesla T4 |
n1-standard-2 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 |
n1-standard-4 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 |
n1-standard-8 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 |
n1-standard-16 |
2, 4, 8 | 1, 2, 4 | 2, 4, 8 | 1, 2, 4 | 1, 2, 4 |
n1-standard-32 |
4, 8 | 2, 4 | 4, 8 | 2, 4 | 2, 4 |
n1-highmem-2 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 |
n1-highmem-4 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 |
n1-highmem-8 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 |
n1-highmem-16 |
2, 4, 8 | 1, 2, 4 | 2, 4, 8 | 1, 2, 4 | 1, 2, 4 |
n1-highmem-32 |
4, 8 | 2, 4 | 4, 8 | 2, 4 | 2, 4 |
n1-highcpu-2 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 |
n1-highcpu-4 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 |
n1-highcpu-8 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 |
n1-highcpu-16 |
2, 4, 8 | 1, 2, 4 | 2, 4, 8 | 1, 2, 4 | 1, 2, 4 |
n1-highcpu-32 |
4, 8 | 2, 4 | 4, 8 | 2, 4 | 2, 4 |
GPUs are optional and incur additional costs.
Scaling
When you deploy a Model
for online prediction as a
DeployedModel
, you
can configure prediction nodes to automatically scale. To do this, set
dedicatedResources.maxReplicaCount
to a
greater value than dedicatedResources.minReplicaCount
.
When you configure a DeployedModel
, you must set
dedicatedResources.minReplicaCount
to at least 1. In other words, you cannot
configure the DeployedModel
to scale to 0 prediction nodes when it is unused.
The prediction nodes for batch prediction do not automatically scale.
Vertex AI uses
BatchDedicatedResources.startingReplicaCount
and ignores BatchDedicatedResources.maxReplicaCount
.
Scaling behavior
If you use an autoscaling configuration, Vertex AI automatically
scales your DeployedModel
to use more prediction nodes when the CPU usage of
your existing nodes gets high. Vertex AI scales your nodes based on
CPU usage even if you have configured your prediction nodes to use GPUs;
therefore if your prediction throughput is causing high GPU usage, but not high
CPU usage, your nodes might not scale as you expect, as autoscaling with
autoscalingMetricSpecs
will autoscale to the most utilized resource. If you deploy a model with GPU
dedicated resources, and do not set autoscalingMetricSpecs
,
Vertex AI will scale the number of predictions nodes so that CPU
and GPU are utilized at 60% (which is default value of MetricSpec). Therefore,
if your prediction throughput is causing high GPU usage, but not high CPU usage,
Vertex AI will scale up, and the CPU utilization will be very low,
which will be visible in monitoring. If your custom container is underutilizing
the GPU, but has an unrelated process that bring CPU utilization above 60%,
Vertex AI will scale up, even if this may not have been needed to
achieve QPS and latency targets.
In all circumstances, custom containers are deployed to machines with one container per machine. This means that if a custom prediction container cannot fully utilize that selected compute resource, such as single threaded code for a multi-core machine, or a custom model that calls another service as part of making the prediction, your nodes may not scale up. For a model without a GPU assigned, CPU utilization is the only metric being evaluated for scaling. For a model with a GPU assigned, CPU utilization and Accelerator Duty Cycle percentages are the metrics being evaluated, so any bottlenecks in a model that do not saturate one of those resources can cause the model to underscale. Anything that saturates one of those metrics, without affecting the other will cause the model to scale up.
Finding the ideal machine type
To determining the ideal machine type for a custom prediction container from a cost perspective, deploy that container as a docker container to a Compute Engine instance directly, then benchmark the instance by calling prediction calls until the vm hits 90+ percent CPU utilization. Do this multiple times for different machine types, and determine the "QPS per cost per hour" of different machine types. You can re-run these experiments while benchmarking latency to find the ideal cost per QPS per your latency targets for your specific custom prediction container. As an example if a custom container contains a python webserver process which can only effectively use 1 core, and that process is calling a multhreaded ML model (such as most implementations of XGBoost), as QPS increases the web server will start to "block" XGBoost because every XGBoost prediction will wait on on the web server process. If this deployed to a 2 or 4 core machine shape, the container will hit the QPS limits and the CPU utilization will be high, so the container is deployed to Vertex AI, and it will autoscale effectively. If the same container is deployed to a 32 or 64 core machine shape, the container will hit the QPS limit, when the single threaded webserver is limiting, but the overall CPU utilization will be low, so the model will not scale up, even if scaling up would increase the overall QPS.
What's next
- Deploy an AutoML tabular model in Google Cloud console
- Deploy a custom-trained model in Google Cloud console
- Deploy a custom-trained model using client libraries
- Get batch predictions