Configure compute resources for inference

Vertex AI allocates nodes to handle online and batch inferences. When you deploy a custom-trained model or AutoML model to an Endpoint resource to serve online inferences or when you request batch inferences, you can customize the type of virtual machine that the inference service uses for these nodes. You can optionally configure inference nodes to use GPUs.

Machine types differ in a few ways:

Number of virtual CPUs (vCPUs) per node
Amount of memory per node
Pricing

By selecting a machine type with more computing resources, you can serve inferences with lower latency or handle more inference requests at the same time.

Manage cost and availability

To help manage costs or ensure availability of VM resources, Vertex AI provides the following:

To make sure that VM resources are available when your inference jobs need them, you can use Compute Engine reservations. Reservations provide a high level of assurance in obtaining capacity for Compute Engine resources. For more information, see Use reservations with inference.
To reduce the cost of running your inference jobs, you can use Spot VMs. Spot VMs are virtual machine (VM) instances that are excess Compute Engine capacity. Spot VMs have significant discounts, but Compute Engine might preemptively stop or delete Spot VMs to reclaim the capacity at any time. For more information, see Use Spot VMs with inference.

Where to specify compute resources

Online inference

If you want to use a custom-trained model or an AutoML tabular model to serve online inferences, you must specify a machine type when you deploy the Model resource as a DeployedModel to an Endpoint. For other types of AutoML models, Vertex AI configures the machine types automatically.

Specify the machine type (and, optionally, GPU configuration) in the dedicatedResources.machineSpec field of your DeployedModel.

Learn how to deploy each model type:

Batch inference

If you want to get batch inferences from a custom-trained model or an AutoML tabular model, you must specify a machine type when you create a BatchPredictionJob resource. Specify the machine type (and, optionally, GPU configuration) in the dedicatedResources.machineSpec field of your BatchPredictionJob.

Machine types

The following table compares the available machine types for serving inferences from custom-trained models and AutoML tabular models:

E2 Series

Name	vCPUs	Memory (GB)
`e2-standard-2`	2	8
`e2-standard-4`	4	16
`e2-standard-8`	8	32
`e2-standard-16`	16	64
`e2-standard-32`	32	128
`e2-highmem-2`	2	16
`e2-highmem-4`	4	32
`e2-highmem-8`	8	64
`e2-highmem-16`	16	128
`e2-highcpu-2`	2	2
`e2-highcpu-4`	4	4
`e2-highcpu-8`	8	8
`e2-highcpu-16`	16	16
`e2-highcpu-32`	32	32

N1 Series

Name	vCPUs	Memory (GB)
`n1-standard-2`	2	7.5
`n1-standard-4`	4	15
`n1-standard-8`	8	30
`n1-standard-16`	16	60
`n1-standard-32`	32	120
`n1-highmem-2`	2	13
`n1-highmem-4`	4	26
`n1-highmem-8`	8	52
`n1-highmem-16`	16	104
`n1-highmem-32`	32	208
`n1-highcpu-4`	4	3.6
`n1-highcpu-8`	8	7.2
`n1-highcpu-16`	16	14.4
`n1-highcpu-32`	32	28.8

N2 Series

Name	vCPUs	Memory (GB)
`n2-standard-2`	2	8
`n2-standard-4`	4	16
`n2-standard-8`	8	32
`n2-standard-16`	16	64
`n2-standard-32`	32	128
`n2-standard-48`	48	192
`n2-standard-64`	64	256
`n2-standard-80`	80	320
`n2-standard-96`	96	384
`n2-standard-128`	128	512
`n2-highmem-2`	2	16
`n2-highmem-4`	4	32
`n2-highmem-8`	8	64
`n2-highmem-16`	16	128
`n2-highmem-32`	32	256
`n2-highmem-48`	48	384
`n2-highmem-64`	64	512
`n2-highmem-80`	80	640
`n2-highmem-96`	96	768
`n2-highmem-128`	128	864
`n2-highcpu-2`	2	2
`n2-highcpu-4`	4	4
`n2-highcpu-8`	8	8
`n2-highcpu-16`	16	16
`n2-highcpu-32`	32	32
`n2-highcpu-48`	48	48
`n2-highcpu-64`	64	64
`n2-highcpu-80`	80	80
`n2-highcpu-96`	96	96

N2D Series

Name	vCPUs	Memory (GB)
`n2d-standard-2`	2	8
`n2d-standard-4`	4	16
`n2d-standard-8`	8	32
`n2d-standard-16`	16	64
`n2d-standard-32`	32	128
`n2d-standard-48`	48	192
`n2d-standard-64`	64	256
`n2d-standard-80`	80	320
`n2d-standard-96`	96	384
`n2d-standard-128`	128	512
`n2d-standard-224`	224	896
`n2d-highmem-2`	2	16
`n2d-highmem-4`	4	32
`n2d-highmem-8`	8	64
`n2d-highmem-16`	16	128
`n2d-highmem-32`	32	256
`n2d-highmem-48`	48	384
`n2d-highmem-64`	64	512
`n2d-highmem-80`	80	640
`n2d-highmem-96`	96	768
`n2d-highcpu-2`	2	2
`n2d-highcpu-4`	4	4
`n2d-highcpu-8`	8	8
`n2d-highcpu-16`	16	16
`n2d-highcpu-32`	32	32
`n2d-highcpu-48`	48	48
`n2d-highcpu-64`	64	64
`n2d-highcpu-80`	80	80
`n2d-highcpu-96`	96	96
`n2d-highcpu-128`	128	128
`n2d-highcpu-224`	224	224

C2 Series

Name	vCPUs	Memory (GB)
`c2-standard-4`	4	16
`c2-standard-8`	8	32
`c2-standard-16`	16	64
`c2-standard-30`	30	120
`c2-standard-60`	60	240

C2D Series

Name	vCPUs	Memory (GB)
`c2d-standard-2`	2	8
`c2d-standard-4`	4	16
`c2d-standard-8`	8	32
`c2d-standard-16`	16	64
`c2d-standard-32`	32	128
`c2d-standard-56`	56	224
`c2d-standard-112`	112	448
`c2d-highcpu-2`	2	4
`c2d-highcpu-4`	4	8
`c2d-highcpu-8`	8	16
`c2d-highcpu-16`	16	32
`c2d-highcpu-32`	32	64
`c2d-highcpu-56`	56	112
`c2d-highcpu-112`	112	224
`c2d-highmem-2`	2	16
`c2d-highmem-4`	4	32
`c2d-highmem-8`	8	64
`c2d-highmem-16`	16	128
`c2d-highmem-32`	32	256
`c2d-highmem-56`	56	448
`c2d-highmem-112`	112	896

C3 Series

Name	vCPUs	Memory (GB)
`c3-highcpu-4`	4	8
`c3-highcpu-8`	8	16
`c3-highcpu-22`	22	44
`c3-highcpu-44`	44	88
`c3-highcpu-88`	88	176
`c3-highcpu-176`	176	352

A2 Series

Name	vCPUs	Memory (GB)	GPUs (NVIDIA A100)
`a2-highgpu-1g`	12	85	1 (A100 40GB)
`a2-highgpu-2g`	24	170	2 (A100 40GB)
`a2-highgpu-4g`	48	340	4 (A100 40GB)
`a2-highgpu-8g`	96	680	8 (A100 40GB)
`a2-megagpu-16g`	96	1360	16 (A100 40GB)
`a2-ultragpu-1g`	12	170	1 (A100 80GB)
`a2-ultragpu-2g`	24	340	2 (A100 80GB)
`a2-ultragpu-4g`	48	680	4 (A100 80GB)
`a2-ultragpu-8g`	96	1360	8 (A100 80GB)

A3 Series

Name	vCPUs	Memory (GB)	GPUs (NVIDIA H100 or H200)
`a3-highgpu-1g`	26	234	1 (H100 80GB)
`a3-highgpu-2g`	52	468	2 (H100 80GB)
`a3-highgpu-4g`	104	936	4 (H100 80GB)
`a3-highgpu-8g`	208	1872	8 (H100 80GB)
`a3-edgegpu-8g`	208	1872	8 (H100 80GB)
`a3-ultragpu-8g`	224	2952	8 (H200 141GB)

A4X Series

Name	vCPUs	Memory (GB)	GPUs (NVIDIA GB200)
`a4x-highgpu-4g`	140	884	4

G2 Series

Name	vCPUs	Memory (GB)	GPUs (NVIDIA L4)
`g2-standard-4`	4	16	1
`g2-standard-8`	8	32	1
`g2-standard-12`	12	48	1
`g2-standard-16`	16	64	1
`g2-standard-24`	24	96	2
`g2-standard-32`	32	128	1
`g2-standard-48`	48	192	4
`g2-standard-96`	96	384	8

Learn about pricing for each machine type. Read more about the detailed specifications of these machine types in the Compute Engine documentation about machine types.

Find the ideal machine type

Online inference

To find the ideal machine type for your use case, we recommend loading your model on multiple machine types and measuring characteristics such as the latency, cost, concurrency, and throughput.

One way to do this is to run this notebook on multiple machine types and compare the results to find the one that works best for you.

Vertex AI reserves approximately 1 vCPU on each replica for running system processes. This means that running the notebook on a single core machine type would be comparable to using a 2-core machine type for serving inferences.

When considering inference costs, remember that although larger machines cost more, they can lower overall cost because fewer replicas are required to serve the same workload. This is particularly evident for GPUs, which tend to cost more per hour, but can both provide lower latency and cost less overall.

Batch inference

For more information, see Choose machine type and replica count.

Optional GPU accelerators

Some configurations, such as the A2 series and G2 series, have a fixed number of GPUs built-in.

The A4X (a4x-highgpu-4g) series requires a minimum replica count of 18. This machine is purchased per rack, and has a minimum of 18 VMs.

Other configurations, such as the N1 series, let you optionally add GPUs to accelerate each inference node.

To add optional GPU accelerators, you must account for several requirements:

You can only use GPUs when your Model resource is based on a TensorFlow SavedModel, or when you use a custom container that has been designed to take advantage of GPUs. You can't use GPUs for scikit-learn or XGBoost models.
The availability of each type of GPU varies depending on which region you use for your model. Learn which types of GPUs are available in which regions.
You can only use one type of GPU for your DeployedModel resource or BatchPredictionJob, and there are limitations on the number of GPUs you can add depending on which machine type you are using. The following table describes these limitations.

The following table shows the optional GPUs that are available for online inference and how many of each type of GPU you can use with each Compute Engine machine type:

Valid numbers of GPUs for each machine type
Machine type	NVIDIA Tesla P100	NVIDIA Tesla V100	NVIDIA Tesla P4	NVIDIA Tesla T4
`n1-standard-2`	1, 2, 4	1, 2, 4, 8	1, 2, 4	1, 2, 4
`n1-standard-4`	1, 2, 4	1, 2, 4, 8	1, 2, 4	1, 2, 4
`n1-standard-8`	1, 2, 4	1, 2, 4, 8	1, 2, 4	1, 2, 4
`n1-standard-16`	1, 2, 4	2, 4, 8	1, 2, 4	1, 2, 4
`n1-standard-32`	2, 4	4, 8	2, 4	2, 4
`n1-highmem-2`	1, 2, 4	1, 2, 4, 8	1, 2, 4	1, 2, 4
`n1-highmem-4`	1, 2, 4	1, 2, 4, 8	1, 2, 4	1, 2, 4
`n1-highmem-8`	1, 2, 4	1, 2, 4, 8	1, 2, 4	1, 2, 4
`n1-highmem-16`	1, 2, 4	2, 4, 8	1, 2, 4	1, 2, 4
`n1-highmem-32`	2, 4	4, 8	2, 4	2, 4
`n1-highcpu-2`	1, 2, 4	1, 2, 4, 8	1, 2, 4	1, 2, 4
`n1-highcpu-4`	1, 2, 4	1, 2, 4, 8	1, 2, 4	1, 2, 4
`n1-highcpu-8`	1, 2, 4	1, 2, 4, 8	1, 2, 4	1, 2, 4
`n1-highcpu-16`	1, 2, 4	2, 4, 8	1, 2, 4	1, 2, 4
`n1-highcpu-32`	2, 4	4, 8	2, 4	2, 4

Optional GPUs incur additional costs.

Configure compute resources for inference Stay organized with collections Save and categorize content based on your preferences.

Manage cost and availability

Where to specify compute resources

Online inference

Batch inference

Machine types

E2 Series

N1 Series

N2 Series

N2D Series

C2 Series

C2D Series

C3 Series

A2 Series

A3 Series

A4X Series

G2 Series

Find the ideal machine type

Online inference

Batch inference

Optional GPU accelerators

What's next

Configure compute resources for inference