AI Platform Prediction allocates nodes to handle online prediction requests sent to a model version. When you deploy a model version, you can customize the type of virtual machine that AI Platform Prediction uses for these nodes.
Machine types differ in several ways:
- Number of virtual CPUs (vCPUs) per node
- Amount of memory per node
- Support for GPUs, which you can add to some machine types
- Support for certain AI Platform Prediction features
- Pricing
- Service level agreement (SLA) coverage
By selecting a machine type with more computing resources, you can serve predictions with lower latency or handle more prediction requests at the same time.
Available machine types
Compute Engine (N1) machine types and the mls1-c1-m2
machine type are generally available for online
prediction. The mls1-c4-m2
machine type is available in beta.
The following table compares the available machine types:
Name | Availability | vCPUs | Memory (GB) | Supports GPUs? | ML framework support | Max model size |
---|---|---|---|---|---|---|
mls1-c1-m2 (default on global endpoint) |
Generally available | 1 | 2 | No | TensorFlow, XGBoost, scikit-learn (including pipelines with custom code), custom prediction routines | 500 MB |
mls1-c4-m2 |
Beta | 4 | 2 | No | TensorFlow, XGBoost, scikit-learn (including pipelines with custom code), custom prediction routines | 500 MB |
n1-standard-2 (default on regional endpoints) |
Generally available | 2 | 7.5 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
n1-standard-4 |
Generally available | 4 | 15 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
n1-standard-8 |
Generally available | 8 | 30 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
n1-standard-16 |
Generally available | 16 | 60 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
n1-standard-32 |
Generally available | 32 | 120 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
n1-highmem-2 |
Generally available | 2 | 13 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
n1-highmem-4 |
Generally available | 4 | 26 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
n1-highmem-8 |
Generally available | 8 | 52 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
n1-highmem-16 |
Generally available | 16 | 104 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
n1-highmem-32 |
Generally available | 32 | 208 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
n1-highcpu-2 |
Generally available | 2 | 1.8 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
n1-highcpu-4 |
Generally available | 4 | 3.6 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
n1-highcpu-8 |
Generally available | 8 | 7.2 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
n1-highcpu-16 |
Generally available | 16 | 14.4 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
n1-highcpu-32 |
Generally available | 32 | 28.8 | Yes | TensorFlow, XGBoost, and scikit-learn | 10 GB |
Learn about pricing for each machine type. Read more about the detailed specifications of Compute Engine (N1) machine types in the Compute Engine documentation.
Specifying a machine type
You can specify a machine type choice when you create a model
version. If you don't
specify a machine type, then your model version defaults to using
n1-standard-2
if you are using a regional
endpoint and mls1-c1-m2
if you
are using the global endpoint.
The following instructions highlight how to specify a machine type when you
create a model version. They use the n1-standard-4
machine type as an example.
To learn about the full process of creating a model version, read the guide to
deploying models.
Google Cloud console
On the Create version page, open the Machine type drop-down list and select Standard > n1-standard-4.
gcloud
After you have uploaded your model artifacts to Cloud Storage and
created a model resource, you can create a
model version that uses the n1-standard-4
machine type:
gcloud ai-platform versions create VERSION_NAME \
--model MODEL_NAME \
--origin gs://model-directory-uri \
--runtime-version 2.11 \
--python-version 3.7 \
--framework ML_FRAMEWORK_NAME \
--region us-central1 \
--machine-type n1-standard-4
Python
This example uses the the Google API Client Library for Python. Before you run the following code sample, you must set up authentication.
After you have uploaded your model artifacts to Cloud Storage and
created a model resource, send a request to
your model's
projects.models.versions.create
method and specify the machineType
field in your request body:
from google.api_core.client_options import ClientOptions
from googleapiclient import discovery
endpoint = 'https://us-central1-ml.googleapis.com'
client_options = ClientOptions(api_endpoint=endpoint)
ml = discovery.build('ml', 'v1', client_options=client_options)
request_dict = {
'name': 'VERSION_NAME',
'deploymentUri': 'gs://model-directory-uri',
'runtimeVersion': '2.11',
'pythonVersion': '3.7',
'framework': 'ML_FRAMEWORK_NAME',
'machineType': 'n1-standard-4'
}
request = ml.projects().models().versions().create(
parent='projects/PROJECT_NAME/models/MODEL_NAME',
body=request_dict
)
response = request.execute()
Using GPUs for online prediction
For some configurations, you can optionally add GPUs to accelerate each prediction node. To use GPUs, you must account for several requirements:
- You can only use GPUs with Compute Engine (N1) machine types. Legacy (MLS1) machine types do not support GPUs.
- You can only use GPUs when you deploy a TensorFlow SavedModel. You cannot use GPUs for scikit-learn or XGBoost models.
- The availability of each type of GPU varies depending on which region you use for your model. Learn which types of GPUs are available in which regions.
- You can only use one type of GPU for your model version, and there are limitations on the number of GPUs you can add depending on which machine type you are using. The following table describes these limitations.
The following table shows the GPUs available for online prediction and how many of each type of GPU you can use with each Compute Engine machine type:
Valid numbers of GPUs for each machine type | |||||
---|---|---|---|---|---|
Machine type | NVIDIA Tesla K80 | NVIDIA Tesla P4 | NVIDIA Tesla P100 | NVIDIA Tesla T4 | NVIDIA Tesla V100 |
n1-standard-2 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 |
n1-standard-4 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 |
n1-standard-8 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 |
n1-standard-16 |
2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 2, 4, 8 |
n1-standard-32 |
4, 8 | 2, 4 | 2, 4 | 2, 4 | 4, 8 |
n1-highmem-2 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 |
n1-highmem-4 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 |
n1-highmem-8 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 |
n1-highmem-16 |
2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 2, 4, 8 |
n1-highmem-32 |
4, 8 | 2, 4 | 2, 4 | 2, 4 | 4, 8 |
n1-highcpu-2 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 |
n1-highcpu-4 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 |
n1-highcpu-8 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 |
n1-highcpu-16 |
2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 2, 4, 8 |
n1-highcpu-32 |
4, 8 | 2, 4 | 2, 4 | 2, 4 | 4, 8 |
GPUs are optional and incur additional costs.
Specifying GPUs
Specify GPUs when you create a model version. AI Platform Prediction allocates the number and type of GPU that you specify for each prediction node. You can automatically scale (preview) or manually scale (GA) the prediction nodes, but the number of GPUs that each node uses is fixed when you create the model version. Unless you have an advanced use case, we recommend that you configure one GPU on each prediction node; in other words, set the accelerator count to 1.
The following instructions show how to specify GPUs for online prediction by
creating a model version that runs on at least two prediction nodes at any time.
Each node uses the n1-standard-4
machine type and one NVIDIA Tesla T4 GPU.
The examples assume that you have already uploaded a TensorFlow SavedModel to Cloud Storage and created a model resource in a region that supports GPUs.
Google Cloud console
Follow the guide to creating a model version. On the Create version page, specify the following options:
- In the Scaling drop-down list, select Auto scaling.
- In the Minimum number of nodes field, enter
2
. - In the Machine type drop-down list, select Standard > n1-standard-4.
- In the Accelerator type drop-down list, select NVIDIA_TESLA_T4.
- In the Accelerator count drop-down list, select 1.
gcloud
Use the gcloud CLI to create a model version. In this example, the
version runs on n1-standard-4 prediction nodes that each use one NVIDIA Tesla T4
GPU. AI Platform Prediction automatically scales the
number of prediction nodes to a number between 2 and 4, depending on GPU usage
at any given time. The example uses the us-central1
regional endpoint:
gcloud beta ai-platform versions create VERSION_NAME \
--model MODEL_NAME \
--origin gs://model-directory-uri \
--runtime-version 2.11 \
--python-version 3.7 \
--framework tensorflow \
--region us-central1 \
--machine-type n1-standard-4 \
--accelerator count=1,type=nvidia-tesla-t4 \
--min-nodes 2 \
--max-nodes 4 \
--metric-targets gpu-duty-cycle=60
Note that the accelerator name is specified in lowercase with hyphens between words.
Python
This example uses the the Google API Client Library for Python. Before you run the following code sample, you must set up authentication.
The example uses theus-central1
regional endpoint.
Send a request to your model's
projects.models.versions.create
method and specify the machineType
, acceleratorConfig
, and manualScaling
fields in your request body:
from google.api_core.client_options import ClientOptions
from googleapiclient import discovery
endpoint = 'https://us-central1-ml.googleapis.com'
client_options = ClientOptions(api_endpoint=endpoint)
ml = discovery.build('ml', 'v1', client_options=client_options)
request_dict = {
'name': 'VERSION_NAME',
'deploymentUri': 'gs://model-directory-uri',
'runtimeVersion': '2.11',
'pythonVersion': '3.7',
'framework': 'TENSORFLOW',
'machineType': 'n1-standard-4',
'acceleratorConfig': {
'count': 1,
'type': 'NVIDIA_TESLA_T4'
},
'autoScaling': {
'minNodes': 2,
'maxNodes': 4,
'metrics': [
{
'name': 'GPU_DUTY_CYCLE',
'target': 60
}
]
}
}
request = ml.projects().models().versions().create(
parent='projects/PROJECT_NAME/models/MODEL_NAME',
body=request_dict
)
response = request.execute()
Note that the accelerator name is specified in uppercase with underscores between words.
Differences between machine types
Besides providing different amounts of computing resources, machine types also vary in their support for certain AI Platform Prediction features. The following table provides an overview of the differences between Compute Engine (N1) machine types and legacy (MLS1) machine types:
Compute Engine (N1) machine types | Legacy (MLS1) machine types | |
---|---|---|
Regions | All regional endpoint regions | All global endpoint regions |
Types of ML artifacts |
|
|
Runtime versions | 1.11 or later | All available AI Platform runtime versions |
Custom container support | Yes | No |
Max model size | 10 GB | 500 MB |
Auto scaling | Minimum nodes = 1 | Minimum nodes = 0 |
Manual scaling | Can update number of nodes | Cannot update number of nodes after creating model version |
GPU support | Yes (TensorFlow only) | No |
AI Explanations support | Yes (TensorFlow only) | No |
VPC Service Controls support | Yes | No |
SLA coverage for generally available machine types | Yes, in some cases | Yes |
The following sections provide detailed explanations about the differences between machine types.
Regional availability
Compute Engine (N1) machine types are available when you deploy your model on a regional endpoint. When you use a Compute Engine (N1) machine type, you cannot deploy your model to the global endpoint.
When you scale a model version that uses Compute Engine (N1) machine types to two or more prediction nodes, the nodes run in multiple zones within the same region. This ensures continuous availability if there is an outage in one of the zones. Learn more in the scaling section of this document.
Note that GPU availability for Compute Engine (N1) machine types also varies by region.
Legacy (MLS1) machine types are available on the global endpoint in many regions. Legacy (MLS1) machine types are not available on regional endpoints.
Batch prediction support
Model versions that use the mls1-c1-m2
machine type support batch
prediction. Model versions that use other machine
types do not support batch prediction.
ML framework support
If you use one of the Compute Engine (N1) machine types, you can create your model version with all of the model artifacts described in the Exporting models of prediction guide, except for two:
- You cannot use a scikit-learn pipeline with custom code.
- You cannot use a custom prediction routine.
For legacy (MLS1) machine types, you can use any type of model artifact that AI Platform Prediction supports, including a scikit-learn pipeline with custom code or a custom prediction routine.
Runtime version support
If you use a Compute Engine (N1) machine type, you must use runtime version 1.11 or later for your model version.
If you use a legacy (MLS1) machine type, you can use any available AI Platform runtime version.
Custom container support
To use a custom container to serve online predictions, you must use a Compute Engine (N1) machine type.
Max model size
The model artifacts that you provide when you create a model version must have a total file size less than 500 MB if you use a legacy (MLS1) machine type. The total file size can be up to 10 GB if you use a Compute Engine (N1) machine type.
Logging predictions
For Compute Engine (N1) machine types, console logging is in preview. For legacy (MLS1) machine types, console logging is generally available.
Scaling prediction nodes
Automatic scaling and manual scaling of prediction nodes both have different constraints depending on whether you use a Compute Engine (N1) machine type or a legacy (MLS1) machine type.
Automatic scaling
If you use a Compute Engine (N1) machine type with automatic
scaling,
your model version must always have at least one node running. In other words,
the version's autoScaling.minNodes
field
defaults to 1 and cannot be less than 1. If you set autoScaling.minNodes
to 2
or greater, then prediction nodes run in multiple zones within the same region.
This ensures continuous availability if there is an outage in one of the zones.
Note that if you allocate more vCPUs or RAM than your machine learning model needs, autoscaling might not work properly. This can cause problems with model performance. Experiment with using different machine types for your model to ensure you are not providing too much compute resources.
If you use a legacy (MLS1) machine type, your model version can scale to zero
nodes when it doesn't receive traffic: autoScaling.minNodes
can be set to 0,
and it is set to 0 by default. Scaling to zero can reduce costs when your model
version is not receiving prediction requests. However, it can also lead to
latency or errors during any periods when AI Platform Prediction is allocating a
new node to handle requests after a period with zero nodes. Learn more about
scaling to zero.
Manual scaling
If you use a Compute Engine (N1) machine type with manual
scaling,
you can update the number of prediction nodes running at any time by using the
projects.models.versions.patch
API method. If you set the manualScaling.nodes
field to 2 or
greater, then prediction nodes run in multiple zones within the same region.
This ensures continuous availability if there is an outage in one of the zones.
If you use a legacy (MLS1) machine type with manual scaling, you cannot update the number of prediction nodes after you create the model version. If you want to change the number of nodes, you must delete the version and create a new one.
VPC Service Controls support
If you use VPC Service Controls to protect AI Platform Prediction, then you cannot create versions that use legacy (MLS1) machine types. You must use Compute Engine (N1) machine types.