This legacy version of AI Platform Prediction is deprecated and will no longer be available on Google Cloud after January 31, 2025. All models, associated metadata, and deployments will be deleted after January 31, 2025. Migrate your resources to Vertex AI to get new machine learning features that are unavailable in AI Platform.

Choosing a machine type for online prediction

AI Platform Prediction allocates nodes to handle online prediction requests sent to a model version. When you deploy a model version, you can customize the type of virtual machine that AI Platform Prediction uses for these nodes.

Machine types differ in several ways:

Number of virtual CPUs (vCPUs) per node
Amount of memory per node
Support for GPUs, which you can add to some machine types
Support for certain AI Platform Prediction features
Pricing
Service level agreement (SLA) coverage

By selecting a machine type with more computing resources, you can serve predictions with lower latency or handle more prediction requests at the same time.

Available machine types

Compute Engine (N1) machine types and the mls1-c1-m2 machine type are generally available for online prediction. The mls1-c4-m2 machine type is available in beta.

The following table compares the available machine types:

Name	Availability	vCPUs	Memory (GB)	Supports GPUs?	ML framework support	Max model size
`mls1-c1-m2` (default on global endpoint)	Generally available	1	2	No	TensorFlow, XGBoost, scikit-learn (including pipelines with custom code), custom prediction routines	500 MB
`mls1-c4-m2`	Beta	4	2	No	TensorFlow, XGBoost, scikit-learn (including pipelines with custom code), custom prediction routines	500 MB
`n1-standard-2` (default on regional endpoints)	Generally available	2	7.5	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB
`n1-standard-4`	Generally available	4	15	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB
`n1-standard-8`	Generally available	8	30	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB
`n1-standard-16`	Generally available	16	60	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB
`n1-standard-32`	Generally available	32	120	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB
`n1-highmem-2`	Generally available	2	13	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB
`n1-highmem-4`	Generally available	4	26	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB
`n1-highmem-8`	Generally available	8	52	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB
`n1-highmem-16`	Generally available	16	104	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB
`n1-highmem-32`	Generally available	32	208	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB
`n1-highcpu-2`	Generally available	2	1.8	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB
`n1-highcpu-4`	Generally available	4	3.6	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB
`n1-highcpu-8`	Generally available	8	7.2	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB
`n1-highcpu-16`	Generally available	16	14.4	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB
`n1-highcpu-32`	Generally available	32	28.8	Yes	TensorFlow, XGBoost, and scikit-learn	10 GB

Learn about pricing for each machine type. Read more about the detailed specifications of Compute Engine (N1) machine types in the Compute Engine documentation.

Specifying a machine type

You can specify a machine type choice when you create a model version. If you don't specify a machine type, then your model version defaults to using n1-standard-2 if you are using a regional endpoint and mls1-c1-m2 if you are using the global endpoint.

The following instructions highlight how to specify a machine type when you create a model version. They use the n1-standard-4 machine type as an example. To learn about the full process of creating a model version, read the guide to deploying models.

Google Cloud console

On the Create version page, open the Machine type drop-down list and select Standard > n1-standard-4.

gcloud

After you have uploaded your model artifacts to Cloud Storage and created a model resource, you can create a model version that uses the n1-standard-4 machine type:

gcloud ai-platform versions create VERSION_NAME \
  --model MODEL_NAME \
  --origin gs://model-directory-uri \
  --runtime-version 2.11 \
  --python-version 3.7 \
  --framework ML_FRAMEWORK_NAME \
  --region us-central1 \
  --machine-type n1-standard-4

Python

This example uses the the Google API Client Library for Python. Before you run the following code sample, you must set up authentication.

How to set up authentication

To set up authentication, you need to create a service account key and set an environment variable for the file path to the service account key.

Create a service account:
1. In the Google Cloud console, go to the Create service account page.
  
  Go to Create service account
2. In the Service account name field, enter a name.
3. Optional: In the Service account description field, enter a description.
4. Click Create.
5. Click the Select a role field. Under All roles, select AI Platform > AI Platform Admin.
6. Click Add another role.
7. Click the Select a role field. Under All roles, select Storage > Storage Object Admin.
  
  Note: The roles you select allow your service account to access resources. You can view and change these roles later by using the Google Cloud console. If you are developing a production app, you might need to specify roles with fewer permissions than AI Platform Admin and Storage Object Admin. For more information, see access control for AI Platform Prediction.
8. Click Done to create the service account.
  
  Do not close your browser window. You will use it in the next step.
Create a service account key for authentication:
1. In the Google Cloud console, click the email address for the service account that you created.
2. Click Keys.
3. Click Add key, then Create new key.
4. Click Create. A JSON key file is downloaded to your computer.
5. Click Close.
Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of the JSON file that contains your service account key. This variable only applies to your current shell session, so if you open a new session, set the variable again.
Example: Linux or macOS

Replace [PATH] with the file path of the JSON file that contains your service account key.
```
export GOOGLE_APPLICATION_CREDENTIALS="[PATH]"
```
For example:
```
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"
```
Example: Windows

Replace [PATH] with the file path of the JSON file that contains your service account key, and [FILE_NAME] with the filename.

With PowerShell:
```
$env:GOOGLE_APPLICATION_CREDENTIALS="[PATH]"
```
For example:
```
$env:GOOGLE_APPLICATION_CREDENTIALS="C:\Users\username\Downloads\[FILE_NAME].json"
```
With command prompt:
```
set GOOGLE_APPLICATION_CREDENTIALS=[PATH]
```

After you have uploaded your model artifacts to Cloud Storage and created a model resource, send a request to your model's projects.models.versions.create method and specify the machineType field in your request body:

from google.api_core.client_options import ClientOptions
from googleapiclient import discovery

endpoint = 'https://us-central1-ml.googleapis.com'
client_options = ClientOptions(api_endpoint=endpoint)
ml = discovery.build('ml', 'v1', client_options=client_options)

request_dict = {
    'name': 'VERSION_NAME',
    'deploymentUri': 'gs://model-directory-uri',
    'runtimeVersion': '2.11',
    'pythonVersion': '3.7',
    'framework': 'ML_FRAMEWORK_NAME',
    'machineType': 'n1-standard-4'
}
request = ml.projects().models().versions().create(
    parent='projects/PROJECT_NAME/models/MODEL_NAME',
    body=request_dict
)
response = request.execute()

Using GPUs for online prediction

For some configurations, you can optionally add GPUs to accelerate each prediction node. To use GPUs, you must account for several requirements:

You can only use GPUs with Compute Engine (N1) machine types. Legacy (MLS1) machine types do not support GPUs.
You can only use GPUs when you deploy a TensorFlow SavedModel. You cannot use GPUs for scikit-learn or XGBoost models.
The availability of each type of GPU varies depending on which region you use for your model. Learn which types of GPUs are available in which regions.
You can only use one type of GPU for your model version, and there are limitations on the number of GPUs you can add depending on which machine type you are using. The following table describes these limitations.

The following table shows the GPUs available for online prediction and how many of each type of GPU you can use with each Compute Engine machine type:

Valid numbers of GPUs for each machine type
Machine type	NVIDIA Tesla K80	NVIDIA Tesla P4	NVIDIA Tesla P100	NVIDIA Tesla T4	NVIDIA Tesla V100
`n1-standard-2`	1, 2, 4, 8	1, 2, 4	1, 2, 4	1, 2, 4	1, 2, 4, 8
`n1-standard-4`	1, 2, 4, 8	1, 2, 4	1, 2, 4	1, 2, 4	1, 2, 4, 8
`n1-standard-8`	1, 2, 4, 8	1, 2, 4	1, 2, 4	1, 2, 4	1, 2, 4, 8
`n1-standard-16`	2, 4, 8	1, 2, 4	1, 2, 4	1, 2, 4	2, 4, 8
`n1-standard-32`	4, 8	2, 4	2, 4	2, 4	4, 8
`n1-highmem-2`	1, 2, 4, 8	1, 2, 4	1, 2, 4	1, 2, 4	1, 2, 4, 8
`n1-highmem-4`	1, 2, 4, 8	1, 2, 4	1, 2, 4	1, 2, 4	1, 2, 4, 8
`n1-highmem-8`	1, 2, 4, 8	1, 2, 4	1, 2, 4	1, 2, 4	1, 2, 4, 8
`n1-highmem-16`	2, 4, 8	1, 2, 4	1, 2, 4	1, 2, 4	2, 4, 8
`n1-highmem-32`	4, 8	2, 4	2, 4	2, 4	4, 8
`n1-highcpu-2`	1, 2, 4, 8	1, 2, 4	1, 2, 4	1, 2, 4	1, 2, 4, 8
`n1-highcpu-4`	1, 2, 4, 8	1, 2, 4	1, 2, 4	1, 2, 4	1, 2, 4, 8
`n1-highcpu-8`	1, 2, 4, 8	1, 2, 4	1, 2, 4	1, 2, 4	1, 2, 4, 8
`n1-highcpu-16`	2, 4, 8	1, 2, 4	1, 2, 4	1, 2, 4	2, 4, 8
`n1-highcpu-32`	4, 8	2, 4	2, 4	2, 4	4, 8

GPUs are optional and incur additional costs.

Specifying GPUs

Specify GPUs when you create a model version. AI Platform Prediction allocates the number and type of GPU that you specify for each prediction node. You can automatically scale (preview) or manually scale (GA) the prediction nodes, but the number of GPUs that each node uses is fixed when you create the model version. Unless you have an advanced use case, we recommend that you configure one GPU on each prediction node; in other words, set the accelerator count to 1.

The following instructions show how to specify GPUs for online prediction by creating a model version that runs on at least two prediction nodes at any time. Each node uses the n1-standard-4 machine type and one NVIDIA Tesla T4 GPU.

The examples assume that you have already uploaded a TensorFlow SavedModel to Cloud Storage and created a model resource in a region that supports GPUs.

Google Cloud console

Follow the guide to creating a model version. On the Create version page, specify the following options:

In the Scaling drop-down list, select Auto scaling.
In the Minimum number of nodes field, enter 2.
In the Machine type drop-down list, select Standard > n1-standard-4.
In the Accelerator type drop-down list, select NVIDIA_TESLA_T4.
In the Accelerator count drop-down list, select 1.

gcloud

Use the gcloud CLI to create a model version. In this example, the version runs on n1-standard-4 prediction nodes that each use one NVIDIA Tesla T4 GPU. AI Platform Prediction automatically scales the number of prediction nodes to a number between 2 and 4, depending on GPU usage at any given time. The example uses the us-central1 regional endpoint:

gcloud beta ai-platform versions create VERSION_NAME \
  --model MODEL_NAME \
  --origin gs://model-directory-uri \
  --runtime-version 2.11 \
  --python-version 3.7 \
  --framework tensorflow \
  --region us-central1 \
  --machine-type n1-standard-4 \
  --accelerator count=1,type=nvidia-tesla-t4 \
  --min-nodes 2 \
  --max-nodes 4 \
  --metric-targets gpu-duty-cycle=60

Note that the accelerator name is specified in lowercase with hyphens between words.

Python

This example uses the the Google API Client Library for Python. Before you run the following code sample, you must set up authentication.

How to set up authentication

To set up authentication, you need to create a service account key and set an environment variable for the file path to the service account key.

Create a service account:
1. In the Google Cloud console, go to the Create service account page.
  
  Go to Create service account
2. In the Service account name field, enter a name.
3. Optional: In the Service account description field, enter a description.
4. Click Create.
5. Click the Select a role field. Under All roles, select AI Platform > AI Platform Admin.
6. Click Add another role.
7. Click the Select a role field. Under All roles, select Storage > Storage Object Admin.
  
  Note: The roles you select allow your service account to access resources. You can view and change these roles later by using the Google Cloud console. If you are developing a production app, you might need to specify roles with fewer permissions than AI Platform Admin and Storage Object Admin. For more information, see access control for AI Platform Prediction.
8. Click Done to create the service account.
  
  Do not close your browser window. You will use it in the next step.
Create a service account key for authentication:
1. In the Google Cloud console, click the email address for the service account that you created.
2. Click Keys.
3. Click Add key, then Create new key.
4. Click Create. A JSON key file is downloaded to your computer.
5. Click Close.
Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of the JSON file that contains your service account key. This variable only applies to your current shell session, so if you open a new session, set the variable again.
Example: Linux or macOS

Replace [PATH] with the file path of the JSON file that contains your service account key.
```
export GOOGLE_APPLICATION_CREDENTIALS="[PATH]"
```
For example:
```
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"
```
Example: Windows

Replace [PATH] with the file path of the JSON file that contains your service account key, and [FILE_NAME] with the filename.

With PowerShell:
```
$env:GOOGLE_APPLICATION_CREDENTIALS="[PATH]"
```
For example:
```
$env:GOOGLE_APPLICATION_CREDENTIALS="C:\Users\username\Downloads\[FILE_NAME].json"
```
With command prompt:
```
set GOOGLE_APPLICATION_CREDENTIALS=[PATH]
```

The example uses theus-central1 regional endpoint.

Send a request to your model's projects.models.versions.create method and specify the machineType, acceleratorConfig, and manualScaling fields in your request body:

from google.api_core.client_options import ClientOptions
from googleapiclient import discovery

endpoint = 'https://us-central1-ml.googleapis.com'
client_options = ClientOptions(api_endpoint=endpoint)
ml = discovery.build('ml', 'v1', client_options=client_options)

request_dict = {
    'name': 'VERSION_NAME',
    'deploymentUri': 'gs://model-directory-uri',
    'runtimeVersion': '2.11',
    'pythonVersion': '3.7',
    'framework': 'TENSORFLOW',
    'machineType': 'n1-standard-4',
    'acceleratorConfig': {
      'count': 1,
      'type': 'NVIDIA_TESLA_T4'
    },
    'autoScaling': {
      'minNodes': 2,
      'maxNodes': 4,
      'metrics': [
        {
          'name': 'GPU_DUTY_CYCLE',
          'target': 60
        }
      ]
    }
}
request = ml.projects().models().versions().create(
    parent='projects/PROJECT_NAME/models/MODEL_NAME',
    body=request_dict
)
response = request.execute()

Note that the accelerator name is specified in uppercase with underscores between words.

Differences between machine types

Besides providing different amounts of computing resources, machine types also vary in their support for certain AI Platform Prediction features. The following table provides an overview of the differences between Compute Engine (N1) machine types and legacy (MLS1) machine types:

	Compute Engine (N1) machine types	Legacy (MLS1) machine types
Regions	All regional endpoint regions	All global endpoint regions
Types of ML artifacts	TensorFlow SavedModels XGBoost artifacts scikit-learn artifacts (not including pipelines with custom code)	TensorFlow SavedModels XGBoost artifacts scikit-learn artifacts (including pipelines with custom code) Custom prediction routines
Runtime versions	1.11 or later	All available AI Platform runtime versions
Custom container support	Yes	No
Max model size	10 GB	500 MB
Auto scaling	Minimum nodes = 1	Minimum nodes = 0
Manual scaling	Can update number of nodes	Cannot update number of nodes after creating model version
GPU support	Yes (TensorFlow only)	No
AI Explanations support	Yes (TensorFlow only)	No
VPC Service Controls support	Yes	No
SLA coverage for generally available machine types	Yes, in some cases	Yes

The following sections provide detailed explanations about the differences between machine types.

Regional availability

Compute Engine (N1) machine types are available when you deploy your model on a regional endpoint. When you use a Compute Engine (N1) machine type, you cannot deploy your model to the global endpoint.

When you scale a model version that uses Compute Engine (N1) machine types to two or more prediction nodes, the nodes run in multiple zones within the same region. This ensures continuous availability if there is an outage in one of the zones. Learn more in the scaling section of this document.

Note that GPU availability for Compute Engine (N1) machine types also varies by region.

Legacy (MLS1) machine types are available on the global endpoint in many regions. Legacy (MLS1) machine types are not available on regional endpoints.

Batch prediction support

Model versions that use the mls1-c1-m2 machine type support batch prediction. Model versions that use other machine types do not support batch prediction.

ML framework support

If you use one of the Compute Engine (N1) machine types, you can create your model version with all of the model artifacts described in the Exporting models of prediction guide, except for two:

You cannot use a scikit-learn pipeline with custom code.
You cannot use a custom prediction routine.

For legacy (MLS1) machine types, you can use any type of model artifact that AI Platform Prediction supports, including a scikit-learn pipeline with custom code or a custom prediction routine.

Runtime version support

If you use a Compute Engine (N1) machine type, you must use runtime version 1.11 or later for your model version.

If you use a legacy (MLS1) machine type, you can use any available AI Platform runtime version.

Custom container support

To use a custom container to serve online predictions, you must use a Compute Engine (N1) machine type.

Max model size

The model artifacts that you provide when you create a model version must have a total file size less than 500 MB if you use a legacy (MLS1) machine type. The total file size can be up to 10 GB if you use a Compute Engine (N1) machine type.

Logging predictions

For Compute Engine (N1) machine types, console logging is in preview. For legacy (MLS1) machine types, console logging is generally available.

Scaling prediction nodes

Automatic scaling and manual scaling of prediction nodes both have different constraints depending on whether you use a Compute Engine (N1) machine type or a legacy (MLS1) machine type.

Automatic scaling

If you use a Compute Engine (N1) machine type with automatic scaling, your model version must always have at least one node running. In other words, the version's autoScaling.minNodes field defaults to 1 and cannot be less than 1. If you set autoScaling.minNodes to 2 or greater, then prediction nodes run in multiple zones within the same region. This ensures continuous availability if there is an outage in one of the zones.

Note that if you allocate more vCPUs or RAM than your machine learning model needs, autoscaling might not work properly. This can cause problems with model performance. Experiment with using different machine types for your model to ensure you are not providing too much compute resources.

If you use a legacy (MLS1) machine type, your model version can scale to zero nodes when it doesn't receive traffic: autoScaling.minNodes can be set to 0, and it is set to 0 by default. Scaling to zero can reduce costs when your model version is not receiving prediction requests. However, it can also lead to latency or errors during any periods when AI Platform Prediction is allocating a new node to handle requests after a period with zero nodes. Learn more about scaling to zero.

Manual scaling

If you use a Compute Engine (N1) machine type with manual scaling, you can update the number of prediction nodes running at any time by using the projects.models.versions.patch API method. If you set the manualScaling.nodes field to 2 or greater, then prediction nodes run in multiple zones within the same region. This ensures continuous availability if there is an outage in one of the zones.

If you use a legacy (MLS1) machine type with manual scaling, you cannot update the number of prediction nodes after you create the model version. If you want to change the number of nodes, you must delete the version and create a new one.

VPC Service Controls support

If you use VPC Service Controls to protect AI Platform Prediction, then you cannot create versions that use legacy (MLS1) machine types. You must use Compute Engine (N1) machine types.