Choosing a machine type for online prediction

AI Platform Prediction allocates nodes to handle online prediction requests sent to a model version. When you deploy a model version, you can customize the type of virtual machine that AI Platform Prediction uses for these nodes.

Machine types differ in several ways:

By selecting a machine type with more computing resources, you can serve predictions with lower latency or handle more prediction requests at the same time.

Available machine types

Compute Engine (N1) machine types and the mls1-c1-m2 machine type are generally available for online prediction. The mls1-c4-m2 machine type is available in beta.

The following table compares the available machine types:

Name Availability vCPUs Memory (GB) Supports GPUs? ML framework support Max model size
mls1-c1-m2 (default on global endpoint) Generally available 1 2 No TensorFlow, XGBoost, scikit-learn (including pipelines with custom code), custom prediction routines 500 MB
mls1-c4-m2 Beta 4 2 No TensorFlow, XGBoost, scikit-learn (including pipelines with custom code), custom prediction routines 500 MB
n1-standard-2 (default on regional endpoints) Generally available 2 7.5 Yes TensorFlow, XGBoost, and scikit-learn 10 GB
n1-standard-4 Generally available 4 15 Yes TensorFlow, XGBoost, and scikit-learn 10 GB
n1-standard-8 Generally available 8 30 Yes TensorFlow, XGBoost, and scikit-learn 10 GB
n1-standard-16 Generally available 16 60 Yes TensorFlow, XGBoost, and scikit-learn 10 GB
n1-standard-32 Generally available 32 120 Yes TensorFlow, XGBoost, and scikit-learn 10 GB
n1-highmem-2 Generally available 2 13 Yes TensorFlow, XGBoost, and scikit-learn 10 GB
n1-highmem-4 Generally available 4 26 Yes TensorFlow, XGBoost, and scikit-learn 10 GB
n1-highmem-8 Generally available 8 52 Yes TensorFlow, XGBoost, and scikit-learn 10 GB
n1-highmem-16 Generally available 16 104 Yes TensorFlow, XGBoost, and scikit-learn 10 GB
n1-highmem-32 Generally available 32 208 Yes TensorFlow, XGBoost, and scikit-learn 10 GB
n1-highcpu-2 Generally available 2 1.8 Yes TensorFlow, XGBoost, and scikit-learn 10 GB
n1-highcpu-4 Generally available 4 3.6 Yes TensorFlow, XGBoost, and scikit-learn 10 GB
n1-highcpu-8 Generally available 8 7.2 Yes TensorFlow, XGBoost, and scikit-learn 10 GB
n1-highcpu-16 Generally available 16 14.4 Yes TensorFlow, XGBoost, and scikit-learn 10 GB
n1-highcpu-32 Generally available 32 28.8 Yes TensorFlow, XGBoost, and scikit-learn 10 GB

Learn about pricing for each machine type. Read more about the detailed specifications of Compute Engine (N1) machine types in the Compute Engine documentation.

Specifying a machine type

You can specify a machine type choice when you create a model version. If you don't specify a machine type, then your model version defaults to using n1-standard-2 if you are using a regional endpoint and mls1-c1-m2 if you are using the global endpoint.

The following instructions highlight how to specify a machine type when you create a model version. They use the n1-standard-4 machine type as an example. To learn about the full process of creating a model version, read the guide to deploying models.

Google Cloud console

On the Create version page, open the Machine type drop-down list and select Standard > n1-standard-4.

gcloud

After you have uploaded your model artifacts to Cloud Storage and created a model resource, you can create a model version that uses the n1-standard-4 machine type:

gcloud ai-platform versions create VERSION_NAME \
  --model MODEL_NAME \
  --origin gs://model-directory-uri \
  --runtime-version 2.11 \
  --python-version 3.7 \
  --framework ML_FRAMEWORK_NAME \
  --region us-central1 \
  --machine-type n1-standard-4

Python

This example uses the the Google API Client Library for Python. Before you run the following code sample, you must set up authentication.

After you have uploaded your model artifacts to Cloud Storage and created a model resource, send a request to your model's projects.models.versions.create method and specify the machineType field in your request body:

from google.api_core.client_options import ClientOptions
from googleapiclient import discovery

endpoint = 'https://us-central1-ml.googleapis.com'
client_options = ClientOptions(api_endpoint=endpoint)
ml = discovery.build('ml', 'v1', client_options=client_options)

request_dict = {
    'name': 'VERSION_NAME',
    'deploymentUri': 'gs://model-directory-uri',
    'runtimeVersion': '2.11',
    'pythonVersion': '3.7',
    'framework': 'ML_FRAMEWORK_NAME',
    'machineType': 'n1-standard-4'
}
request = ml.projects().models().versions().create(
    parent='projects/PROJECT_NAME/models/MODEL_NAME',
    body=request_dict
)
response = request.execute()

Using GPUs for online prediction

For some configurations, you can optionally add GPUs to accelerate each prediction node. To use GPUs, you must account for several requirements:

  • You can only use GPUs with Compute Engine (N1) machine types. Legacy (MLS1) machine types do not support GPUs.
  • You can only use GPUs when you deploy a TensorFlow SavedModel. You cannot use GPUs for scikit-learn or XGBoost models.
  • The availability of each type of GPU varies depending on which region you use for your model. Learn which types of GPUs are available in which regions.
  • You can only use one type of GPU for your model version, and there are limitations on the number of GPUs you can add depending on which machine type you are using. The following table describes these limitations.

The following table shows the GPUs available for online prediction and how many of each type of GPU you can use with each Compute Engine machine type:

Valid numbers of GPUs for each machine type
Machine type NVIDIA Tesla K80 NVIDIA Tesla P4 NVIDIA Tesla P100 NVIDIA Tesla T4 NVIDIA Tesla V100
n1-standard-2 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-standard-4 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-standard-8 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-standard-16 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 2, 4, 8
n1-standard-32 4, 8 2, 4 2, 4 2, 4 4, 8
n1-highmem-2 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highmem-4 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highmem-8 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highmem-16 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 2, 4, 8
n1-highmem-32 4, 8 2, 4 2, 4 2, 4 4, 8
n1-highcpu-2 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highcpu-4 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highcpu-8 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highcpu-16 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 2, 4, 8
n1-highcpu-32 4, 8 2, 4 2, 4 2, 4 4, 8

GPUs are optional and incur additional costs.

Specifying GPUs

Specify GPUs when you create a model version. AI Platform Prediction allocates the number and type of GPU that you specify for each prediction node. You can automatically scale (preview) or manually scale (GA) the prediction nodes, but the number of GPUs that each node uses is fixed when you create the model version. Unless you have an advanced use case, we recommend that you configure one GPU on each prediction node; in other words, set the accelerator count to 1.

The following instructions show how to specify GPUs for online prediction by creating a model version that runs on at least two prediction nodes at any time. Each node uses the n1-standard-4 machine type and one NVIDIA Tesla T4 GPU.

The examples assume that you have already uploaded a TensorFlow SavedModel to Cloud Storage and created a model resource in a region that supports GPUs.

Google Cloud console

Follow the guide to creating a model version. On the Create version page, specify the following options:

  1. In the Scaling drop-down list, select Auto scaling.
  2. In the Minimum number of nodes field, enter 2.
  3. In the Machine type drop-down list, select Standard > n1-standard-4.
  4. In the Accelerator type drop-down list, select NVIDIA_TESLA_T4.
  5. In the Accelerator count drop-down list, select 1.

gcloud

Use the gcloud CLI to create a model version. In this example, the version runs on n1-standard-4 prediction nodes that each use one NVIDIA Tesla T4 GPU. AI Platform Prediction automatically scales the number of prediction nodes to a number between 2 and 4, depending on GPU usage at any given time. The example uses the us-central1 regional endpoint:

gcloud beta ai-platform versions create VERSION_NAME \
  --model MODEL_NAME \
  --origin gs://model-directory-uri \
  --runtime-version 2.11 \
  --python-version 3.7 \
  --framework tensorflow \
  --region us-central1 \
  --machine-type n1-standard-4 \
  --accelerator count=1,type=nvidia-tesla-t4 \
  --min-nodes 2 \
  --max-nodes 4 \
  --metric-targets gpu-duty-cycle=60

Note that the accelerator name is specified in lowercase with hyphens between words.

Python

This example uses the the Google API Client Library for Python. Before you run the following code sample, you must set up authentication.

The example uses theus-central1 regional endpoint.

Send a request to your model's projects.models.versions.create method and specify the machineType, acceleratorConfig, and manualScaling fields in your request body:

from google.api_core.client_options import ClientOptions
from googleapiclient import discovery

endpoint = 'https://us-central1-ml.googleapis.com'
client_options = ClientOptions(api_endpoint=endpoint)
ml = discovery.build('ml', 'v1', client_options=client_options)

request_dict = {
    'name': 'VERSION_NAME',
    'deploymentUri': 'gs://model-directory-uri',
    'runtimeVersion': '2.11',
    'pythonVersion': '3.7',
    'framework': 'TENSORFLOW',
    'machineType': 'n1-standard-4',
    'acceleratorConfig': {
      'count': 1,
      'type': 'NVIDIA_TESLA_T4'
    },
    'autoScaling': {
      'minNodes': 2,
      'maxNodes': 4,
      'metrics': [
        {
          'name': 'GPU_DUTY_CYCLE',
          'target': 60
        }
      ]
    }
}
request = ml.projects().models().versions().create(
    parent='projects/PROJECT_NAME/models/MODEL_NAME',
    body=request_dict
)
response = request.execute()

Note that the accelerator name is specified in uppercase with underscores between words.

Differences between machine types

Besides providing different amounts of computing resources, machine types also vary in their support for certain AI Platform Prediction features. The following table provides an overview of the differences between Compute Engine (N1) machine types and legacy (MLS1) machine types:

Compute Engine (N1) machine types Legacy (MLS1) machine types
Regions All regional endpoint regions All global endpoint regions
Types of ML artifacts
Runtime versions 1.11 or later All available AI Platform runtime versions
Custom container support Yes No
Max model size 10 GB 500 MB
Auto scaling Minimum nodes = 1 Minimum nodes = 0
Manual scaling Can update number of nodes Cannot update number of nodes after creating model version
GPU support Yes (TensorFlow only) No
AI Explanations support Yes (TensorFlow only) No
VPC Service Controls support Yes No
SLA coverage for generally available machine types Yes, in some cases Yes

The following sections provide detailed explanations about the differences between machine types.

Regional availability

Compute Engine (N1) machine types are available when you deploy your model on a regional endpoint. When you use a Compute Engine (N1) machine type, you cannot deploy your model to the global endpoint.

When you scale a model version that uses Compute Engine (N1) machine types to two or more prediction nodes, the nodes run in multiple zones within the same region. This ensures continuous availability if there is an outage in one of the zones. Learn more in the scaling section of this document.

Note that GPU availability for Compute Engine (N1) machine types also varies by region.

Legacy (MLS1) machine types are available on the global endpoint in many regions. Legacy (MLS1) machine types are not available on regional endpoints.

Batch prediction support

Model versions that use the mls1-c1-m2 machine type support batch prediction. Model versions that use other machine types do not support batch prediction.

ML framework support

If you use one of the Compute Engine (N1) machine types, you can create your model version with all of the model artifacts described in the Exporting models of prediction guide, except for two:

For legacy (MLS1) machine types, you can use any type of model artifact that AI Platform Prediction supports, including a scikit-learn pipeline with custom code or a custom prediction routine.

Runtime version support

If you use a Compute Engine (N1) machine type, you must use runtime version 1.11 or later for your model version.

If you use a legacy (MLS1) machine type, you can use any available AI Platform runtime version.

Custom container support

To use a custom container to serve online predictions, you must use a Compute Engine (N1) machine type.

Max model size

The model artifacts that you provide when you create a model version must have a total file size less than 500 MB if you use a legacy (MLS1) machine type. The total file size can be up to 10 GB if you use a Compute Engine (N1) machine type.

Logging predictions

For Compute Engine (N1) machine types, console logging is in preview. For legacy (MLS1) machine types, console logging is generally available.

Scaling prediction nodes

Automatic scaling and manual scaling of prediction nodes both have different constraints depending on whether you use a Compute Engine (N1) machine type or a legacy (MLS1) machine type.

Automatic scaling

If you use a Compute Engine (N1) machine type with automatic scaling, your model version must always have at least one node running. In other words, the version's autoScaling.minNodes field defaults to 1 and cannot be less than 1. If you set autoScaling.minNodes to 2 or greater, then prediction nodes run in multiple zones within the same region. This ensures continuous availability if there is an outage in one of the zones.

Note that if you allocate more vCPUs or RAM than your machine learning model needs, autoscaling might not work properly. This can cause problems with model performance. Experiment with using different machine types for your model to ensure you are not providing too much compute resources.

If you use a legacy (MLS1) machine type, your model version can scale to zero nodes when it doesn't receive traffic: autoScaling.minNodes can be set to 0, and it is set to 0 by default. Scaling to zero can reduce costs when your model version is not receiving prediction requests. However, it can also lead to latency or errors during any periods when AI Platform Prediction is allocating a new node to handle requests after a period with zero nodes. Learn more about scaling to zero.

Manual scaling

If you use a Compute Engine (N1) machine type with manual scaling, you can update the number of prediction nodes running at any time by using the projects.models.versions.patch API method. If you set the manualScaling.nodes field to 2 or greater, then prediction nodes run in multiple zones within the same region. This ensures continuous availability if there is an outage in one of the zones.

If you use a legacy (MLS1) machine type with manual scaling, you cannot update the number of prediction nodes after you create the model version. If you want to change the number of nodes, you must delete the version and create a new one.

VPC Service Controls support

If you use VPC Service Controls to protect AI Platform Prediction, then you cannot create versions that use legacy (MLS1) machine types. You must use Compute Engine (N1) machine types.