Choosing a machine type for online prediction

AI Platform Prediction allocates nodes to handle online prediction requests sent to a model version. When you deploy a model version, you can customize the type of virtual machine that AI Platform Prediction uses for these nodes.

Machine types differ in several ways:

By selecting a machine type with more computing resources, you can serve predictions with lower latency or handle more prediction requests at the same time.

Available machine types

Compute Engine (N1) machine types and the mls1-c1-m2 machine type are generally available for online prediction. The mls1-c4-m2 machine type is available in beta.

The following table compares the available machine types:

Name Availability vCPUs Memory (GB) Supports GPUs? ML framework support Max model size
mls1-c1-m2 (default) Generally available 1 2 No TensorFlow, XGBoost, scikit-learn (including pipelines with custom code), custom prediction routines 500 MB
mls1-c4-m2 Beta 4 2 No TensorFlow, XGBoost, scikit-learn (including pipelines with custom code), custom prediction routines 500 MB
n1-standard-2 Generally available 2 7.5 Yes TensorFlow, XGBoost, and scikit-learn 2 GB
n1-standard-4 Generally available 4 15 Yes TensorFlow, XGBoost, and scikit-learn 2 GB
n1-standard-8 Generally available 8 30 Yes TensorFlow, XGBoost, and scikit-learn 2 GB
n1-standard-16 Generally available 16 60 Yes TensorFlow, XGBoost, and scikit-learn 2 GB
n1-standard-32 Generally available 32 120 Yes TensorFlow, XGBoost, and scikit-learn 2 GB
n1-highmem-2 Generally available 2 13 Yes TensorFlow, XGBoost, and scikit-learn 2 GB
n1-highmem-4 Generally available 4 26 Yes TensorFlow, XGBoost, and scikit-learn 2 GB
n1-highmem-8 Generally available 8 52 Yes TensorFlow, XGBoost, and scikit-learn 2 GB
n1-highmem-16 Generally available 16 104 Yes TensorFlow, XGBoost, and scikit-learn 2 GB
n1-highmem-32 Generally available 32 208 Yes TensorFlow, XGBoost, and scikit-learn 2 GB
n1-highcpu-2 Generally available 2 1.8 Yes TensorFlow, XGBoost, and scikit-learn 2 GB
n1-highcpu-4 Generally available 4 3.6 Yes TensorFlow, XGBoost, and scikit-learn 2 GB
n1-highcpu-8 Generally available 8 7.2 Yes TensorFlow, XGBoost, and scikit-learn 2 GB
n1-highcpu-16 Generally available 16 14.4 Yes TensorFlow, XGBoost, and scikit-learn 2 GB
n1-highcpu-32 Generally available 32 28.8 Yes TensorFlow, XGBoost, and scikit-learn 2 GB

Learn about pricing for each machine type. Read more about the detailed specifications of Compute Engine (N1) machine types in the Compute Engine documentation.

Specifying a machine type

You can specify a machine type choice when you create a model version. If you don't specify a machine type, your model version defaults to using mls1-c1-m2 for its nodes.

The following instructions highlight how to specify a machine type when you create a model version. They use the n1-standard-4 machine type as an example. To learn about the full process of creating a model version, read the guide to deploying models.

Cloud Console

On the Create version page, open the Machine type drop-down list and select Standard > n1-standard-4.

gcloud

After you have uploaded your model artifacts to Cloud Storage and created a model resource, you can create a model version that uses the n1-standard-4 machine type:

gcloud ai-platform versions create version_name \
  --model model_name \
  --origin gs://model-directory-uri \
  --runtime-version 2.2 \
  --python-version 3.7 \
  --framework ml-framework-name \
  --machine-type n1-standard-4

Python

This example uses the the Google APIs Client Library for Python. Before you run the following code sample, you must set up authentication.

After you have uploaded your model artifacts to Cloud Storage and created a model resource, send a request to your model's projects.models.versions.create method and specify the machineType field in your request body:

from googleapiclient import discovery

ml = discovery.build('ml', 'v1')
request_dict = {
    'name': 'version_name',
    'deploymentUri': 'gs://model-directory-uri',
    'runtimeVersion': '2.2',
    'pythonVersion': '3.7',
    'framework': 'ML_FRAMEWORK_NAME',
    'machineType': 'n1-standard-4'
}
request = ml.projects().models().versions().create(
    parent='projects/project-name/models/model_name',
    body=request_dict
)
response = request.execute()

Using GPUs for online prediction

For some configurations, you can optionally add GPUs to accelerate each prediction node. To use GPUs, you must account for several requirements:

  • You can only use GPUs with Compute Engine (N1) machine types. Legacy (MLS1) machine types do not support GPUs.
  • You can only use GPUs when you deploy a TensorFlow SavedModel. You cannot use GPUs for scikit-learn or XGBoost models.
  • The availability of each type of GPU varies depending on which region you use for your model. Learn which types of GPUs are available in which regions.
  • You can only use one type of GPU for your model version, and there are limitations on the number of GPUs you can add depending on which machine type you are using. The following table describes these limitations.

The following table shows the GPUs available for online prediction and how many of each type of GPU you can use with each Compute Engine machine type:

Valid numbers of GPUs for each machine type
Machine type NVIDIA Tesla K80 NVIDIA Tesla P4 NVIDIA Tesla P100 NVIDIA Tesla T4 NVIDIA Tesla V100
n1-standard-2 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-standard-4 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-standard-8 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-standard-16 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 2, 4, 8
n1-standard-32 4, 8 2, 4 2, 4 2, 4 4, 8
n1-highmem-2 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highmem-4 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highmem-8 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highmem-16 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 2, 4, 8
n1-highmem-32 4, 8 2, 4 2, 4 2, 4 4, 8
n1-highcpu-2 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highcpu-4 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highcpu-8 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highcpu-16 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 2, 4, 8
n1-highcpu-32 4, 8 2, 4 2, 4 2, 4 4, 8

GPUs are optional and incur additional costs.

Specifying GPUs

Specify GPUs when you create a model version. AI Platform Prediction allocates the number and type of GPU that you specify for each prediction node. You must manually scale prediction nodes for your version when you use GPUS. You can later change how many nodes are running, but you cannot currently use automatic scaling with GPUs.

The following instructions show how to specify GPUs for online prediction by creating a model version that runs on two prediction nodes. The node uses the n1-standard-4 machine type with one NVIDIA Tesla T4 GPU.

The examples assume that you have already uploaded a TensorFlow SavedModel to Cloud Storage and created a model resource in a region that supports GPUs.

Cloud Console

Follow the guide to creating a model version. On the Create version page, specify the following options:

  1. In the Machine type drop-down list, select Standard > n1-standard-4.
  2. In the Accelerator type drop-down list, select NVIDIA_TESLA_T4.
  3. In the Accelerator count drop-down list, select 1.
  4. In the Scaling drop-down list, select Manual scaling.
  5. In the Number of nodes field, enter 2.

gcloud

First, create a YAML configuration file (conventionally named config.yaml) in order to configure manual scaling for your version. AI Platform Prediction uses automatic scaling by default, and you cannot override this with command-line flags. The following example configuration file specifies that you want your model version to use two prediction nodes:

config.yaml

manualScaling:
  nodes: 2

You can optionally provide additional version parameters in the configuration file, but note that any command-line flags that you provide to the gcloud tool override parameters in the file.

Next, use the gcloud tool and your configuration file to create a model version. In this example, the version runs on one n1-standard-4 prediction node that uses one NVIDIA Tesla T4 GPU. The example uses the us-central1 regional endpoint:

gcloud ai-platform versions create version_name \
  --model model_name \
  --origin gs://model-directory-uri \
  --runtime-version 2.2 \
  --python-version 3.7 \
  --framework tensorflow \
  --region us-central1 \
  --machine-type n1-standard-4 \
  --accelerator count=1,type=nvidia-tesla-t4 \
  --config config.yaml

Note that the accelerator name is specified in lowercase with hyphens between words.

Python

This example uses the the Google APIs Client Library for Python. Before you run the following code sample, you must set up authentication.

The example uses theus-central1 regional endpoint.

Send a request to your model's projects.models.versions.create method and specify the machineType, acceleratorConfig, and manualScaling fields in your request body:

from google.api_core.client_options import ClientOptions
from googleapiclient import discovery

endpoint = 'https://us-central1-ml.googleapis.com'
client_options = ClientOptions(api_endpoint=endpoint)
ml = discovery.build('ml', 'v1', client_options=client_options)

request_dict = {
    'name': 'version_name',
    'deploymentUri': 'gs://model-directory-uri',
    'runtimeVersion': '2.2',
    'pythonVersion': '3.7',
    'framework': 'TENSORFLOW',
    'machineType': 'n1-standard-4',
    'acceleratorConfig': {
      'count': 1,
      'type': 'NVIDIA_TESLA_T4'
    },
    'manualScaling': {
      'nodes': 2
    }
}
request = ml.projects().models().versions().create(
    parent='projects/project-name/models/model_name',
    body=request_dict
)
response = request.execute()

Note that the accelerator name is specified in uppercase with underscores between words.

Differences between machine types

Besides providing different amounts of computing resources, machine types also vary in their support for certain AI Platform Prediction features. The following table provides an overview of the differences between Compute Engine (N1) machine types and legacy (MLS1) machine types:

Compute Engine (N1) machine types Legacy (MLS1) machine types
Regions All regional endpoint regions All global endpoint regions
Types of ML artifacts
Runtime versions 1.11 or later All available AI Platform runtime versions
Max model size 2 GB 500 MB
Logging No stream logging All types of logging
Auto scaling Minimum nodes = 1 Minimum nodes = 0
Manual scaling Can update number of nodes Cannot update number of nodes after creating model version
GPU support Yes (TensorFlow only) No
AI Explanations support Yes (TensorFlow only) No
VPC Service Controls support Yes No
SLA coverage for generally available machine types Yes, in some cases Yes

The following sections provide detailed explanations about the differences between machine types.

Regional availability

Compute Engine (N1) machine types are available when you deploy your model on a regional endpoint. Currently, they are available on the the us-central1, europe-west4, and asia-east1 regional endpoints. When you use a Compute Engine (N1) machine type, you cannot deploy your model to the global endpoint.

When you scale a model version that uses Compute Engine (N1) machine types to two or more prediction nodes, the nodes run in multiple zones within the same region. This ensures continuous availability if there is an outage in one of the zones. Learn more in the scaling section of this document.

Note that GPU availability for Compute Engine (N1) machine types also varies by region.

Legacy (MLS1) machine types are available on the global endpoint in many regions. Legacy (MLS1) machine types are not available on regional endpoints.

Batch prediction support

Model versions that use the mls1-c1-m2 machine type support batch prediction. Model versions that use other machine types do not support batch prediction.

ML framework support

If you use one of the Compute Engine (N1) machine types, you can create your model version with all of the model artifacts described in the Exporting models of prediction guide, except for two:

For legacy (MLS1) machine types, you can use any type of model artifact that AI Platform Prediction supports, including a scikit-learn pipeline with custom code or a custom prediction routine.

Runtime version support

If you use a Compute Engine (N1) machine types, you must use runtime version 1.11 or later for your model version.

If you use a legacy (MLS1) machine type, you can use any available AI Platform runtime version.

Max model size

The model artifacts that you provide when you create a model version must have a total file size less than 500 MB if you use a legacy (MLS1) machine type. The total file size can be up to 2 GB if you use a Compute Engine (N1) machine type.

Logging predictions

Compute Engine (N1) machine types do not support stream logging of prediction nodes' stderr and stdout streams.

Legacy (MLS1) machine types support all types of online prediction logging.

Scaling prediction nodes

Automatic scaling and manual scaling of prediction nodes both have different constraints depending on whether you use a Compute Engine (N1) machine type or a legacy (MLS1) machine type.

Automatic scaling

If you use a Compute Engine (N1) machine type with automatic scaling, your model version must always have at least one node running. In other words, the version's autoScaling.minNodes field defaults to 1 and cannot be less than 1. If you set autoScaling.minNodes to 2 or greater, then prediction nodes run in multiple zones within the same region. This ensures continuous availability if there is an outage in one of the zones.

Note that if you allocate more vCPUs or RAM than your machine learning model needs, autoscaling might not work properly. This can cause problems with model performance. Experiment with using different machine types for your model to ensure you are not providing too much compute resources.

If you use GPUs for your model version, you cannot use automatic scaling. You must use manual scaling.

If you use a legacy (MLS1) machine type, your model version can scale to zero nodes when it doesn't receive traffic. (autoScaling.minNodes can be set to 0, and it is set to 0 by default.)

Manual scaling

If you use a Compute Engine (N1) machine type with manual scaling, you can update the number of prediction nodes running at any time by using the projects.models.versions.patch API method. If you set the manualScaling.nodes field to 2 or greater, then prediction nodes run in multiple zones within the same region. This ensures continuous availability if there is an outage in one of the zones.

If you use a legacy (MLS1) machine type with manual scaling, you cannot update the number of prediction nodes after you create the model version. If you want to change the number of nodes, you must delete the version and create a new one.

VPC Service Controls support

If you use VPC Service Controls to protect AI Platform Prediction, then you cannot create versions that use legacy (MLS1) machine types. You must use Compute Engine (N1) machine types.