AI Platform Prediction: Custom container concepts

This document discusses serving machine learning (ML) models on Google Cloud using custom containers in AI Platform Prediction.

AI Platform Prediction is Google Cloud's managed ML model serving platform. As a managed service, the platform handles infrastructure setup, maintenance, and management. AI Platform Prediction supports both CPU and GPU inferencing and offers a selection of n1-standard node shapes in Compute Engine, letting you customize the scale unit to fit your requirements.

The AI Platform Prediction service simplifies model service by requiring you to only specify where you store your model artifacts. However, with this simplicity comes reduced flexibility—for example, you can only serve frameworks that are built into the service. With custom containers, AI Platform Prediction lowers the abstraction level so that you can choose whichever framework, model server, preprocessing, and post-processing that you need.

This document is part of a series for ML engineers and architects who design, build, and maintain a highly performant model serving platform on Google Cloud:

This document discusses model serving choices with a focus on custom containers in AI Platform Prediction. The associated document focuses on integrating Triton Inference Server with custom containers in AI Platform Prediction.

When to use custom containers with AI Platform Prediction

Google Cloud offers several options for serving ML models. To help you choose which platform best fits your prediction needs, the following table provides common choices that are based on various requirements:

Prediction needs AI Platform Prediction Basic AI Platform Prediction custom containers Google Kubernetes Engine (GKE) Dataflow (used as a prediction pipeline)
A managed model prediction service Yes Yes No Yes
Low-latency synchronous prediction Yes Yes Yes No
Asynchronous streaming analytics No No No Yes
Batch prediction Yes Yes Depending on the model server Yes
Built-in support for frameworks TensorFlow, scikit-learn, XGBoost None, user containerized None, user containerized User loaded
Support for third-party model servers No User containerized User containerized Not applicable
The choice of node machine type mls1 and n1-standard n1-standard Any supported by GKE Any supported by Dataflow
GPU support Yes Yes Yes Yes
Custom processing external to the model server Custom prediction routines Yes Yes Not applicable

The following sections provide more details to help you evaluate your options.

Activation

It's important to consider how a model is consumed after it's served. Applications such as transaction risk and ad serving typically call for synchronous online prediction and are sensitive to latency. Customer lifetime value (CLV) predictions and recommendations are typically processed more effectively in large, periodic batches. Predictive maintenance use cases generally stream telemetry asynchronously. In these cases, you use a model to identify failure conditions, and the consumer of the predictions can be an aggregating application that isn't the sender of the telemetry. Your choice of a serving platform depends on how your model is consumed.

Compute support

AI Platform Prediction Basic supports both mls1 and n1-standard machine types, whereas custom containers in AI Platform Prediction support n1-standard machine types. For best performance, you should consider only the n1-standard machine types.

If you're comfortable with managing a prediction platform at the infrastructure level and your applications are stateless and designed for retries on lost requests, you can use GKE with preemptible nodes. Preemptible nodes can help you reduce compute costs. However, you need to know that preemptible nodes only last for up to 24 hours. If you plan to use preemptible nodes in production, you should only deploy workloads that are tolerant to nodes disappearing. Offline, restartable jobs with checkpointing are ideal for preemptible nodes.

Third-party model servers

If you need a custom model server such as Triton Inference Server or Facebook TorchServe, you should choose a prediction platform that can serve directly from a container or load custom model servers.

Built-in support for frameworks

AI Platform Prediction Basic is a managed prediction service that directly supports TensorFlow, XGBoost, and scikit-learn. We recommend AI Platform Prediction Basic if the supported frameworks and framework versions can directly serve your model.

Preprocessing and post-processing of data

When you deploy a model for prediction, the shape of the data is rarely the same as the shape of the data that was used to train the model. This dynamic, which is known as training-serving skew, requires that you preprocess your data before the model can process prediction requests. Predicted results are also commonly used with some sort of post-processing. If possible, we recommend that you add your processing operations to your model. Frameworks, such as TensorFlow, let you add operations into the serving graph, which not only simplifies your environment but increases its performance because you're reducing hops.

Sometimes you might be unable to incorporate processing into the model graph. For example, you might be using preprocessing with XGBoost, or your preprocessing is a model that's trained in PyTorch but your classifier is a model that's trained in TensorFlow. Although custom prediction routines in AI Platform Prediction let you inject this kind of preprocessing code, we recommend that you use custom containers for performance reasons. Custom prediction routines not only add an extra hop, the extra node is the less- performant mls1 type.

Architecture for custom containers in AI Platform Prediction

The following diagram illustrates the core components for custom containers in AI Platform Prediction:

Core components include the container image, any post-processing or
preprocessing code, the model, and the model
store.

The following sections explain the important concepts in the preceding diagram.

Base image

You provide a custom container image (A) that's a base image of a model server or an image that's derived from the base image of a model server. In the example provided in AI Platform Prediction: Direct model server setup for NVIDIA Triton Inference Server, the model server is Triton Inference Server.

Preprocessing and post-processing code

If any special code (B) is required—such as custom routing or processing—you use a new Dockerfile to incorporate the artifacts into the container. The preprocessing code listens for incoming requests, applies preprocessing, and then passes the preprocessed request to the model server. Any post-processing code must run and return results to the client.

Model store

In choosing the location where the model is persisted, you have two options:

  • Specify the path to model artifacts in Cloud Storage. If your model server can directly read from Cloud Storage, this option (C) is preferable. It lets you deploy a new model version without the need to create a new container.
  • Copy the model artifacts directly into the container. This option (D) is for model servers that can't directly read from Cloud Storage. This option requires a new container each time that you deploy a new model version.

Model

AI Platform Prediction uses an architectural paradigm that's based on models and model versions, where the model is a logical grouping of model versions. A typical use for this paradigm is continuous training. In this case, applications that are designed for a particular model use that model but adopt newer model versions when the model is retrained over time.

Model version

A model version in AI Platform Prediction is an instance of a model that's trained with a specific dataset, configuration, and hyperparameters. In AI Platform Prediction, model versions are treated as immutable. If you need to update a model (for example, you retrained the model with a new dataset), you need to create a new model version and then roll it out. Model versions help to simplify canary rollouts. Because model versions exist within a model, you must create the model before you create the model version.

Custom container

A custom container (E) is an instance of the model version. You create the custom container from the custom container image.

Inference client

After you configure the model server, the inference client (F) communicates with AI Platform Prediction.

Custom container patterns

For this series, we refer to two general architectural patterns for custom containers: the direct model server pattern, and the model server with listener pattern. The listener pattern can include custom preprocessing and post-processing. To help you choose the model serving pattern that best fits your needs, the following table describes some use cases that each pattern supports:

Prediction needs Direct model server Model server with listener
Simplified deployment Yes No
Lowest-possible latency Yes Depends on processing
Single-model support Yes Yes
Multiple-model stacking No Yes
Complex preprocessing and post-processing that's external to the model server No Yes
Customizable or multiple routing options No Yes

The following sections describe the two container patterns.

Direct model server

The direct model server pattern offers direct access from AI Platform to the model server without any intermediary connections, as the following diagram shows:

The model server is stored in a custom container and is accessed
directly.

This pattern is useful when you need the following:

  • One prediction route. Depending on the model server, supporting more than one model might require multiple routes. If you serve only one model, or if you can manage your model server by using only one predict route, the direct model server pattern is your best choice. That pattern is generally the easiest to configure.
  • Low latency. Direct communication between AI Platform and the model server eliminates any potential delays caused by an intermediary.
  • The ability to encapsulate preprocessing and post-processing in the model. Frameworks such as TensorFlow let you add preprocessing and post-processing operations to the compute graph. In general, this approach is the most performant way to handle preprocessing and post-processing because you minimize memory marshaling across processes. Some model servers, such as Triton Inference Server, offer ensemble modeling, where one model's output can be the input for another model. This capability allows for upstream preprocessing and downstream post-processing within the model server.

Model server with listener

In this pattern, you set up a listener between AI Platform and the model server, as the following diagram shows:

A custom container stores the model server and a listener containing any
preprocessing or post-processing
code.

This pattern is useful when you need to handle the following:

  • Complex routing or multi-model stacking. Some model servers support multiple models in the same instance and require separate routes to access the models. To multiplex through the predict route, a listener is usually required, with destination route information embedded somewhere else (such as the header).
  • Complex preprocessing and post-processing. Some frameworks that don't let you embed preprocessing and post-processing logic require processing outside of the inference process. Also, there might be scenarios where a preprocessing step needs to make an external call to another service. These patterns become difficult or impossible to include in a graph.
  • Inference format conversion. To support the different formats across multiple model servers, you can use a listener to do format conversion.

It's important to understand that the listener is the intermediary between the model server and AI Platform and thus adds latency to the process. The impact of this latency is greater on a light model than on a heavy model. For example, if the inference time for a heavy deep learning model is 50 milliseconds (ms), and you add 20 ms of preprocessing, that 20 ms is a 40% increase. If you add the same 20 ms to an inference time of 10 ms for a light XGBoost model, that 20 ms is a 200% increase.

Planning a model and model version

AI Platform Prediction organizes model artifacts in a project-model-version hierarchy. For example, a company can have multiple Google Cloud projects. Each project can have multiple models. Each model can have one or more model versions. This structure looks like the following:

  • Example Organization
    • SalesOps project
      • CustomerPropensity model
        • v20201131 model version
        • v20201107 model version
        • v20201009 model version

For more information on models and model versions, see Prediction overview.

Before you create a model version, you must create a model. The following is the format for the REST request to create a model:

POST -d "{'name': 'MODEL_NAME'}" \
    ENDPOINT/projects/PROJECT_ID/models/

This request contains the following values:

  • MODEL_NAME: the name of the model that you're creating—for example, buyer_recommendation
  • ENDPOINT: the AI Platform service at https://ml.googleapis.com/v1
  • PROJECT_ID: the Google Cloud project where you want to create this model

You can create subsequent model versions into this model with the following POST request:

POST -d @version_spec.json \
    ENDPOINT/projects/PROJECT_ID/models/MODEL_NAME/versions

For more information, see the REST commands in the AI Platform Prediction reference.

The following is a sample version_spec.json file for Triton Inference Server:

{
  "name": "v3",
  "deployment_uri": "gs://model-artifacts-source-location",
  "container": {
    "image": "gcr.io/my_project/triton:20.06",
    "args": ["tritonserver",
             "--model-repository=$AIP_STORAGE_URI"
    ],
    "env": [
    ],
    "ports": [
      { "containerPort": 8000 }
    ]
  },
  "routes": {
    "predict": "/v2/models/$AIP_MODEL_NAME/infer",
    "health": "/v2/models/$AIP_MODEL_NAME"
  },
  "machine_type": "n1-standard4",
  "acceleratorConfig": {
    "count":1,
    "type":"nvidia-tesla-t4"
  },
  "autoScale": {
    "minNodes":1
  }
}

The following table describes the elements in the preceding JSON specification:

Element (* = required) Description
name* The model version name.
deployment_uri The location of models in Cloud Storage. When a model version is created, AI Platform copies the artifacts from this location and makes them available through the AIP_STORAGE_URI environment variable.
container:image* The path to the container image.
container:args The command and arguments used when the container starts.
container:env The environment variables. These user-supplied environment variables are made available to the container at runtime.
container:ports* The network port that AI Platform uses to communicate with containers. If a port isn't specified, the default port is 8080.
routes:predict

The container endpoint routes that AI Platform uses to forward prediction requests to the container. If this element is absent, the endpoint routes default to the following:

/v1/models/$AIP_MODEL_NAME/version/$AIP_VERSION_NAME:predict

routes:health

The container endpoint route that AI Platform uses to check the health of the model version container. Only after a successful response—a 200 HTTP status code from the container with HTTP GET—is traffic directed to this instance of the model version. If this element isn't specified, the endpoint route defaults to the following:

/v1/models/$AIP_MODEL_NAME/version/$AIP_VERSION_NAME

machine_type* The virtual machine type for the inference node.
acceleratorConfig:Count If a GPU is required, the number of GPUs for each node.
acceleratorConfig:Type If a GPU is required, the type of GPU.
autoScale:minNodes The minimum nodes in autoscaling mode. This element cannot be used with the manualScale element.
manualScale:nodes The node count in manual scaling mode. This element cannot be used with the autoScale element.

The environment variables that have the prefix AIP_ in spec.json exist only after the model version container is created. From the previous sample version_spec.json file, consider the following:

  • $AIP_STORAGE_URI. When you create a model version, AI Platform copies the model artifacts from deployment_uri. When the container comes online, it looks to $AIP_STORAGE_URI for its model assets. In the sample, we define the flag for the Triton Inference Server model repository as --model-repository=$AIP_STORAGE_URI.
  • $AIP_MODEL_NAME. Model servers such as Triton Inference Server and TorchServe support more than one model simultaneously and expect the request endpoint path to specify which model to run inference for the request. The value of this environment is derived from ${MODEL_NAME} when the model version is created.

The following table describes some common AIP_ environment variables that are available to the container:

Environment variable Description
AIP_PROJECT_NUMBER The project number of the project where a model is created.
AIP_MODEL_NAME The name of the model in which model versions are created. The name is taken from the ${MODEL_NAME} value in the POST command that you use to create the model.
AIP_VERSION_NAME The name of a model version. The model version name is taken from the name value in the earlier JSON specification that you use to create the model.
AIP_HTTP_PORT The target port of the listener in the custom container that AI Platform communicates with. The listener might be the model server itself or an intermediary such as data preprocessing that's upstream to the model server. This port is taken from container:ports in the earlier JSON specification.
AIP_PREDICT_ROUTE The route that AI Platform forwards prediction requests to when it communicates with the custom container. This route is taken from routes:predict in the earlier JSON specification.
AIP_HEALTH_ROUTE The route that AI Platform uses to check on custom container health. This route is taken from routes:health in the earlier JSON specification.
AIP_STORAGE_URI The location where the model assets are copied to from deployment_uri in the earlier JSON specification. This environment variable defines where the model server looks for the model artifacts.

What's next