Prediction overview

You can host your trained machine learning models in the cloud and use AI Platform Prediction to infer target values for new data. This page discusses model hosting and prediction and introduces considerations you should keep in mind for your projects.

How it works

AI Platform Prediction manages computing resources in the cloud to run your models. You can request predictions from your models and get predicted target values for them. Here is the process to get set up to make predictions in the cloud:

  1. You export your model as artifacts that you can deploy to AI Platform Prediction.

  2. You create a model resource in AI Platform Prediction and then create a model version from your saved model.

    If you're deploying a custom prediction routine, you also provide the code to run at prediction time.

  3. You format your input data for prediction and request either online prediction or batch prediction

  4. When you use online prediction, the service runs your saved model and returns the requested predictions as the response message for the call.

    • Your model version is deployed in the region you specified when you created the model.
    • Although it is not guaranteed, a model version that you use regularly is generally kept ready to run.

    When you use batch prediction, which is only supported for TensorFlow models, the process is a little more involved:

    1. The prediction service allocates resources to run your job. This includes one or more prediction nodes.

    2. The service restores your TensorFlow graph on each allocated node.

    3. The prediction service distributes your input data across the allocated nodes.

    4. Each node runs your graph and saves the predictions to a Cloud Storage location that you specify.

    5. When all of your input data is processed, the service shuts down your job and releases the resources it allocated for it.

Model deployment

AI Platform Prediction can host your models so that you can get predictions from them in the cloud. The process of hosting a saved model is called deployment. The prediction service manages the infrastructure needed to run your model at scale, and makes it available for online and batch prediction requests. This section describes model deployment.

About models and versions

AI Platform Prediction organizes your trained models using resources called models and versions. A model is a machine learning solution. For example, you might create a model called census to contain all of your work on a U.S. census machine learning model. The entity you create, named census, is a container for actual implementations of the machine-learning model, which are called versions.

Developing a machine-learning model is an iterative process. For that reason, the AI Platform Prediction resource paradigm is set up with the assumption that you'll be making multiple versions of each machine learning model. This terminology can be confusing because a AI Platform Prediction model resource is not actually a machine-learning model on its own. In AI Platform Prediction a model is a container for the versions of the machine learning model.

What's in a version?

The "model" that you deploy to AI Platform Prediction as a model version consists of one or more artifacts produced by training with hosted frameworks, TensorFlow, scikit-learn, or XGBoost. It doesn't matter whether you trained your model in the cloud using AI Platform Training or elsewhere.

You can also deploy a custom prediction routine (beta), in which case you can provide the model version with additional training artifacts and code for handling prediction requests.

Variations between versions

The versions you create for any given model resource are arbitrary; you can use the same model resource even if you completely change the machine-learning model between versions. A model is an organizational tool that you can use however it makes sense for your situation.

It is common, especially after you have a version in production, to keep the inputs and outputs the same between model versions. This enables you to switch versions without needing to change other application structure you may have built around your model. It also makes it easy to test new versions with existing data.

Default version

Every model with at least one version has a default version; the default is set when the first version is created. If you request predictions specifying just a model name, AI Platform Prediction uses the default version for that model.

Note that the only time the service automatically sets the default version is when you create the very first one. You can manually make any subsequent version the default by calling projects.models.versions.setDefault (also exposed as gcloud ai-platform versions set-default and as an option in the Versions list of the Model details page on the Google Cloud console (get to the model details page by clicking on your model in the models list on the Models page). This enables you to, for example, use a stable default version to serve predictions in production while testing newer versions without creating a dedicated model resource for testing.

Naming models and versions

Model and version names must:

  • Contain only (case-sensitive) mixed-case letters, numbers, and underscores.
  • Begin with a letter.
  • Contain 128 or fewer characters.
  • Be unique within a given project (for models) or model (for versions).

There are no rules for names beyond those technical requirements, but here are some best-practices:

  • Model names should be descriptive and distinctive—you may need to pick them out of lists of many names in logs or reports.
  • Version names are best kept short and simple. It's easier to identify 'v1' in a list of resources than '2017_01_29T13_54_58', for example.

Model and version limits

To learn how many models and versions you can create in one Google Cloud project, see Resource quotas.

Model deployment parameters

AI Platform Prediction needs some information to create your model version. You also have some options you can configure. This section describes the parameters of both types. These parameters are defined in the Version object or added for convenience in the gcloud ai-platform versions create command.

Version name
A name for the new version that is unique among the names of other versions of the model.
Description
You can provide a description for your version. At present the description is only given when you get the version information with the API; neither the Google Cloud CLI nor the Google Cloud console display the description.
Deployment URI
You must provide the URI of the Cloud Storage location where your SavedModel is stored. AI Platform Prediction pulls the model from this location and deploys it. This parameter is called --origin in the gcloud ai-platform versions create command. If you deploy a custom prediction routine (beta), you may provide the URI of a Cloud Storage directory containing any artifacts your model version uses for prediction, not just a SavedModel.
Runtime version
AI Platform Prediction uses the latest stable runtime version to deploy your model version unless you specify a different supported one. The runtime version primarily determines the version of TensorFlow that the prediction service uses to run your model. When you run a batch prediction job you have the option of overriding the assigned runtime version. Online prediction always uses the runtime version set when the model version is deployed.
Manual scaling

You can specify the number of prediction nodes to keep running for your model version. See the section on scaling for more information.

Staging bucket

If you are using the Google Cloud CLI to deploy your model, you can use a SavedModel on your local computer. The tool stages it in the Cloud Storage location you specify before deploying it to AI Platform Prediction.

Graph changes for prediction

You may have included TensorFlow Ops in your computation graph that were useful primarily in the context of training. Once you've trained your model, you can remove those ops from your graph before exporting your final version.

Much of the advice given in the training application development page is aimed at the prediction experience. In some cases those are changes that you make to your model when the bulk of your training is done and you're ready to start deploying versions.

Getting predictions

You can send new data to your deployed model versions to get predictions. The following sections describe important prediction considerations.

Online prediction versus batch prediction

Learn about the differences between online and batch predictions.

Understanding prediction nodes and resource allocation

AI Platform Prediction measures the amount of processing you consume for prediction in node hours. This section describes these nodes and how they are allocated for the different types of prediction.

It's easiest to think of a node as a virtual machine (VM), even though they are implemented with a different mechanism than a traditional VM. Each node is provisioned with a set amount of processing power and memory. It also has an operating system image and a set configuration of software needed to run your model to get predictions.

Both online and batch prediction run your node with distributed processing, so a given request or job can use multiple nodes simultaneously. You are charged for total node usage by the minute, using an hourly rate. For example, running two nodes for ten minutes is charged the same as running one node for twenty minutes. Online and batch prediction allocate nodes differently, which can have a substantial effect on what you will be charged.

Node allocation for batch prediction

The batch prediction service scales the number of nodes it uses, to minimize the amount of elapsed time your job takes. To do that, the service:

  • Allocates some nodes to handle your job when you start it.

  • Scales the number of nodes during the job in an attempt to optimize efficiency. Each node takes time to get started, so the service tries to allocate just enough of them so that the startup time is countered by the reduction in elapsed time.

  • Shuts down the nodes as soon as your job is done.

You can affect the scaling of a batch prediction job by specifying a maximum number of nodes to use. You generally want as many nodes as the service will use, but node usage is subject to the AI Platform Prediction quota policy. You may want to limit the number of nodes allocated to a given job, especially if you share your project with others and potentially run jobs (both training and prediction) concurrently.

Node allocation for online prediction

The online prediction service scales the number of nodes it uses, to maximize the number of requests it can handle without introducing too much latency. To do that, the service:

  • Allocates some nodes the first time you request predictions after a long pause in requests.

  • Scales the number of nodes in response to request traffic, adding nodes when traffic increases, and removing them when there are fewer requests.

  • Keeps at least one node ready over a period of several minutes, to handle requests even when there are none to handle. The ready state ensures that the service can serve each prediction promptly.

  • Scales down to zero after your model version goes several minutes without a prediction request.

After the service has scaled down to zero, or when there is a sudden spike in traffic, it can take time (seconds to minutes) to initialize nodes to serve requests. The initialization time depends on your model version size, so a client-side timeout may result in dropping requests until the new nodes have been initialized, and/or increased latencies during this period of time.

To ensure prompt serving at all times, you can specify a minimum number of nodes that the service should keep ready, by setting the minNodes option on your model version. This setting can increase cost, as you pay for the nodes even when no predictions are served.

Limitations of automatic scaling

AI Platform Prediction automatic scaling for online prediction can help you serve varying rates of prediction requests while minimizing costs. However, it is not ideal for all situations. The service may not be able to bring nodes online fast enough to keep up with large spikes of request traffic. If your traffic regularly has steep spikes, and if reliably low latency is important to your application, you may want to consider manual scaling.

Using manual scaling

You can affect the scaling of online prediction for a model version by specifying a number of nodes to keep running regardless of traffic. Setting the number of nodes manually actually stops the service from scaling, which means that the number of nodes you specify will always be ready, and you will be charged continuously for them. You should avoid this unless the number of requests your model receives inherently fluctuates faster than the automatic scaling can keep up. You set the number of nodes to use by setting manualScaling in the Version object you pass to projects.models.versions.create.

Multi-zone scaling

If your version uses a Compute Engine (N1) machine type and you set autoScaling.minNodes or manualScaling.nodes to 2 or greater (depending on whether you are using automatic or manual scaling), then your prediction nodes run in multiple zones in the same region. This ensures continuous availability if there is an outage in one of the zones.

Prediction input data

The data you use for getting predictions is new data that takes the same form as the data you used for training. Online and batch prediction both use the same data (the features of your model), but they require different formats depending on which type of prediction and which interface you use. These formats are summarized in the following table, and described in more detail in the sections below:

Prediction type and interface Supported input format
Batch with API call Text file with JSON instance strings or TFRecord file (may be compressed)
Batch with gcloud CLI Text file with JSON instance strings or TFRecord file (may be compressed)
Online with API call JSON request message
Online with gcloud CLI Text file with JSON instance strings or CSV file

Instances JSON strings

The basic format for both online and batch prediction is a list of instance data tensors. These can be either plain lists of values or members of a JSON object, depending on how you configured your inputs in your training application:

This example shows an input tensor and an instance key:

{"values": [1, 2, 3, 4], "key": 1}

The makeup of the JSON string can be complex as long as it follows these rules:

  • The top level of instance data must be a JSON object—a dictionary of key/value pairs.

  • Individual values in an instance object can be strings, numbers, or lists. You cannot embed JSON objects.

  • Lists must contain only items of the same type (including other lists). You may not mix string and numerical values.

The following string (formatted for readability) shows an object containing a label and an image, where the image is a 3-dimensional array of 8-bit integers:

{
  "tag": "beach",
  "image": [
    [
      [138, 30, 66],
      [130, 20, 56],
      ...
    ],
    [
      [126, 38, 61],
      [122, 24, 57],
      ...
    ],
        ...
  ]
}

If your model only takes a single input, you don't need to wrap it in a JSON object. For example, if you send a single tensor (vector in this case) with four values you don't need to format it like this:

{"values": [1, 2, 3, 4]}

You can just format each instance as a list:

[1, 2, 3, 4]
Binary data in prediction input

Binary data can't be formatted as the UTF-8 encoded strings that JSON supports. If you have binary data in your inputs, you must use base64 encoding to represent it. The following special formatting is required:

  • Your encoded string must be formatted as a JSON object with a single key named b64. The following Python example encodes a buffer of raw JPEG data using the base64 library to make an instance:

    {"image_bytes":{"b64": base64.b64encode(jpeg_data)}}
    
  • In your TensorFlow model code, you must name the aliases for your input and output tensors so that they end with '_bytes'.

Online prediction input data

You pass input instances for online prediction as the message body for the predict request. For formatting of the request and response body, see the details of the prediction request.

In brief: Make each instance an item in a list, and name the list member instances. So, the simple data instance JSON example given above becomes:

{"instances": [{"values": [1, 2, 3, 4], "key": 1}]}

When you use gcloud ai-platform projects predict to request online predictions, you pass in a file with the same format you use for batch prediction.

Batch prediction input data

You provide input data for batch prediction in one or more text files containing rows of JSON instance data as described above. An input file contains no column headers or other formatting beyond the simple JSON syntax.

{"image": [0.0, 0.0, ... ], "key": 0}
{"image": [0.0, 0.0, ... ], "key": 1}
{"image": [0.0, 0.0, ... ], "key": 2}

Instance keys

AI Platform Prediction runs your batch prediction jobs using distributed processing. This means that your data is distributed among an arbitrary cluster of virtual machines, and is processed in an unpredictable order. To be able to match the returned predictions with your input instances, you must have instance keys defined. An instance key is a value that every instance has that is unique among the instances in a set of data. The simplest key is an index number.

You should pass the keys through your graph unaltered in your training application. If your data doesn't already have instance keys, you can add them as part of your data preprocessing.

Runtime versions

As new versions of AI Platform Prediction are released, it is possible that models developed against older versions will become obsolete. This is particularly pertinent if you arrive at an effective model version that remains unchanged for a long period. You should review the AI Platform Prediction versioning policy and make sure that you understand the AI Platform Prediction runtime version that you use to train your model versions.

Runtime versions and predictions

You must specify a supported AI Platform Prediction runtime version when you create a model version. Doing so establishes the model version's default setting.

You can specify a runtime version to use when you start a batch prediction job. This is to accommodate getting predictions using a model that is not deployed on AI Platform Prediction. For a deployed model, use the model's default runtime version in your job request. Using a different runtime version is likely to cause unexpected errors.

You cannot request online predictions from models outside of AI Platform Prediction, so there is no option to override the default runtime version in your request.

The default runtime version set for a model version cannot be changed. To specify a different runtime version for a model version, deploy a new version using the same training artifacts that you used initially.

Regions and predictions

Google Cloud uses regions, subdivided into zones, to define the geographic location of physical computing resources. When you deploy a model for prediction using AI Platform Prediction, you specify the default region that you want the prediction to run in.

When you start a batch prediction job, you can specify a region to run the job in, overriding the default region. Online predictions are always served from the default region specified for the model.

To see the available regions for AI Platform Prediction services, including model training and online/batch prediction, read the guide to regions.

Prediction logging

Batch prediction generates job logs that you can view on Cloud Logging. You can also get logs for online prediction requests if you configure your model to generate them when you create it. Note that you must specify this option when you create your model resource in AI Platform Prediction; either all versions of a model generate logs for online predictions, or none of them do.

You can set online prediction logging for a model by setting onlinePredictionLogging to true (True in Python) in the Model resource you use when creating your model with projects.models.create. If you use the Google Cloud CLI to create your model, include the --enable-logging flag when you run gcloud ai-platform models create.

Getting predictions from undeployed models

You can request batch prediction using a model that you haven't deployed to the AI Platform Prediction service. Instead of specifying a model or version name, you can use the URI of a Cloud Storage location where the model you want to run is stored.

Because an undeployed model doesn't have an established default runtime version, you must explicitly set it in your job request.

In all other ways, a batch prediction job using an undeployed model behaves as any other batch job.

Model testing

You can use the AI Platform Prediction prediction service to host your models that are in production, but you can also use it to test your models. Traditionally, model testing is the step before preparing to deploy a machine learning solution. The purpose of a test pass is to test your model in an environment that's as close to the way that it will be used in real-world situations.

Remember that you can have multiple versions of a model concurrently deployed on the service. That means you can have multiple revisions of your model in testing at once if you need to. It also makes it easy to have a production version of the model deployed while testing the next revision. As with so much of developing machine learning applications, the availability of fresh data is often a limiting factor. You should develop strategies to split the data you have and collect new data to use for testing.

What's next