Prediction Basics

You can host your trained machine learning models in the cloud and use the Cloud ML prediction service to infer target values for new data. This page discusses model hosting and prediction and introduces considerations you should keep in mind for your projects.

How it works

The Cloud ML Engine prediction service manages computing resources in the cloud to run your models. You can request predictions from your models and get predicted target values for them. Here is the process to get set up to make predictions in the cloud:

  1. You export your model using SavedModel as part of your training application.

  2. You create a model resource in Cloud ML Engine and then create a model version from your saved model.

  3. You format your input data for prediction and request either online prediction or batch prediction

  4. When you use online prediction, the service runs your saved model and returns the requested predictions as the response message for the call.

    • Your model version is deployed in the region you specified when you created the model.
    • Although it is not guaranteed, a model version that you use regularly is generally kept ready to run.

    When you use batch prediction, the process is a little more involved:

    1. The prediction service allocates resources to run your job. This includes one or more prediction nodes.

    2. The service restores your TensorFlow graph on each allocated node.

    3. The prediction service distributes your input data across the allocated nodes.

    4. Each node runs your graph and saves the predictions to a Google Cloud Storage location that you specify.

    5. When all of your input data is processed, the service shuts down your job and releases the resources it allocated for it.

Model deployment

Cloud ML Engine can host your models so that you can get predictions from them in the cloud. The process of hosting a saved model is called deployment. The prediction service manages the infrastructure needed to run your model at scale, and makes it available for online and batch prediction requests. This section describes model deployment.

About models and versions

Cloud ML Engine organizes your trained models using resources called models and versions. A model is a machine learning solution. For example, you might create a model called census to contain all of your work on a U.S. census machine learning model. The entity you create, named census, is a container for actual implementations of the machine-learning model, which are called versions.

Developing a machine-learning model is an iterative process. For that reason, the Cloud ML Engine resource paradigm is set up with the assumption that you'll be making multiple versions of each machine learning model. This terminology can be confusing because a Cloud ML Engine model resource is not actually a machine-learning model on its own. In Cloud ML Engine a model is a container for the versions of the machine learning model.

What's in a version?

The "model" that you deploy to Cloud ML Engine as a model version is a TensorFlow SavedModel. You export a SavedModel in your trainer. It doesn't matter whether you trained your model in the cloud using Cloud ML Engine or elsewhere as long as you have a SavedModel with the serving signature set to serving_default.

Variations between versions

The versions you create for any given model resource are arbitrary; you can use the same model resource even if you completely change the machine-learning model between versions. A model is an organizational tool that you can use however it makes sense for your situation.

It is common, especially after you have a version in production, to keep the inputs and outputs the same between model versions. This enables you to switch versions without needing to change other application structure you may have built around your model. It also makes it easy to test new versions with existing data.

Default version

Every model with at least one version has a default version; the default is set when the first version is created. If you request predictions specifying just a model name, Cloud ML Engine uses the default version for that model.

Note that the only time the service automatically sets the default version is when you create the very first one. You can manually make any subsequent version the default by calling projects.models.versions.setDefault (also exposed as gcloud ml-engine versions set-default and as an option in the Versions list of the Model details page on Google Cloud Platform console (get to the model details page by clicking on your model in the models list on the Models page). This enables you to, for example, use a stable default version to serve predictions in production while testing newer versions without creating a dedicated model resource for testing.

Naming models and versions

Model and version names must:

  • Contain only (case-sensitive) mixed-case letters, numbers, and underscores.
  • Begin with a letter.
  • Contain 128 or fewer characters.
  • Be unique within a given project (for models) or model (for versions).

There are no rules for names beyond those technical requirements, but here are some best-practices:

  • Model names should be descriptive and distinctive—you may need to pick them out of lists of many names in logs or reports.
  • Version names are best kept short and simple. It's easier to identify 'v1' in a list of resources than '2017_01_29T13_54_58', for example.

Model and version limits

The Cloud ML Engine quota policy sets a limit of 100 models per project and limits the total number of versions (combined between all models) to 200.

Model deployment parameters

Cloud ML Engine needs some information to create your model version. You also have some options you can configure. This section describes the parameters of both types. These parameters are defined in the Version object or added for convenience in the gcloud ml-engine versions create command.

Version name
A name for the new version that is unique among the names of other versions of the model.
You can provide a description for your version. At present the description is only given when you get the version information with the API; neither the gcloud command-line tool nor Google Cloud Platform Console display the description.
Deployment URI
You must provide the URI of the Cloud Storage location where your SavedModel is stored. Cloud ML Engine pulls the model from this location and deploys it. This parameter is called --origin in the gcloud ml-engine versions create command.
Runtime version
Cloud ML Engine uses the latest stable runtime version to deploy your model version unless you specify a different supported one. The runtime version primarily determines the version of TensorFlow that the prediction service uses to run your model. When you run a batch prediction job you have the option of overriding the assigned runtime version. Online prediction always uses the runtime version set when the model version is deployed.
Manual scaling

You can specify the number of training nodes to keep running for your model version. See the section on scaling for more information.

Staging bucket

If you are using the gcloud command-line tool to deploy your model, you can use a SavedModel on your local computer. The tool stages it in the Cloud Storage location you specify before deploying it to Cloud ML Engine.

Graph changes for prediction

You may have included TensorFlow Ops in your computation graph that were useful primarily in the context of training. Once you've trained your model, you can remove those ops from your graph before exporting your final version.

Much of the advice given in the training application development page is aimed at the prediction experience. In some cases those are changes that you make to your model when the bulk of your training is done and you're ready to start deploying versions.

Getting predictions

You can send new data to your deployed model versions to get predictions. The following sections describe important prediction considerations.

Online prediction versus batch prediction

Cloud ML Engine provides two ways to get predictions from trained models: online prediction (sometimes called HTTP prediction), and batch prediction. In both cases, you pass input data to a cloud-hosted machine-learning model and get inferences for each data instance. The differences are shown in the following table:

Online prediction Batch prediction
Optimized to minimize the latency of serving predictions. Optimized to handle a high volume of instances in a job and to run more complex models.
Can process one or more instances per request. Can process one or more instances per request.
Predictions returned in the response message. Predictions written to output files in a Cloud Storage location that you specify.
Input data passed directly as a JSON string. Input data passed indirectly as one or more URIs of files in Cloud Storage locations.
Returns as soon as possible. Asynchronous request.
Anyone with Viewer access to the project can request. Must be a project Editor to run.
Runs on the runtime version and in the region selected when you deploy the model. Can run in any available region, using any available runtime version. Though you should run with the defaults for deployed model versions.
Runs models deployed to Cloud ML Engine. Runs models deployed to Cloud ML Engine or models stored in accessible Google Cloud Storage locations.

The needs of your application dictate the type of prediction you should use. You should generally use online prediction when you are making requests in response to application input or in other situations where timely inference is needed. Batch prediction is ideal for processing accumulated data when you don't need immediate results. For example a periodic job that gets predictions for all data collected since the last job. You should also inform your decision with the potential differences in prediction costs.

Batch prediction latency

If you use a simple model and a small set of input instances, you'll find that there is a considerable difference between how long it takes to finish identical prediction requests using online versus batch prediction. It might take a batch job several minutes to complete predictions that are returned almost instantly by an online request. This is a side-effect of the different infrastructure used by the two methods of prediction. Cloud ML Engine allocates and initializes resources for a batch prediction job when you send the request. Online prediction is typically ready to process at the time of request.

Understanding prediction nodes and resource allocation

Cloud ML Engine measures the amount of processing you consume for prediction in node hours. This section describes these nodes and how they are allocated for the different types of prediction.

It's easiest to think of a node as a virtual machine (VM), even though they are implemented with a different mechanism than a traditional VM. Each node is provisioned with a set amount of processing power and memory. It also has an operating system image and a set configuration of software needed to run your model to get predictions.

Both online and batch prediction run your node with distributed processing, so a given request or job can use multiple nodes simultaneously. You are charged for total node usage by the minute, using an hourly rate. For example, running two nodes for ten minutes is charged the same as running one node for twenty minutes. Online and batch prediction allocate nodes differently, which can have a substantial effect on what you will be charged.

Node allocation for batch prediction

The batch prediction service scales the number of nodes it uses to minimize the amount of elapsed time your job takes. To do that, the service:

  • Allocates some nodes to handle your job when you start it.

  • Scales the number of nodes during the job in an attempt to optimize efficiency. Each node takes time to get started, so the service tries to allocate just enough of them so that the startup time is countered by the reduction in elapsed time.

  • Shuts down the nodes as soon as your job is done.

You can affect the scaling of a batch prediction job by specifying a maximum number of nodes to use. You generally want as many nodes as the service will use, but node usage is subject to the Cloud ML Engine quota policy. You may want to limit the number of nodes allocated to a given job, especially if you share your project with others and potentially run jobs (both training and prediction) concurrently.

Node allocation for online prediction

The online prediction service scales the number of nodes it uses to maximize the number of requests it can handle without introducing too much latency. To do that, the service:

  • Allocates some nodes the first time you request predictions after a long pause in requests.

  • Scales the number of nodes in response to request traffic, adding nodes when traffic increases, and removing them when there are fewer requests.

  • Keeps at least one node ready to handle requests even when there are none to handle. It scales down to zero when your model version goes several minutes without a prediction request.

The service keeps your model in a ready state as long as you have a steady stream of requests. In this way each prediction can be served promptly. However, it can take a long time—tens of seconds, maybe as much as a few minutes—to initialize nodes to serve a request when the service has scaled down to zero.

Limitations of automatic scaling

Cloud ML Engine automatic scaling for online prediction can help you serve varying rates of prediction requests while minimizing costs. However, it is not ideal for all situations. The service may not be able to bring nodes online fast enough to keep up with large spikes of request traffic. If your traffic regularly has steep spikes, and if reliably low latency is important to your application, you may want to consider manual scaling.

Using manual scaling

You can affect the scaling of online prediction for a model version by specifying a number of nodes to keep running regardless of traffic. Setting the number of nodes manually actually stops the service from scaling, which means that the number of nodes you specify will always be ready, and you will be charged continuously for them. You should avoid this unless the number of requests your model receives inherently fluctuates faster than the automatic scaling can keep up. You set the number of nodes to use by setting manualScaling in the Version object you pass to projects.models.versions.create.

Prediction input data

The data you use for getting predictions is new data that takes the same form as the data you used for training. Online and batch prediction both use the same data (the features of your model), but they require different formats depending on which type of prediction and which interface you use. These formats are summarized in the following table, and described in more detail in the sections below:

Prediction type and interface Supported input format
Batch with API call Text file with JSON instance strings or TFRecords file (may be compressed)
Batch with gcloud tool Text file with JSON instance strings or TFRecords file (may be compressed)
Online with API call JSON request message
Online with gcloud tool Text file with JSON instance strings or CSV file

Instances JSON strings

The basic format for both online and batch prediction is a list of instance data tensors. These can be either plain lists of values or members of a JSON object, depending on how you configured your inputs in your training application:

This example shows an input tensor and an instance key:

{"values": [1, 2, 3, 4], "key": 1}

The makeup of the JSON string can be complex as long as it follows these rules:

  • The top level of instance data must be a JSON object—a dictionary of key/value pairs.

  • Individual values in an instance object can be strings, numbers, or lists. You cannot embed JSON objects.

  • Lists must contain only items of the same type (including other lists). You may not mix string and numerical values.

The following string (formatted for readability) shows an object containing a label and an image, where the image is a 3-dimensional array of 8-bit integers:

  "tag": "beach",
  "image": [
      [138, 30, 66],
      [130, 20, 56],
      [126, 38, 61],
      [122, 24, 57],

If your model only takes a single input, you don't need to wrap it in a JSON object. For example, if you send a single tensor (vector in this case) with four values you don't need to format it like this:

{"values": [1, 2, 3, 4]}

You can just format each instance as a list:

[1, 2, 3, 4]
Binary data in prediction input

Binary data can't be formatted as the UTF-8 encoded strings that JSON supports. If you have binary data in your inputs, you must use base64 encoding to represent it. The following special formatting is required:

  • Your encoded string must be formatted as a JSON object with a single key named b64. The following Python example encodes a buffer of raw JPEG data using the base64 library (you must import base64) to make an instance: your value in Python with this code:

    {"image_bytes":{"b64": base64.b64encode(jpeg_data)}}
  • In your TensorFlow model code (in your training application) you must name the input/output aliases for that value so that it ends with '_bytes'.

Online prediction input data

You pass input instances for online prediction as the message body for the projects.predict call. Make each instance an item in a list, and name the list member instances.

So the simple data instance JSON example given above becomes:

{"instances": [{"values": [1, 2, 3, 4], "key": 1}]}

Batch prediction input data

You provide input data for batch prediction in one or more text files containing rows of JSON instance data as described above. An input file contains no column headers or other formatting beyond the simple JSON syntax.

{"image": [0.0, 0.0, ... ], "key": 0}
{"image": [0.0, 0.0, ... ], "key": 1}
{"image": [0.0, 0.0, ... ], "key": 2}

Instance keys

Cloud ML Engine runs your batch prediction jobs using distributed processing. This means that your data is distributed among an arbitrary cluster of virtual machines, and is processed in an unpredictable order. To be able to match the returned predictions with your input instances, you must have instance keys defined. An instance key is a value that every instance has that is unique among the instances in a set of data. The simplest key is an index number.

You should pass the keys through your graph unaltered in your training application. If your data doesn't already have instance keys, you can add them as part of your data preprocessing.

Runtime versions

As new versions of Cloud ML Engine are released, it is possible that models developed against older versions will become obsolete. This is particularly pertinent if you arrive at an effective model version that remains unchanged for a long period. You should review the Cloud ML Engine versioning policy and make sure that you understand the Cloud ML Engine runtime version that you use to train your model versions.

Runtime versions and predictions

You can specify a supported Cloud ML Engine runtime version when you create a model version. Doing so establishes the model version's default setting. If you don't specify one explicitly, Cloud ML Engine creates your version using the current default runtime version (typically the most recent stable version).

You can specify a runtime version to use when you start a batch prediction job. This is to accommodate getting predictions using a model that is not deployed on Cloud ML Engine. You should never use a different runtime version than the default for a deployed model. Doing so is likely to cause unexpected errors.

You cannot request online predictions from models outside of Cloud ML Engine, so there is no option to override the default runtime version in your request.

The default runtime version set for a model version cannot be changed. To specify a different runtime version for a model version, deploy a new version using the same training artifacts that you used initially.

Regions and predictions

Google Cloud Platform uses zones and regions to define the geographic locations of physical computing resources. Cloud ML Engine uses regions to designate its processing. When you deploy a model for prediction, you specify the default region that you want prediction to run in.

When you start a batch prediction job, you can specify a region to run the job in, overriding the default region. Online predictions are always served from the region set when the model was created.

Prediction logging

Batch prediction generates job logs that you can view on Stackdriver Logging. You can also get logs for online prediction requests if you configure your model to generate them when you create it. Note that you must specify this option when you create your model resource in Cloud ML Engine; either all versions of a model generate logs for online predictions, or none of them do.

You can set online prediction logging for a model by setting onlinePredictionLogging to true (True in Python) in the Model resource you use when creating your model with projects.models.create. If you use the gcloud command-line tool to create your model, include the --enable-logging flag when you run gcloud ml-engine models create.

Getting predictions from undeployed models

You can request batch prediction using a model that you haven't deployed to the Cloud ML Engine service. Instead of specifying a model or version name, you can use the URI of a Google Cloud Storage location where the model you want to run is stored.

Because an undeployed model doesn't have an established default runtime version, you should explicitly set it in your job request. If you don't, Cloud ML Engine will use the latest stable runtime version.

In all other ways, a batch prediction job using an undeployed model behaves as any other batch job.

Model testing

You can use the Cloud ML Engine prediction service to host your models that are in production, but you can also use it to test your models. Traditionally, model testing is the step before preparing to deploy a machine learning solution. The purpose of a test pass is to test your model in an environment that's as close to the way that it will be used in real-world situations.

Remember that you can have multiple versions of a model concurrently deployed on the service. That means you can have multiple revisions of your model in testing at once if you need to. It also makes it easy to have a production version of the model deployed while testing the next revision. As with so much of developing machine learning applications, the availability of fresh data is often a limiting factor. You should develop strategies to split the data you have and collect new data to use for testing.

What's next

Send feedback about...

Cloud Machine Learning Engine (Cloud ML Engine)