Getting online predictions

AI Platform Prediction online prediction is a service optimized to run your data through hosted models with as little latency as possible. You send small batches of data to the service and it returns your predictions in the response.

Learn about online versus batch prediction or read an overview of prediction concepts.

Before you begin

In order to request predictions, you must first:


Online prediction is available in certain regions. Moreover, different machine types are available in each region. To learn about the availability of online prediction in each region, read the guide to regions.

Creating models and versions

You make the following important decisions about how to run online prediction when creating the model and version resources:

Resource created Decision specified at resource creation
Model Region in which to run predictions
Model Enable online prediction logging
Version Runtime version to use
Version Python version to use
Version Machine type to use for online prediction

You can't update the settings listed above after the initial creation of the model or version. If you need to change these settings, create a new model or version resource with the new settings and redeploy your model.

Machine types available for online prediction

When you create a version, you can choose what type of virtual machine AI Platform Prediction uses for online prediction nodes. Learn more about machine types.

Requesting logs for online prediction requests

The AI Platform Prediction prediction service does not provide logged information about requests by default, because the logs incur cost. Online prediction at a high rate of queries per second (QPS) can produce a substantial number of logs, which are subject to Cloud Logging pricing or BigQuery pricing.

If you want to enable online prediction logging, you must configure it when you create a model resource or when you create a model version resource, depending on which type of logging you want to enable. There are three types of logging, which you can enable independently:

  • Access logging, which logs information like timestamp and latency for each request to Cloud Logging.

    You can enable access logging when you create a model resource.

  • Console logging, which logs the stderr and stdout streams from your prediction nodes to Cloud Logging, and can be useful for debugging. This type of logging is in preview for Compute Engine (N1) machine types and is generally available for legacy (MLS1) machine types.

    You can enable console logging when you create a model resource.

  • Request-response logging, which logs a sample of online prediction requests and responses to a BigQuery table. This type of logging is in beta.

    You can enable request-response logging by creating a model version resource, then updating that version.


To enable access logging, include the --enable-logging flag when you create your model with the gcloud ai-platform models create command. For example:

gcloud ai-platform models create MODEL_NAME \
  --region=us-central1 \

To enable console logging (preview), use the gcloud beta component and include the --enable-console-logging flag. For example:

gcloud components install beta

gcloud beta ai-platform models create MODEL_NAME \
  --region=us-central1 \

You cannot currently enable request-response logging (beta) by using the gcloud CLI. You can only enable this type of logging when you send a projects.models.versions.patch request to the REST API.


To enable access logging, set onlinePredictionLogging to True in the Model resource when creating your model with projects.models.create.

To enable console logging (beta), set the onlinePredictionConsoleLogging field to True in the Model resource.

Request-response logging

Unlike the other types of logging, you can't enable request-response logging when you create a model. Instead, you can enable it by using the projects.models.versions.patch method on an existing model version. (You must first create the model version using the Google Cloud console, the gcloud CLI, or the REST API.)

To enable request-response logging, populate the requestLoggingConfig field of the Version resource with the following entries:

  • samplingPercentage: a number between 0 or 1 defining the fraction of requests to log. For example, set this value to 1 in order to log all requests or to 0.1 to log 10% of requests.
  • bigqueryTableName: the fully qualified name (PROJECT_ID.DATASET_NAME.TABLE_NAME) of the BigQuery table where you want to log requests and responses. The table must already exist with the following schema:

    Field nameTypeMode

    Learn how to create a BigQuery table.

Inspect models with the What-If Tool

You can use the What-If Tool(WIT) within notebook environments to inspect AI Platform Prediction models through an interactive dashboard. The What-If Tool integrates with TensorBoard, Jupyter notebooks, Colab notebooks, and JupyterHub. It is also preinstalled on Vertex AI Workbench user-managed notebooks TensorFlow instances.

Learn how to use the What-If Tool with AI Platform.

Formatting your input for online prediction

Formatting instances as JSON strings

The basic format for online prediction is a list of data instances. These can be either plain lists of values or members of a JSON object, depending on how you configured your inputs in your training application. TensorFlow models and custom prediction routines can accept more complex inputs, while most scikit-learn and XGBoost models expect a list of numbers as input.

This example shows an input tensor and an instance key to a TensorFlow model:

{"values": [1, 2, 3, 4], "key": 1}

The makeup of the JSON string can be complex as long as it follows these rules:

  • The top level of instance data must be a JSON object: a dictionary of key/value pairs.

  • Individual values in an instance object can be strings, numbers, or lists. You cannot embed JSON objects.

  • Lists must contain only items of the same type (including other lists). You may not mix string and numerical values.

You pass input instances for online prediction as the message body for the projects.predict call. Learn more about the request body's formatting requirements.


You can format your input in two different ways, depending on how you plan to send your prediction request. We recommend that you use the gcloud ai-platform predict command's --json-request flag. Alternatively, you can use the --json-instances flag with new-line delimited JSON data.

For --json-request

Make each instance an item in a JSON array, and provide the array as the instances field of a JSON file. For example:


{"instances": [
  {"values": [1, 2, 3, 4], "key": 1},
  {"values": [5, 6, 7, 8], "key": 2}

For --json-instances

Ensure that your input file is a newline-delimited JSON file, with each instance as a JSON object, one instance per line. For example:


{"values": [1, 2, 3, 4], "key": 1}
{"values": [5, 6, 7, 8], "key": 2}


Make each instance an item in a JSON array, and provide the array as the instances field of a JSON object. For example:

{"instances": [
  {"values": [1, 2, 3, 4], "key": 1},
  {"values": [5, 6, 7, 8], "key": 2}

Binary data in prediction input

Binary data cannot be formatted as the UTF-8 encoded strings that JSON supports. If you have binary data in your inputs, you must use base64 encoding to represent it. The following special formatting is required:

  • Your encoded string must be formatted as a JSON object with a single key named b64. The following Python 2.7 example encodes a buffer of raw JPEG data using the base64 library to make an instance:

    {"image_bytes": {"b64": base64.b64encode(jpeg_data)}}

    In Python 3, base64 encoding outputs a byte sequence. You must convert this to a string to make it JSON serializable:

    {'image_bytes': {'b64': base64.b64encode(jpeg_data).decode()}}
  • In your TensorFlow model code, you must name the aliases for your binary input and output tensors so that they end with '_bytes'.

Requesting predictions

Request an online prediction by sending your input data instances as a JSON string in a predict request. For formatting of the request and response body, see the details of the prediction request.

If you don't specify a model version, your prediction request uses the default version of the model.


  1. Create environment variables to hold the parameters, including a version value if you decide to specify a particular model version:

  2. Use gcloud ai-platform predict to send instances to a deployed model. Note that --version is optional.

    gcloud ai-platform predict \
      --model=$MODEL_NAME \
      --version=$VERSION_NAME \
      --json-request=$INPUT_DATA_FILE \

    Replace REGION with the region of the regional endpoint where you created your model. If you created your model on the global endpoint, omit the --region flag.

  3. The gcloud tool parses the response and prints the predictions to your terminal in a human-readable format. You can specify a different output format, such as JSON or CSV, by using the --format flag with your predict command. See available output formats.


You can use the Google API Client Library for Python to call the AI Platform Training and Prediction API without manually constructing HTTP requests. Before you run the following code sample, you must set up authentication.

# Create the AI Platform service object.
# To authenticate set the environment variable
# GOOGLE_APPLICATION_CREDENTIALS=<path_to_service_account_file>
service ="ml", "v1")

def predict_json(project, model, instances, version=None):
    """Send json data to a deployed model for prediction.

        project (str): project where the AI Platform Model is deployed.
        model (str): model name.
        instances ([Mapping[str: Any]]): Keys should be the names of Tensors
            your deployed model expects as inputs. Values should be datatypes
            convertible to Tensors, or (potentially nested) lists of datatypes
            convertible to tensors.
        version: str, version of the model to target.
        Mapping[str: any]: dictionary of prediction results defined by the
    name = f"projects/{project}/models/{model}"

    if version is not None:
        name += f"/versions/{version}"

    response = (
        service.projects().predict(name=name, body={"instances": instances}).execute()

    if "error" in response:
        raise RuntimeError(response["error"])

    return response["predictions"]


You can use the Google API Client Library for Java to call the AI Platform Training and Prediction API without manually constructing HTTP requests. Before you run the following code sample, you must set up authentication.

 * Copyright 2017 Google Inc.
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * See the License for the specific language governing permissions and
 * limitations under the License.

import java.util.ArrayList;
import java.util.List;

 * Sample code for sending an online prediction request to Cloud Machine Learning Engine.

public class OnlinePredictionSample {
  public static void main(String[] args) throws Exception {
    HttpTransport httpTransport = GoogleNetHttpTransport.newTrustedTransport();
    JsonFactory jsonFactory = GsonFactory.getDefaultInstance();
    Discovery discovery = new Discovery.Builder(httpTransport, jsonFactory, null).build();

    RestDescription api = discovery.apis().getRest("ml", "v1").execute();
    RestMethod method = api.getResources().get("projects").getMethods().get("predict");

    JsonSchema param = new JsonSchema();
    String projectId = "YOUR_PROJECT_ID";
    // You should have already deployed a model and a version.
    // For reference, see
    String modelId = "YOUR_MODEL_ID";
    String versionId = "YOUR_VERSION_ID";
        "name", String.format("projects/%s/models/%s/versions/%s", projectId, modelId, versionId));

    GenericUrl url =
        new GenericUrl(UriTemplate.expand(api.getBaseUrl() + method.getPath(), param, true));

    String contentType = "application/json";
    File requestBodyFile = new File("input.txt");
    HttpContent content = new FileContent(contentType, requestBodyFile);

    List<String> scopes = new ArrayList<>();

    GoogleCredentials credential = GoogleCredentials.getApplicationDefault().createScoped(scopes);
    HttpRequestFactory requestFactory =
        httpTransport.createRequestFactory(new HttpCredentialsAdapter(credential));
    HttpRequest request = requestFactory.buildRequest(method.getHttpMethod(), url, content);

    String response = request.execute().parseAsString();

Troubleshooting online prediction

Common errors in online prediction include the following:

  • Out of memory errors
  • Input data is formatted incorrectly
  • A single online prediction request must contain no more than 1.5 MB of data. Requests created using the gcloud CLI can handle no more than 100 instances per file. To get predictions for more instances at the same time, use batch prediction.

Try reducing your model size before deploying it to AI Platform Prediction for prediction.

See more details on troubleshooting online prediction.

What's next