AI Platform Prediction online prediction is a service optimized to run your data through hosted models with as little latency as possible. You send small batches of data to the service and it returns your predictions in the response.
Learn about online versus batch prediction or read an overview of prediction concepts.
Before you begin
In order to request predictions, you must first:
Export your trained model as one or more model artifacts that can be deployed to AI Platform Prediction.
Deploy your trained model to AI Platform Prediction by creating a model resource and version.
Regions
Online prediction is available in certain regions. Moreover, different machine types are available in each region. To learn about the availability of online prediction in each region, read the guide to regions.
Creating models and versions
You make the following important decisions about how to run online prediction when creating the model and version resources:
Resource created | Decision specified at resource creation |
---|---|
Model | Region in which to run predictions |
Model | Enable online prediction logging |
Version | Runtime version to use |
Version | Python version to use |
Version | Machine type to use for online prediction |
You can't update the settings listed above after the initial creation of the model or version. If you need to change these settings, create a new model or version resource with the new settings and redeploy your model.
Machine types available for online prediction
When you create a version, you can choose what type of virtual machine AI Platform Prediction uses for online prediction nodes. Learn more about machine types.
Requesting logs for online prediction requests
The AI Platform Prediction prediction service does not provide logged information about requests by default, because the logs incur cost. Online prediction at a high rate of queries per second (QPS) can produce a substantial number of logs, which are subject to Cloud Logging pricing or BigQuery pricing.
If you want to enable online prediction logging, you must configure it when you create a model resource or when you create a model version resource, depending on which type of logging you want to enable. There are three types of logging, which you can enable independently:
Access logging, which logs information like timestamp and latency for each request to Cloud Logging.
You can enable access logging when you create a model resource.
Console logging, which logs the
stderr
andstdout
streams from your prediction nodes to Cloud Logging, and can be useful for debugging. This type of logging is in preview for Compute Engine (N1) machine types and is generally available for legacy (MLS1) machine types.You can enable console logging when you create a model resource.
Request-response logging, which logs a sample of online prediction requests and responses to a BigQuery table. This type of logging is in beta.
You can enable request-response logging by creating a model version resource, then updating that version.
gcloud
To enable access logging, include the --enable-logging
flag when you
create your model with the
gcloud ai-platform models create
command. For example:
gcloud ai-platform models create MODEL_NAME \
--region=us-central1 \
--enable-logging
To enable console logging (preview), use the gcloud beta
component and
include the --enable-console-logging
flag. For example:
gcloud components install beta
gcloud beta ai-platform models create MODEL_NAME \
--region=us-central1 \
--enable-console-logging
You cannot currently enable request-response logging (beta) by using the
gcloud CLI. You can only enable this type of logging when you send a
projects.models.versions.patch
request to the REST API.
REST API
To enable access logging, set onlinePredictionLogging
to True
in the
Model resource when creating your model with
projects.models.create
.
To enable console logging (beta), set the
onlinePredictionConsoleLogging
field to True
in the Model resource.
Request-response logging
Unlike the other types of logging, you can't enable request-response logging
when you create a model. Instead, you can enable it by using the
projects.models.versions.patch
method on an existing model version. (You must first create the model
version using the
Google Cloud console, the gcloud CLI, or the REST API.)
To enable request-response logging, populate the requestLoggingConfig
field of the Version
resource
with the following entries:
samplingPercentage
: a number between 0 or 1 defining the fraction of requests to log. For example, set this value to1
in order to log all requests or to0.1
to log 10% of requests.bigqueryTableName
: the fully qualified name (PROJECT_ID.DATASET_NAME.TABLE_NAME) of the BigQuery table where you want to log requests and responses. The table must already exist with the following schema:Field name Type Mode model STRING REQUIRED model_version STRING REQUIRED time TIMESTAMP REQUIRED raw_data STRING REQUIRED raw_prediction STRING NULLABLE groundtruth STRING NULLABLE
Inspect models with the What-If Tool
You can use the What-If Tool(WIT) within notebook environments to inspect AI Platform Prediction models through an interactive dashboard. The What-If Tool integrates with TensorBoard, Jupyter notebooks, Colab notebooks, and JupyterHub. It is also preinstalled on Vertex AI Workbench user-managed notebooks TensorFlow instances.
Learn how to use the What-If Tool with AI Platform.
Formatting your input for online prediction
Formatting instances as JSON strings
The basic format for online prediction is a list of data instances. These can be either plain lists of values or members of a JSON object, depending on how you configured your inputs in your training application. TensorFlow models and custom prediction routines can accept more complex inputs, while most scikit-learn and XGBoost models expect a list of numbers as input.
This example shows an input tensor and an instance key to a TensorFlow model:
{"values": [1, 2, 3, 4], "key": 1}
The makeup of the JSON string can be complex as long as it follows these rules:
The top level of instance data must be a JSON object: a dictionary of key/value pairs.
Individual values in an instance object can be strings, numbers, or lists. You cannot embed JSON objects.
Lists must contain only items of the same type (including other lists). You may not mix string and numerical values.
You pass input instances for online prediction as the message body for the
projects.predict
call. Learn more about the request body's formatting
requirements.
gcloud
You can format your input in two different ways, depending on how you plan
to send your prediction request. We recommend that you use the
gcloud ai-platform predict
command's --json-request
flag. Alternatively, you can use the
--json-instances
flag with new-line delimited JSON data.
For --json-request
Make each instance an item in a JSON array, and provide the array as the
instances
field of a JSON file. For example:
instances.json
{"instances": [
{"values": [1, 2, 3, 4], "key": 1},
{"values": [5, 6, 7, 8], "key": 2}
]}
For --json-instances
Ensure that your input file is a newline-delimited JSON file, with each instance as a JSON object, one instance per line. For example:
instances.jsonl
{"values": [1, 2, 3, 4], "key": 1}
{"values": [5, 6, 7, 8], "key": 2}
REST API
Make each instance an item in a JSON array, and provide the array as the
instances
field of a JSON object. For example:
{"instances": [
{"values": [1, 2, 3, 4], "key": 1},
{"values": [5, 6, 7, 8], "key": 2}
]}
Binary data in prediction input
Binary data cannot be formatted as the UTF-8 encoded strings that JSON supports. If you have binary data in your inputs, you must use base64 encoding to represent it. The following special formatting is required:
Your encoded string must be formatted as a JSON object with a single key named
b64
. The following Python 2.7 example encodes a buffer of raw JPEG data using the base64 library to make an instance:{"image_bytes": {"b64": base64.b64encode(jpeg_data)}}
In Python 3, base64 encoding outputs a byte sequence. You must convert this to a string to make it JSON serializable:
{'image_bytes': {'b64': base64.b64encode(jpeg_data).decode()}}
In your TensorFlow model code, you must name the aliases for your binary input and output tensors so that they end with '_bytes'.
Requesting predictions
Request an online prediction by sending your input data instances as a JSON string in a predict request. For formatting of the request and response body, see the details of the prediction request.
If you don't specify a model version, your prediction request uses the default version of the model.
gcloud
Create environment variables to hold the parameters, including a version value if you decide to specify a particular model version:
MODEL_NAME="[YOUR-MODEL-NAME]" INPUT_DATA_FILE="instances.json" VERSION_NAME="[YOUR-VERSION-NAME]"
Use gcloud ai-platform predict to send instances to a deployed model. Note that
--version
is optional.gcloud ai-platform predict \ --model=$MODEL_NAME \ --version=$VERSION_NAME \ --json-request=$INPUT_DATA_FILE \ --region=REGION
Replace REGION with the region of the regional endpoint where you created your model. If you created your model on the global endpoint, omit the
--region
flag.The gcloud tool parses the response and prints the predictions to your terminal in a human-readable format. You can specify a different output format, such as JSON or CSV, by using the --format flag with your predict command. See available output formats.
Python
You can use the Google API Client Library for Python to call the AI Platform Training and Prediction API without manually constructing HTTP requests. Before you run the following code sample, you must set up authentication.
Java
You can use the Google API Client Library for Java to call the AI Platform Training and Prediction API without manually constructing HTTP requests. Before you run the following code sample, you must set up authentication.
Troubleshooting online prediction
Common errors in online prediction include the following:
- Out of memory errors
- Input data is formatted incorrectly
- A single online prediction request must contain no more than 1.5 MB of data. Requests created using the gcloud CLI can handle no more than 100 instances per file. To get predictions for more instances at the same time, use batch prediction.
Try reducing your model size before deploying it to AI Platform Prediction for prediction.
See more details on troubleshooting online prediction.
What's next
- Use batch prediction to get inferences asynchronously.
- Get more details about the prediction process.
- Troubleshoot problems that arise when you request online predictions.