This page shows you how to get online (real-time) predictions from your custom trained models using the Google Cloud console or the Vertex AI API.
Format your input for online prediction
This section shows how to format and encode your prediction input instances as
JSON, which is required if you are using the
predict
or explain
method. This isn't required
if you are rawPredict
method. For information on which method to choose, see Send request to
endpoint.
If you're using the Vertex AI SDK for Python to send prediction requests, specify
the list of instances without the instances
field. For example, specify [
["the","quick","brown"], ... ]
instead of { "instances": [
["the","quick","brown"], ... ] }
.
If your model uses a custom container, your
input must be formatted as JSON, and there is an additional parameters
field
that can be used for your container. Learn more about format prediction input
with custom containers.
Format instances as JSON strings
The basic format for online prediction is a list of data instances. These can be either plain lists of values or members of a JSON object, depending on how you configured your inputs in your training application. TensorFlow models can accept more complex inputs, while most scikit-learn and XGBoost models expect a list of numbers as input.
This example shows an input tensor and an instance key to a TensorFlow model:
{"values": [1, 2, 3, 4], "key": 1}
The makeup of the JSON string can be complex as long as it follows these rules:
The top level of instance data must be a JSON object: a dictionary of key-value pairs.
Individual values in an instance object can be strings, numbers, or lists. You can't embed JSON objects.
Lists must contain only items of the same type (including other lists). You may not mix string and numerical values.
You pass input instances for online prediction as the message body for the
projects.locations.endpoints.predict
call. Learn more about the request
body's formatting requirements.
Make each instance an item in a JSON array, and provide the array as the
instances
field of a JSON object. For example:
{"instances": [
{"values": [1, 2, 3, 4], "key": 1},
{"values": [5, 6, 7, 8], "key": 2}
]}
Encode binary data for prediction input
Binary data can't be formatted as the UTF-8 encoded strings that JSON supports. If you have binary data in your inputs, you must use base64 encoding to represent it. The following special formatting is required:
Your encoded string must be formatted as a JSON object with a single key named
b64
. In Python 3, base64 encoding outputs a byte sequence. You must convert this to a string to make it JSON serializable:{'image_bytes': {'b64': base64.b64encode(jpeg_data).decode()}}
In your TensorFlow model code, you must name the aliases for your binary input and output tensors so that they end with '_bytes'.
Request and response examples
This section describes the format of the prediction request body and of the response body, with examples for TensorFlow, scikit-learn, and XGBoost.
Request body details
TensorFlow
The request body contains data with the following structure (JSON representation):
{
"instances": [
<value>|<simple/nested list>|<object>,
...
]
}
The instances[]
object is required, and must contain the list of instances
to get predictions for.
The structure of each element of the instances list is determined by your model's input definition. Instances can include named inputs (as objects) or can contain only unlabeled values.
Not all data includes named inputs. Some instances are simple JSON values (boolean, number, or string). However, instances are often lists of simple values, or complex nested lists.
Below are some examples of request bodies.
CSV data with each row encoded as a string value:
{"instances": ["1.0,true,\\"x\\"", "-2.0,false,\\"y\\""]}
Plain text:
{"instances": ["the quick brown fox", "the lazy dog"]}
Sentences encoded as lists of words (vectors of strings):
{ "instances": [ ["the","quick","brown"], ["the","lazy","dog"], ... ] }
Floating point scalar values:
{"instances": [0.0, 1.1, 2.2]}
Vectors of integers:
{ "instances": [ [0, 1, 2], [3, 4, 5], ... ] }
Tensors (in this case, two-dimensional tensors):
{ "instances": [ [ [0, 1, 2], [3, 4, 5] ], ... ] }
Images, which can be represented different ways. In this encoding scheme the first two dimensions represent the rows and columns of the image, and the third dimension contains lists (vectors) of the R, G, and B values for each pixel:
{ "instances": [ [ [ [138, 30, 66], [130, 20, 56], ... ], [ [126, 38, 61], [122, 24, 57], ... ], ... ], ... ] }
Data encoding
JSON strings must be encoded as UTF-8. To send binary data, you must
base64-encode the data and mark it as binary. To mark a JSON string as
binary, replace it with a JSON object with a single attribute named
b64
:
{"b64": "..."}
The following example shows two serialized tf.Examples
instances, requiring base64 encoding (fake data, for illustrative
purposes only):
{"instances": [{"b64": "X5ad6u"}, {"b64": "IA9j4nx"}]}
The following example shows two JPEG image byte strings, requiring base64 encoding (fake data, for illustrative purposes only):
{"instances": [{"b64": "ASa8asdf"}, {"b64": "JLK7ljk3"}]}
Multiple input tensors
Some models have an underlying TensorFlow graph that accepts multiple input tensors. In this case, use the names of JSON name/value pairs to identify the input tensors.
For a graph with input tensor aliases "tag" (string) and "image" (base64-encoded string):
{ "instances": [ { "tag": "beach", "image": {"b64": "ASa8asdf"} }, { "tag": "car", "image": {"b64": "JLK7ljk3"} } ] }
For a graph with input tensor aliases "tag" (string) and "image" (3-dimensional array of 8-bit ints):
{ "instances": [ { "tag": "beach", "image": [ [ [138, 30, 66], [130, 20, 56], ... ], [ [126, 38, 61], [122, 24, 57], ... ], ... ] }, { "tag": "car", "image": [ [ [255, 0, 102], [255, 0, 97], ... ], [ [254, 1, 101], [254, 2, 93], ... ], ... ] }, ... ] }
scikit-learn
The request body contains data with the following structure (JSON representation):
{
"instances": [
<simple list>,
...
]
}
The instances[]
object is required, and must contain the list of instances
to get predictions for. In the following example, each input instance is a
list of floats:
{
"instances": [
[0.0, 1.1, 2.2],
[3.3, 4.4, 5.5],
...
]
}
The dimension of input instances must match what your model expects. For example, if your model requires three features, then the length of each input instance must be 3.
XGBoost
The request body contains data with the following structure (JSON representation):
{
"instances": [
<simple list>,
...
]
}
The instances[]
object is required, and must contain the list of instances
to get predictions for. In the following example, each input instance is a
list of floats:
{
"instances": [
[0.0, 1.1, 2.2],
[3.3, 4.4, 5.5],
...
]
}
The dimension of input instances must match what your model expects. For example, if your model requires three features, then the length of each input instance must be 3.
Vertex AI doesn't support sparse representation of input instances for XGBoost.
The online prediction service interprets zeros and NaN
s differently. If
the value of a feature is zero, use 0.0
in the corresponding input. If the
value of a feature is missing, use "NaN"
in the corresponding input.
The following example represents a prediction request with a single input instance, where the value of the first feature is 0.0, the value of the second feature is 1.1, and the value of the third feature is missing:
{"instances": [[0.0, 1.1, "NaN"]]}
PyTorch
If your model uses a PyTorch prebuilt
container,
TorchServe's default handlers expect each instance to be wrapped in a data
field. For example:
{
"instances": [
{ "data": , <value> },
{ "data": , <value> }
]
}
Response body details
If the call is successful, the response body contains one prediction entry per instance in the request body, given in the same order:
{
"predictions": [
{
object
}
],
"deployedModelId": string
}
If prediction fails for any instance, the response body contains no predictions. Instead, it contains a single error entry:
{
"error": string
}
The predictions[]
object contains the list of predictions, one for each
instance in the request.
On error, the error
string contains a message describing the problem. The
error is returned instead of a prediction list if an error occurred while
processing any instance.
Even though there is one prediction per instance, the format of a prediction isn't directly related to the format of an instance. Predictions take whatever format is specified in the outputs collection defined in the model. The collection of predictions is returned in a JSON list. Each member of the list can be a simple value, a list, or a JSON object of any complexity. If your model has more than one output tensor, each prediction will be a JSON object containing a name-value pair for each output. The names identify the output aliases in the graph.
Response body examples
TensorFlow
The following examples show some possible responses:
-
A simple set of predictions for three input instances, where each prediction is an integer value:
{"predictions": [5, 4, 3], "deployedModelId": 123456789012345678 }
-
A more complex set of predictions, each containing two named values that correspond to output tensors, named
label
andscores
respectively. The value oflabel
is the predicted category ("car" or "beach") andscores
contains a list of probabilities for that instance across the possible categories.{ "predictions": [ { "label": "beach", "scores": [0.1, 0.9] }, { "label": "car", "scores": [0.75, 0.25] } ], "deployedModelId": 123456789012345678 }
-
A response when there is an error processing an input instance:
{"error": "Divide by zero"}
scikit-learn
The following examples show some possible responses:
-
A simple set of predictions for three input instances, where each prediction is an integer value:
{"predictions": [5, 4, 3], "deployedModelId": 123456789012345678 }
-
A response when there is an error processing an input instance:
{"error": "Divide by zero"}
XGBoost
The following examples show some possible responses:
-
A simple set of predictions for three input instances, where each prediction is an integer value:
{"predictions": [5, 4, 3], "deployedModelId": 123456789012345678 }
-
A response when there is an error processing an input instance:
{"error": "Divide by zero"}
Send a request to an endpoint
There are three ways to send a request:
Prediction request: send a request to
predict
to get an online prediction.Raw prediction request: sends a request to
rawPredict
, which lets you use an arbitrary HTTP payload rather than following the guidelines described in the Format your input sections of this page. You might want to get raw predictions if:- You are using a custom container which receives requests and sends responses that differ from the guidelines.
- You require lower latency.
rawPredict
skips the serialization steps and directly forwards the request to the prediction container. - You are serving predictions with NVIDIA Triton.
Explanation request: sends a request to
explain
. If you have configured yourModel
for Vertex Explainable AI, then you can get online explanations. Online explanation requests have the same format as online prediction requests, and they return similar responses; the only difference is that online explanation responses include feature attributions as well as predictions.
Send an online prediction request
gcloud
The following example uses the gcloud ai endpoints predict
command:
Write the following JSON object to file in your local environment. The filename doesn't matter, but for this example, name the file
request.json
.{ "instances": INSTANCES }
Replace the following:
- INSTANCES: A JSON array of instances that you want to get predictions for. The format of each instance depends on which inputs your trained ML model expects. For more information, see Formatting your input for online prediction.
Run the following command:
gcloud ai endpoints predict ENDPOINT_ID \ --region=LOCATION_ID \ --json-request=request.json
Replace the following:
- ENDPOINT_ID: The ID for the endpoint.
- LOCATION_ID: The region where you are using Vertex AI.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID
- ENDPOINT_ID: The ID for the endpoint.
- INSTANCES: A JSON array of instances that you want to get predictions for. The format of each instance depends on which inputs your trained ML model expects. For more information, see Formatting your input for online prediction.
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:predict
Request JSON body:
{ "instances": INSTANCES }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:predict"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:predict" | Select-Object -Expand Content
- PREDICTIONS: A JSON array of predictions, one for each instance that you included in the request body.
-
DEPLOYED_MODEL_ID: The ID of the
DeployedModel
that served the predictions.
{ "predictions": PREDICTIONS, "deployedModelId": "DEPLOYED_MODEL_ID" }
Java
Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Send an online prediction request to a dedicated endpoint
Dedicated endpoints use a new URL path. You can retrieve this path from the
dedicatedEndpointDns
field in the REST API, or from
Endpoint.dedicated_endpoint_dns
in the Vertex AI SDK for Python. You can also
construct the endpoint path manually using the following code:
f"https://ENDPOINT_ID.LOCATION_ID-PROJECT_NUMBER.prediction.vertexai.goog/v1/projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID:predict"
Replace the following:
- ENDPOINT_ID: The ID for the endpoint.
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_NUMBER: the project number. This is different from the project ID. You can find the project number on the project's Project Settings page in the Google Cloud console.
To send a prediction to a dedicated endpoint using the Vertex AI SDK for Python,
set the use_dedicated_endpoint
parameter to True
:
endpoint.predict(instances=instances, use_dedicated_endpoint=True)
Send an online raw prediction request
gcloud
The following examples use the gcloud ai endpoints raw-predict
command:
-
To request predictions with the JSON object in REQUEST specified on
the command line:
gcloud ai endpoints raw-predict ENDPOINT_ID \ --region=LOCATION_ID \ --request=REQUEST
To request predictions with an image stored in the file
image.jpeg
and the appropriateContent-Type
header:gcloud ai endpoints raw-predict ENDPOINT_ID \ --region=LOCATION_ID \ --http-headers=Content-Type=image/jpeg \ --request=@image.jpeg
Replace the following:
- ENDPOINT_ID: The ID for the endpoint.
- LOCATION_ID: The region where you are using Vertex AI.
- REQUEST: The contents of the request that you want to get predictions for. The format of the request depends on what your custom container expects, which may not necessarily be a JSON object.
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
The response includes the following HTTP headers:
X-Vertex-AI-Endpoint-Id
: ID of theEndpoint
that served this prediction.X-Vertex-AI-Deployed-Model-Id
: ID of the Endpoint'sDeployedModel
that served this prediction.
Send an online explanation request
gcloud
The following example uses the gcloud ai endpoints explain
command:
Write the following JSON object to file in your local environment. The filename doesn't matter, but for this example, name the file
request.json
.{ "instances": INSTANCES }
Replace the following:
- INSTANCES: A JSON array of instances that you want to get predictions for. The format of each instance depends on which inputs your trained ML model expects. For more information, see Formatting your input for online prediction.
Run the following command:
gcloud ai endpoints explain ENDPOINT_ID \ --region=LOCATION_ID \ --json-request=request.json
Replace the following:
- ENDPOINT_ID: The ID for the endpoint.
- LOCATION_ID: The region where you are using Vertex AI.
Optionally, if you want to send an explanation request to a specific
DeployedModel
on theEndpoint
, you can specify the--deployed-model-id
flag:gcloud ai endpoints explain ENDPOINT_ID \ --region=LOCATION \ --deployed-model-id=DEPLOYED_MODEL_ID \ --json-request=request.json
In addition to the placeholders described previously, replace the following:
-
DEPLOYED_MODEL_ID Optional: The ID of the deployed model for which you want to get
explanations. The ID is included in the
predict
method's response. If you need to request explanations for a particular model and you have more than one model deployed to the same endpoint, you can use this ID to ensure that the explanations are returned for that particular model.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID
- ENDPOINT_ID: The ID for the endpoint.
- INSTANCES: A JSON array of instances that you want to get predictions for. The format of each instance depends on which inputs your trained ML model expects. For more information, see Formatting your input for online prediction.
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:explain
Request JSON body:
{ "instances": INSTANCES }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:explain"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:explain" | Select-Object -Expand Content
- PREDICTIONS: A JSON array of predictions, one for each instance that you included in the request body.
- EXPLANATIONS: A JSON array of explanations, one for each prediction.
-
DEPLOYED_MODEL_ID: The ID of the
DeployedModel
that served the predictions.
{ "predictions": PREDICTIONS, "explanations": EXPLANATIONS, "deployedModelId": "DEPLOYED_MODEL_ID" }
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
What's next
- Learn about Online prediction logging.