Get an online prediction

Use online predictions when you are making requests in response to application input or in situations that require timely inference. This page shows you how to get online predictions from your custom trained models on Google Distributed Cloud Hosted (GDCH) using the Vertex AI Prediction API.

Before you begin

Before sending a request to make online predictions, perform the following steps:

  1. Deploy the model resource to an endpoint. This action associates compute resources with the model so that it can serve online predictions with low latency.
  2. Check the status of the DeployedModel custom resource (CR) of your model and ensure it is ready to accept prediction requests:

    kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG get -f DEPLOYED_MODEL_NAME.yaml -o jsonpath='{.status.primaryCondition}'
    

    Replace the following:

    • PREDICTION_CLUSTER_KUBECONFIG: the path to the kubeconfig file in the Prediction user cluster.
    • DEPLOYED_MODEL_NAME: the name of the DeployedModel definition file.

    The primary condition must show that the DeployedModel is ready.

    The following output shows a sample response:

    {"firstObservedTime":"2024-01-19T01:18:30Z","lastUpdateTime":"2024-01-19T01:35:47Z","message":"DeployedModel is ready", "observedGeneration":1, "reason":"Ready", "resourceName":"my-tf-model","type":"DeployedModel"}
    
  3. Check the status of the Endpoint CR and ensure it is ready to accept prediction requests:

    kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG get -f ENDPOINT_NAME.yaml -o jsonpath='{$.status.conditions[?(@.type == "EndpointReady")]}'
    

    Replace the following:

    • PREDICTION_CLUSTER_KUBECONFIG: the path to the kubeconfig file in the Prediction user cluster.
    • ENDPOINT_NAME: the name of the Endpoint definition file.

    The Status field of the EndpointReady condition must be in a True status.

    The following output shows a sample response:

    {"lastTransitionTime":"2024-01-19T05:12:26Z","message":"Endpoint Ready", "observedGeneration":1,"reason":"ResourceReady","status":"True","type":"EndpointReady"}%
    

Send a request to an endpoint

To make an online prediction request, perform the following steps:

  1. Create a JSON file with the request body details in the format required by the target container. For more information about the JSON representation with examples for TensorFlow, see Request body details.

  2. Show details of the Endpoint CR of your prediction model:

    kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG get endpoint PREDICTION_ENDPOINT -n PROJECT_NAMESPACE -o jsonpath='{.status.endpointFQDN}'
    

    Replace the following:

    • PREDICTION_CLUSTER_KUBECONFIG: the path to the kubeconfig file in the Prediction user cluster.
    • PREDICTION_ENDPOINT: the name of the endpoint.
    • PROJECT_NAMESPACE: the name of the prediction project namespace.
  3. The output must show the status field, which displays the endpoint fully-qualified domain name on the endpointFQDN field. Register this endpoint URL path to use it for your requests.

  4. Send an online prediction request to the endpoint URL path using HTTP or gRPC calls. For HTTP and gRPC request examples, try online predictions.

    Alternatively, you can use a Vertex AI Workbench JupyterLab notebook to build the request, execute it, and handle the response.

If successful, you receive a JSON response to your online prediction request. For more information about responses, see Response body details.

Request and response examples

This section describes the format of the prediction request body and of the response body, with examples for TensorFlow.

Request body details

The request body contains data with the following structure (JSON representation):

{
  "instances": [
    <value>|<simple/nested list>|<object>,
    ...
  ]
}

The instances[] object is required, and must contain the list of instances to get predictions for.

The structure of each element of the instances list is determined by your model's input definition. Instances can include named inputs (as objects) or can contain only unlabeled values.

Not all data includes named inputs. Some instances are simple JSON values (boolean, number, or string). However, instances are often lists of simple values, or complex nested lists.

The following are some examples of request bodies:

  • CSV data with each row encoded as a string value:

    {"instances": ["1.0,true,\\"x\\"", "-2.0,false,\\"y\\""]}
    
  • Plain text:

    {"instances": ["the quick brown fox", "the lazy dog"]}
    
  • Sentences encoded as lists of words (vectors of strings):

    {
    "instances": [
    ["the","quick","brown"],
    ["the","lazy","dog"],
    ...
    ]
    }
    
  • Floating point scalar values:

    {"instances": [0.0, 1.1, 2.2]}
    
  • Vectors of integers:

    {
    "instances": [
    [0, 1, 2],
    [3, 4, 5],
    ...
    ]
    }
    
  • Tensors (in this case, two-dimensional tensors):

    {
    "instances": [
    [
    [0, 1, 2],
    [3, 4, 5]
    ],
    ...
    ]
    }
    
  • Images, which can be represented in different ways. In this encoding scheme the first two dimensions represent the rows and columns of the image, and the third dimension contains lists (vectors) of the R, G, and B values for each pixel:

    {
    "instances": [
    [
    [
    [138, 30, 66],
    [130, 20, 56],
    ...
    ],
    [
    [126, 38, 61],
    [122, 24, 57],
    ...
    ],
    ...
    ],
    ...
    ]
    }
    

Data encoding

JSON strings must be encoded as UTF-8. To send binary data, you must base64-encode the data and mark it as binary. To mark a JSON string as binary, replace it with a JSON object with a single attribute named b64:

{"b64": "..."}

The following example shows two serialized tf.Examples instances, requiring base64 encoding (fake data, for illustrative purposes only):

{"instances": [{"b64": "X5ad6u"}, {"b64": "IA9j4nx"}]}

The following example shows two image byte strings, requiring base64 encoding (fake data, for illustrative purposes only):

{"instances": [{"b64": "ASa8asdf"}, {"b64": "JLK7ljk3"}]}

Multiple input tensors

Some models have an underlying TensorFlow graph that accepts multiple input tensors. In this case, use the names of JSON name-value pairs to identify the input tensors.

For a graph with input tensor aliases "tag" (string) and "image" (base64-encoded string):

{
"instances": [
{
"tag": "beach",
"image": {"b64": "ASa8asdf"}
},
{
"tag": "car",
"image": {"b64": "JLK7ljk3"}
}
]
}

For a graph with input tensor aliases "tag" (string) and "image" (3-dimensional array of 8-bit ints):

{
"instances": [
{
"tag": "beach",
"image": [
[
[138, 30, 66],
[130, 20, 56],
...
],
[
[126, 38, 61],
[122, 24, 57],
...
],
...
]
},
{
"tag": "car",
"image": [
[
[255, 0, 102],
[255, 0, 97],
...
],
[
[254, 1, 101],
[254, 2, 93],
...
],
...
]
},
...
]
}

Response body details

Responses are very similar to requests.

If the call is successful, the response body contains one prediction entry per instance in the request body, given in the same order:

{
  "predictions": [
    {
      object
    }
  ],
  "deployedModelId": string
}

If prediction fails for any instance, the response body contains no predictions. Instead, it contains a single error entry:

{
  "error": string
}

The predictions[] object contains the list of predictions, one for each instance in the request.

On error, the error string contains a message describing the problem. The error is returned instead of a prediction list if an error occurred while processing any instance.

Even though there is one prediction per instance, the format of a prediction is not directly related to the format of an instance. Predictions take whatever format is specified in the outputs collection defined in the model. The collection of predictions is returned in a JSON list. Each member of the list can be a simple value, a list, or a JSON object of any complexity. If your model has more than one output tensor, each prediction will be a JSON object containing a name-value pair for each output. The names identify the output aliases in the graph.

Response body examples

The following examples show some possible responses:

  • A simple set of predictions for three input instances, where each prediction is an integer value:

    {"predictions":
      [5, 4, 3],
      "deployedModelId": 123456789012345678
    }
    
  • A more complex set of predictions, each containing two named values that correspond to output tensors, named label and scores, respectively. The value of label is the predicted category ("car" or "beach") and scores contains a list of probabilities for that instance across the possible categories.

    {
      "predictions": [
        {
          "label": "beach",
          "scores": [0.1, 0.9]
        },
        {
          "label": "car",
          "scores": [0.75, 0.25]
        }
      ],
      "deployedModelId": 123456789012345678
    }
    
  • A response when there is an error processing an input instance:

    {"error": "Divide by zero"}