Batch predictions

This page describes how you can provide multiple rows of data to AutoML Tables at once, and receive a prediction for each row.

Introduction

After you have created (trained) a model, you can make an asynchronous request for a batch of predictions using the batchPredict method. You supply input data to the batchPredict method, in table format. Each row provides values for the features you trained the model to use. The batchPredict method sends that data to the model and returns predictions for each row of data.

The maximum lifespan for a custom model is two years. You must create and train a new model to continue making predictions after that amount of time.

Using curl

To make it more convenient to run the curl samples in this topic, set the following environment variable. Replace project-id with the name of your GCP project.

export PROJECT_ID="project-id"

Requesting a batch prediction

For batch predictions, you specify a data source and a results destination in either a BigQuery table or a CSV file in Cloud Storage. You do not need to use the same format for the source and destination. For example, you could use BigQuery for the data source and a CSV file in Cloud Storage for the results destination. Use the appropriate steps from the two tasks below depending on your requirements.

Your data source must contain tabular data that includes all of the columns used to train the model. You can include columns that were not in the training data, or that were in the training data but excluded from use for training. These extra columns are included in the prediction output, but they are not used for generating the prediction.

Using BigQuery tables

The names of the columns and data types of your input data must match the data you used in your training data. The columns can be in a different order than the training data.

BigQuery table requirements

  • BigQuery data source tables must be no larger than 100 GB.
  • You must use a multi-regional BigQuery dataset in the US location.
  • If the table is in a different project, you must provide the BigQuery Data Editor role to the AutoML Tables service account in that project. Learn more.

Requesting the batch prediction

Console

  1. Go to the AutoML Tables page in the Google Cloud Platform Console.

    Go to the AutoML Tables page

  2. Select Models and open the model that you want to use.

  3. Select the Predict tab.

  4. Click Batch prediction.

  5. For Input dataset, select Table from BigQuery and provide the project, dataset, and table IDs for your data source.

  6. For Result, select BigQuery project and provide the project ID for your results destination.

  7. Click Send batch prediction to request the batch prediction.

    AutoML Tables batch prediction page

curl command

You request batch predictions by using the models.batchPredict method.

The following example requests a prediction for a set of values.

  • Replace model-id with the ID of your model. The ID is the last element of the name of your model. For example, if the name of your model is projects/4321/locations/us-central1/models/TBL584, then the ID of your model is TBL584.

    AutoML Tables creates a subfolder in this path named with the following format: prediction-model_name-timestamp. The subfolder contains the prediction results.

curl -X POST \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    -H "Content-Type: application/json" \
    -d '{
        "inputConfig": {
             "bigquerySource": {
                 "inputUri": "bq://project-id.dataset-id.table-id"
              },
         },
         "outputConfig": {
             "bigqueryDestination": {
                  "outputUri": "bq://project-id"
              },
         },
    }' \
    https://automl.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/models/model-id:batchPredict

To get status for this operation, use the operation ID returned in the response. Learn more.

Java

/**
 * Demonstrates using the AutoML client to request prediction from automl tables using bigQuery.
 *
 * @param projectId the Id of the project.
 * @param computeRegion the Region name. (e.g., "us-central1")
 * @param modelId the Id of the model which will be used prediction.
 * @param inputUri BigQuery URIs.
 * @param outputUriPrefix BigQuery URIs.
 * @throws IOException
 * @throws ExecutionException
 * @throws InterruptedException
 */
public static void batchPredictionUsingBqSourceAndBqDest(
    String projectId,
    String computeRegion,
    String modelId,
    String inputUri,
    String outputUriPrefix)
    throws IOException, InterruptedException, ExecutionException {

  // Create client for prediction service.
  PredictionServiceClient predictionClient = PredictionServiceClient.create();

  // Get full path of model.
  ModelName modelName = ModelName.of(projectId, computeRegion, modelId);

  // Set the Input URI.
  BigQuerySource.Builder bigQuerySource = BigQuerySource.newBuilder();
  bigQuerySource.setInputUri(inputUri);

  // Set the Batch Input Configuration.
  BatchPredictInputConfig batchInputConfig =
      BatchPredictInputConfig.newBuilder().setBigquerySource(bigQuerySource).build();

  // Set the Output URI.
  BigQueryDestination.Builder bigQueryDestination = BigQueryDestination.newBuilder();
  bigQueryDestination.setOutputUri(outputUriPrefix);

  // Set the Batch Output Configuration.
  BatchPredictOutputConfig batchOutputConfig =
      BatchPredictOutputConfig.newBuilder().setBigqueryDestination(bigQueryDestination).build();

  // Set the modelName, input and output config in the batch prediction.
  BatchPredictRequest batchRequest =
      BatchPredictRequest.newBuilder()
          .setInputConfig(batchInputConfig)
          .setOutputConfig(batchOutputConfig)
          .setName(modelName.toString())
          .build();

  // Get the latest state of a long-running operation.
  OperationFuture<BatchPredictResult, OperationMetadata> operation =
      predictionClient.batchPredictAsync(batchRequest);

  System.out.println(
      String.format("Operation name: %s", operation.getInitialFuture().get().getName()));
}

Node.js

const automl = require(`@google-cloud/automl`);

// Create client for prediction service.
const client = new automl.v1beta1.PredictionServiceClient();

/**
 * Demonstrates using the AutoML client to request prediction from
 * automl tables using bigQuery.
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = '[PROJECT_ID]' e.g., "my-gcloud-project";
// const computeRegion = '[REGION_NAME]' e.g., "us-central1";
// const modelId = '[MODEL_ID]' e.g., "TBL4704590352927948800";
// const inputUri = '[BIGQUERY_PATH]'
// e.g., "bq://<project_id>.<dataset_id>.<table_id>",
// `The Big Query URI containing the inputs`;
// const outputUri = '[BIGQUERY_PATH]' e.g., "bq://<project_id>",
// `The destination Big Query URI for storing outputs`;

// Get the full path of the model.
const modelFullId = client.modelPath(projectId, computeRegion, modelId);

// Get the Big Query input URI.
const inputConfig = {
  bigquerySource: {
    inputUri: inputUri,
  },
};

// Get the Big Query output URI.
const outputConfig = {
  bigqueryDestination: {
    outputUri: outputUri,
  },
};

// Get the latest state of long-running operation.
client
  .batchPredict({
    name: modelFullId,
    inputConfig: inputConfig,
    outputConfig: outputConfig,
  })
  .then(responses => {
    const operation = responses[1];
    console.log(`Operation name: ${operation.name}`);
  })
  .catch(err => {
    console.error(err);
  });

Python

# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# model_id = 'MODEL_ID_HERE'
# input_path = 'gs://path/to/file.csv' or
#   'bq://project_id.dataset_id.table_id'
# output_path = 'gs://path' or `bq://project_id'


from google.cloud import automl_v1beta1 as automl
import csv

automl_client = automl.AutoMlClient()

# Get the full path of the model.
model_full_id = automl_client.model_path(
    project_id, compute_region, model_id
)

# Create client for prediction service.
prediction_client = automl.PredictionServiceClient()

if input_path.startswith('bq'):
    input_config = {"bigquery_source": {"input_uri": input_path}}
else:
    # Get the multiple Google Cloud Storage URIs.
    input_uris = input_path.split(",").strip()
    input_config = {"gcs_source": {"input_uris": input_uris}}

if output_path.startswith('bq'):
    output_config = {"bigquery_destination": {"output_uri": output_path}}
else:
    # Get the multiple Google Cloud Storage URIs.
    output_uris = output_path.split(",").strip()
    output_config = {"gcs_destination": {"output_uris": output_uris}}

# Query model
response = prediction_client.batch_predict(
    model_full_id, input_config, output_config)
print("Making batch prediction... ")
try:
    result = response.result()
except:
    # Hides Any to BatchPredictResult error.
    pass
print("Batch prediction complete.\n{}".format(response.metadata))

Using CSV files in Cloud Storage

The names of the columns and data types of your input data must match the data you used in your training data. The columns can be in a different order than the training data.

CSV file requirements

  • The first line of the data source must contain the name of the columns.
  • Each data source file must not be larger than 10 GB.

    You can include multiple files, up to a maximum amount of 100 GB.

  • The Cloud Storage bucket must be Regional, and must reside in the us-central1 region.

  • If the Cloud Storage bucket is in a different project than where you use AutoML Tables, you must provide the Storage Object Creator role to the AutoML Tables service account in that project. Learn more.

Console

  1. Go to the AutoML Tables page in the Google Cloud Platform Console.

    Go to the AutoML Tables page

  2. Select Models and open the model that you want to use.

  3. Select the Predict tab.

  4. Click Batch prediction.

  5. For **Input dataset, select CSVs from Cloud Storage and provide the bucket URI for your data source.

  6. For Result, select Cloud Storage bucket and provide the bucket URI for your destination bucket.

  7. Click Send batch prediction to request the batch prediction.

    AutoML Tables batch prediction page

Command-line

You request batch predictions by using the models.batchPredict method.

The following example requests a prediction for a set of values.

  • Replace model-id with the ID of your model. The ID is the last element of the name of your model. For example, if the name of your model is projects/4321/locations/us-central1/models/TBL584, then the ID of your model is TBL584.

    AutoML Tables creates a subfolder in this path named with the following format: prediction-model_name-timestamp. The subfolder contains the prediction results. You must have write permissions to this path.

curl -X POST \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    -H "Content-Type: application/json" \
    -d '{
         "inputConfig": {
             "gcsSource": {
                  "inputUris": [
                       "gs://bucket-name/directory-name/object-name.csv"
                   ]
              },
          },
          "outputConfig": {
             "gcsDestination": {
                  "outputUriPrefix": "gs://bucket-name/directory-name"
              },
          },
    }' \
    https://automl.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/models/model-id:batchPredict

To get status for this operation, use the operation ID returned in the response. Learn more.

Java

/**
 * Demonstrates using the AutoML client to request prediction from automl tables using GCS.
 *
 * @param projectId the Id of the project.
 * @param computeRegion the Region name. (e.g., "us-central1")
 * @param modelId the Id of the model which will be used prediction.
 * @param inputUri Google Cloud Storage URIs.
 * @param outputUriPrefix gcsUri the Destination URI (Google Cloud Storage).
 * @throws IOException
 * @throws ExecutionException
 * @throws InterruptedException
 */
public static void batchPredictionUsingGcsSourceAndGcsDest(
    String projectId,
    String computeRegion,
    String modelId,
    String inputUri,
    String outputUriPrefix)
    throws IOException, InterruptedException, ExecutionException {

  // Create client for prediction service.
  PredictionServiceClient predictionClient = PredictionServiceClient.create();

  // Get full path of model.
  ModelName modelName = ModelName.of(projectId, computeRegion, modelId);

  // Set the Input URI.
  GcsSource.Builder gcsSource = GcsSource.newBuilder();

  // Add multiple csv files.
  String[] inputUris = inputUri.split(",");
  for (String addInputUri : inputUris) {
    gcsSource.addInputUris(addInputUri);
  }

  // Set the Batch Input Configuration.
  BatchPredictInputConfig batchInputConfig =
      BatchPredictInputConfig.newBuilder().setGcsSource(gcsSource).build();

  // Set the Output URI.
  GcsDestination.Builder gcsDestination = GcsDestination.newBuilder();
  gcsDestination.setOutputUriPrefix(outputUriPrefix);

  // Set the Batch Output Configuration.
  BatchPredictOutputConfig batchOutputConfig =
      BatchPredictOutputConfig.newBuilder().setGcsDestination(gcsDestination).build();

  // Set the modelName, input and output config in the batch prediction.
  BatchPredictRequest batchRequest =
      BatchPredictRequest.newBuilder()
          .setInputConfig(batchInputConfig)
          .setOutputConfig(batchOutputConfig)
          .setName(modelName.toString())
          .build();

  // Get the latest state of a long-running operation.
  OperationFuture<BatchPredictResult, OperationMetadata> operation =
      predictionClient.batchPredictAsync(batchRequest);

  System.out.println(
      String.format("Operation name: %s", operation.getInitialFuture().get().getName()));
}

Node.js

const automl = require(`@google-cloud/automl`);

// Create client for prediction service.
const client = new automl.v1beta1.PredictionServiceClient();

/**
 * Demonstrates using the AutoML client to request prediction from
 * automl tables using GCS.
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = '[PROJECT_ID]' e.g., "my-gcloud-project";
// const computeRegion = '[REGION_NAME]' e.g., "us-central1";
// const modelId = '[MODEL_ID]' e.g., "TBL4704590352927948800";
// const inputUri = '[GCS_PATH]' e.g., "gs://<bucket-name>/<csv file>",
// `The Google Cloud Storage URI containing the inputs`;
// const outputUriPrefix = '[GCS_PATH]'
// e.g., "gs://<bucket-name>/<folder-name>",
// `The destination Google Cloud Storage URI for storing outputs`;

// Get the full path of the model.
const modelFullId = client.modelPath(projectId, computeRegion, modelId);

// Get the multiple Google Cloud Storage input URIs.
const inputUris = inputUri.split(`,`);
const inputConfig = {
  gcsSource: {
    inputUris: inputUris,
  },
};

// Get the Google Cloud Storage output URI.
const outputConfig = {
  gcsDestination: {
    outputUriPrefix: outputUriPrefix,
  },
};

// Get the latest state of long-running operation.
client
  .batchPredict({
    name: modelFullId,
    inputConfig: inputConfig,
    outputConfig: outputConfig,
  })
  .then(responses => {
    const operation = responses[1];
    console.log(`Operation name: ${operation.name}`);
  })
  .catch(err => {
    console.error(err);
  });

Python

# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# model_id = 'MODEL_ID_HERE'
# input_path = 'gs://path/to/file.csv' or
#   'bq://project_id.dataset_id.table_id'
# output_path = 'gs://path' or `bq://project_id'


from google.cloud import automl_v1beta1 as automl
import csv

automl_client = automl.AutoMlClient()

# Get the full path of the model.
model_full_id = automl_client.model_path(
    project_id, compute_region, model_id
)

# Create client for prediction service.
prediction_client = automl.PredictionServiceClient()

if input_path.startswith('bq'):
    input_config = {"bigquery_source": {"input_uri": input_path}}
else:
    # Get the multiple Google Cloud Storage URIs.
    input_uris = input_path.split(",").strip()
    input_config = {"gcs_source": {"input_uris": input_uris}}

if output_path.startswith('bq'):
    output_config = {"bigquery_destination": {"output_uri": output_path}}
else:
    # Get the multiple Google Cloud Storage URIs.
    output_uris = output_path.split(",").strip()
    output_config = {"gcs_destination": {"output_uris": output_uris}}

# Query model
response = prediction_client.batch_predict(
    model_full_id, input_config, output_config)
print("Making batch prediction... ")
try:
    result = response.result()
except:
    # Hides Any to BatchPredictResult error.
    pass
print("Batch prediction complete.\n{}".format(response.metadata))

Retrieving your results

Retrieving prediction results in BigQuery

If you specified BigQuery as your output destination, the results of your batch prediction request are returned as a new dataset in the BigQuery project you specified. The BigQuery dataset is the name of your model prepended with "prediction_" and appended with the timestamp of when the prediction job started. You can find the BigQuery dataset name in Recent predictions on the Batch prediction page of the Predict tab for your model.

The BigQuery dataset contains two tables: predictions and errors. The errors table has a row for every row in your prediction request for which AutoML Tables could not return a prediction (for example, if a non-nullable feature was null). The predictions table contains a row for every prediction returned.

In the predictions table, AutoML Tables returns your prediction data, and creates a new column for the prediction results by prepending "predicted_" onto your target column name. The prediction results column contains a nested BigQuery structure that contains the prediction results.

To retrieve the prediction results, you can use a query in the BigQuery console. The format of the query depends on your model type.

Binary classification:

SELECT predicted_<target-column-name>[OFFSET(0)].tables AS value_1,
predicted_<target-column-name>[OFFSET(1)].tables AS value_2
FROM <bq-dataset-name>.predictions

"value_1" and "value_2", are place markers, you can replace them with the target values or an equivalent.

Multi-class classification:

SELECT predicted_<target-column-name>[OFFSET(0)].tables AS value_1,
predicted_<target-column-name>[OFFSET(1)].tables AS value_2,
predicted_<target-column-name>[OFFSET(2)].tables AS value_3,
...
predicted_<target-column-name>[OFFSET(4)].tables AS value_5
FROM <bq-dataset-name>.predictions

"value_1", "value_2", and so on are place markers, you can replace them with the target values or an equivalent.

Regression:

SELECT predicted_<target-column-name>[OFFSET(0)].tables.value,
predicted_<target-column-name>[OFFSET(0)].tables.prediction_interval.start,
predicted_<target-column-name>[OFFSET(0)].tables.prediction_interval.end
FROM <bq-dataset-name>.predictions

Retrieving results in Cloud Storage

If you specified Cloud Storage as your output destination, the results of your batch prediction request are returned as CSV files in a new folder in the bucket you specified. The name of the folder is the name of your model, prepended with "prediction-" and appended with the timestamp of when the prediction job started. You can find the Cloud Storage folder name in Recent predictions at the bottom of the Batch prediction page of the Predict tab for your model.

The Cloud Storage folder contains two types of files: error files and prediction files. If the results are large, additional files are created.

The error files are named errors_1.csv, errors_2.csv, and so on. They contain a header row, and a row for every row in your prediction request for which AutoML Tables could not return a prediction.

The prediction files are named tables_1.csv, tables_2.csv, and so on. They contain a header row with the column names, and a row for every prediction returned.

In the prediction files, AutoML Tables returns your prediction data, and creates one or more new columns for the prediction results, depending on your model type:

Classification:

For each potential value of your target column, a column named <target-column-name>_<value>_score is added to the results. This column contains the score, or confidence estimate, for that value.

Regression:

The predicted value for that row is returned in a column named predicted_<target-column-name>. The prediction interval is not returned for CSV output.

Interpreting your results

How you interpret your results depends on the business problem you are solving and how your data is distributed.

Interpreting your results for classification models

Prediction results for classification models (binary and multi-class) return a probability score for each potential value of the target column. You must determine how you want to use the scores.

For example, to get a binary classification from the provided scores, you would identify a threshold value. If there are two classes, "A" and "B", you should classify the example as "A" if the score for "A" is greater than the chosen threshold, and "B" otherwise. For imbalanced datasets, the threshold might approach 100% or 0%.

You can use the confusion matrix on the Evaluate page for your model in the GCP Console to see how changing the threshold changes the results from your training data. This might help you determine the best way to use the score values to interpret your prediction results.

Interpreting your results for regression models

For regression models, an expected value is returned, and for many problems, you can use that value directly. You can also use the prediction interval, if it is returned, and if a range makes sense for your business problem.

What's next

Learn more about long-running operations.

هل كانت هذه الصفحة مفيدة؟ يرجى تقييم أدائنا:

إرسال تعليقات حول...

AutoML Tables Documentation