Training models

This page describes how to use AutoML Tables to train a custom model based on your dataset. You must already have created a dataset and imported data into it.

Introduction

You create a custom model by training it using a prepared dataset. AutoML Tables uses the items from the dataset to train the model, test it, and evaluate its performance. You can review the results, adjust the training dataset as needed and train a new model using the improved dataset.

As part of preparing to train a model, you update the schema information of the dataset. These schema updates affect any future model that uses that dataset. Models that have already begun training are unaffected.

Training a model can take several hours to complete. You can check training progress in the Google Cloud console, or by using the Cloud AutoML API.

Since AutoML Tables creates a new model each time you start training, your project may include numerous models. You can get a list of the models in your project and can delete models that you no longer need.

Models must be retrained every six months so that they can continue to serve predictions.

Training a model

Console

  1. If needed, open the Datasets page and click on the dataset you want to use.

    This opens the dataset in the Train tab.

    AutoML Tables schema page

  2. Select the target column for your model.

    This is the value that the model is trained to predict. Its data type determines whether the resulting model is a regression (Numeric) or a classification (Categorical) model. Learn more.

    If your target column has a data type of Categorical, it must have at least two and no more than 500 distinct values.

  3. Review Data type, Nullability, and the data statistics for each column in your dataset.

    You can click on individual columns to get more details about that column. Learn more about schema review.

    AutoML Tables schema page

  4. If you want to control your data split, click Edit additional parameters and specify a data split column or a Time column. Learn more.

    AutoML Tables schema page

  5. If you want to weight your training examples by the value of a column, click Edit additional parameters and specify the appropriate column. Learn more.

  6. Review the summary statistics and details to ensure that your data quality is what you expect, and that you have identified any columns that need to be excluded when you create your model.

    For more information, see Analyzing your training data.

  7. When you are satisfied with your dataset schema, click Train model at the top of the screen.

    When you make changes to your schema, AutoML Tables updates the summary statistics, which can take a few moments to complete. You do not need to wait for this process to complete before initiating model training.

    AutoML Tables schema page

  8. For Training budget, enter the maximum number of training hours for this model.

    Training budget is between 1 and 72 hours. This is the maximum amount of training time you will be charged for.

    Suggested training time is related to the size of your training data. The table below shows suggested training time ranges by row count; a large number of columns will also increase training time.

    Rows Suggested training time
    Less than 100,000 1-3 hours
    100,000 - 1,000,000 1-6 hours
    1,000,000 - 10,000,000 1-12 hours
    More than 10,000,000 3 - 24 hours

    Model creation includes other tasks besides training, so the total time it takes to create your model is longer than the training time. For example, if you specify 2 training hours, it could still take 3 or more hours before the model is ready to deploy. You are charged only for actual training time.

    Learn more about training prices.

    If AutoML Tables detects that the model is no longer improving before the training budget is exhausted, it stops training. If you want to use the entire budgeted training time, open Advanced options and disable Early stopping.

  9. In the Input feature selection section, exclude any columns that you targeted for exclusion in the schema analysis step.

  10. If you do not want to use the default optimization objective, open Advanced options and select the metric you want AutoML Tables to optimize for when training your model. Learn more.

    Depending on the data type of your target column, there might be only one choice for Optimization objective.

  11. Click Train model to begin model training.

    Training a model can take several hours to complete depending on the size of the dataset and the training budget. You can close your browser window without affecting the training process.

    After the model is successfully trained, the Models tab shows high-level metrics for the model, such as precision and recall.

    High-level metrics for a trained model

    For help with evaluating the quality of your model, see Evaluating models.

REST

The following example shows how you can review and update your data schema before training your model.

If your resources are located in the EU region, use eu for {location} and use the eu-automl.googleapis.com endpoint. Otherwise, use us-central1. Learn more.

  1. After the import completes, list your table specifications to get your table ID.

    Before using any of the request data, make the following replacements:

    • endpoint: automl.googleapis.com for the global location, and eu-automl.googleapis.com for the EU region.
    • project-id: your Google Cloud project ID.
    • location: the location for the resource: us-central1 for Global or eu for the European Union.
    • dataset-id: the ID of the dataset. For example, TBL6543.

    HTTP method and URL:

    GET https://endpoint/v1beta1/projects/project-id/locations/location/datasets/dataset-id/tableSpecs/

    To send your request, expand one of these options:

    The table ID is shown in bold in the name field.

  2. List your column specifications.

    Before using any of the request data, make the following replacements:

    • endpoint: automl.googleapis.com for the global location, and eu-automl.googleapis.com for the EU region.
    • project-id: your Google Cloud project ID.
    • location: the location for the resource: us-central1 for Global or eu for the European Union.
    • dataset-id: the ID of the dataset. For example, TBL6543.
    • table-id: the ID of the table.

    HTTP method and URL:

    GET https://endpoint/v1beta1/projects/project-id/locations/location/datasets/dataset-id/tableSpecs/table-id/columnSpecs/

    To send your request, expand one of these options:

  3. Optionally, configure your target column.

    This is the value that the model is trained to predict. Its data type determines whether the resulting model is a regression (Numeric) or a classification (Categorical) model. Learn more.

    If your target column has a data type of Categorical, it must have at least two and no more than 500 distinct values.

    You can also specify the target column when you train the model. If you plan to do so, retain your table ID and desired target column ID for later use.

    Before using any of the request data, make the following replacements:

    • endpoint: automl.googleapis.com for the global location, and eu-automl.googleapis.com for the EU region.
    • project-id: your Google Cloud project ID.
    • location: the location for the resource: us-central1 for Global or eu for the European Union.
    • dataset-id: the ID of your dataset.
    • target-column-id: the ID of your target column.

    HTTP method and URL:

    PATCH https://endpoint/v1beta1/projects/project-id/locations/location/datasets/dataset-id

    Request JSON body:

    {
      "tablesDatasetMetadata": {
        "targetColumnSpecId": "target-column-id"
      }
    }
    

    To send your request, expand one of these options:

  4. Optionally, update the mlUseColumnSpecId field to specify your data split, and the weightColumnSpecId field to use a weight column.

    Before using any of the request data, make the following replacements:

    • endpoint: automl.googleapis.com for the global location, and eu-automl.googleapis.com for the EU region.
    • project-id: your Google Cloud project ID.
    • location: the location for the resource: us-central1 for Global or eu for the European Union.
    • dataset-id: the ID of your dataset.
    • split-column-id: the ID of your target column.
    • weight-column-id: the ID of your target column.

    HTTP method and URL:

    PATCH https://endpoint/v1beta1/projects/project-id/locations/location/datasets/dataset-id

    Request JSON body:

    {
      "tablesDatasetMetadata": {
        "mlUseColumnSpecId": "split-column-id",
        "weightColumnSpecId": "weight-column-id"
      }
    }
    

    To send your request, expand one of these options:

  5. Review your column stats to ensure that the dataType values are correct, and columns have the correct value for nullable.

    If a field is marked non-nullable, it means that it had no null values for the training dataset. Make sure this will be true for your prediction data as well; if a column is marked non-nullable, and a value is not supplied for it at prediction time, a prediction error is returned for that row.

    Learn more about schema review.

  6. Review your data quality.

    Learn more about analyzing your training data.

  7. Train the model.

    Before using any of the request data, make the following replacements:

    • endpoint: automl.googleapis.com for the global location, and eu-automl.googleapis.com for the EU region.
    • project-id: your Google Cloud project ID.
    • location: the location for the resource: us-central1 for Global or eu for the European Union.
    • dataset-id: the dataset ID.
    • table-id: the table ID, used to set the target column.
    • target-column-id: the ID of the target column.
    • model-display-name: the display name for the new model.
    • optimization-objective with the metric to optimize (optional).

      See About model optimization objectives.

    • train-budget-milli-node-hours with the number of milli-node-hours for training. For example, 1000 = 1 hour.

      Suggested training time is related to the size of your training data. The table below shows suggested training time ranges by row count; a large number of columns will also increase training time.

      Rows Suggested training time
      Less than 100,000 1-3 hours
      100,000 - 1,000,000 1-6 hours
      1,000,000 - 10,000,000 1-12 hours
      More than 10,000,000 3 - 24 hours

      Model creation includes other tasks besides training, so the total time it takes to create your model is longer than the training time. For example, if you specify 2 training hours, it could still take 3 or more hours before the model is ready to deploy. You are charged only for actual training time.

      Learn more about training prices.

      If AutoML Tables detects that the model is no longer improving before the training budget is exhausted, it stops training. If you want to use the entire budgeted training time, set the disableEarlyStopping property on the tablesModelMetadata object to true.

    HTTP method and URL:

    POST https://endpoint/v1beta1/projects/project-id/locations/location/models/

    Request JSON body:

    {
      "datasetId": "dataset-id",
      "displayName": "model-display-name",
      "tablesModelMetadata": {
        "trainBudgetMilliNodeHours": "train-budget-milli-node-hours",
        "optimizationObjective": "optimization-objective",
        "targetColumnSpec": {
          "name": "projects/project-id/locations/location/datasets/dataset-id/tableSpecs/table-id/columnSpecs/target-column-id"
        }
      },
    }
    

    To send your request, expand one of these options:

    You should receive a JSON response similar to the following:

    {
    
      "name": "projects/292381/locations/us-central1/operations/TBL64984",
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata",
        "createTime": "2019-12-30T22:12:03.014058Z",
        "updateTime": "2019-12-30T22:12:03.014058Z",
        "cancellable": true,
        "createModelDetails": {
          "modelDisplayName": "new_model1"
        },
        "worksOn": [
          "projects/292381/locations/us-central1/datasets/TBL3718"
        ],
        "state": "RUNNING"
      }
    }
    

    Training a model is a long-running operation. You can poll for the operation status or wait for the operation to return. Learn more.

Java

If your resources are located in the EU region, you must explicitly set the endpoint. Learn more.

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.automl.v1beta1.AutoMlClient;
import com.google.cloud.automl.v1beta1.ColumnSpec;
import com.google.cloud.automl.v1beta1.ColumnSpecName;
import com.google.cloud.automl.v1beta1.LocationName;
import com.google.cloud.automl.v1beta1.Model;
import com.google.cloud.automl.v1beta1.OperationMetadata;
import com.google.cloud.automl.v1beta1.TablesModelMetadata;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

class TablesCreateModel {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String datasetId = "YOUR_DATASET_ID";
    String tableSpecId = "YOUR_TABLE_SPEC_ID";
    String columnSpecId = "YOUR_COLUMN_SPEC_ID";
    String displayName = "YOUR_DATASET_NAME";
    createModel(projectId, datasetId, tableSpecId, columnSpecId, displayName);
  }

  // Create a model
  static void createModel(
      String projectId,
      String datasetId,
      String tableSpecId,
      String columnSpecId,
      String displayName)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // A resource that represents Google Cloud Platform location.
      LocationName projectLocation = LocationName.of(projectId, "us-central1");

      // Get the complete path of the column.
      ColumnSpecName columnSpecName =
          ColumnSpecName.of(projectId, "us-central1", datasetId, tableSpecId, columnSpecId);

      // Build the get column spec.
      ColumnSpec targetColumnSpec =
          ColumnSpec.newBuilder().setName(columnSpecName.toString()).build();

      // Set model metadata.
      TablesModelMetadata metadata =
          TablesModelMetadata.newBuilder()
              .setTargetColumnSpec(targetColumnSpec)
              .setTrainBudgetMilliNodeHours(24000)
              .build();

      Model model =
          Model.newBuilder()
              .setDisplayName(displayName)
              .setDatasetId(datasetId)
              .setTablesModelMetadata(metadata)
              .build();

      // Create a model with the model metadata in the region.
      OperationFuture<Model, OperationMetadata> future =
          client.createModelAsync(projectLocation, model);
      // OperationFuture.get() will block until the model is created, which may take several hours.
      // You can use OperationFuture.getInitialFuture to get a future representing the initial
      // response to the request, which contains information while the operation is in progress.
      System.out.format("Training operation name: %s%n", future.getInitialFuture().get().getName());
      System.out.println("Training started...");
    }
  }
}

Node.js

If your resources are located in the EU region, you must explicitly set the endpoint. Learn more.

const automl = require('@google-cloud/automl');
const client = new automl.v1beta1.AutoMlClient();

/**
 * Demonstrates using the AutoML client to create a model.
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = '[PROJECT_ID]' e.g., "my-gcloud-project";
// const computeRegion = '[REGION_NAME]' e.g., "us-central1";
// const datasetId = '[DATASET_ID]' e.g., "TBL2246891593778855936";
// const tableId = '[TABLE_ID]' e.g., "1991013247762825216";
// const columnId = '[COLUMN_ID]' e.g., "773141392279994368";
// const modelName = '[MODEL_NAME]' e.g., "testModel";
// const trainBudget = '[TRAIN_BUDGET]' e.g., "1000",
// `Train budget in milli node hours`;

// A resource that represents Google Cloud Platform location.
const projectLocation = client.locationPath(projectId, computeRegion);

// Get the full path of the column.
const columnSpecId = client.columnSpecPath(
  projectId,
  computeRegion,
  datasetId,
  tableId,
  columnId
);

// Set target column to train the model.
const targetColumnSpec = {name: columnSpecId};

// Set tables model metadata.
const tablesModelMetadata = {
  targetColumnSpec: targetColumnSpec,
  trainBudgetMilliNodeHours: trainBudget,
};

// Set datasetId, model name and model metadata for the dataset.
const myModel = {
  datasetId: datasetId,
  displayName: modelName,
  tablesModelMetadata: tablesModelMetadata,
};

// Create a model with the model metadata in the region.
client
  .createModel({parent: projectLocation, model: myModel})
  .then(responses => {
    const initialApiResponse = responses[1];
    console.log(`Training operation name: ${initialApiResponse.name}`);
    console.log('Training started...');
  })
  .catch(err => {
    console.error(err);
  });

Python

The client library for AutoML Tables includes additional Python methods that simplify using the AutoML Tables API. These methods refer to datasets and models by name instead of id. Your dataset and model names must be unique. For more information, see the Client reference.

If your resources are located in the EU region, you must explicitly set the endpoint. Learn more.

# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# dataset_display_name = 'DATASET_DISPLAY_NAME_HERE'
# model_display_name = 'MODEL_DISPLAY_NAME_HERE'
# train_budget_milli_node_hours = 'TRAIN_BUDGET_MILLI_NODE_HOURS_HERE'
# include_column_spec_names = 'INCLUDE_COLUMN_SPEC_NAMES_HERE'
#    or None if unspecified
# exclude_column_spec_names = 'EXCLUDE_COLUMN_SPEC_NAMES_HERE'
#    or None if unspecified

from google.cloud import automl_v1beta1 as automl

client = automl.TablesClient(project=project_id, region=compute_region)

# Create a model with the model metadata in the region.
response = client.create_model(
    model_display_name,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    dataset_display_name=dataset_display_name,
    include_column_spec_names=include_column_spec_names,
    exclude_column_spec_names=exclude_column_spec_names,
)

print("Training model...")
print(f"Training operation name: {response.operation.name}")
print(f"Training completed: {response.result()}")

Schema review

AutoML Tables infers the data type and whether a column is nullable for each column based on the original data type (if it was imported from BigQuery) and the values in the column. You should check each column and make sure it looks correct.

Use the following list to review your schema:

  • Fields that contain free-form text should be Text.

    Text fields are separated into tokens by UnicodeScriptTokenizer, with individual tokens being used for model training. The UnicodeScriptTokenizer tokenizes text by whitespace, while also separating punctuation from text and different languages from each other.

  • If the value of a column is one of a finite set of values, it should probably be Categorical, regardless of the type of data used in the field.

    For example, you might have codes for colors: 1 = red, 2 = yellow, etc. You should make sure that such a field was designated as Categorical.

    An exception to this guidance is if the column contains multi-word strings. In this case, you should set it as a Text column, even if it has a low cardinality. AutoML Tables tokenizes Text columns, and might be able to derive prediction signal from the individual tokens or their order.

  • If a field is marked non-nullable, it means that it had no null values for the training dataset. Make sure this will be true for your prediction data as well; if a column is marked non-nullable, and a value is not supplied for it at prediction time, a prediction error is returned for that row.

Analyzing your training data

  • If a column has a high percentage of missing values, make sure this is expected, and not due to a data collection issue.

  • Make sure the number of invalid values is relatively low or zero.

    Any row that contains one or more invalid value is automatically excluded from being used for model training.

  • If Distinct values for a Categorical column approaches the number of rows (for example, more than 90%), that column will not provide much training signal. It should be excluded from training. ID columns should always be excluded.

  • If a column's Correlation with Target value is high, make sure that is expected, and not an indication of target leakage.

    If the column will be available when you request predictions, then it is probably a feature with strong explanatory power and can be included. However, sometimes features with high correlation are in fact derived from the target or collected after the fact. These features must be excluded from training, because they are not available at prediction time, so the model is unusable in production.

    Correlation is calculated for categorical, numeric, and timestamp columns, using Cramér's V. For numeric columns, it is calculated using bucket counts generated from quantiles.

About model optimization objectives

The optimization objective impacts how your model is trained, and therefore how it performs in production. The table below provides some details about what kinds of problems each objective is best for:

Optimization objective Problem type API value Use this objective if you want to...
AUC ROC Classification MAXIMIZE_AU_ROC Distinguish between classes. Default value for binary classification.
Log loss Classification MINIMIZE_LOG_LOSS Keep prediction probabilities as accurate as possible. Only supported objective for multi-class classification.
AUC PR Classification MAXIMIZE_AU_PRC Optimize results for predictions for the less common class.
Precision at Recall Classification MAXIMIZE_PRECISION_AT_RECALL Optimize precision at a specific recall value.
Recall at Precision Classification MAXIMIZE_RECALL_AT_PRECISION Optimize recall at a specific precision value.
RMSE Regression MINIMIZE_RMSE Capture more extreme values accurately.
MAE Regression MINIMIZE_MAE View extreme values as outliers with less impact on model.
RMSLE Regression MINIMIZE_RMSLE Penalize error on relative size rather than absolute value. Especially helpful when both predicted and actual values can be quite large.

What's next