This page describes how AutoML Tables enables you and your team to build high-performing models from your tabular data.
See our Known issues page for current known issues and how to avoid or recover from them.
AutoML Tables helps you create clean, effective training data by providing information about missing data, correlation, cardinality, and distribution for each of your features. And because there's no charge for importing your data and viewing information about it, you don't incur charges from AutoML Tables until you start training your model.
When you kick off training, AutoML Tables automatically performs common feature engineering tasks for you, including:
- Normalize and bucketize numeric features.
- Create one-hot encoding and embeddings for categorical features.
- Perform basic processing for text features.
- Extract date- and time-related features from Timestamp columns.
For more information, see Data preparation that AutoML Tables does for you.
Parallel model testing
When you kick off training for your model, AutoML Tables takes your dataset and starts training for multiple model architectures at the same time. This approach enables AutoML Tables to determine the best model architecture for your data quickly, without having to serially iterate over the many possible model architectures. The model architectures AutoML Tables tests include:
- Feedforward deep neural network
- Gradient Boosted Decision Tree
- Ensembles of various model architectures
As new model architectures come out of the research community, we will add those as well.
Model evaluation and final model creation
Using your training and validation sets, we determine the best model architecture for your data. We then train two more models, using the parameters and architecture we determined in the parallel testing phase:
A model trained with your training and validation sets.
We use your test set to provide the model evaluation on this model.
A model trained with your training, validation, and test sets.
This is the model that we provide to you to use to make predictions.
Choosing between AutoML Tables and BigQuery ML
You might want to use BigQuery ML if you are more focused on rapid experimentation or iteration with what data to include in the model and want to use simpler model types for this purpose (such as logistic regression).
You might want to work directly in the AutoML Tables interface if you have already finalized the data, and you:
Are optimizing for maximizing model quality (accuracy, low RMSE, and so on) without needing to manually do feature engineering, model selection, ensembling, and so on.
Are willing to wait longer to attain that model quality. AutoML Tables takes at least an hour to train a model, because it experiments with many modeling options. BigQuery ML potentially returns models in minutes because it sticks with the model architectures and parameter values and ranges you set.
Have a wide variety of feature inputs (beyond numbers and classes) that would benefit from the additional automated feature engineering that AutoML Tables provides.
Model transparency and Cloud Logging
You can view the structure of your AutoML Tables model using Cloud Logging. In Logging, you can see the final model hyperparameters as well as the hyperparameters and objective values used during model validation.
For more information, see Logging.
We know that you need to be able to explain how your data relates to the final model, and to the predictions it makes. We provide you with two primary ways to gain insight into your model and how it operates:
Feature importance, sometimes called feature attributions, enables you to see which features contributed the most to model training (model feature importance) and individual predictions (local feature importance).
Model feature importance
Model feature importance helps you ensure that the features that informed model training make sense for your data and business problem. All of the features with a high feature importance value should represent a valid signal for prediction, and be able to be consistently included in your prediction requests.
Model feature importance is calculated using permutation importance. For help with using model feature importance, see Evaluating models.
Local feature importance
Local feature importance gives you visibility into how the features in a specific prediction request informed the resulting prediction.
To arrive at each local feature importance value, first the baseline prediction score is calculated. Baseline values are computed from the training data, using the median value for numeric features and the mode for categorical features. The prediction generated from the baseline values is the baseline prediction score.
For classification models, the local feature importance tells you how much each feature added to or subtracted from the probability assigned to the class with the highest score, as compared with the baseline prediction score.
For regression models, the local feature importance for a prediction tells you how much each feature added to or subtracted from the result as compared with the baseline prediction score.
AutoML Tables computes local feature importance using the Sampled Shapley method. For more information about model explainability, see Introduction to AI Explanations.
For help with using local feature importance, see Getting an online prediction.
Considerations for using local feature importance:
Local feature importance results are available only for models trained on or after 15 November, 2019.
Each local feature importance value shows only how much the feature affected the prediction for that row. To understand the overall behavior of the model, use the model feature importance graph on the Evaluate tab.
Local feature importance values are always relative to the baseline value. Make sure you reference the baseline value when you are evaluating your local feature importance results. The baseline value is available only from the Cloud Console.
The local feature importance values depend entirely on the model and data used to train the model. They can tell only the patterns the model found in the data, and can't detect any fundamental relationships in the data. So, the presence of a high feature importance for a certain feature does not demonstrate a relationship between that feature and the target; it merely shows that the model is using the feature in its predictions.
Feature importance values alone do not tell you if your model is fair, unbiased, or of sound quality. You should carefully evaluate your training dataset, procedure, and evaluation metrics, in addition to feature importance.
Test data export
You can export your test set, along with the predictions your model made. This capability gives you insight into how your model is performing on individual rows of training data. Examining your test set and its results can help you understand what types of predictions your model performs poorly on, and might provide clues into how you can improve your data for a higher-quality model.