This page describes how the three sets (training, validation, and test) are used when you train an AutoML model, and the ways you can control how your data is split into the sets for AutoML models.
How AutoML uses data splits
AutoML uses data splits differently depending on the data type of the training data.
For image data sets, AutoML uses the training set to train the model, and the validation set to validate the results that the model returns during training. When training is complete, AutoML uses the test set to provide the final evaluation metrics.
For tabular models, the training process uses the following steps:
The training set is used to train models with different preprocessing, architecture, and hyperparameter option combinations. These models are evaluated on the validation set for quality, which guides the exploration of additional option combinations. The best parameters and architectures determined in the parallel tuning phase are used to train two ensemble models as described below.
Vertex AI trains an evaluation model, using the training and validation sets as training data. Vertex AI generates the final model evaluation metrics on this model, using the test set. This is the first time in the process that the test set is used. This approach ensures that the final evaluation metrics are an unbiased reflection of how well the final trained model will perform in production.
A model is trained with the training, validation, and test sets, to maximize the amount of training data. This model is the one that you use to request predictions.
For text data sets, the training and validation sets are used to try different preprocessing, architecture, and hyperparameter option combinations. These trials result in trained models that are then evaluated on the validation set for quality and to guide exploration of additional option combinations.
When more trials no longer lead to quality improvements, that version of the model is considered the final, best performing, trained model. Next, Vertex AI trains two more models, using the parameters and architecture determined in the parallel tuning phase:
A model trained with your training and validation sets.
Vertex AI generates the model evaluation metrics on this model, using your test set. This is the first time in the process that the test set is used. This approach ensures that the final evaluation metrics are an unbiased reflection of how well the final trained model will perform in production.
A model trained with your training, validation, and test sets.
This model is the one that you use to request predictions.
For video data sets, AutoML uses the training set to train the model, and then uses the test set to provide the final evaluation metrics. The validation set is not used for video data sets.
Data split options
You can let Vertex AI divide your data automatically; in many cases, that is the best approach. However, if you want to control how your data is split into sets, you have some options for how to do that.
You have the following options for splitting your data:
- Using the default split
- Using the ml_use label
- Using the data filter (not supported for tabular data)
- Changing the percentages
- Using a time column (tabular data only)
You choose only one of these options; you make the choice when you train your model.
Some of these options require changes to the training data (for example, the
ml_use label or the Time column). Including data or labels for
data split options does not require you to use those options; you can still
choose another option when you train your model.
Using the default split
Vertex AI can automatically perform the data split for you. Your data is randomly split into the three sets by percentage, depending on your data type. This is the easiest way to split your data, and works well in most cases.
To use the default data split, accept the default in the Cloud Console, or leave the split field empty for the API.
Using the ml_use label
To use the
ml_use label to control your data split, you must have set the
ml_use label on your data. See
Setting a value for the
For AutoML data types other than tabular, when you train your
model, you specify Manual (Advanced) for the Data split in the
Cloud Console. If you are training using the Vertex AI API, you
use the FilterSplit object,
aiplatform.googleapis.com/ml_use=training for the training filter,
aiplatform.googleapis.com/ml_use=validation for the validation filter, and
aiplatform.googleapis.com/ml_use=test for the test filter.
Any data items with an
ml_use value are assigned to the specified set. Data
items that do not have
ml_use set are excluded from the training process.
For the tabular data type, you use the Data split column.
Using the data filter
You can use other labels (besides ml-use) and other fields to split your data by
using the FilterSplit object in
the Vertex AI API. For example, you could set the
labels.flower=daisy, and the
labels.flower=dahlia. This setting would cause all data
rose to be added to the training set, all data labeled as
added to the validation set, and all data labeled as
dahlia to be added to
the test set.
If you filter on multiple fields, a data item could match more than one filter. In this case, the training set takes precedence, followed by the validation set, followed by the test set. In other words, an item is put into the test set only if it matches the filter for the test set, but does not match for either the training or validation filters. If an item does not match the filters for any of the sets, it is excluded from training.
You should not use categories for your data split related to what the model will be predicting; each of your sets should reflect the range of data the model will use to make predictions. (For example, you should not use the filters described previously for a model expected to categorize pictures by flower type.)
If you do not want a filter to match any items, set it to "
(the minus sign).
The data filter is not supported for tabular data.
Changing the percentages
If you do not choose another way to control your data split, your data is randomly split into the sets according to the default percentages for your data type. You can change the percentages to any values that add up to 100 (for the Vertex AI API, you use fractions that add up to 1.0).
To change the percentages (fractions), you use the FractionSplit object to define your fractions. For image, text, and video data types, you can also use the Cloud Console to update your split percentages when you train your model.
Using a time column
For tabular training data, if your data is time-sensitive, you can designate one column as a Time column. Vertex AI uses the Time column to split your data, with the earliest of the rows used for training, the next rows for validation, and the latest rows for testing. If you are using the API, you can also specify the percentages of your data to use for the sets. For more information, see The Time column.
Setting a value for the ml_use label
To use the
ml_use label to control your data split, you first set a value for
ml_use label for your data items.
How you set the
ml_use label for your data depends on your data type:
Setting ml_use for vision, video, and text data
You can set the
ml_use label for vision, video, and text data at data import
time (per data item or for the entire import file), or after data import by
using the Google Cloud Console.
Setting ml_use on individual data items at import time
You can set the
ml_use label on each data item by including a value for the
aiplatform.googleapis.com/ml_use field in your JSONL data, or setting the
value of the first column of the CSV file. See the information about
preparing data for your data type for more details.
If any of your data items are repeated in your data (if the same video, image,
or text snippet appears multiple times in your import file),
Vertex AI uses the
ml_use value for the first data item it
encounters, and ignores any subsequent
ml_use values. The first encountered
item is not necessarily the item that is nearer to the beginning of the upload
Setting ml_use for entire upload files
If your data can be sorted into different upload files by
ml_use value, you
can set the
ml_use value for the entire upload file by using the per-file
dropdown menu when you upload files using the Cloud Console, or by
map field in the datasets.import method.
If you set
ml_use for an upload file, and the file also contains
ml_use values in the file take precedence over the file-wide
Setting ml_use after import
After you have uploaded your data, you can set or update the
ml_use value for
specific data items in the Cloud Console by selecting one or more items
in the list view and using the Assign ML use dropdown menu.
If you reupload a data file, even if the ml_use values have changed, it does not
ml_use value. You cannot update
ml_use values after import by
using the Vertex AI API.
Setting ml_use for tabular data
ml_use per row of training data for tabular data, you provide a
data split column in your training data, and specify it when you train your
model. For more information, see
The data split column.