About data splits for AutoML models

This page describes how the three sets (training, validation, and test) are used when you train an AutoML model, and the ways you can control how your data is split into the sets for AutoML models.

How AutoML uses data splits

AutoML uses data splits differently depending on the data type of the training data.

Image

For image data sets, AutoML uses the training set to train the model, and the validation set to validate the results that the model returns during training. When training is complete, AutoML uses the test set to provide the final evaluation metrics.

Tabular

For tabular models, the training process uses the following steps:

  1. Model trials

    The training set is used to train models with different preprocessing, architecture, and hyperparameter option combinations. These models are evaluated on the validation set for quality, which guides the exploration of additional option combinations. The best parameters and architectures determined in the parallel tuning phase are used to train two ensemble models as described below.

  2. Model evaluation

    Vertex AI trains an evaluation model, using the training and validation sets as training data. Vertex AI generates the final model evaluation metrics on this model, using the test set. This is the first time in the process that the test set is used. This approach ensures that the final evaluation metrics are an unbiased reflection of how well the final trained model will perform in production.

  3. Serving model

    A model is trained with the training, validation, and test sets, to maximize the amount of training data. This model is the one that you use to request predictions.

Text

For text data sets, the training and validation sets are used to try different preprocessing, architecture, and hyperparameter option combinations. These trials result in trained models that are then evaluated on the validation set for quality and to guide exploration of additional option combinations.

When more trials no longer lead to quality improvements, that version of the model is considered the final, best performing, trained model. Next, Vertex AI trains two more models, using the parameters and architecture determined in the parallel tuning phase:

  1. A model trained with your training and validation sets.

    Vertex AI generates the model evaluation metrics on this model, using your test set. This is the first time in the process that the test set is used. This approach ensures that the final evaluation metrics are an unbiased reflection of how well the final trained model will perform in production.

  2. A model trained with your training, validation, and test sets.

    This model is the one that you use to request predictions.

Video

For video data sets, AutoML uses the training set to train the model, and then uses the test set to provide the final evaluation metrics. The validation set is not used for video data sets.

Data split options

You can let Vertex AI divide your data automatically; in many cases, that is the best approach. However, if you want to control how your data is split into sets, you have some options for how to do that.

You have the following options for splitting your data:

You choose only one of these options; you make the choice when you train your model.

Some of these options require changes to the training data (for example, the ml_use label or the Time column). Including data or labels for data split options does not require you to use those options; you can still choose another option when you train your model.

Using the default split

Vertex AI can automatically perform the data split for you. Your data is randomly split into the three sets by percentage, depending on your data type. This is the easiest way to split your data, and works well in most cases.

Set Text Image Video Tabular
Training 80 80 80 80
Validation 10 10 N/A 10
Test 10 10 20 10

To use the default data split, accept the default in the Cloud Console, or leave the split field empty for the API.

Using the ml_use label

To use the ml_use label to control your data split, you must have set the ml_use label on your data. See Setting a value for the ml_use label.

For AutoML data types other than tabular, when you train your model, you specify Manual (Advanced) for the Data split in the Cloud Console. If you are training using the Vertex AI API, you use the FilterSplit object, specifying aiplatform.googleapis.com/ml_use=training for the training filter, aiplatform.googleapis.com/ml_use=validation for the validation filter, and aiplatform.googleapis.com/ml_use=test for the test filter.

Any data items with an ml_use value are assigned to the specified set. Data items that do not have ml_use set are excluded from the training process.

For the tabular data type, you use the Data split column.

Using the data filter

You can use other labels (besides ml-use) and other fields to split your data by using the FilterSplit object in the Vertex AI API. For example, you could set the trainingFilter to labels.flower=rose, the validationFilter to labels.flower=daisy, and the testFilter to labels.flower=dahlia. This setting would cause all data labeled as rose to be added to the training set, all data labeled as daisy added to the validation set, and all data labeled as dahlia to be added to the test set.

If you filter on multiple fields, a data item could match more than one filter. In this case, the training set takes precedence, followed by the validation set, followed by the test set. In other words, an item is put into the test set only if it matches the filter for the test set, but does not match for either the training or validation filters. If an item does not match the filters for any of the sets, it is excluded from training.

You should not use categories for your data split related to what the model will be predicting; each of your sets should reflect the range of data the model will use to make predictions. (For example, you should not use the filters described previously for a model expected to categorize pictures by flower type.)

If you do not want a filter to match any items, set it to "-" (the minus sign).

The data filter is not supported for tabular data.

Changing the percentages

If you do not choose another way to control your data split, your data is randomly split into the sets according to the default percentages for your data type. You can change the percentages to any values that add up to 100 (for the Vertex AI API, you use fractions that add up to 1.0).

To change the percentages (fractions), you use the FractionSplit object to define your fractions. For image, text, and video data types, you can also use the Cloud Console to update your split percentages when you train your model.

Using a time column

For tabular training data, if your data is time-sensitive, you can designate one column as a Time column. Vertex AI uses the Time column to split your data, with the earliest of the rows used for training, the next rows for validation, and the latest rows for testing. If you are using the API, you can also specify the percentages of your data to use for the sets. For more information, see The Time column.

Setting a value for the ml_use label

To use the ml_use label to control your data split, you first set a value for the ml_use label for your data items.

How you set the ml_use label for your data depends on your data type:

Setting ml_use for vision, video, and text data

You can set the ml_use label for vision, video, and text data at data import time (per data item or for the entire import file), or after data import by using the Google Cloud Console.

Setting ml_use on individual data items at import time

You can set the ml_use label on each data item by including a value for the aiplatform.googleapis.com/ml_use field in your JSONL data, or setting the value of the first column of the CSV file. See the information about preparing data for your data type for more details.

If any of your data items are repeated in your data (if the same video, image, or text snippet appears multiple times in your import file), Vertex AI uses the ml_use value for the first data item it encounters, and ignores any subsequent ml_use values. The first encountered item is not necessarily the item that is nearer to the beginning of the upload file.

Setting ml_use for entire upload files

If your data can be sorted into different upload files by ml_use value, you can set the ml_use value for the entire upload file by using the per-file dropdown menu when you upload files using the Cloud Console, or by using the dataItemLabels map field in the datasets.import method.

If you set ml_use for an upload file, and the file also contains ml_use values, the ml_use values in the file take precedence over the file-wide value.

Setting ml_use after import

After you have uploaded your data, you can set or update the ml_use value for specific data items in the Cloud Console by selecting one or more items in the list view and using the Assign ML use dropdown menu.

If you reupload a data file, even if the ml_use values have changed, it does not update the ml_use value. You cannot update ml_use values after import by using the Vertex AI API.

Setting ml_use for tabular data

To specify ml_use per row of training data for tabular data, you provide a data split column in your training data, and specify it when you train your model. For more information, see The data split column.