About data splits for AutoML models

This page describes how Vertex AI uses the training, validation, and test sets of your data to train an AutoML model and the ways you can control how your data is split among these three sets. AutoML uses data splits differently depending on the data type of the training data.

This page describes data splits for image, text, and video data. For information about data splits for tabular data, see Data splits for tabular data.

Image

For image datasets, AutoML uses the training set to train the model, and the validation set to validate the results that the model returns during training. When training is complete, AutoML uses the test set to provide the final evaluation metrics.

Text

For text data sets, the training and validation sets are used to try different preprocessing, architecture, and hyperparameter option combinations. These trials result in trained models that are then evaluated on the validation set for quality and to guide exploration of additional option combinations.

When more trials no longer lead to quality improvements, that version of the model is considered the final, best performing, trained model. Next, Vertex AI trains two more models, using the parameters and architecture determined in the parallel tuning phase:

  1. A model trained with your training and validation sets.

    Vertex AI generates the model evaluation metrics on this model, using your test set. This is the first time in the process that the test set is used. This approach ensures that the final evaluation metrics are an unbiased reflection of how well the final trained model will perform in production.

  2. A model trained with your training, validation, and test sets.

    This model is the one that you use to request predictions.

Video

For video data sets, AutoML uses the training set to train the model, and then uses the test set to provide the final evaluation metrics. The validation set is not used for video data sets.

You can let Vertex AI divide your data automatically. Your data is randomly split into the three sets by percentage. This is the easiest way to split your data, and works well in most cases.

Set Text Image Video
Training 80 80 80
Validation 10 10 N/A
Test 10 10 20

To use the default data split, accept the default in the Google Cloud console, or leave the split field empty for the API.

If you want to control how your data is split into sets, you have the following options:

You choose only one of these options; you make the choice when you train your model. Some of these options require changes to the training data (for example, the ml_use label). Including data or labels for data split options does not require you to use those options; you can still choose another option when you train your model.

Manual split for unstructured data

The manual split is also known as "predefined split".

To use the ml_use label to control your data split, you must have set the ml_use label on your data.

Set a value for the ml_use label

You can set the ml_use label for vision, video, and text data at data import time (per data item or for the entire import file), or after data import by using the Google Cloud console.

Setting ml_use on individual data items at import time

You can set the ml_use label on each data item by including a value for the aiplatform.googleapis.com/ml_use field in your JSON Lines data, or setting the value of the first column of the CSV file. See the information about preparing data for your data type for more details.

If any of your data items are repeated in your data (if the same video, image, or text snippet appears multiple times in your import file), Vertex AI uses the ml_use value for the first data item it encounters, and ignores any subsequent ml_use values. The first encountered item is not necessarily the item that is nearer to the beginning of the upload file.

Setting ml_use for entire upload files

If your data can be sorted into different upload files by ml_use value, you can set the ml_use value for the entire upload file by using the per-file drop-down menu when you upload files using the Google Cloud console, or by using the dataItemLabels map field in the datasets.import method.

If you set ml_use for an upload file, and the file also contains ml_use values, the ml_use values in the file take precedence over the file-wide value.

Setting ml_use after import

After you have uploaded your data, you can set or update the ml_use value for specific data items in the Google Cloud console by selecting one or more items in the list view and using the Assign ML use drop-down menu.

If you reupload a data file, even if the ml_use values have changed, it does not update the ml_use value. You can't update ml_use values after import by using the Vertex AI API.

Use the ml_use label

When you train your model, you specify Manual (Advanced) for the Data split in the Google Cloud console. If you are training using the Vertex AI API, you use the FilterSplit object, specifying labels.aiplatform.googleapis.com/ml_use=training for the training filter, labels.aiplatform.googleapis.com/ml_use=validation for the validation filter, and labels.aiplatform.googleapis.com/ml_use=test for the test filter. For example:

model = job.run(
dataset=dataset,
model_display_name=_name,
training_filter_split="labels.aiplatform.googleapis.com/ml_use=training",
validation_filter_split="labels.aiplatform.googleapis.com/ml_use=validation",
test_filter_split="labels.aiplatform.googleapis.com/ml_use=test")

Any data items with an ml_use value are assigned to the specified set. Data items that don't have ml_use set are excluded from the training process.

Data filter split

You can use other labels (besides ml-use) and other fields to split your data by using the FilterSplit object in the Vertex AI API. For example, you could set the trainingFilter to labels.flower=rose, the validationFilter to labels.flower=daisy, and the testFilter to labels.flower=dahlia. This setting would cause all data labeled as rose to be added to the training set, all data labeled as daisy added to the validation set, and all data labeled as dahlia to be added to the test set.

If you filter on multiple fields, a data item could match more than one filter. In this case, the training set takes precedence, followed by the validation set, followed by the test set. In other words, an item is put into the test set only if it matches the filter for the test set, but does not match for either the training or validation filters. If an item does not match the filters for any of the sets, it is excluded from training.

Don't use categories for your data split related to what the model will be predicting; each of your sets must reflect the range of data that the model uses to make predictions. (For example, don't use the filters described previously for a model expected to categorize pictures by flower type.)

If you don't want a filter to match any items, set it to "-" (the minus sign).

Mathematical split

The mathematical split is also known as "fraction split".

By default, your data is randomly split into the sets according to the default percentages for your data type. You can change the percentages to any values that add up to 100 (for the Vertex AI API, you use fractions that add up to 1.0).

To change the percentages (fractions), you use the FractionSplit object to define your fractions. For image, text, and video data types, you can also use the Google Cloud console to update your split percentages when you train your model.