Prepare tabular training data

This page describes how to prepare your tabular data for use in a Vertex AI dataset. Vertex AI datasets can be used to train AutoML models or custom-trained models. The quality of your training data impacts the effectiveness of the models you create.

See also Best practices for creating tabular training data and Data types for tabular data.

Identify your problem

The first step in creating effective tabular training data is to make sure that your problem is well defined and will yield the prediction results you need. If you are new to machine learning, you should consider using AutoML to train your model. AutoML models can be used for the following objectives:

  • A regression model analyzes your tabular data and returns a numeric value. For example, you could train a model to estimate the value of a house.

  • A classification model analyzes your tabular data and returns a list of categories that describe the data. For example, you could train a model to predict whether the purchase history for a customer predicts that they will buy a subscription or not.

  • A forecasting model (Preview) uses multiple rows of time-dependent tabular data from the past to predict a series of numeric values that extend into the future. For example, by forecasting future product demand, a retail organization could optimize its supply chain to reduce the chance of overstocking or selling out of that product.

Training data structure

The requirements for your training data depend on your model objective. Select your objective below:

Classification/Regression

Your training data must conform to the following basic requirements:

  • It must be 100 GB or smaller.

  • There must be at least two and no more than 1,000 columns.

    For datasets that train AutoML models, one column must be the target, and there must be at least one feature available to train the model. If the training data does not include the target column, Vertex AI cannot associate the training data with the desired result.

    Ideally, your training data has many more than two columns.

  • There must be at least 1,000 and no more than 100,000,000 rows.

    Depending on how many features your dataset has, 1,000 rows might not be enough to train a high-performing model. Learn more.

  • You must use the appropriate data format (wide or narrow) for your objective.

    For classification and regression models, wide format is generally best, with every row representing one training data item (product, person, and so on).

    Learn more about data formats.

Forecasting

For forecasting models (Preview), your training data must conform to the following basic requirements:

  • It must be 100 GB or smaller.

  • There must be at least two and no more than 1,000 columns.

    For datasets that train AutoML models, one column must be the target, and there must be at least one feature available to train the model. If the training data does not include the target column, Vertex AI cannot associate the training data with the desired result.

    Ideally, your training data has many more than two columns.

  • There must be at least 1,000 and no more than 100,000,000 rows.

    Depending on how many features your dataset has, 1,000 rows might not be enough to train a high-performing model. Learn more.

  • Use narrow (sometimes called long) data format. In narrow format, each row represents the item specified by the time series identifier for a specific point in time, along with all of the data for that item at that point in time.

    Learn more about data formats.

  • You must specify a Time column, and it must have a value for every row.

    The Time column is required, and is used to place the observation represented by that row in time.

  • You must specify a time series identifier column, and it must have a value for every row.

    Learn more.

  • The interval between your training rows must be consistent.

    This is your data frequency; it will affect how the model is trained and the frequency of prediction results. Learn more.

Preparing your import source

You can provide model training data to Vertex AI in two ways:

Which source you use depends on how your data is stored, and the size and complexity of your data. If your dataset is small, and you don't need more complex data types, CSV might be easier. For larger datasets that include arrays and structs, you must use BigQuery.

BigQuery

Your BigQuery table or view must conform to the BigQuery location requirements.

If your BigQuery table or view is in a different project than the project where you are creating your Vertex AI dataset, or your BigQuery table or view is backed by an external data source, you might need to add one or more roles to the Vertex AI Service Agent. See Role addition requirements for BigQuery.

You do not need to specify a schema for your BigQuery table. Vertex AI automatically infers the schema for your table when you import your data.

Your BigQuery uri (specifying the location of your training data) must conform to the following format:

bq://<project_id>.<dataset_id>.<table_id>

The uri cannot contain any other special characters.

For information about BigQuery data types and how they map into Vertex AI, see BigQuery tables. For more information about using BigQuery external data sources, see Introduction to external data sources.

CSV files

CSV files can be in Cloud Storage, or on your local computer. They must conform to the following requirements:

  • The first line of the first file must be a header, containing the names of the columns. If the first row of a subsequent file is the same as the header, then it is also treated as a header, otherwise it will be treated as data.
  • Column names can include any alphanumeric character or an underscore (_). The column name cannot begin with an underscore.
  • Each file must not be larger than 10 GB.

    You can include multiple files, up to a maximum amount of 100 GB.

  • The delimiter must be a comma (",").

You do not need to specify a schema for your CSV data. Vertex AI automatically infers the schema for your table when you import your data, and uses the header row for column names.

For more information about CSV file format and data types, see CSV files.

If you are importing your data from Cloud Storage, it must be in a bucket that meets the following requirements:

If you are importing your data from your local computer, you must have a Cloud Storage bucket that meets the following requirements:

Controlling data splits

When you use a dataset to train an AutoML model, your data is divided into three splits: a training set, a validation set, and a test set. For more information about how the sets are used, see How AutoML uses data splits.

The ways you can control your data splits depend on your objective; the forecasting objective supports different options than the classification and regression objectives. Select the tab for your objective below:

Classification/Regression

By default, Vertex AI randomly selects 80% of your data rows for the training set, 10% for the validation set, and 10% for the test set. For datasets that do not change over time, are relatively balanced, and that reflect the distribution of the data that will be used for predictions in production, the random selection algorithm is usually sufficient. The key goal is to ensure that your test set accurately represents the data the model will see in production. This ensures that the evaluation metrics provide an accurate signal on how the model will perform on real world data.

Here are some times when you might want to actively choose what rows are used in which data split:

  • You are not training a forecasting model, but your data is time-sensitive.

    In this case, you should use a Time column, or a manual split that results in the most recent data being used as the test set.

  • Your test data includes data from populations that will not be represented in production.

    For example, suppose you are training a model with purchase data from a number of stores. You know, however, that the model will be used primarily to make predictions for stores that are not in the training data. To ensure that the model can generalize to unseen stores, you should segregate your data sets by stores. In other words, your test set should include only stores different from the evaluation set, and the evaluation set should include only stores different from the training set.

  • Your classes are imbalanced.

    If you have many more of one class than another in your training data, you might need to manually include more examples of the minority class in your test data. Vertex AI does not perform stratified sampling, so the test set could include too few or even zero examples of the minority class.

You can control what rows are selected for which split using one of these approaches:

You can specify only one of these options when you train a model.

Manual split

The data split column enables you to select specific rows to be used for training, validation, and testing. When you create your training data, you add a column that can contain one of the following (case sensitive) values:

  • TRAIN
  • VALIDATE
  • TEST
  • UNASSIGNED

The values in this column must be one of the two following combinations:

  • All of TRAIN, VALIDATE, and TEST
  • Only TEST and UNASSIGNED

Every row must have a value for this column; it cannot be the empty string.

For example, with all sets specified:

"TRAIN","John","Doe","555-55-5555"
"TEST","Jane","Doe","444-44-4444"
"TRAIN","Roger","Rogers","123-45-6789"
"VALIDATE","Sarah","Smith","333-33-3333"

With only the test set specified:

"UNASSIGNED","John","Doe","555-55-5555"
"TEST","Jane","Doe","444-44-4444"
"UNASSIGNED","Roger","Rogers","123-45-6789"
"UNASSIGNED","Sarah","Smith","333-33-3333"

The data split column can have any valid column name; its transformation type can be Categorical, Text, or Auto.

If the value of the data split column is UNASSIGNED, Vertex AI automatically assigns that row to the training or validation set.

When you train your model, you select a Manual data split and specify this column as the data split column.

Splitting your data chronologically

For regression and classification models, you can use the Time column to tell Vertex AI that time matters for your data (the data is not randomly distributed over time). When you specify the Time column, Vertex AI uses the earliest of the rows for training, the next rows for validation, and the latest rows for testing.

For the classification and regression objectives, Vertex AI treats each row as an independent and identically distributed training example; setting the Time column does not change this. The Time column is used only to split the data set.

If you specify a Time column, you must include a value for the Time column for every row in your dataset. Make sure that the Time column has enough distinct values, so that the evaluation and test sets are non-empty. Usually, having at least 20 distinct values should be sufficient.

The data in the Time column must conform to one of the formats supported by the timestamp transformation. However, the Time column can have any supported transformation, because the transformation only affects how that column is used in training; transformations do not affect data split.

When you specify a Time column by using the Vertex AI API, you can also change the percentages of the training data that get assigned to each set.

When you configure your training options for a new model, you select this column as the Time column.

Splitting your data mathematically using the split percentages

By default, the percentages of training data used for the training, validation, and test sets are 80, 10, and 10, respectively. You can change the percentages when you train your model by using the TrainingPipeline.create method of the Vertex AI API.

When you change the percentages, the default split method is used for your model (random).

Forecasting

By default, Vertex AI uses a chronological split algorithim to separate your forecasting (Preview) data into the three data splits. In most cases, the default split is recommended. If you want to control which training data rows are used for which data set, use a manual split.

The default (chronological) data split works as follows:

  1. The training data is sorted by date.
  2. Using the predetermined set percentages (80/10/10), the time period covered by the training data is separated into three blocks, one for each training set.
  3. Empty rows are added to the beginning of each time series to enable the model to learn from rows that do not have enough history (context window). The number of added rows is the size of the context window set at training time.

  4. Using the forecast horizon size as set at training time, each row whose future data (forecast horizon) falls fully into one of the data sets is used for that set. (Rows whose forecast horizon straddles two sets are discarded to avoid data leakage.)

    Chronological split diagram

Manual split

The data split column enables you to select specific rows to be used for training, validation, and testing. When you create your training data, you add a column that can contain one of the following (case sensitive) values:

  • TRAIN
  • VALIDATE
  • TEST

Every row must have a value for this column; it cannot be the empty string.

For example:

"TRAIN","sku_id_1","2020-09-21","10"
"TEST","sku_id_1","2020-09-22","23"
"TRAIN","sku_id_2","2020-09-22","3"
"VALIDATE","sku_id_2","2020-09-23","45"

The data split column can have any valid column name; its transformation type can be Categorical, Text, or Auto.

When you train your model, you select a Manual data split and specify this column as the data split column.

For more information about creating a manual split for a forecasting model, see best practices for splitting your time series data.

About using weights in your training data

By default, Vertex AI weights each row of your training data equally-- no row is considered to be more important for training purposes than another.

Sometimes, you might want some rows to have more importance for training. For example, if you are using spending data, you might want the data associated with higher spenders to have a larger impact on the model. If missing a specific outcome is something you particularly want to avoid, then you can weight rows with that outcome more heavily.

You give rows a relative weight by adding a weight column to your dataset. The weight column must be a Numeric column. The weight value can be 0‑10,000. Higher values indicate that the row is more important when training the model. A weight of 0 causes the row to be ignored. If you include a weight column, it must contain a value for every row.

Later, when you train your model, you specify this column as the Weight column.

Custom weighting schemes are used only for training the model; they do not affect the test set used for model evaluation.

Requirements for the target column

The requirements for the target column vary depending on your model objective. Select your objective below:

Classification/Regression

For regression and classification models, the target column must conform to the following requirements:

  • It must be either Categorical or Numerical.
  • If it is Categorical, it must have at least 2 and no more than 500 distinct values.

Forecasting

For forecasting models (Preview), the target column must be Numerical.

What's next