Best practices for creating tabular training data

Well-designed data increases the quality of the resulting machine learning model. Use the guidelines on this page to increase the quality of your training data and model.

The following topics are covered:

Best practices for all tabular data models
Best practices for tabular AutoML models
Best practices for tabular forecasting models
Best practices for data format

Best practices for all tabular data models

The following best practices apply whether you create an AutoML model or a custom-trained model that uses tabular data.

Avoid data leakage

Data leakage happens when your training data includes predictive information that is not available when you ask for an inference. Data leakage can cause your model to show excellent evaluation metrics, but perform poorly on real data.

For example, suppose you want to know how much ice cream your store will sell tomorrow. You can't include the target day's temperature in your training data, because you don't know the temperature (it hasn't happened yet). However, you can use the predicted temperature from the previous day, which you can include in the inference request.

Data leakage can also occur when you use the same data in multiple data splits. If you use time series data, make sure that data from the same date is only used in one of the three data splits.

Avoid training-serving skew

Training-serving skew happens when you generate your training data differently than you generate the data you use to request inferences.

For example, if you use an average value, and for training purposes you average over 10 days, but when you request inference, you average over the last month.

In general, review any difference between how you generate your training data and your serving data (the data you use to generate inferences) to prevent training-serving skew.

Training-serving skew and data distribution

Training-serving skew can also occur based on your data distribution in your training, validation, and testing data splits. There is frequently a difference between the data distribution that a model sees when it's deployed in production versus the data distribution of the dataset that a model is trained on. For example, in production, a model may be applied on an entirely different population of users than those seen during training, or the model may be used to make inferences 30 days after the final training data was recorded.

For best results, make sure that the distribution of the data splits used to create your model accurately reflects the difference between the training data, and the data that you make inferences on in your production environment. Vertex AI can produce non-monotonic inferences, and if the production data is sampled from a very different distribution than the training data, non-monotonic inferences are not very reliable.

Furthermore, the difference between production data and training data must be reflected in the difference between the validation data split and the training data split, and between the testing data split and the validation data split.

For example, if you plan on making inferences about user lifetime value (LTV) over the next 30 days, then make sure that the data in your validation data split is from 30 days after the data in your training data split, and that the data in your testing data split is from 30 days after your validation data split.

Similarly, if you want your model to be tuned to make generalized inferences about new users, then make sure that data from a specific user is only contained in a single split of your training data. For example, all of the rows that pertain to user1 are in the training data split, all of the rows that pertain to user2 are in the validation data split, and all of the rows that pertain to user3 are in the testing data split.

Provide a time signal

For classification and regression models, if the underlying pattern in your data is likely to shift over time (it is not randomly distributed in time), provide that information to Vertex AI. You can provide a time signal in several ways:

If each row of data has a timestamp, make sure that column is included, has a transformation type of Timestamp, and is set as the Time column when you train your model. This ordering is used to split the data, with the most recent data as the test data, and the earliest data as the training data. Learn more.
If your time column doesn't have many distinct values, use a manual split instead of using the Time column to split your data. Otherwise, you might not get enough rows in each dataset, which can cause training to fail.
If the time information is not contained in a single column, you can use a manual data split to use the most recent data as the test data, and the earliest data as the training data.

Make information explicit where needed

For some data primitives, you can improve model quality by engineering features.

For example, if your data includes longitude and latitude, these columns are treated as numeric, with no special calculations. If location or distance provides signal for your problem, engineer a feature that provides that information explicitly.

Some data types that might require feature engineering:

Longitude/Latitude
URLs
IP addresses
Email addresses
Phone numbers
Other geographic codes (for example, postal codes)

Include calculated or aggregated data in a row

Vertex AI uses only the input data in a single row to predict the target value for that row. If you have calculated or aggregated data from other rows or sources that would be valuable in determining the predicted value for a row, include that data with the source row. Be careful that your new column does not cause data leakage or training-serving skew.

For example, if you want to predict next week's demand for a product, improve the quality of the inference by including columns with the following values:

The total number of items in stock from the same category as the product.
The average price of items in stock from the same category as the product.
The number of days before a known holiday when the inference is requested.

In another example, if you want to predict whether a specific user will buy a product, improve the quality of the inference by including columns with the following values:

The average historic conversion rate or click-through rate for the specific user.
How many products are currently in the user's shopping cart.

Avoid bias

Make sure that your training data is representative of the entire universe of potential data that you make inferences for. For example, if you have customers that live all over the world, don't use training data from only one country.

Best practices for tabular AutoML models

The following best practices are for creating tabular training data for AutoML tabular models.

Represent null values appropriately

If you import from CSV, use empty strings to represent null values. From BigQuery, use the NULL value.

If your data uses special characters or numbers to represent null values, including zero, these values are misinterpreted, reducing model quality.

Avoid missing values where possible

Check your data for missing values, and correct them if possible. Otherwise, you can leave the value blank, and it is treated as a null value.

Use spaces to separate text

Vertex AI tokenizes text strings and can derive training signal from individual words. It uses spaces to separate words; words separated by other characters are treated as a single entity.

For example, if you provide the text "red/green/blue", it is not tokenized into "red", "green", and "blue". If those individual words might be important for training the model, transform the text to "red green blue" before including it in your training data.

Make sure your categorical features are accurate and clean

Data inconsistencies can cause categories to be incorrectly split. For example, if your data includes "Brown" and "brown", Vertex AI uses those values as separate categories, when you might have intended them to be the same. Misspellings can have a similar effect. Make sure you remove these kinds of inconsistencies from your categorical data before creating your training data.

Use extra care with imbalanced classes for classification models

If you have imbalanced classes (a classification problem with one or more outcomes that is seen rarely), review the following tips.

Provide sufficient training data for the minority class

Having too few rows of data for one class degrades model quality. If possible, provide at least 100 rows of data for every class.

Consider using a manual split

Vertex AI selects the rows for the test dataset randomly (but deterministically). For imbalanced classes, you could end up with a small number of the minority class in your test dataset, or even none, which causes training to fail.

If you have imbalanced classes, you might want to assign a manual split to make sure enough rows with the minority outcomes are included in every split.

Provide enough training data

If you don't provide enough training data, the resulting model might perform poorly. The more columns you use to train your model, the more data you need to provide.

Your dataset must always include at least 1,000 rows.

The following table provides some heuristics for how much training data to provide, depending on your objective.

Objective	Suggested minimum amount of training data
Classification	At least 10 times as many rows as you have columns.
Forecasting	At least 10 time series for every column used to train the model.
Regression	At least 50 times as many rows as the number of columns.

Leave all other preprocessing and transformations to Vertex AI

Unless otherwise noted, let Vertex AI do the feature engineering for you when you train an AutoML model. AutoML does best when it has access to your underlying data. For a list of all the transformations AutoML performs by transformation type, see Vertex AI transformations.

Best practices for tabular forecasting models

Training data for forecasting models has some special considerations.

Considerations for choosing the data granularity

When you train a forecasting model, you specify the data granularity, or the time interval between the training data rows. It can be hourly, daily, weekly, monthly, or yearly. In addition, it can be every 1, 5, 10, 15, or 30 minutes.

The data granularity must be consistent throughout the training data, and all batch inference data. If you specify a daily granularity, and there are 2 days between two training data rows, Vertex AI treats the interim day as missing data, which can degrade model performance. Multiple rows in the same time series with the same timestamp (as determined by the granularity) are considered a validation error at training time.

Generally, your data collection practices determine your data granularity.

How to find a good value for the context window

If you expect to have a lot of inference data that does not extend into the past (cold starts), begin by setting the context window to 0. Otherwise, a context window between the size of the forecast horizon and 10 times the size of the forecast horizon should work well.

You can try the following steps to find a good value for your data:

For the first training iteration, set the context window and the forecast horizon to the same value, and set your training budget to at least 6 hours.
Train the model again, with the same training budget, but double the size of the context window to 2 times the size of the forecast horizon.
If the evaluation metrics for the second model show substantial improvement, train the model again, increasing the context window to 5 times the size of the forecast horizon. Consider making a proportional increase to the training budget (if you trained for 10 hours in the first step, increase the training budget to 50 hours).
Continue increasing the context window until you are no longer seeing improved evaluation metrics or until you are satisfied with the results. Revert back to the lowest value of the context window that produced acceptable results.

Increasing the context window has the following effects:

Increases the training time

With a larger context window, the model uses more data points in training, causing the training time to increase.
Increases the required amount of history for inference data

Your inference data should provide as many historical datapoints as the value of the context window.

Best practices for data format

You can create your training data in either wide or narrow format. For regression and classification models, wide format is widely used and can be easier to assemble and review. For forecasting models using narrow format can help you avoid setting up unintentional connections between your data and your target (data leakage).

When you create training data to train a forecasting model, each row should represent a single observation on a single time series. You must have a column that represents your time series identifier (how the time series are distinguished from each other), and a column that represents the value that you'll be predicting (your target). Then every other value in the row that is used to train the model must be present at the time you request an inference for your target.

Consider the following (simplified and abbreviated) sample training data:

Date	Widget_1_Demand	Widget_2_Demand	Widget_3_Demand	Promo	Region
01/01/2019	112	241	0	0	CA
01/02/2019	141	219	0	1	CA
01/03/2019	149	244	0	0	CA
01/01/2019	52	0	43	0	IL
01/02/2019	81	0	26	1	IL
01/03/2019	89	0	86	0	IL

This table, in wide format, shows business data by date, but it wouldn't be usable for a forecasting model in its current form. There is no single target column, no time series ID column, and for any given date, you won't know the demand for the other widgets at the time of inference.

You could convert this table into this format:

Date	Product	Region_CA_Demand	Region_IL_Demand	Promo
01/01/2019	Widget_1	112	52	0
01/02/2019	Widget_1	141	81	1
01/03/2019	Widget_1	149	89	0
01/01/2019	Widget_2	241	0	0
01/02/2019	Widget_2	219	0	1
01/03/2019	Widget_2	244	0	0
01/01/2019	Widget_3	0	43	0
01/02/2019	Widget_3	0	26	1
01/03/2019	Widget_3	0	86	0

We now have a potential time series ID column, Product. However, this format could be used only to predict one of the regions, and the data for the other region would need to be known at the time of inference.

The solution is to convert to narrow format, so that each row represents a single observation. Any data that is independent of the time series is repeated for each row:

Date	Demand	ID	Promo
01/01/2019	112	Widget_1_CA	0
01/02/2019	141	Widget_1_CA	1
01/03/2019	149	Widget_1_CA	0
01/01/2019	52	Widget_1_IL	0
01/02/2019	81	Widget_1_IL	1
01/03/2019	89	Widget_1_IL	0
01/01/2019	241	Widget_2_CA	0
01/02/2019	219	Widget_2_CA	1
01/03/2019	244	Widget_2_CA	0
01/01/2019	0	Widget_2_IL	0
01/02/2019	0	Widget_2_IL	1
01/03/2019	0	Widget_2_IL	0
01/01/2019	0	Widget_3_CA	0
01/02/2019	0	Widget_3_CA	1
01/03/2019	0	Widget_3_CA	0
01/01/2019	43	Widget_3_IL	0
01/02/2019	26	Widget_3_IL	1
01/03/2019	86	Widget_3_IL	0

Now we have a time series identifier (ID), a target column (Demand), and a Time column (Date). In addition, each row is based on a single observation, which can be used to predict the target value. The Promo column is used as a feature to train the model.

In reality, you will have many more rows and many more columns than these examples. But you must follow the guidelines here to structure your data to avoid data leakage.