Preprocessing data for tabular built-in algorithms

This page provides a general overview of how AI Platform Training preprocesses your data for training with tabular built-in algorithms. Additionally, it explains the requirements and limitations for your input data.

Tabular built-in algorithms

Built-in algorithms that accept tabular data (numerical and categorical data) have some preprocessing features.

For specific details on how preprocessing works for each tabular built-in algorithm, see its corresponding guide:

The distributed version of the XGBoost algorithm does not support automatic preprocessing.

Format input data

Your input data must be a CSV file with UTF-8 encoding.

You must prepare your input CSV file to meet the following requirements:

  • Remove the header row. The header row contains the labels for each column. Remove the header row in order to avoid submitting it with the rest of the data instances as part of the training data.
  • Ensure that the target column is the first column. The target column contains the value that you are trying to predict. For a classification algorithm, all values in the target column are a class or category. For a regression algorithm, all values in the target column are a numerical value.

How preprocessing works

Automatic preprocessing works for categorical and numerical data. The preprocessing routine first analyzes and then transforms your data.

Analysis

First, AI Platform Training analyzes the dataset column by column. For each column, AI Platform Training automatically detects its data type, identifies how the column should be treated for data transformation, and computes some statistics for the data in the column. The training job captures the results of this analysis in the metadata.json file, which is included with other training artifacts in your Cloud Storage bucket.

  • Type: The column can be either numerical or categorical.
  • Treatment: The algorithm identifies how to treat each column. Columns can be treated as constants or row identifiers. Categorical columns can also be tagged for identity or vocabulary, based on whether the categorical values are integers or strings. A column with a large number of categories gets a hashing treatment to calculate a smaller, more manageable number of categories.
  • Statistics: Statistics are calculated to help transform the features in each column, based on the column's type and treatment.

Transformation

After the initial analysis of the dataset is complete, AI Platform Training transforms your data based on the types, treatments and statistics applied to your dataset. AI Platform Training does transformations in the following order:

  1. Splits the training dataset into validation and test datasets if you specify the split percentages.
  2. Removes rows that have more than 10% of features missing.
  3. Fills up missing values. The mean is used for numerical columns. For XGBoost, zeroes are used for categorical columns.

Example transformations

Rows with 10% of missing values are removed. In the following examples, assume the row has 10 values. Each example row is truncated for simplicity.

Row issue Original values Transformed values Explanation
Example row with no missing values [3, 0.45, ...,
'fruits', 0, 1]
[3, 0.45, ...,
1, 0, 0, 0, 1]
The string 'fruits' is transformed to the values "1, 0, 0" in one-hot encoding. For TensorFlow-based algorithms, this happens in the TensorFlow graph. For XGBoost, AI Platform Training does this transformation.
Too many missing values [3, 0.45, ...,
'fruits', __, __]
Row is removed More than 10% of values in the row are missing.
Missing numerical value [3, 0.45, ...,
'fruits', 0, __]
[3, 0.45, ...,
1, 0, 0, 0, 0.54]
  • The mean value for the column replaces the missing numerical value. In this example, the mean is 0.54.
  • The string 'fruits' is transformed to the values "1, 0, 0" in one-hot encoding. For TensorFlow-based algorithms, this happens in the TensorFlow graph. For XGBoost, AI Platform Training does this transformation.
Missing categorical value [3, 0.45, ...,
__, 0, 1]
[3, 0.45, ...,
0, 0, 0, 0, 1]
  • The missing categorical value is transformed to the values "0, 0, 0" in one-hot encoding. For TensorFlow-based algorithms, this happens in the TensorFlow graph. For XGBoost, AI Platform Training does this transformation.

There are further differences in the transformation process, depending on which ML framework the built-in algorithm is based on. For the TensorFlow-based built-in algorithms (linear learner, wide and deep), the column treatments correspond directly to which feature columns are created in the TensorFlow model. AI Platform Training simply assigns feature columns for the TensorFlow Estimator model, and then the data transformation becomes part of the preprocessing that occurs within the TensorFlow Estimator model.

Otherwise, as with XGBoost, AI Platform Training applies column treatments and performs the data transformations directly.

What's next