This page provides a general overview of how AI Platform preprocesses your data for training with built-in algorithms. Additionally, it explains the requirements and limitations for your input data.
Format input data
Your input data must be a CSV file with UTF-8 encoding.
You must prepare your input CSV file to meet the following requirements:
- Remove the header row. The header row contains the labels for each column. Remove the header row in order to avoid submitting it with the rest of the data instances as part of the training data.
- Ensure that the target column is the first column. The target column contains the value that you are trying to predict. For a classification algorithm, all values in the target column are a class or category. For a regression algorithm, all values in the target column are a numerical value.
How preprocessing works
Automatic preprocessing works for categorical and numerical data. The preprocessing routine first analyzes and then transforms your data.
First, AI Platform analyzes the dataset column by column. For each
column, AI Platform automatically detects its data type, identifies
how the column should be treated for data transformation, and computes some
statistics for the data in the column.
The training job captures the results of this analysis in the
file, which is included with other training artifacts in your Cloud Storage
- Type: The column can be either numerical or categorical.
- Treatment: The algorithm identifies how to treat each column. Columns can be treated as constants or row identifiers. Categorical columns can also be tagged for identity or vocabulary, based on whether the categorical values are integers or strings. A column with a large number of categories gets a hashing treatment to calculate a smaller, more manageable number of categories.
- Statistics: Statistics are calculated to help transform the features in each column, based on the column's type and treatment.
After the initial analysis of the dataset is complete, AI Platform transforms your data based on the types, treatments and statistics applied to your dataset. AI Platform does transformations in the following order:
- Splits the training dataset into validation and test datasets if you specify the split percentages.
- Removes rows that have more than 10% of features missing.
- Fills up missing values. The mean is used for numerical columns. For XGBoost, zeroes are used for categorical columns.
Rows with 10% of missing values are removed. In the following examples, assume the row has 10 values. Each example row is truncated for simplicity.
|Row issue||Original values||Transformed values||Explanation|
|Example row with no missing values||[3, 0.45, ...,
'fruits', 0, 1]
|[3, 0.45, ...,
1, 0, 0, 0, 1]
|The string 'fruits' is transformed to the values "1, 0, 0" in one-hot encoding. For TensorFlow-based algorithms, this happens in the TensorFlow graph. For XGBoost, AI Platform does this transformation.|
|Too many missing values||[3, 0.45, ...,
'fruits', __, __]
|Row is removed||More than 10% of values in the row are missing.|
|Missing numerical value||[3, 0.45, ...,
'fruits', 0, __]
|[3, 0.45, ...,
1, 0, 0, 0, 0.54]
|Missing categorical value||[3, 0.45, ...,
__, 0, 1]
|[3, 0.45, ...,
0, 0, 0, 0, 1]
There are further differences in the transformation process, depending on which ML framework the built-in algorithm is based on. For the TensorFlow-based built-in algorithms (linear learner, wide and deep), the column treatments correspond directly to which feature columns are created in the TensorFlow model. AI Platform simply assigns feature columns for the TensorFlow Estimator model, and then the data transformation becomes part of the preprocessing that occurs within the TensorFlow Estimator model.
Otherwise, as with XGBoost, AI Platform applies column treatments and performs the data transformations directly.
For specific details on how preprocessing works for each built-in algorithm, see its corresponding guide:
- Get started with one of the built-in algorithms.