Data types and transformations for tabular data

This page describes how Vertex AI works with different types of tabular data for AutoML models.

Introduction to transformations in Vertex AI

Tabular data is made up of many rows of data. Each row has the same features, or columns. Each feature has a source data type, which depends on the source (BigQuery or a CSV file in Cloud Storage) . When you use the data to train a model, Vertex AI examines the source data type and feature values, and infers how it will use that feature in model training. This is called the transformation for that feature. If needed, you can specify a different supported transformation for any feature.

For forecasting models (Preview), you also need to provide the feature type for all of your columns.

Vertex AI transformations

Categorical

Applying a categorical transformation causes that feature to represent values in a category. That is, a nominal level. The values differ only based on their name without order. You can use numbers to represent categorical values, but the values have no numeric relationship with each other. That is, a categorical 1 is not "greater" than a categorical 0.

Here are some examples of categorical values:

  • Boolean - true, false.
  • Country - "USA", "Canada", "China", and so on.
  • HTTP status code - "200", "404", "500", and so on.

Categorical values are case-sensitive; spelling variations are treated as different categories (for example, "Color" and "Colour" are not combined).

When you train a model with a feature with a categorical transformation, Vertex AI applies the following data transformations to the feature, and uses any that provide signal for training:

  • The categorical string as is--no change to case, punctuation, spelling, tense, and so on.
  • Convert the category name to a dictionary lookup index and generate an embedding for each index.
  • Categories that appear less than 5 times in the training dataset are treated as the "unknown" category. The "unknown" category gets its own special lookup index and resulting embedding.

Text

A text transformation causes the feature to be used as free-form text, typically comprised of text tokens.

Here are some examples of text values:

  • "The quick brown fox"
  • "This restaurant is the best! The food is delicious"

For forecasting models, the text transformation is not supported for covariate features.

When you train a model with a feature with a text transformation, Vertex AI applies the following data transformations to the feature, and uses any that provide signal for training:

  • The text as is--no change to case, punctuation, spelling, tense, and so on.
  • Tokenize text to words and generate 1-grams and 2-grams from words. Convert each n-gram to a dictionary lookup index and generate an embedding for each index. Combine the embedding of all elements into a single embedding using the mean.

    Tokenization is based on unicode script boundaries.

  • Missing values get their own lookup index and resulting embedding.
  • Stop-words receive no special treatment and are not removed.

Numeric

A numeric transformation causes a column to be used as an ordinal or quantitative number. These numbers can be compared. That is, two distinct numbers can be less than or greater than one another.

Leading or trailing whitespace is trimmed.

The following table shows all compatible formats for a numeric transformation:

Format Examples Notes
Numeric string "101", 101.5" The period character (".") is the only valid decimal delimiter. "101,5" and "100,000" are not valid numeric strings.
Scientific notation "1.12345E+11", "1.12345e+11" See note for numeric strings regarding decimal delimiters.
Not a number "NAN", "nan", "+NAN" Case is ignored. Prepended plus ("+") or minus ("-") characters are ignored. Interpreted as NULL value.
Infinity "INF", "+inf" Case is ignored. Prepended plus ("+") or minus ("-") characters are ignored. Interpreted as NULL value.

If a value in a column with a numeric transformation does not conform to one of these formats, then either the entire row is excluded from training, or the value is treated as null. You choose between these outcomes when you select the numeric transformation.

When you train a model with a feature with a numeric transformation, Vertex AI applies the following data transformations to the feature, and uses any that provide signal for training:

  • The value converted to float32.
  • The z_score of the value.
  • A bucket index of the value based on quantiles. Bucket size is 100.
  • log(value+1) when the value is greater than or equal to 0. Otherwise, this transformation is not applied and the value is considered a missing value.
  • z_score of log(value+1) when the value is greater than or equal to 0. Otherwise, this transformation is not applied and the value is considered a missing value.
  • A boolean value that indicates whether the value is null.
  • Rows with invalid numerical inputs (for example, a string that can not be parsed to float32) are not included for training and prediction.
  • Extreme/outlier values are not given any special treatment.

Timestamp

A Timestamp transformation causes a feature to be used as a point in time, represented either as a civil time with a time zone, or a Unix timestamp. Only features with a Timestamp transformation can be used for the Time column.

If a time zone is not specified with the civil time, it defaults to UTC.

The following table shows all compatible timestamp string formats:

Format Example Notes
%E4Y-%m-%d "2017-01-30" See the Abseil documentation for a description of this format.
%E4Y/%m/%d "2017/01/30"
%Y/%m/%d %H:%M:%E*S "2017/01/30 23:59:58"
%d-%m-%E4Y "30-11-2018"
%d/%m/%E4Y "30/11/2018"
%d-%B-%E4Y "30-November-2018"
%Y-%m-%dT%H:%M:%E*S%Ez "2019-05-17T23:56:09.05Z" RFC 3339
Unix timestamp string in seconds "1541194447" Only for times between 01/Jan/1990 and 01/Jan/2030.
Unix timestamp string in milliseconds "1541194447000"
Unix timestamp string in microseconds "1541194447000000"
Unix timestamp string in nanoseconds "1541194447000000000"

If a value in a column with a timestamp transformation does not conform to one of these formats, then either the entire row is excluded from training, or the value is treated as null. You choose between these outcomes when you select the timestamp transformation.

When you train a model with a feature with a timestamp transformation, Vertex AI applies the following data transformations to the feature, and uses any that provide signal for training:

  • Apply the transformations for Numerical columns.
  • Determine the year, month, day,and weekday. Treat each value from the timestamp as a Categorical column.
  • Invalid numerical values (for example, values that fall outside of a typical timestamp range, or are extreme values) receive no special treatment and are not removed.
  • Rows with invalid timestamp inputs (for example, an invalid timestamp string) are not included for training and prediction.

Compound data types

Sometimes you need to include data that is made up of multiple data primitives, such as an array or a struct. Compound data types are available only by using BigQuery as a data source, and are not supported for forecasting models (Preview).

Struct

A struct can be used to represent a group of labeled fields. A struct has a list of field names, each associated with a data type. The list of fields and their data types must be the same for all struct values in a column.

Here are some examples of structs:

  • Blood pressure - {"timestamp": 1535761416, "systolic": 110, "diastolic": 70}
  • Product - {"name": "iPhone", price: 1000}

You use the BigQuery STRUCT data type to represent structs.

Struct values are automatically flattened into fields. Vertex AI applies the data transformation to the flattened fields according to their transformation type.

Array

An array can be used to represent a list of values. The contained values must accept the same transformation type. You can include structs in arrays; all of the structs in the array must have the same structure.

Vertex AI processes arrays as representing relative weight. In other words, items that appear later in the array are weighted more heavily than items that appear towards the beginning.

Here are some examples of arrays:

  • Product categories:

    ["Clothing", "Women", "Dress", ...]

  • Most recent purchases:

    ["iPhone", "Laptop", "Suitcase", ...]

  • User records:

    [{"name": "Joelle", ID: 4093}, {"name": "Chloe", ID: 2047}, {"name": "Neko", ID: 3432}, ...]

You use the BigQuery ARRAY data type to represent arrays.

The data transformation Vertex AI applies depends on the transformation type applied to the array:

Array type Transformation
Numerical array
  • All transformations for Numerical types applied to the average of the last N items where N = {1, 2, 4, 8, all}. So the items most heavily emphasized are the ones towards the end of the array, not the beginning.
  • The average of empty arrays is treated as zero.
Categorical array
  • For each element in the array of the last N items where N = {1, 2, 4, 8, all}, convert the category name to a dictionary lookup index and generate an embedding for each index. Combine the embedding of all elements into a single embedding using the mean.
  • Empty arrays treated as an embedding of zeroes.
Text array
  • Concatenate all text values in the array into a single text value using a space (" ") as a delimiter, and then treat the result as a single text value. Apply the transformations for Text columns.
  • Empty arrays treated as an embedding of zeroes.
Timestamp array
  • Apply the transformations for Numerical columns to the average of the last N items of the array. N = {1, 2, 4, 8, all}. This means that the items most heavily emphasized are the ones towards the end of the array.
Struct array
  • The structs in the array are flattened into individual fields, and assembled into arrays by field. The array transformation (as described in this table) is applied, according to the field type for that array.

Column name format

When you create your schema for BigQuery or your header row for CSV, you name your columns (features) in your training data. Column names can include any alphanumeric character or an underscore (_). The column name cannot begin with an underscore.

BigQuery tables

Supported BigQuery data types

Before creating your BigQuery table, you should know which BigQuery data types are supported and how they map to Vertex AI data types.

BigQuery data type Supported for import? JSON data type Supported transformations
INT64 Y String Numeric, Categorical, Timestamp
NUMERIC, BIGNUMERIC Y Number Numeric, Categorical
FLOAT64 Y Number Numeric, Categorical
BOOL Y Boolean Categorical
STRING Y String Text, Categorical, Numeric, Timestamp
BYTES N N/A N/A
DATE Y String Timestamp, Categorical
DATETIME Y String Timestamp, Categorical, Text
GEOGRAPHY N N/A N/A
TIME Y String Categorical
TIMESTAMP Y String Timestamp, Categorical, Numeric
Array Y Array A repeated version of the transformation used for a single instance of that data type. Timestamp conversions are not supported for array data.
STRUCT Y Object Struct

CSV files

Supported transformations

All CSV data is in string format. For models trained on training data in CSV files in Cloud Storage, batch prediction data must also be provided in CSV format in Cloud Storage.

The following transformations are available for string data:

  • Text
  • Categorical
  • Numeric
  • Timestamp

CSV format

Vertex AI uses the RFC 4180 CSV format. Commas (",") must be used for the delimiter.

Feature type and availability at forecast

Every column used for training a forecasting model (Preview) must have a type: attribute or covariate. Covariates are further designated as available or unavailable at forecast time.

Series type Available at forecast time Description Examples API fields
Attribute Available An attribute is a static feature that does not change with time. Item color, product description. time_series_attribute_columns
Covariate Available

An exogenous variable that is expected to change over time. A covariate available at forecast time is a leading indicator.

You must provide prediction data for this column for each point in the forecast horizon.

Holidays, planned promotions or events. available_at_forecast_columns
Covariate Not available A covariate not available at forecast time. You don't need to provide values for these features when creating a forecast. Actual weather. unavailable_at_forecast_columns

How missing or null values are handled

How missing values are handled depends on your model objective and the transformation applied for that feature.

Classification and Regression

For classification and regression models, null values result in an embedding for categorical and text transformations; for other transformations, the null value is left as null.

Forecasting

For forecasting models (Preview), null values are imputed from the surrounding data. (There is no option to leave a null value as null.) If you would prefer to control the way null values are imputed, you can impute them explicitly. The best values to use might depend on your data and your business problem.

Missing rows (for example, no row for a specific date, with a data granularity of daily) are allowed, but Vertex AI does not impute values for the missing data. Because missing rows can decrease model quality, you should avoid missing rows where possible. For example, if a row is missing because sales quantity for that day was zero, add a row for that day and explicitly set sales data to 0.

What values are treated as null values

When training an AutoML tabular model, Vertex AI treats the following values as null values:

  • A BigQuery NULL value.

  • NaN or infinite numeric values.

  • An empty string. Vertex AI does not trim spaces from strings. That is, " " is not considered a null value.

  • A string that can be converted to NaN or an infinite numeric value.

    • For "NAN": ignore case, with an optional plus or minus prepended.
    • For "INF": ignore case, with an optional plus or minus prepended.
  • Missing values.

  • Values in a column with a Numeric or Timestamp transformation that are not in a valid format for the column's transformation. In this case, if you specified that the row with the invalid value should be used in training, the invalid value is considered to be null.

What's next