Data types

This page describes the data types you can import into a AutoML Tables dataset, and how those types map to BigQuery or CSV.

Introduction

When you import training data, AutoML Tables suggests a data type for each column based on the native type in the input data and the values in that column. The data type of the column is important, because it affects how that column is used in training the model. After importing your data, you review each column to ensure that the data type AutoML Tables chose is the correct one for your data.

When you create the model, the dataset is converted to a list of Row objects, which has its own data types. If you use online predictions, you must convert your data to use this format.

AutoML Tables data types

Categorical

Categorical value represents values in a category. That is, a nominal level. The values differ only based on their name without order. You can use numbers to represent categorical values, but the values have no numeric relationship with each other. That is, a categorical 1 is not "greater" than a categorical 0.

Here are some examples of categorical values:

  • Boolean - true, false.
  • Country - "USA", "Canada", "China", and so on.
  • HTTP status code - "200", "404", "500", and so on.

Categorical values are case-sensitive; spelling variations are treated as different categories (for example, "Color" and "Colour" are not combined).

Text

A text value represents free-form text, typically comprised of text tokens.

Here are some examples of text values:

  • "The quick brown fox"
  • "This restaurant is the best! The food is delicious"

Text fields are parsed into tokens by whitespace for model training.

Numeric

A numeric value represents an ordinal or quantitative number. These numbers can be compared. That is, two distinct numbers can be less than or greater than one another.

AutoML Tables interprets any compatible string as numeric. Leading or trailing whitespace is trimmed.

The following table shows all compatible formats for the numeric data type:

Format Examples Notes
Numeric string "101", 101.5" The period character (".") is the only valid decimal delimiter. "101,5" and "100,000" are not valid numeric strings.
Scientific notation "1.12345E+11", "1.12345e+11" See note for numeric strings regarding decimal delimiters.
Not a number "NAN", "nan", "+NAN" Case is ignored. Prepended plus ("+") or minus ("-") characters are ignored. Interpreted as NULL value.
Infinity "INF", "+inf" Case is ignored. Prepended plus ("+") or minus ("-") characters are ignored. Interpreted as NULL value.

Timestamp

A Timestamp value represents a point in time, represented either as a civil time with a time zone, or a Unix timestamp. Only features of type Timestamp can be used for the Time column.

If a time zone is not specified with the civil time, it will default to UTC.

The following table shows all compatible timestring formats:

Format Example Notes
%E4Y-%m-%d "2017-01-30" See the Abseil documentation for a description of this format.
%E4Y/%m/%d "2017/01/30"
%Y/%m/%d %H:%M:%E*S "2017/01/30 23:59:58"
%d-%m-%E4Y "30-11-2018"
%d/%m/%E4Y "30/11/2018"
%d-%B-%E4Y "30-November-2018"
%Y-%m-%dT%H:%M:%E*S%Ez "2019-05-17T23:56:09.05Z" RFC 3339
Unix timestamp string in seconds "1541194447" Only for times between 01/Jan/1990 and 01/Jan/2030.
Unix timestamp string in milliseconds "1541194447000"
Unix timestamp string in microseconds "1541194447000000"
Unix timestamp string in nanoseconds "1541194447000000000"

Struct

A struct can be used to represent a group of labeled fields. A struct has a list of field names, each associated with a data type. The list of fields and their data types must be the same for all struct values in a column.

Here are some examples of structs:

  • Blood pressure - {"timestamp": 1535761416, "systolic": 110, "diastolic": 70}
  • Product - {"name": "iPhone", price: 1000}

You use the BigQuery STRUCT data type to represent structs.

Array

An array can be used to represent a list of values. The contained values must be of the same data type. You can include compound data types (structs) in an array; all of the structs must have the same structure.

AutoML Tables processes arrays as representing relative weight. In other words, items that appear later in the array are weighted more heavily than items that appear towards the beginning.

Here are some examples of arrays:

  • Product categories:

    ["Clothing", "Women", "Dress", ...]

  • Most recent purchases:

    ["iPhone", "Laptop", "Suitcase", ...]

  • User records:

    [{"name": "Joelle", ID: 4093}, {"name": "Chloe", ID: 2047}, {"name": "Neko", ID: 3432}, ...]

You use the BigQuery ARRAY data type to represent arrays.

Column name format

When you create your schema for BigQuery or your header row for CSV, you name your columns (features) in your training data. Column names can include any alphanumeric character or an underscore (_). The column name cannot begin with an underscore.

BigQuery tables

Supported data types

Before creating your BigQuery table, you should know which BigQuery data types are supported and how they map to AutoML Tables data types.

BigQuery data type Supported for import? AutoML Tables data types
INT64 Y Numeric, Categorical
NUMERIC Y Numeric, Categorical
FLOAT64 Y Numeric, Categorical
BOOL Y Categorical
STRING Y Text, Categorical, Numeric
BYTES N
DATE Y Timestamp, Categorical
DATETIME Y Timestamp, Categorical
GEOGRAPHY N
TIME Y Categorical
TIMESTAMP Y Timestamp, Categorical
ARRAY Y Array
STRUCT Y Struct

CSV files

Supported data types

All CSV data is imported as strings. You can use the following AutoML Tables data types when you import using CSV:

  • Text
  • Categorical
  • Numeric
  • Timestamp

CSV format

AutoML Tables uses the RFC 4180 CSV format.

Row object format

When you request an online prediction, you must present the prediction data as a JSON representation of a Row object. The table below shows the acceptable data formats for every AutoML Tables data type. You can choose the data format that is easiest for you to provide.

AutoML Tables data type Row object data types Formats
Categorical bool_type true, false
string_value "42"
"blue"
"2014-01-31"
"2014-01-31 13:14:15.123456789"
"21:02:42.118039"
"1553040000" (UNIX timestamp)
Numeric string_value "42.3"
number_value 42.3
Text string_value "The quick brown fox"
Timestamp string_value "2014-01-31"
"2014-01-31 13:14:15.123456789"
"1553040000" (UNIX timestamp)
Array list_value ["dog", "cat", "fish"]
Struct struct_value {"field1": "ABC", "field2": 100}

What's next