This page describes the data types you can import into a AutoML Tables dataset, and how those types map to BigQuery or CSV.
Introduction
When you import training data, AutoML Tables suggests a data type for each column based on the native type in the input data and the values in that column. The data type of the column is important, because it affects how that column is used in training the model. After importing your data, you review each column to ensure that the data type AutoML Tables chose is the correct one for your data.
When you create the model, the dataset is converted to a list of Row objects, which has its own data types. If you use online predictions, you must convert your data to use this format.
AutoML Tables data types
Categorical
Categorical value represents values in a category. That is, a nominal level. The values differ only based on their name without order. You can use numbers to represent categorical values, but the values have no numeric relationship with each other. That is, a categorical 1 is not "greater" than a categorical 0.
Here are some examples of categorical values:
- Boolean -
true
,false
. - Country -
"USA"
,"Canada"
,"China"
, and so on. - HTTP status code -
"200"
,"404"
,"500"
, and so on.
Categorical values are case-sensitive; spelling variations are treated as different categories (for example, "Color" and "Colour" are not combined).
Text
A text value represents free-form text, typically comprised of text tokens.
Here are some examples of text values:
"The quick brown fox"
"This restaurant is the best! The food is delicious"
Text fields are parsed into tokens by whitespace for model training.
Numeric
A numeric value represents an ordinal or quantitative number. These numbers can be compared. That is, two distinct numbers can be less than or greater than one another.
AutoML Tables interprets any compatible string as numeric. Leading or trailing whitespace is trimmed.
The following table shows all compatible formats for the numeric data type:
Format | Examples | Notes |
Numeric string | "101", 101.5" | The period character (".") is the only valid decimal delimiter. "101,5" and "100,000" are not valid numeric strings. |
Scientific notation | "1.12345E+11", "1.12345e+11" | See note for numeric strings regarding decimal delimiters. |
Not a number | "NAN", "nan", "+NAN" | Case is ignored. Prepended plus ("+") or minus ("-") characters are ignored. Interpreted as NULL value. |
Infinity | "INF", "+inf" | Case is ignored. Prepended plus ("+") or minus ("-") characters are ignored. Interpreted as NULL value. |
Timestamp
A Timestamp value represents a point in time, represented either as a civil time with a time zone, or a Unix timestamp. Only features of type Timestamp can be used for the Time column.
If a time zone is not specified with the civil time, it will default to UTC.
The following table shows all compatible timestring formats:
Format | Example | Notes |
%E4Y-%m-%d |
"2017-01-30" | See the Abseil documentation for a description of this format. |
%E4Y/%m/%d |
"2017/01/30" | |
%Y/%m/%d %H:%M:%E*S |
"2017/01/30 23:59:58" | |
%d-%m-%E4Y |
"30-11-2018" | |
%d/%m/%E4Y |
"30/11/2018" | |
%d-%B-%E4Y |
"30-November-2018" | |
%Y-%m-%dT%H:%M:%E*S%Ez |
"2019-05-17T23:56:09.05Z" | RFC 3339 |
Unix timestamp string in seconds | "1541194447" | Only for times between 01/Jan/1990 and 01/Jan/2030. |
Unix timestamp string in milliseconds | "1541194447000" | |
Unix timestamp string in microseconds | "1541194447000000" | |
Unix timestamp string in nanoseconds | "1541194447000000000" |
Struct
A struct can be used to represent a group of labeled fields. A struct has a list of field names, each associated with a data type. The list of fields and their data types must be the same for all struct values in a column.
Here are some examples of structs:
- Blood pressure -
{"timestamp": 1535761416, "systolic": 110, "diastolic": 70}
- Product -
{"name": "iPhone", price: 1000}
You use the BigQuery STRUCT data type to represent structs.
Array
An array can be used to represent a list of values. The contained values must be of the same data type. You can include compound data types (structs) in an array; all of the structs must have the same structure.
AutoML Tables processes arrays as representing relative weight. In other words, items that appear later in the array are weighted more heavily than items that appear towards the beginning.
Here are some examples of arrays:
Product categories:
["Clothing", "Women", "Dress", ...]
Most recent purchases:
["iPhone", "Laptop", "Suitcase", ...]
User records:
[{"name": "Joelle", ID: 4093}, {"name": "Chloe", ID: 2047}, {"name": "Neko", ID: 3432}, ...]
You use the BigQuery ARRAY data type to represent arrays.
Column name format
When you create your schema for BigQuery or your header row
for CSV, you name your columns (features) in your training data.
Column names can include any alphanumeric character or an underscore (_
).
The column name cannot begin with an underscore.
BigQuery tables
Supported data types
Before creating your BigQuery table, you should know which BigQuery data types are supported and how they map to AutoML Tables data types.
BigQuery data type | Supported for import? | AutoML Tables data types |
INT64 | Y | Numeric, Categorical |
NUMERIC | Y | Numeric, Categorical |
FLOAT64 | Y | Numeric, Categorical |
BOOL | Y | Categorical |
STRING | Y | Text, Categorical, Numeric |
BYTES | N | |
DATE | Y | Timestamp, Categorical |
DATETIME | Y | Timestamp, Categorical |
GEOGRAPHY | N | |
TIME | Y | Categorical |
TIMESTAMP | Y | Timestamp, Categorical |
ARRAY | Y | Array |
STRUCT | Y | Struct |
CSV files
Supported data types
All CSV data is imported as strings. You can use the following AutoML Tables data types when you import using CSV:
- Text
- Categorical
- Numeric
- Timestamp
CSV format
AutoML Tables uses the RFC 4180 CSV format.
Row object format
When you request an online prediction, you must present the prediction data as a JSON representation of a Row object. The table below shows the acceptable data formats for every AutoML Tables data type. You can choose the data format that is easiest for you to provide.
AutoML Tables data type | Row object data types |
Formats |
Categorical | bool_type | true, false |
string_value |
"42" "blue" "2014-01-31" "2014-01-31 13:14:15.123456789" "21:02:42.118039" "1553040000" (UNIX timestamp) |
|
Numeric | string_value | "42.3" |
number_value | 42.3 | |
Text | string_value | "The quick brown fox" |
Timestamp | string_value |
"2014-01-31" "2014-01-31 13:14:15.123456789" "1553040000" (UNIX timestamp) |
Array | list_value | ["dog", "cat", "fish"] |
Struct | struct_value | {"field1": "ABC", "field2": 100} |
What's next
- Learn more about BigQuery data types
- Learn how to prepare your data for import into AutoML Tables