Data types

This page describes the data types you can import into a AutoML Tables dataset, and how those types map to BigQuery or CSV.

Introduction

When you import training data, AutoML Tables suggests a data type for each column based on the native type in the input data and the values in that column. The data type of the column is important, because it affects how that column is used in training the model. After importing your data, you review each column to ensure that the data type AutoML Tables chose is the correct one for your data.

When you create the model, the dataset is converted to a list of Row objects, which has its own data types. If you use online predictions, you must convert your data to use this format.

AutoML Tables data types

Categorical

Categorical value represents values in a category. That is, a nominal level. The values differ only based on their name without order. You can use numbers to represent categorical values, but the values have no numeric relationship with each other. That is, a categorical 1 is not "greater" than a categorical 0.

Here are some examples of categorical values:

  • Boolean - true, false.
  • Country - "USA", "Canada", "China", and so on.
  • HTTP status code - "200", "404", "500", and so on.

Categorical values are case-sensitive; spelling variations are treated as different categories (for example, "Color" and "Colour" are not combined).

Text

A text value represents free-form text, typically comprised of text tokens.

Here are some examples of text values:

  • "The quick brown fox"
  • "This restaurant is the best! The food is delicious"

Text fields are parsed into tokens by whitespace for model training.

Numeric

A numeric value represents an ordinal or quantitative number. These numbers can be compared. That is, two distinct numbers can be less than or greater than one another.

A numeric value can be a number or a string that contains a valid number or can be considered as a numeric value.

Here are some examples of numeric values:

  • 0
  • 1.1
  • "-10"

Timestamp

A Timestamp value represents a point in time, represented either as a civil time with a time zone, or a Unix timestamp. Only features of type Timestamp can be used for the Time column.

If a time zone is not specified with the civil time, it will default to UTC. AutoML Tables supports a variety of common date time formats, including but not limited to the following examples:

  • "2018-01-30"
  • "2018/01/30"
  • "01/30/2018"
  • "30/01/2018"
  • "2018-01-30T23:59:58-0800"
  • "2018-01-30T23:59:58"
  • "2018-01-30T23:59"

AutoML Tables also supports Unix timestamps, in the form of seconds, milliseconds, microseconds, or nanoseconds since Unix epoch.

Array

An array can be used to represent a list of values. The contained values must be of the same data type.

AutoML Tables processes arrays as representing relative weight. In other words, items that appear earlier in the array are weighted more heavily than items that appear later in an array.

Here are some examples of arrays:

  • Product categories - ["Clothing", "Women", "Dress"]
  • Most recent purchases - ["iPhone", "Laptop", "Suitcase"]

You use the BigQuery ARRAY data type to represent arrays.

Struct

A struct can be used to represent a group of labeled fields. A struct has a list of field names, each associated with a data type. The list of fields and their data types must be the same for all struct values in a column.

Here are some examples of structs:

  • Blood pressure - {"timestamp": 1535761416, "systolic": 110, "diastolic": 70}
  • Product - {"name": "iPhone", price: 1000}

You use the BigQuery STRUCT data type to represent structs.

Column name format

When you create your schema for BigQuery or your header row for CSV, you name your columns (features) in your training data. Column names can include any alphanumeric character or an underscore (_). The column name cannot begin with an underscore.

BigQuery tables

Supported data types

Before creating your BigQuery table, you should know which BigQuery data types are supported and how they map to AutoML Tables data types.

BigQuery data type Supported for import? AutoML Tables data types
INT64 Y Numeric, Categorical
NUMERIC Y Numeric, Categorical
FLOAT64 Y Numeric, Categorical
BOOL Y Categorical
STRING Y Text, Categorical, Numeric
BYTES N
DATE Y Timestamp, Categorical
DATETIME Y Timestamp, Categorical
GEOGRAPHY N
TIME Y Categorical
TIMESTAMP Y Timestamp, Categorical
ARRAY Y Array
STRUCT Y Struct

CSV files

Supported data types

All CSV data is imported as strings. You can use the following AutoML Tables data types when you import using CSV:

  • Text
  • Categorical
  • Numeric
  • Timestamp

CSV format

AutoML Tables uses the RFC 4180 CSV format.

Row object format

When you request an online prediction, you must present the prediction data as a JSON representation of a Row object. The table below shows the acceptable data formats for every AutoML Tables data type. You can choose the data format that is easiest for you to provide.

AutoML Tables data type Row object data types Formats
Categorical bool_type true, false
string_value "42"
"blue"
"2014-01-31"
"2014-01-31 13:14:15.123456789"
"21:02:42.118039"
"1553040000" (UNIX timestamp)
number_value 42.3
Numeric string_value "42.3"
number_value 42.3
Text string_value "The quick brown fox"
Timestamp string_value "2014-01-31"
"2014-01-31 13:14:15.123456789"
"1553040000" (UNIX timestamp)
Array list_value ["dog", "cat", "fish"]
Struct struct_value {"field1": "ABC", "field2": 100}

What's next

هل كانت هذه الصفحة مفيدة؟ يرجى تقييم أدائنا:

إرسال تعليقات حول...

AutoML Tables Documentation