Training with built-in algorithms on AI Platform Training allows you to submit your dataset and train a model without writing any training code. This page explains how the built-in linear learner algorithm works, and how to use it.
Overview
This built-in algorithm does preprocessing and training:
- Preprocessing: AI Platform Training processes your mix of categorical and numerical data into an all numerical dataset in order to prepare it for training.
- Training: Using the dataset and the model parameters you supplied, AI Platform Training runs training using TensorFlow's Linear Estimator.
Limitations
The following features are not supported for training with the built-in linear learner algorithm:
- Multi-GPU training. Built-in algorithms use only one GPU at a time. To take full advantage of training with multiple GPUs on one machine, you must create a training application. Find more information about machine types.
- Training with TPUs. To train with TPUs, you must create a training application. Learn how to run a training job with TPUs.
- Distributed training. To run a distributed training job on AI Platform Training, you must create a training application.
Supported machine types
The following AI Platform Training scale tiers and machine types are supported:
BASIC
scale tierCUSTOM
scale tier with any of the Compute Engine machine types supported by AI Platform Training.CUSTOM
scale tier with any of the following legacy machine types:standard
large_model
complex_model_s
complex_model_m
complex_model_l
standard_gpu
standard_p100
standard_v100
large_model_v100
Format input data
Each row of a dataset represents one instance, and each column of a dataset represents a feature value. The target column represents the value you want to predict.
Prepare CSV file
Your input data must be a CSV file with UTF-8 encoding. If your training data only consists of categorical and numerical values, then you can use our preprocessing module to fill in missing numerical values, split the dataset, and remove rows with more than 10% of values missing. Otherwise, you can run training without automatic preprocessing enabled.
You must prepare your input CSV file to meet the following requirements:
- Remove the header row. The header row contains the labels for each column. Remove the header row in order to avoid submitting it with the rest of the data instances as part of the training data.
- Ensure that the target column is the first column. The target column contains the value that you are trying to predict. For a classification algorithm, all values in the target column are a class or category. For a regression algorithm, all values in the target column are a numerical value.
Handle integer values
The meaning of integer values can be ambiguous, which makes columns of integer values problematic in automatic preprocessing. AI Platform Training automatically determines how to handle integer values. By default:
- If every integer value is unique, the column is treated as instance keys.
- If there are only a few unique integer values, the column is treated as categorical.
- Otherwise, the values in the column are converted to float and treated as numerical.
To override these default determinations:
- If the data should be treated as numerical, convert all integer values in the column to floating point, ex. {101.0, 102.0, 103.0}
- If the data should be treated as categorical, prepend a non-numeric prefix to all integer values in the column, ex. {code_101, code_102, code_103}
Check Cloud Storage bucket permissions
To store your data, use a Cloud Storage bucket in the same Google Cloud project you're using to run AI Platform Training jobs. Otherwise, grant AI Platform Training access to the Cloud Storage bucket where your data is stored.
Submit a linear learner training job
This section explains how to submit a training job using the built-in linear learner algorithm.
You can find brief explanations of each hyperparameter within the Google Cloud console, and a more comprehensive explanation in the reference for the built-in linear learner algorithm.
Console
Go to the AI Platform Training Jobs page in the Google Cloud console:
Click the New training job button. From the options that display below, click Built-in algorithm training.
On the Create a new training job page, select linear learner and click Next.
To learn more about all the available parameters, follow the links in the Google Cloud console and refer to the built-in linear learner reference for more details.
gcloud
Set environment variables for your job, filling in
[VALUES-IN-BRACKETS]
with your own values:# Specify the name of the Cloud Storage bucket where you want your # training outputs to be stored, and the Docker container for # your built-in algorithm selection. BUCKET_NAME='BUCKET_NAME' IMAGE_URI='gcr.io/cloud-ml-algos/linear_learner_cpu:latest' # Specify the Cloud Storage path to your training input data. TRAINING_DATA='gs://$BUCKET_NAME/YOUR_FILE_NAME.csv' DATE="$(date '+%Y%m%d_%H%M%S')" MODEL_NAME='MODEL_NAME' JOB_ID="${MODEL_NAME}_${DATE}" JOB_DIR="gs://${BUCKET_NAME}/algorithm_training/${MODEL_NAME}/${DATE}"
Submit the training job using
gcloud ai-platform jobs training submit
. Adjust this generic example to work with your dataset:gcloud ai-platform jobs submit training $JOB_ID \ --master-image-uri=$IMAGE_URI --scale-tier=BASIC --job-dir=$JOB_DIR \ -- \ --preprocess --model_type=$MODEL_TYPE --batch_size=250 \ --learning_rate=0.1 --max_steps=1000 \ --training_data_path=$TRAINING_DATA
Monitor the status of your training job by viewing logs with
gcloud
. Refer togcloud ai-platform jobs describe
andgcloud ai-platform jobs stream-logs
.gcloud ai-platform jobs describe ${JOB_ID} gcloud ai-platform jobs stream-logs ${JOB_ID}
How preprocessing works
Automatic preprocessing works for categorical and numerical data. The preprocessing routine first analyzes and then transforms your data.
Analysis
First, AI Platform Training automatically detects the data type of each
column, identifies how each column should be treated, and computes some
statistics of the data in the column. This information is captured in the
metadata.json
file.
AI Platform Training analyzes the type of the target column to identify
whether the given dataset is for regression or classification. If this analysis
conflicts with your selection for the model_type
, it results
in an error. Be explicit about how the target column should be treated by
formatting your data clearly in ambiguous cases.
Type: The column can be either numerical or categorical.
Treatment: AI Platform Training identifies how to treat each column as follows:
- If the column includes a single value in all the rows, it is treated as a constant.
- If the column is categorical, and includes unique values in all the rows, it is treated as a row_identifier.
- If the column is numerical with float values, or if it's numerical with integer values and it contains many unique values, then the column is treated as numerical.
- If the column is numerical with integer values, and it contains
few enough unique values, then the column is treated as
a categorical column where the integer values are the identity or
the vocabulary.
- A column is considered to have few unique values if the number of unique values in the column is less than 20% of the number of rows in the input dataset.
- If the column is categorical with high cardinality, then the column is
treated with hashing, where the number of hash buckets equals to the
square root of the number of unique values in the column.
- A categorical column is considered to have high cardinality if the number of unique values is greater than the square root of the number of rows in the dataset.
- If the column is categorical, and the number of unique values is less than or equal to the square root of the number of rows in the dataset, then the column is treated as a normal categorical column with vocabulary.
Statistics: AI Platform Training computes the following statistics, based on the identified column type and treatment, to be used for transforming the column in a later stage.
- If the column is numeric, the mean and variance values are computed.
- If the column is categorical, and the treatment is identity or vocabulary, the distinct values are extracted from the column.
- If the column is categorical, and the treatment is hashing, the number of hash buckets is computed with respect to the cardinality of the column.
Transformation
After the initial analysis of the dataset is complete, AI Platform Training transforms your data based on the types, treatments and statistics applied to your dataset. AI Platform Training does transformations in the following order:
- Splits the training dataset into validation and test datasets if you specify the split percentages.
- Removes rows that have more than 10% of features missing.
- Fills up missing numerical values using the mean of the column.
Example transformations
Rows with 10% of missing values are removed. In the following examples, assume the row has 10 values. Each example row is truncated for simplicity.
Row issue | Original values | Transformed values | Explanation |
---|---|---|---|
Example row with no missing values | [3, 0.45, ..., 'fruits', 0, 1] |
[3, 0.45, ..., 1, 0, 0, 0, 1] |
The string 'fruits' is transformed to the values "1, 0, 0" in one-hot encoding. This happens later in the TensorFlow graph. |
Too many missing values | [3, 0.45, ..., 'fruits', __, __] |
Row is removed | More than 10% of values in the row are missing. |
Missing numerical value | [3, 0.45, ..., 'fruits', 0, __] |
[3, 0.45, ..., 1, 0, 0, 0, 0.54] |
|
Missing categorical value | [3, 0.45, ..., __, 0, 1] |
[3, 0.45, ..., 0, 0, 0, 0, 1] |
|
Feature columns
During transformation, the columns are not processed. Instead, the metadata produced during analysis is passed to AI Platform Training to create the feature columns accordingly:
Column type | Column treatment | Resulting feature column |
---|---|---|
Numerical | (All column treatment types) |
tf.feature_column.numeric_column
The mean and variance values are used to standardize the
values: |
Categorical | Identity |
tf.feature_column.categorical_column_with_identity
|
Categorical | Vocabulary |
tf.feature_column.categorical_column_with_vocabulary_list
|
Categorical | Hashing |
tf.feature_column.categorical_column_with_hash_bucket
|
Categorical | Constant or Row identifier | Ignored. No feature column created. |
After automatic preprocessing is complete, AI Platform Training uploads your processed dataset back to your Cloud Storage bucket at the directory you specified in the job request.
Further learning resources
- Learn more about large-scale linear models.
- Learn more about how linear models are built with the TensorFlow Estimator API.
- Learn more about TensorFlow feature columns.