Training using the built-in linear learner algorithm

Training with built-in algorithms on AI Platform Training allows you to submit your dataset and train a model without writing any training code. This page explains how the built-in linear learner algorithm works, and how to use it.

Overview

This built-in algorithm does preprocessing and training:

  1. Preprocessing: AI Platform Training processes your mix of categorical and numerical data into an all numerical dataset in order to prepare it for training.
  2. Training: Using the dataset and the model parameters you supplied, AI Platform Training runs training using TensorFlow's Linear Estimator.

Limitations

The following features are not supported for training with the built-in linear learner algorithm:

Supported machine types

The following AI Platform Training scale tiers and machine types are supported:

Format input data

Each row of a dataset represents one instance, and each column of a dataset represents a feature value. The target column represents the value you want to predict.

Prepare CSV file

Your input data must be a CSV file with UTF-8 encoding. If your training data only consists of categorical and numerical values, then you can use our preprocessing module to fill in missing numerical values, split the dataset, and remove rows with more than 10% of values missing. Otherwise, you can run training without automatic preprocessing enabled.

You must prepare your input CSV file to meet the following requirements:

  • Remove the header row. The header row contains the labels for each column. Remove the header row in order to avoid submitting it with the rest of the data instances as part of the training data.
  • Ensure that the target column is the first column. The target column contains the value that you are trying to predict. For a classification algorithm, all values in the target column are a class or category. For a regression algorithm, all values in the target column are a numerical value.

Handle integer values

The meaning of integer values can be ambiguous, which makes columns of integer values problematic in automatic preprocessing. AI Platform Training automatically determines how to handle integer values. By default:

  • If every integer value is unique, the column is treated as instance keys.
  • If there are only a few unique integer values, the column is treated as categorical.
  • Otherwise, the values in the column are converted to float and treated as numerical.

To override these default determinations:

  • If the data should be treated as numerical, convert all integer values in the column to floating point, ex. {101.0, 102.0, 103.0}
  • If the data should be treated as categorical, prepend a non-numeric prefix to all integer values in the column, ex. {code_101, code_102, code_103}

Check Cloud Storage bucket permissions

To store your data, use a Cloud Storage bucket in the same Google Cloud project you're using to run AI Platform Training jobs. Otherwise, grant AI Platform Training access to the Cloud Storage bucket where your data is stored.

Submit a linear learner training job

This section explains how to submit a training job using the built-in linear learner algorithm.

You can find brief explanations of each hyperparameter within the Google Cloud console, and a more comprehensive explanation in the reference for the built-in linear learner algorithm.

Console

  1. Go to the AI Platform Training Jobs page in the Google Cloud console:

    AI Platform Training Jobs page

  2. Click the New training job button. From the options that display below, click Built-in algorithm training.

  3. On the Create a new training job page, select linear learner and click Next.

  4. To learn more about all the available parameters, follow the links in the Google Cloud console and refer to the built-in linear learner reference for more details.

gcloud

  1. Set environment variables for your job, filling in [VALUES-IN-BRACKETS] with your own values:

       # Specify the name of the Cloud Storage bucket where you want your
       # training outputs to be stored, and the Docker container for
       # your built-in algorithm selection.
       BUCKET_NAME='BUCKET_NAME'
       IMAGE_URI='gcr.io/cloud-ml-algos/linear_learner_cpu:latest'
    
       # Specify the Cloud Storage path to your training input data.
       TRAINING_DATA='gs://$BUCKET_NAME/YOUR_FILE_NAME.csv'
    
       DATE="$(date '+%Y%m%d_%H%M%S')"
       MODEL_NAME='MODEL_NAME'
       JOB_ID="${MODEL_NAME}_${DATE}"
    
       JOB_DIR="gs://${BUCKET_NAME}/algorithm_training/${MODEL_NAME}/${DATE}"
    
  2. Submit the training job using gcloud ai-platform jobs training submit. Adjust this generic example to work with your dataset:

       gcloud ai-platform jobs submit training $JOB_ID \
          --master-image-uri=$IMAGE_URI --scale-tier=BASIC --job-dir=$JOB_DIR \
          -- \
          --preprocess --model_type=$MODEL_TYPE --batch_size=250 \
          --learning_rate=0.1 --max_steps=1000 \
          --training_data_path=$TRAINING_DATA
    
  3. Monitor the status of your training job by viewing logs with gcloud. Refer to gcloud ai-platform jobs describe and gcloud ai-platform jobs stream-logs.

       gcloud ai-platform jobs describe ${JOB_ID}
       gcloud ai-platform jobs stream-logs ${JOB_ID}
    

How preprocessing works

Automatic preprocessing works for categorical and numerical data. The preprocessing routine first analyzes and then transforms your data.

Analysis

First, AI Platform Training automatically detects the data type of each column, identifies how each column should be treated, and computes some statistics of the data in the column. This information is captured in the metadata.json file.

AI Platform Training analyzes the type of the target column to identify whether the given dataset is for regression or classification. If this analysis conflicts with your selection for the model_type, it results in an error. Be explicit about how the target column should be treated by formatting your data clearly in ambiguous cases.

  • Type: The column can be either numerical or categorical.

  • Treatment: AI Platform Training identifies how to treat each column as follows:

    • If the column includes a single value in all the rows, it is treated as a constant.
    • If the column is categorical, and includes unique values in all the rows, it is treated as a row_identifier.
    • If the column is numerical with float values, or if it's numerical with integer values and it contains many unique values, then the column is treated as numerical.
    • If the column is numerical with integer values, and it contains few enough unique values, then the column is treated as a categorical column where the integer values are the identity or the vocabulary.
      • A column is considered to have few unique values if the number of unique values in the column is less than 20% of the number of rows in the input dataset.
    • If the column is categorical with high cardinality, then the column is treated with hashing, where the number of hash buckets equals to the square root of the number of unique values in the column.
      • A categorical column is considered to have high cardinality if the number of unique values is greater than the square root of the number of rows in the dataset.
    • If the column is categorical, and the number of unique values is less than or equal to the square root of the number of rows in the dataset, then the column is treated as a normal categorical column with vocabulary.
  • Statistics: AI Platform Training computes the following statistics, based on the identified column type and treatment, to be used for transforming the column in a later stage.

    • If the column is numeric, the mean and variance values are computed.
    • If the column is categorical, and the treatment is identity or vocabulary, the distinct values are extracted from the column.
    • If the column is categorical, and the treatment is hashing, the number of hash buckets is computed with respect to the cardinality of the column.

Transformation

After the initial analysis of the dataset is complete, AI Platform Training transforms your data based on the types, treatments and statistics applied to your dataset. AI Platform Training does transformations in the following order:

  1. Splits the training dataset into validation and test datasets if you specify the split percentages.
  2. Removes rows that have more than 10% of features missing.
  3. Fills up missing numerical values using the mean of the column.

Example transformations

Rows with 10% of missing values are removed. In the following examples, assume the row has 10 values. Each example row is truncated for simplicity.

Row issue Original values Transformed values Explanation
Example row with no missing values [3, 0.45, ...,
'fruits', 0, 1]
[3, 0.45, ...,
1, 0, 0, 0, 1]
The string 'fruits' is transformed to the values "1, 0, 0" in one-hot encoding. This happens later in the TensorFlow graph.
Too many missing values [3, 0.45, ...,
'fruits', __, __]
Row is removed More than 10% of values in the row are missing.
Missing numerical value [3, 0.45, ...,
'fruits', 0, __]
[3, 0.45, ...,
1, 0, 0, 0, 0.54]
  • The mean value for the column replaces the missing numerical value. In this example, the mean is 0.54.
  • The string 'fruits' is transformed to the values "1, 0, 0" in one-hot encoding. This happens later in the TensorFlow graph.
Missing categorical value [3, 0.45, ...,
__, 0, 1]
[3, 0.45, ...,
0, 0, 0, 0, 1]
  • The missing categorical value is transformed to the values "0, 0, 0" in one-hot encoding. This happens later in the TensorFlow graph.

Feature columns

During transformation, the columns are not processed. Instead, the metadata produced during analysis is passed to AI Platform Training to create the feature columns accordingly:

Column type Column treatment Resulting feature column
Numerical (All column treatment types) tf.feature_column.numeric_column

The mean and variance values are used to standardize the values:
new_value = (input_value - mean) / sqrt(variance)

Categorical Identity tf.feature_column.categorical_column_with_identity
Categorical Vocabulary tf.feature_column.categorical_column_with_vocabulary_list
Categorical Hashing tf.feature_column.categorical_column_with_hash_bucket
Categorical Constant or Row identifier Ignored. No feature column created.

After automatic preprocessing is complete, AI Platform Training uploads your processed dataset back to your Cloud Storage bucket at the directory you specified in the job request.

Further learning resources