Training using the built-in linear learner algorithm

Training with built-in algorithms on AI Platform allows you to submit your dataset and train a model without writing any training code. This page explains how the built-in linear learner algorithm works, and how to use it.

Overview

This built-in algorithm does preprocessing and training:

  1. Preprocessing: AI Platform processes your mix of categorical and numerical data into an all numerical dataset in order to prepare it for training.
  2. Training: Using the dataset and the model parameters you supplied, AI Platform runs training using Tensorflow's Linear Estimator.

Limitations

The following features are not supported for training with the built-in linear learner algorithm:

Supported machine types

The following AI Platform scale tiers and machine types are supported:

  • BASIC
  • CUSTOM:
    • standard
    • large_model
    • complex_model_s
    • complex_model_m
    • complex_model_l
    • standard_gpu
    • standard_p100
    • standard_v100
    • large_model_v100

Additionally, you can also use Compute Engine machine types (beta). Find more information about machine types.

Format input data

Each row of a dataset represents one instance, and each column of a dataset represents a feature value. The target column represents the value you want to predict.

Prepare CSV file

Your input data must be a CSV file with UTF-8 encoding. If your training data only consists of categorical and numerical values, then you can use our preprocessing module to fill in missing numerical values, split the dataset, and remove rows with more than 10% of values missing. Otherwise, you can run training without automatic preprocessing enabled.

You must prepare your input CSV file to meet the following requirements:

  • Remove the header row. The header row contains the labels for each column. Remove the header row in order to avoid submitting it with the rest of the data instances as part of the training data.
  • Ensure that the target column is the first column. The target column contains the value that you are trying to predict. For a classification algorithm, all values in the target column are a class or category. For a regression algorithm, all values in the target column are a numerical value.

Handle integer values

Columns of integer values are interpreted as categorical columns by default, if there are few enough unique values. For example, if a column in your dataset includes integer values such as {101, 102, 103}, AI Platform interprets these values as categories, such as {'high', 'medium', 'low'}.

To avoid this incorrect analysis, make sure to convert integers to floats when you intend the data to be numerical: {101.0, 102.0, 103.0}. To ensure that integers are interpreted as categorical, append a string before or after each value: {code_101, code_102, code_103}.

Submit a linear learner training job

This section explains how to submit a training job using the built-in linear learner algorithm.

You can find brief explanations of each hyperparameter within the Google Cloud Platform Console, and a more comprehensive explanation in the reference for the built-in linear learner algorithm.

Console

  1. Go to the AI Platform Jobs page in the Google Cloud Platform Console:

    AI Platform Jobs page

  2. Click the New training job button. From the options that display below, click Built-in model training.

  3. On the Create a new training job page, select Built-in linear learner and click Next.

  4. To learn more about all the available parameters, follow the links in the Google Cloud Platform Console and refer to the built-in linear learner reference for more details.

gcloud

  1. If you have installed gcloud prior to using this tutorial, update to the latest version of gcloud beta:

       gcloud components install beta
    
  2. Set environment variables for your job, filling in [VALUES-IN-BRACKETS] with your own values:

       # Specify the name of the Cloud Storage bucket where you want your
       # training outputs to be stored, and the Docker container for
       # your built-in algorithm selection.
       BUCKET_NAME='[YOUR-BUCKET-NAME]'
       IMAGE_URI='gcr.io/cloud-ml-algos/linear_learner_cpu:latest'
    
       # Specify the Cloud Storage path to your training input data.
       TRAINING_DATA='gs://[YOUR_BUCKET_NAME]/[YOUR_FILE_NAME].csv'
    
       DATASET_NAME='census'
       ALGORITHM='linear'
       MODEL_TYPE='classification'
    
       DATE='date '+%Y%m%d_%H%M%S''
       MODEL_NAME="${DATASET_NAME}_${ALGORITHM}_${MODEL_TYPE}"
       JOB_ID="${MODEL_NAME}_${DATE}"
    
       JOB_DIR="gs://${BUCKET_NAME}/algorithm_training/${MODEL_NAME}/${DATE}"
    
  3. Submit the training job using gcloud beta ai-platform jobs training submit:

       gcloud beta ai-platform jobs submit training $JOB_ID \
          --master-image-uri=$IMAGE_URI --scale-tier=BASIC --job-dir=$JOB_DIR \
          -- \
          --preprocess --model_type=$MODEL_TYPE --batch_size=250 \
          --learning_rate=0.1 --max_steps=1000 \
          --training_data_path=$TRAINING_DATA
    
  4. Monitor the status of your training job by viewing logs with gcloud. Refer to gcloud ai-platform jobs describe and gcloud ai-platform jobs stream-logs.

       gcloud ai-platform jobs describe ${JOB_ID}
       gcloud ai-platform jobs stream-logs ${JOB_ID}
    

How preprocessing works

Automatic preprocessing works for categorical and numerical data. The preprocessing routine first analyzes and then transforms your data.

Analysis

First, AI Platform automatically detects the data type of each column, identifies how each column should be treated, and computes some statistics of the data in the column. This information is captured in the metadata.json file.

AI Platform analyzes the type of the target column to identify whether the given dataset is for regression or classification. If this analysis conflicts with your selection for the model_type, it results in an error. Be explicit about how the target column should be treated by formatting your data clearly in ambiguous cases.

  • Type: The column can be either numerical or categorical.

  • Treatment: AI Platform identifies how to treat each column as follows:

    • If the column includes a single value in all the rows, it is treated as a constant.
    • If the column is categorical, and includes unique values in all the rows, it is treated as a row_identifier.
    • If the column is numerical with float values, or if it's numerical with integer values and it contains many unique values, then the column is treated as numerical.
    • If the column is numerical with integer values, and it contains few enough unique values, then the column is treated as a categorical column where the integer values are the identity or the vocabulary.
      • A column is considered to have few unique values if the number of unique values in the column is less than 20% of the number of rows in the input dataset.
    • If the column is categorical with high cardinality, then the column is treated with hashing, where the number of hash buckets equals to the square root of the number of unique values in the column.
      • A categorical column is considered to have high cardinality if the number of unique values is greater than the square root of the number of rows in the dataset.
    • If the column is categorical, and the number of unique values is less than or equal to the square root of the number of rows in the dataset, then the column is treated as a normal categorical column with vocabulary.
  • Statistics: AI Platform computes the following statistics, based on the identified column type and treatment, to be used for transforming the column in a later stage.

    • If the column is numeric, the mean and variance values are computed.
    • If the column is categorical, and the treatment is identity or vocabulary, the distinct values are extracted from the column.
    • If the column is categorical, and the treatment is hashing, the number of hash buckets is computed with respect to the cardinality of the column.

Transformation

After the initial analysis of the dataset is complete, AI Platform transforms your data based on the types, treatments and statistics applied to your dataset. AI Platform does transformations in the following order:

  1. Splits the training dataset into validation and test datasets if you specify the split percentages.
  2. Removes rows that have more than 10% of features missing.
  3. Fills up missing numerical values using the mean of the column.

Example transformations

Rows with 10% of missing values are removed. In the following examples, assume the row has 10 values. Each example row is truncated for simplicity.

Row issue Original values Transformed values Explanation
Example row with no missing values [3, 0.45, ...,
'fruits', 0, 1]
[3, 0.45, ...,
1, 0, 0, 0, 1]
The string 'fruits' is transformed to the values "1, 0, 0" in one-hot encoding. This happens later in the TensorFlow graph.
Too many missing values [3, 0.45, ...,
'fruits', __, __]
Row is removed More than 10% of values in the row are missing.
Missing numerical value [3, 0.45, ...,
'fruits', 0, __]
[3, 0.45, ...,
1, 0, 0, 0, 0.54]
  • The mean value for the column replaces the missing numerical value. In this example, the mean is 0.54.
  • The string 'fruits' is transformed to the values "1, 0, 0" in one-hot encoding. This happens later in the TensorFlow graph.
Missing categorical value [3, 0.45, ...,
__, 0, 1]
[3, 0.45, ...,
0, 0, 0, 0, 1]
  • The missing categorical value is transformed to the values "0, 0, 0" in one-hot encoding. This happens later in the TensorFlow graph.

Feature columns

During transformation, the columns are not processed. Instead, the metadata produced during analysis is passed to AI Platform to create the feature columns accordingly:

Column type Column treatment Resulting feature column
Numerical (All column treatment types) tf.feature_column.numeric_column

The mean and variance values are used to standardize the values:
new_value = (input_value - mean) / sqrt(variance)

Categorical Identity tf.feature_column.categorical_column_with_identity
Categorical Vocabulary tf.feature_column.categorical_column_with_vocabulary_list
Categorical Hashing tf.feature_column.categorical_column_with_hash_bucket
Categorical Constant or Row identifier Ignored. No feature column created.

After automatic preprocessing is complete, AI Platform uploads your processed dataset back to your Cloud Storage bucket at the directory you specified in the job request.

Further learning resources

Was this page helpful? Let us know how we did:

Send feedback about...

AI Platform