Getting started with the built-in linear learner algorithm

With built-in algorithms on AI Platform Training, you can submit your training data, select an algorithm, and let AI Platform Training handle the preprocessing and training for you, without writing any code for a training application.

Overview

In this tutorial, you train a linear learner model without writing any code. You submit the Census Income dataset to AI Platform Training for preprocessing and training, and then you deploy the model on AI Platform Training to get predictions. The resulting model predicts the probability that an individual's yearly income is greater than $50,000.

Before you begin

To complete this tutorial on the command line, use either Cloud Shell or any environment where the Google Cloud CLI is installed.

Complete the following steps to set up a GCP account, enable the required APIs, and install and activate the Google Cloud CLI:

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the AI Platform Training & Prediction and Compute Engine APIs.

    Enable the APIs

  5. Install the Google Cloud CLI.
  6. To initialize the gcloud CLI, run the following command:

    gcloud init
  7. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  8. Make sure that billing is enabled for your Google Cloud project.

  9. Enable the AI Platform Training & Prediction and Compute Engine APIs.

    Enable the APIs

  10. Install the Google Cloud CLI.
  11. To initialize the gcloud CLI, run the following command:

    gcloud init

Setup

To use tabular built-in algorithms, you must remove the header row from your CSV file and move the target values to the first column. We have modified the original Census dataset for use with this tutorial, and hosted it in a public Cloud Storage bucket, gs://cloud-samples-data/ai-platform/census/algorithms/data/.

Console

Before you start your training job, you need to copy the data from our public Cloud Storage bucket to your Cloud Storage bucket.

Copy the sample data to your Cloud Storage bucket

  1. First, download the training and testing data from our public Cloud Storage bucket.

    1. Navigate to our public Cloud Storage bucket:

      Get the sample data

    2. Download both test.csv and train.csv:

      1. Click the file name.

      2. From the Object details page, click Download. These files download to your local environment as ai-platform_census_algorithms_data_test.csv and ai-platform_census_algorithms_data_train.csv respectively.

  2. Next, upload the training and testing data to your Cloud Storage bucket.

    1. Navigate to the Browser page for your Cloud Storage bucket. Select your project from the Select a project drop-down list, or open it in a new tab:

      Cloud Storage Browser page

    2. Click the name of the bucket you want to use, or create a new bucket if you do not have one. (If you create a new bucket, make sure it is a regional bucket, and select the same region where you're running the AI Platform Training training job.)

    3. (Optional) Click Create folder to create a folder for the files you upload. Enter a name for the folder (for example, "data") and click Create. Then, navigate to the new folder by clicking the folder name.

    4. Click Upload files to upload both the training and testing files, ai-platform_census_algorithms_data_train.csv and ai-platform_census_algorithms_data_test.csv to your bucket.

Now that the data is copied to your bucket, you can start a training job by selecting the type of algorithm you want to use.

Select your algorithm

  1. Go to the AI Platform Training Jobs page in the Google Cloud console:

    AI Platform Training Jobs page

  2. Click the New training job button. From the options that display below, click Built-in algorithm training. The Create a new training job page displays.

  3. The training job creation is divided into four steps. The first step is Training algorithm. Select linear learner and click Next.

gcloud

Set up environment variables for your project ID, your Cloud Storage bucket, the Cloud Storage path to the training data, and your algorithm selection.

AI Platform Training built-in algorithms are in Docker containers hosted in Container Registry.

PROJECT_ID=YOUR_PROJECT_ID
BUCKET_NAME=YOUR_BUCKET_NAME
REGION="us-central1"
gcloud config set project $PROJECT_ID
gcloud config set compute/region $REGION

# Copy the training data into your Cloud Storage bucket, and set the path
# to your copy of the training data.
TRAINING_DATA_SOURCE=gs://cloud-samples-data/ai-platform/census/algorithms/data/train.csv
TRAINING_DATA_PATH=gs://$BUCKET_NAME/algorithms-demo/data/train.csv
gcloud storage cp $TRAINING_DATA_SOURCE $TRAINING_DATA_PATH

# Specify the Docker container URI specific to the algorithm.
IMAGE_URI="gcr.io/cloud-ml-algos/linear_learner_cpu:latest"

Submit a training job

To submit a job, you must specify some basic training arguments and some basic arguments related to the linear learner algorithm.

General arguments for the training job:

Training job arguments
Argument Description
job-id Unique ID for your training job. You can use this to find logs for the status of your training job after you submit it.
job-dir Cloud Storage path where AI Platform Training saves training files after completing a successful training job.
scale-tier Specifies machine types for training. Use BASIC to select a configuration of just one machine.
master-image-uri Container Registry URI used to specify which Docker container to use for the training job. Use the container for the built-in linear learner algorithm defined earlier as IMAGE_URI.
region Specify the available region in which to run your training job. For this tutorial, you can use the region us-central1.

Arguments specific to the built-in linear learner algorithm:

Algorithm arguments
Argument Description
preprocess Boolean argument stating whether or not AI Platform Training should preprocess the data.
model_type Indicates the type of model to train: classification or regression.
training_data_path Cloud Storage location to the training data, which must be a CSV file.
learning_rate The learning rate used by the linear optimizer.
max_steps Number of steps to run the training for.
batch_size Number of examples to use per training step.

For a detailed list of all other linear learner algorithm flags, refer to the built-in linear learner reference.

Console

  1. Leave Enable automatic data preprocessing checked.

  2. For Training data path, click Browse. In the right panel, click the name of the bucket where you uploaded the training data, and navigate to your ai-platform_census_algorithms_data_train.csv file.

  3. Leave the fields for Validation data and Test data at their default settings.

  4. In Output directory, enter the path to your Cloud Storage bucket where you want AI Platform Training to store the outputs from your training job. You can fill in your Cloud Storage bucket path directly, or click the Browse button to select it.

    To keep things organized, create a new directory within your Cloud Storage bucket for this training job. You can do this within the Browse pane.

    Click Next.

  5. For Model type, select Classification.

  6. Leave all other fields on their default settings, and click Next.

  7. On the Job settings page:

    1. Enter a unique Job ID (such as "linear_example").
    2. Enter an available region (such as "us-central1").
    3. Select "BASIC" for the scale tier.

    Click Done to submit the training job.

gcloud

  1. Set up all the arguments for the training job and the algorithm, before using gcloud to submit the job:

    DATASET_NAME="census"
    ALGORITHM="linear"
    MODEL_TYPE="classification"
    MODEL_NAME="${DATASET_NAME}_${ALGORITHM}_${MODEL_TYPE}"
    
    # Give a unique name to your training job.
    DATE="$(date '+%Y%m%d_%H%M%S')"
    JOB_ID="${MODEL_NAME}_${DATE}"
    
    # Make sure you have access to this Cloud Storage bucket.
    JOB_DIR="gs://${BUCKET_NAME}/algorithms_training/${MODEL_NAME}/${DATE}"
    
  2. Submit the job:

    gcloud ai-platform jobs submit training $JOB_ID \
      --master-image-uri=$IMAGE_URI --scale-tier=BASIC --job-dir=$JOB_DIR \
      -- \
      --preprocess --model_type=$MODEL_TYPE --batch_size=250 \
      --learning_rate=0.1 --max_steps=1000 --training_data_path=$TRAINING_DATA_PATH
    

  3. After the job is submitted successfully, you can view the logs using the following gcloud commands:

    gcloud ai-platform jobs describe $JOB_ID
    gcloud ai-platform jobs stream-logs $JOB_ID
    

Understand your job directory

After the successful completion of a training job, AI Platform Training creates a trained model in your Cloud Storage bucket, along with some other artifacts. You can find the following directory structure within your JOB_DIR:

  • artifacts/
    • metadata.json
  • model/ (a TensorFlow SavedModel directory that also contains a deployment_config.yaml file)
    • saved_model.pb
    • deployment_config.yaml
  • processed_data/
    • test.csv
    • training.csv
    • validation.csv

The job directory also contains various model checkpoint files in an "experiment" directory.

Confirm that the directory structure in your JOB_DIR matches:

gcloud storage ls $JOB_DIR/* --all-versions

Deploy the trained model

AI Platform Prediction organizes your trained models using model and version resources. An AI Platform Prediction model is a container for the versions of your machine learning model.

To deploy a model, you create a model resource in AI Platform Prediction, create a version of that model, then use the model and version to request online predictions.

Learn more about how to deploy models to AI Platform Prediction.

Console

  1. On the Jobs page, you can find a list of all your training jobs. Click the name of the training job you just submitted ("linear_example" or the job name you used).

  2. On the Job details page, you can view the general progress of your job, or click View logs for a more detailed view of its progress.

  3. When the job is successful, the Deploy model button appears at the top. Click Deploy model.

  4. Select "Deploy as new model", and enter a model name, such as "linear_model". Next, click Confirm.

  5. On the Create version page, enter a version name, such as "v1", and leave all other fields at their default settings. Click Save.

  6. On the Model details page, your version name displays. The version takes a few minutes to create. When the version is ready, a checkmark icon appears by the version name.

  7. Click the version name ("v1") to navigate to the Version details page. In the next step of this tutorial, you send a prediction request

gcloud

The training process with the built-in linear learner algorithm produces a file, deployment_config.yaml, that makes it easier to deploy your model on AI Platform Prediction for predictions.

  1. Copy the file to your local directory and view its contents:

    gcloud storage cp $JOB_DIR/model/deployment_config.yaml .
    cat deployment_config.yaml
    

    Your deployment_config.yaml file should appear similar to the following:

    deploymentUri: gs://YOUR_BUCKET_NAME/algorithms_training/census_linear_classification/20190227060114/model
    framework: TENSORFLOW
    labels:
      global_step: '1000'
      job_id: census_linear_classification_20190227060114
      accuracy: '86'
    runtimeVersion: '1.14'
    pythonVersion: '2.7'
    
  2. Create the model and version in AI Platform Training:

    MODEL_NAME="${DATASET_NAME}_${ALGORITHM}_${MODEL_TYPE}"
    gcloud ai-platform models create $MODEL_NAME --regions $REGION
    
    # Create a model and a version using the file above.
    VERSION_NAME="v_${DATE}"
    
    gcloud ai-platform versions create $VERSION_NAME \
      --model $MODEL_NAME \
      --config deployment_config.yaml
    

    The version takes a few minutes to create.

Get online predictions

When requesting predictions, you need to make sure that your input data is formatted the same way as the training data. Before training, AI Platform Training preprocesses your data by transforming it into the corpus shown in metadata.json.

The linear learner model applies similar preprocessing to your input data before making predictions.

Console

  1. On the Version details page for "v1", the version you just created, you can send a sample prediction request.

    Select the Test & Use tab.

  2. Copy the following sample to the input field:

     {
       "instances": [
         {"csv_row": "44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0, 40, United-States", "key": "dummy-key"}
       ]
     }
    
  3. Click Test.

    The sample prediction result has several fields. The classes list shows the predicted class >50K':

    {
       "predictions": [
         {
           ...
           "classes": [
             ">50K"
           ],
           ...
         }
       ]
     }
    

    In this case, the deployed model predicts the individual whose information you provided earns a salary greater than $50,000. (Since training is non-deterministic, your model may differ.)

gcloud

  1. Review the last few lines of metadata.json:

    gcloud storage cat $JOB_DIR/artifacts/metadata.json | tail
    

    The target_column.mapping object shows how the predicted classes will display in the prediction results:

        "target_algorithm": "TensorFlow",
        "target_column": {
          "mapping": {
            "0": "<=50K",
            "1": ">50K"
          },
        "num_category": 2,
        "type": "classification"
      }
    }
    
  2. Prepare the prediction input for one data instance. Note that you must provide each data instance as a JSON object with the following fields:

    • csv_row, a string containing a comma-separated row of features in the same format as the instances used during training.
    • key, a string identifier that is unique for each instance. This acts as an instance key that appears as part of the prediction output, so you can match each prediction to the corresponding input instance.

      This is necessary for batch prediction, because batch prediction processes input and saves output in an unpredictable order.

      For online prediction, which produces output in the same order as the input that you provide, instance keys are less crucial. This example only performs prediction on a single instance, so the value of the instance key does not matter.

    To send an online prediction request using the Google Cloud CLI, as in this example, write each instance to a row in a newline-delimited JSON file.

    Run the following commands in your terminal to create input for a single instance that you can send to AI Platform Prediction:

     # A sample record from census dataset. Ground truth is >50K
    RAW_DATA_POINT='44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0, 40, United-States'
    
     # Create a prediction request file
    echo "{\"csv_row\": \"$RAW_DATA_POINT\", \"key\": \"dummy-key\"}" > sample_input.json
    
    # Check the prediction request file.
    cat sample_input.json
    
  3. Send the prediction request:

    gcloud ai-platform predict \
      --model $MODEL_NAME \
      --version $VERSION_NAME \
      --json-instances sample_input.json \
      --format "value(predictions[0].classes[0])" \
      --signature-name "predict"
    

    This prediction output is filtered to show only the predicted class:

    >50K
    

Most likely, the prediction output is >50K. The deployed model predicts the individual whose information you provided earns a salary greater than $50,000. (Since training is non-deterministic, your model may differ.)

About the data

The Census Income dataset that this sample uses for training is hosted by the UC Irvine Machine Learning Repository.

Census data courtesy of: Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.

What's next