Training with scikit-learn on AI Platform

The AI Platform training service manages computing resources in the cloud to train your models. This page describes the process to train a scikit-learn model using AI Platform.

This tutorial trains a simple model to predict a person's income level based on the Census Income Data Set. You create a training application locally, upload it to Cloud Storage, and submit a training job. The AI Platform training service writes its output to your Cloud Storage bucket, and creates logs in Logging.

This content is also available on GitHub as a Jupyter notebook.

How to train your model on AI Platform

You can train your model on AI Platform in three steps:

  • Create your Python model file
    • Add code to download your data from Cloud Storage so that AI Platform can use it
    • Add code to export and save the model to Cloud Storage after AI Platform finishes training the model
  • Prepare a training application package
  • Submit the training job

Before you begin

Complete the following steps to set up a GCP account, activate the AI Platform API, and install and activate the Cloud SDK.

Set up your GCP project

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud Platform project. Learn how to enable billing.

  4. Enable the AI Platform ("Cloud Machine Learning Engine") and Compute Engine APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.

Set up your environment

Choose one of the options below to set up your environment locally on macOS or in a remote environment on Cloud Shell.

For macOS users, we recommend that you set up your environment using the MACOS tab below. Cloud Shell, shown on the CLOUD SHELL tab, is available on macOS, Linux, and Windows. Cloud Shell provides a quick way to try AI Platform, but isn’t suitable for ongoing development work.

macOS

  1. Check Python installation
    Confirm that you have Python installed and, if necessary, install it.

    python -V
  2. Check pip installation
    pip is Python’s package manager, included with current versions of Python. Check if you already have pip installed by running pip --version. If not, see how to install pip.

    You can upgrade pip using the following command:

    pip install -U pip

    See the pip documentation for more details.

  3. Install virtualenv
    virtualenv is a tool to create isolated Python environments. Check if you already have virtualenv installed by running virtualenv --version. If not, install virtualenv:

    pip install --user --upgrade virtualenv

    To create an isolated development environment for this guide, create a new virtual environment in virtualenv. For example, the following command activates an environment named cmle-env:

    virtualenv cmle-env
    source cmle-env/bin/activate
  4. For the purposes of this tutorial, run the rest of the commands within your virtual environment.

    See more information about using virtualenv. To exit virtualenv, run deactivate.

Cloud Shell

  1. Open the Google Cloud Platform Console.

    Google Cloud Platform Console

  2. Click the Activate Google Cloud Shell button at the top of the console window.

    Activate Google Cloud Shell

    A Cloud Shell session opens inside a new frame at the bottom of the console and displays a command-line prompt. It can take a few seconds for the shell session to be initialized.

    Cloud Shell session

    Your Cloud Shell session is ready to use.

  3. Configure the gcloud command-line tool to use your selected project.

    gcloud config set project [selected-project-id]

    where [selected-project-id] is your project ID. (Omit the enclosing brackets.)

Verify the Google Cloud SDK components

To verify that the Google Cloud SDK components are installed:

  1. List your models:

    gcloud ai-platform models list
  2. If you have not created any models before, the command returns an empty list:

    Listed 0 items.

    After you start creating models, you can see them listed by using this command.

  3. If you have installed gcloud previously, update gcloud:

    gcloud components update

Install frameworks

macOS

Within your virtual environment, run the following command to install the versions of scikit-learn and pandas used in AI Platform runtime version 1.14:

(cmle-env)$ pip install scikit-learn==0.20.2 pandas==0.24.0

By providing version numbers in the preceding command, you ensure that the dependencies in your virtual environment match the dependencies in the runtime version. This helps prevent unexpected behavior when your code runs on AI Platform.

For more details, installation options, and troubleshooting information, refer to the installation instructions for each framework:

Cloud Shell

Run the following command to install scikit-learn, and pandas:

pip install --user scikit-learn pandas

For more details, installation options, and troubleshooting information, refer to the installation instructions for each framework:

Set up your Cloud Storage bucket

You'll need a Cloud Storage bucket to store your training code and dependencies. For the purposes of this tutorial, it is easiest to use a dedicated Cloud Storage bucket in the same project you're using for AI Platform.

If you're using a bucket in a different project, you must ensure that your AI Platform service account can access your training code and dependencies in Cloud Storage. Without the appropriate permissions, your training job fails. See how to grant permissions for storage.

Make sure to use or set up a bucket in the same region you're using to run training jobs. See the available regions for AI Platform services.

This section shows you how to create a new bucket. You can use an existing bucket, but it must be in the same region where you plan on running AI Platform jobs. Additionally, if it is not part of the project you are using to run AI Platform, you must explicitly grant access to the AI Platform service accounts.

  1. Specify a name for your new bucket. The name must be unique across all buckets in Cloud Storage.

    BUCKET_NAME="your_bucket_name"

    For example, use your project name with -mlengine appended:

    PROJECT_ID=$(gcloud config list project --format "value(core.project)")
    BUCKET_NAME=${PROJECT_ID}-mlengine
  2. Check the bucket name that you created.

    echo $BUCKET_NAME
  3. Select a region for your bucket and set a REGION environment variable.

    Use the same region where you plan on running AI Platform jobs. See the available regions for AI Platform services.

    For example, the following code creates REGION and sets it to us-central1:

    REGION=us-central1
  4. Create the new bucket:

    gsutil mb -l $REGION gs://$BUCKET_NAME

About the data

The Census Income Data Set that this sample uses for training is hosted by the UC Irvine Machine Learning Repository.

Census data courtesy of: Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://archive.ics.uci.edu/ml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

For your convenience, we have hosted the data in a public Cloud Storage bucket: gs://cloud-samples-data/ai-platform/sklearn/census_data/, which you can download within your Python training file.

Create your Python model file

You can find all the training code for this section on GitHub: train.py.

The rest of this section provides an explanation of what the training code does.

Setup

Import the following libraries from Python, Cloud SDK and scikit-learn. Set a variable for the name of your Cloud Storage bucket.

import datetime
import pandas as pd

from google.cloud import storage

from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelBinarizer


# TODO: REPLACE 'YOUR_BUCKET_NAME' with your GCS Bucket name.
BUCKET_NAME = 'YOUR_BUCKET_NAME'

Download data from Cloud Storage

During the typical development process, you upload your own data to Cloud Storage so that AI Platform can access it. The data for this tutorial is hosted in a public bucket: gs://cloud-samples-data/ai-platform/sklearn/census_data/

The code below downloads the training dataset, adult.data. (Evaluation data is available in adult.test, but is not used in this tutorial.)

# Public bucket holding the census data
bucket = storage.Client().bucket('cloud-samples-data')

# Path to the data inside the public bucket
blob = bucket.blob('ai-platform/sklearn/census_data/adult.data')
# Download the data
blob.download_to_filename('adult.data')

Add your model code

The model training code does a few basic steps:

  • Define and load data
  • Convert categorical features to numerical features
  • Extract numerical features with a scikit-learn pipeline
  • Export and save the model to Cloud Storage

Define and load data

# Define the format of your input data including unused columns (These are the columns from the census data files)
COLUMNS = (
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income-level'
)

# Categorical columns are columns that need to be turned into a numerical value to be used by scikit-learn
CATEGORICAL_COLUMNS = (
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country'
)


# Load the training census dataset
with open('./adult.data', 'r') as train_data:
    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)

# Remove the column we are trying to predict ('income-level') from our features list
# Convert the Dataframe to a lists of lists
train_features = raw_training_data.drop('income-level', axis=1).values.tolist()
# Create our training labels list, convert the Dataframe to a lists of lists
train_labels = (raw_training_data['income-level'] == ' >50K').values.tolist()

Convert categorical features to numerical features

# Since the census data set has categorical features, we need to convert
# them to numerical values. We'll use a list of pipelines to convert each
# categorical column and then use FeatureUnion to combine them before calling
# the RandomForestClassifier.
categorical_pipelines = []

# Each categorical column needs to be extracted individually and converted to a numerical value.
# To do this, each categorical column will use a pipeline that extracts one feature column via
# SelectKBest(k=1) and a LabelBinarizer() to convert the categorical value to a numerical one.
# A scores array (created below) will select and extract the feature column. The scores array is
# created by iterating over the COLUMNS and checking if it is a CATEGORICAL_COLUMN.
for i, col in enumerate(COLUMNS[:-1]):
    if col in CATEGORICAL_COLUMNS:
        # Create a scores array to get the individual categorical column.
        # Example:
        #  data = [39, 'State-gov', 77516, 'Bachelors', 13, 'Never-married', 'Adm-clerical',
        #         'Not-in-family', 'White', 'Male', 2174, 0, 40, 'United-States']
        #  scores = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        #
        # Returns: [['State-gov']]
        # Build the scores array.
        scores = [0] * len(COLUMNS[:-1])
        # This column is the categorical column we want to extract.
        scores[i] = 1
        skb = SelectKBest(k=1)
        skb.scores_ = scores
        # Convert the categorical column to a numerical value
        lbn = LabelBinarizer()
        r = skb.transform(train_features)
        lbn.fit(r)
        # Create the pipeline to extract the categorical feature
        categorical_pipelines.append(
            ('categorical-{}'.format(i), Pipeline([
                ('SKB-{}'.format(i), skb),
                ('LBN-{}'.format(i), lbn)])))

Extract numerical features with a scikit-learn pipeline

# Create pipeline to extract the numerical features
skb = SelectKBest(k=6)
# From COLUMNS use the features that are numerical
skb.scores_ = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0]
categorical_pipelines.append(('numerical', skb))

# Combine all the features using FeatureUnion
preprocess = FeatureUnion(categorical_pipelines)

# Create the classifier
classifier = RandomForestClassifier()

# Transform the features and fit them to the classifier
classifier.fit(preprocess.transform(train_features), train_labels)

# Create the overall model as a single pipeline
pipeline = Pipeline([
    ('union', preprocess),
    ('classifier', classifier)
])

Export and save the model to Cloud Storage

If your Cloud Storage bucket is in the same project you're using for AI Platform, then AI Platform can read from and write to your bucket. If not, you need to make sure that the project you are using to run AI Platform can access your Cloud Storage bucket. See how to grant permissions for storage.

Make sure to name your model file model.pkl or model.joblib if you want to use it to request online predictions with AI Platform.

# Export the model to a file
model = 'model.joblib'
joblib.dump(pipeline, model)

# Upload the model to GCS
bucket = storage.Client().bucket(BUCKET_NAME)
blob = bucket.blob('{}/{}'.format(
    datetime.datetime.now().strftime('census_%Y%m%d_%H%M%S'),
    model))
blob.upload_from_filename(model)

Verify model file upload Cloud Storage (Optional)

In the command line, view the contents of the destination model folder to verify that your model file has been uploaded to Cloud Storage. Set an environment variable (BUCKET_ID) for the name of your bucket, if you have not already done so.

gsutil ls gs://$BUCKET_ID/census_*

The output should appear similar to the following:

gs://[YOUR-PROJECT-ID]/census_[DATE]_[TIME]/model.joblib

Create training application package

The easiest (and recommended) way to create a training application package uses gcloud to package and upload the application when you submit your training job. This method allows you to create a very simple file structure with only two files. For this tutorial, the file structure of your training application package should appear similar to the following:

census_training/
    __init__.py
    train.py
  1. Create a directory locally:

    mkdir census_training
    
  2. Create a blank file named __init__.py:

    touch census_training/__init__.py
    
  3. Save your training code in one Python file, and save that file within your census_training directory. See the example code for train.py. You can use cURL to download and save the file:

    curl https://raw.githubusercontent.com/GoogleCloudPlatform/cloudml-samples/master/sklearn/notebooks/census_training/train.py > census_training/train.py
    

Learn more about packaging a training application.

Submit training job

In this section, you use gcloud ai-platform jobs submit training to submit your training job.

Specify training job parameters

Set the following environment variables for each parameter in your training job request:

  • PROJECT_ID - Use the PROJECT_ID that matches your Google Cloud Platform project.
  • BUCKET_ID - The name of your Cloud Storage bucket.
  • JOB_NAME - A name to use for the job (mixed-case letters, numbers, and underscores only, starting with a letter). In this case: census_training_$(date +"%Y%m%d_%H%M%S")
  • JOB_DIR - The path to a Cloud Storage location to use for your training job's output files. For example, gs://$BUCKET_ID/scikit_learn_job_dir.
  • TRAINING_PACKAGE_PATH - The local path to the root directory of your training application. In this case: ./census_training/.
  • MAIN_TRAINER_MODULE - Specifies which file the AI Platform training service should run. This is formatted as [YOUR_FOLDER_NAME.YOUR_PYTHON_FILE_NAME]. In this case, census_training.train.
  • REGION - The name of the region you're using to run your training job. Use one of the available regions for the AI Platform training service. Make sure your Cloud Storage bucket is in the same region.
  • RUNTIME_VERSION - You must specify a AI Platform runtime version that supports scikit-learn. In this example, 1.14.
  • PYTHON_VERSION - The Python version to use for the job. Python 3.5 is available with AI Platform 1.4 or greater. If you don't specify a Python version, the training service uses Python 2.7. For this tutorial, specify Python 2.7.
  • SCALE_TIER - A predefined cluster specification for machines to run your training job. In this case, BASIC. You can also use custom scale tiers to define your own cluster configuration for training.

For your convenience, the environment variables for this tutorial are below. Replace [VALUES-IN-BRACKETS] with the appropriate values:

PROJECT_ID=[YOUR-PROJECT-ID]
BUCKET_ID=[YOUR-BUCKET-ID]
JOB_NAME=census_training_$(date +"%Y%m%d_%H%M%S")
JOB_DIR=gs://$BUCKET_ID/scikit_learn_job_dir
TRAINING_PACKAGE_PATH="[YOUR-LOCAL-PATH-TO-TRAINING-PACKAGE]/census_training/"
MAIN_TRAINER_MODULE=census_training.train
REGION=us-central1
RUNTIME_VERSION=1.14
PYTHON_VERSION=2.7
SCALE_TIER=BASIC

Submit the request:

gcloud ai-platform jobs submit training $JOB_NAME \
  --job-dir $JOB_DIR \
  --package-path $TRAINING_PACKAGE_PATH \
  --module-name $MAIN_TRAINER_MODULE \
  --region $REGION \
  --runtime-version=$RUNTIME_VERSION \
  --python-version=$PYTHON_VERSION \
  --scale-tier $SCALE_TIER

You should see output similar to the following:

Job [census_training_[DATE]_[TIME]] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe census_training_[DATE]_[TIME]

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs census_training_[DATE]_[TIME]
jobId: census_training_[DATE]_[TIME]
state: QUEUED

Viewing your training logs (optional)

AI Platform captures all stdout and stderr streams and logging statements. These logs are stored in Logging; they are visible both during and after execution.

To view the logs for your training job:

Console

  1. Open your AI Platform Jobs page.

    Open jobs in the GCP Console

  2. Select the name of the training job to inspect. This brings you to the Job details page for your selected training job.

  3. Within the job details, select the View logs link. This brings you to the Logging page where you can search and filter logs for your selected training job.

gcloud

You can view logs in your terminal with gcloud ai-platform jobs stream-logs.

gcloud ai-platform jobs stream-logs $JOB_NAME

What's next

Czy ta strona była pomocna? Podziel się z nami swoją opinią:

Wyślij opinię na temat...

Potrzebujesz pomocy? Odwiedź naszą stronę wsparcia.