Train an ML model with scikit-learn and XGBoost

The AI Platform Training service manages computing resources in the cloud to train your models. This page describes the process to train a model with scikit-learn and XGBoost using AI Platform Training.

Overview

In this tutorial, you train a simple model to predict the species of flowers, using the Iris dataset. After you adjust your model training code to download data from Cloud Storage and upload your saved model file to Cloud Storage, you create a training application package and use it to run training on AI Platform Training.

How to train your model on AI Platform Training

After you complete the initial setup process, you can train your model on AI Platform Training in three steps:

  • Create your Python training module
    • Add code to download your data from Cloud Storage so that AI Platform Training can use it
    • Add code to export and save the model to Cloud Storage after AI Platform Training finishes training the model
  • Prepare a training application package
  • Submit the training job

The initial setup process includes creating a Google Cloud project, enabling billing and APIs, setting up a Cloud Storage bucket to use with AI Platform Training, and installing scikit-learn or XGBoost locally. If you already have everything set up and installed, skip to creating your model training code.

Before you begin

Complete the following steps to set up a GCP account, activate the AI Platform Training API, and install and activate the Cloud SDK.

Set up your GCP project

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the AI Platform Training & Prediction and Compute Engine APIs.

    Enable the APIs

  5. Install the Google Cloud CLI.
  6. To initialize the gcloud CLI, run the following command:

    gcloud init
  7. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  8. Make sure that billing is enabled for your Google Cloud project.

  9. Enable the AI Platform Training & Prediction and Compute Engine APIs.

    Enable the APIs

  10. Install the Google Cloud CLI.
  11. To initialize the gcloud CLI, run the following command:

    gcloud init

Set up your environment

Choose one of the options below to set up your environment locally on macOS or in a remote environment on Cloud Shell.

For macOS users, we recommend that you set up your environment using the MACOS tab below. Cloud Shell, shown on the CLOUD SHELL tab, is available on macOS, Linux, and Windows. Cloud Shell provides a quick way to try AI Platform Training, but isn't suitable for ongoing development work.

macOS

  1. Check Python installation
    Confirm that you have Python installed and, if necessary, install it.

    python -V
  2. Check pip installation
    pip is Python's package manager, included with current versions of Python. Check if you already have pip installed by running pip --version. If not, see how to install pip.

    You can upgrade pip using the following command:

    pip install -U pip

    See the pip documentation for more details.

  3. Install virtualenv
    virtualenv is a tool to create isolated Python environments. Check if you already have virtualenv installed by running virtualenv --version. If not, install virtualenv:

    pip install --user --upgrade virtualenv

    To create an isolated development environment for this guide, create a new virtual environment in virtualenv. For example, the following command activates an environment named aip-env:

    virtualenv aip-env
    source aip-env/bin/activate
  4. For the purposes of this tutorial, run the rest of the commands within your virtual environment.

    See more information about using virtualenv. To exit virtualenv, run deactivate.

Cloud Shell

  1. Open the Google Cloud console.

    Google Cloud console

  2. Click the Activate Google Cloud Shell button at the top of the console window.

    Activate Google Cloud Shell

    A Cloud Shell session opens inside a new frame at the bottom of the console and displays a command-line prompt. It can take a few seconds for the shell session to be initialized.

    Cloud Shell session

    Your Cloud Shell session is ready to use.

  3. Configure the gcloud command-line tool to use your selected project.

    gcloud config set project [selected-project-id]

    where [selected-project-id] is your project ID. (Omit the enclosing brackets.)

Install frameworks

macOS

Within your virtual environment, run the following command to install the versions of scikit-learn, XGBoost, and pandas used in AI Platform Training runtime version 2.11:

(aip-env)$ pip install scikit-learn==1.0.2 xgboost==1.6.2 pandas==1.3.5

By providing version numbers in the preceding command, you ensure that the dependencies in your virtual environment match the dependencies in the runtime version. This helps prevent unexpected behavior when your code runs on AI Platform Training.

For more details, installation options, and troubleshooting information, refer to the installation instructions for each framework:

Cloud Shell

Run the following command to install scikit-learn, XGBoost, and pandas:

pip install --user scikit-learn xgboost pandas

For more details, installation options, and troubleshooting information, refer to the installation instructions for each framework:

Set up your Cloud Storage bucket

You'll need a Cloud Storage bucket to store your training code and dependencies. For the purposes of this tutorial, it is easiest to use a dedicated Cloud Storage bucket in the same project you're using for AI Platform Training.

If you're using a bucket in a different project, you must ensure that your AI Platform Training service account can access your training code and dependencies in Cloud Storage. Without the appropriate permissions, your training job fails. See how to grant permissions for storage.

Make sure to use or set up a bucket in the same region you're using to run training jobs. See the available regions for AI Platform Training services.

This section shows you how to create a new bucket. You can use an existing bucket, but it must be in the same region where you plan on running AI Platform jobs. Additionally, if it is not part of the project you are using to run AI Platform Training, you must explicitly grant access to the AI Platform Training service accounts.

  1. Specify a name for your new bucket. The name must be unique across all buckets in Cloud Storage.

    BUCKET_NAME="YOUR_BUCKET_NAME"

    For example, use your project name with -aiplatform appended:

    PROJECT_ID=$(gcloud config list project --format "value(core.project)")
    BUCKET_NAME=${PROJECT_ID}-aiplatform
  2. Check the bucket name that you created.

    echo $BUCKET_NAME
  3. Select a region for your bucket and set a REGION environment variable.

    Use the same region where you plan on running AI Platform Training jobs. See the available regions for AI Platform Training services.

    For example, the following code creates REGION and sets it to us-central1:

    REGION=us-central1
  4. Create the new bucket:

    gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION

Create your Python training module

Create a file, iris_training.py, that contains the code to train your model. This section provides an explanation of what each part of the training code does:

  • Setup and imports
  • Download the data from Cloud Storage
  • Load data into pandas
  • Train and save your model
  • Upload your saved model file to Cloud Storage

For your convenience, the full code for iris_training.py is hosted on GitHub so you can use it for this tutorial:

Setup

Import the following libraries from Python and scikit-learn or XGBoost. Set a variable for the name of your Cloud Storage bucket.

scikit-learn

import datetime
import os
import subprocess
import sys
import pandas as pd
from sklearn import svm
from sklearn.externals import joblib

# Fill in your Cloud Storage bucket name
BUCKET_NAME = '<YOUR_BUCKET_NAME>'

XGBoost

import datetime
import os
import subprocess
import sys
import pandas as pd
import xgboost as xgb

# Fill in your Cloud Storage bucket name
BUCKET_NAME = '<YOUR_BUCKET_NAME>'

Download data from Cloud Storage

During the typical development process, you upload your own data to Cloud Storage so that AI Platform Training can access it. The data for this tutorial is hosted in a public Cloud Storage bucket: gs://cloud-samples-data/ai-platform/iris/

The following code downloads the data using gsutil, and then diverts the data from gsutil to stdout:

scikit-learn

iris_data_filename = 'iris_data.csv'
iris_target_filename = 'iris_target.csv'
data_dir = 'gs://cloud-samples-data/ml-engine/iris'

# gsutil outputs everything to stderr so we need to divert it to stdout.
subprocess.check_call(['gsutil', 'cp', os.path.join(data_dir,
                                                    iris_data_filename),
                       iris_data_filename], stderr=sys.stdout)
subprocess.check_call(['gsutil', 'cp', os.path.join(data_dir,
                                                    iris_target_filename),
                       iris_target_filename], stderr=sys.stdout)

XGBoost

iris_data_filename = 'iris_data.csv'
iris_target_filename = 'iris_target.csv'
data_dir = 'gs://cloud-samples-data/ai-platform/iris'

# gsutil outputs everything to stderr so we need to divert it to stdout.
subprocess.check_call(['gsutil', 'cp', os.path.join(data_dir,
                                                    iris_data_filename),
                       iris_data_filename], stderr=sys.stdout)
subprocess.check_call(['gsutil', 'cp', os.path.join(data_dir,
                                                    iris_target_filename),
                       iris_target_filename], stderr=sys.stdout)

Load data into pandas

Use pandas to load your data into NumPy arrays for training with scikit-learn or XGBoost.

scikit-learn

# Load data into pandas, then use `.values` to get NumPy arrays
iris_data = pd.read_csv(iris_data_filename).values
iris_target = pd.read_csv(iris_target_filename).values

# Convert one-column 2D array into 1D array for use with scikit-learn
iris_target = iris_target.reshape((iris_target.size,))

XGBoost

# Load data into pandas, then use `.values` to get NumPy arrays
iris_data = pd.read_csv(iris_data_filename).values
iris_target = pd.read_csv(iris_target_filename).values

# Convert one-column 2D array into 1D array for use with XGBoost
iris_target = iris_target.reshape((iris_target.size,))

Train and save a model

Create a training module for AI Platform Training to run. In this example, the training module trains a model on the Iris training data (iris_data and iris_target) and saves your trained model by exporting it to a file. If you want to use AI Platform Prediction to get online predictions after training, you must name your model file according to the library you use to export it. See more about the naming requirements for your model file.

scikit-learn

Following the scikit-learn example on model persistence, you can train and export a model as shown below:

# Train the model
classifier = svm.SVC(gamma='auto', verbose=True)
classifier.fit(iris_data, iris_target)

# Export the classifier to a file
model_filename = 'model.joblib'
joblib.dump(classifier, model_filename)

To export the model, you also have the option to use the pickle library as follows:

import pickle
with open('model.pkl', 'wb') as model_file:
  pickle.dump(classifier, model_file)

XGBoost

You can export the model by using the "save_model" method of the Booster object.

# Load data into DMatrix object
dtrain = xgb.DMatrix(iris_data, label=iris_target)

# Train XGBoost model
bst = xgb.train({}, dtrain, 20)

# Export the classifier to a file
model_filename = 'model.bst'
bst.save_model(model_filename)

To export the model, you also have the option to use the pickle library as follows:

import pickle
with open('model.pkl', 'wb') as model_file:
  pickle.dump(bst, model_file)

Model file naming requirements

For online prediction, the saved model file that you upload to Cloud Storage must be named one of: model.pkl, model.joblib, or model.bst, depending on which library you used. This restriction ensures that AI Platform Prediction uses the same pattern to reconstruct the model on import as was used during export.

This requirement does not apply if you create a custom prediction routine (beta).

scikit-learn

Library used to export model Correct model name
pickle model.pkl
sklearn.externals.joblib model.joblib

XGBoost

Library used to export model Correct model name
pickle model.pkl
joblib model.joblib
xgboost.Booster model.bst

For future iterations of your model, organize your Cloud Storage bucket so that each new model has a dedicated directory.

Upload your saved model to Cloud Storage

If you're using a Cloud Storage bucket outside of the Google Cloud project you're using to run AI Platform Training, make sure that AI Platform Training has access to your bucket.

scikit-learn

# Upload the saved model file to Cloud Storage
gcs_model_path = os.path.join('gs://', BUCKET_NAME,
    datetime.datetime.now().strftime('iris_%Y%m%d_%H%M%S'), model_filename)
subprocess.check_call(['gsutil', 'cp', model_filename, gcs_model_path],
    stderr=sys.stdout)

XGBoost

# Upload the saved model file to Cloud Storage
gcs_model_path = os.path.join('gs://', BUCKET_NAME,
    datetime.datetime.now().strftime('iris_%Y%m%d_%H%M%S'), model_filename)
subprocess.check_call(['gsutil', 'cp', model_filename, gcs_model_path],
    stderr=sys.stdout)

Create training application package

With iris_training.py created from the above snippets, create a training application package that includes iris_training.py as its main module.

The easiest (and recommended) way to create a training application package uses gcloud to package and upload the application when you submit your training job. This method requires you to create a very simple file structure with two files:

scikit-learn

For this tutorial, the file structure of your training application package should appear similar to the following:

iris_sklearn_trainer/
    __init__.py
    iris_training.py
  1. In the command line, create a directory locally:

    mkdir iris_sklearn_trainer
    
  2. Create an empty file named __init__.py:

    touch iris_sklearn_trainer/__init__.py
    
  3. Save your training code as iris_training.py, and save that file within your iris_sklearn_trainer directory. Alternatively, use cURL to download and save the file from GitHub:

    curl https://raw.githubusercontent.com/GoogleCloudPlatform/cloudml-samples/master/sklearn/iris_training.py > iris_sklearn_trainer/iris_training.py
    

    View the full source code on GitHub.

  4. Confirm that your training application package is set up correctly:

    ls ./iris_sklearn_trainer
      __init__.py  iris_training.py
    

XGBoost

For this tutorial, the file structure of your training application package should appear similar to the following:

iris_xgboost_trainer/
    __init__.py
    iris_training.py
  1. In the command line, create a directory locally:

    mkdir iris_xgboost_trainer
    
  2. Create an empty file named __init__.py:

    touch iris_xgboost_trainer/__init__.py
    
  3. Save your training code as iris_training.py, and save that file within your iris_xgboost_trainer directory. Alternatively, use cURL to download and save the file from GitHub:

    curl https://raw.githubusercontent.com/GoogleCloudPlatform/cloudml-samples/master/xgboost/iris_training.py > iris_xgboost_trainer/iris_training.py
    

    View the full source code on GitHub.

  4. Confirm that your training application package is set up correctly:

    ls ./iris_xgboost_trainer
      __init__.py  iris_training.py
    

Learn more about packaging a training application.

Run trainer locally

You can test your training application locally using the gcloud ai-platform local train command. This step is optional, but it is helpful for debugging purposes.

scikit-learn

In the command line, set the following environment variables, replacing [VALUES-IN-BRACKETS] with the appropriate values:

TRAINING_PACKAGE_PATH="./iris_sklearn_trainer/"
MAIN_TRAINER_MODULE="iris_sklearn_trainer.iris_training"

Test your training job locally:

gcloud ai-platform local train \
  --package-path $TRAINING_PACKAGE_PATH \
  --module-name $MAIN_TRAINER_MODULE

XGBoost

In the command line, set the following environment variables, replacing [VALUES-IN-BRACKETS] with the appropriate values:

TRAINING_PACKAGE_PATH="./iris_xgboost_trainer/"
MAIN_TRAINER_MODULE="iris_xgboost_trainer.iris_training"

Test your training job locally:

gcloud ai-platform local train \
  --package-path $TRAINING_PACKAGE_PATH \
  --module-name $MAIN_TRAINER_MODULE

Submit training job

In this section, you use gcloud ai-platform jobs submit training to submit your training job.

Specify training job parameters

Set the following environment variables for each parameter in your training job request:

  • BUCKET_NAME - The name of your Cloud Storage bucket.
  • JOB_NAME - A name to use for the job (mixed-case letters, numbers, and underscores only, starting with a letter). For example, iris_scikit_learn_$(date +"%Y%m%d_%H%M%S") or iris_xgboost_$(date +"%Y%m%d_%H%M%S").
  • JOB_DIR - The path to a Cloud Storage location to use for your training job's output files. For example, gs://$BUCKET_NAME/scikit_learn_job_dir or gs://$BUCKET_NAME/xgboost_job_dir.
  • TRAINING_PACKAGE_PATH - The local path to the root directory of your training application. For example, ./iris_sklearn_trainer/ or ./iris_xgboost_trainer/.
  • MAIN_TRAINER_MODULE - Specifies which file the AI Platform Training training service should run. This is formatted as [YOUR_FOLDER_NAME.YOUR_PYTHON_FILE_NAME]. For example, iris_sklearn_trainer.iris_training or iris_xgboost_trainer.iris_training.
  • REGION - The name of the region you're using to run your training job. Use one of the available regions for the AI Platform Training training service. Make sure your Cloud Storage bucket is in the same region.
  • RUNTIME_VERSION - You must specify a AI Platform Training runtime version that supports scikit-learn. In this example, 2.11.
  • PYTHON_VERSION - The Python version to use for the job. For this tutorial, specify Python 3.7.
  • SCALE_TIER - A predefined cluster specification for machines to run your training job. In this case, BASIC. You can also use custom scale tiers to define your own cluster configuration for training.

For your convenience, the environment variables for this tutorial are below.

scikit-learn

Replace [VALUES-IN-BRACKETS] with the appropriate values:

    BUCKET_NAME=[YOUR-BUCKET-NAME]
    JOB_NAME="iris_scikit_learn_$(date +"%Y%m%d_%H%M%S")"
    JOB_DIR=gs://$BUCKET_NAME/scikit_learn_job_dir
    TRAINING_PACKAGE_PATH="./iris_sklearn_trainer/"
    MAIN_TRAINER_MODULE="iris_sklearn_trainer.iris_training"
    REGION=us-central1
    RUNTIME_VERSION=2.11
    PYTHON_VERSION=3.7
    SCALE_TIER=BASIC

XGBoost

Replace [VALUES-IN-BRACKETS] with the appropriate values:

    BUCKET_NAME=[YOUR-BUCKET-NAME]
    JOB_NAME="iris_xgboost_$(date +"%Y%m%d_%H%M%S")"
    JOB_DIR=gs://$BUCKET_NAME/xgboost_job_dir
    TRAINING_PACKAGE_PATH="./iris_xgboost_trainer/"
    MAIN_TRAINER_MODULE="iris_xgboost_trainer.iris_training"
    REGION=us-central1
    RUNTIME_VERSION=2.11
    PYTHON_VERSION=3.7
    SCALE_TIER=BASIC

Submit the training job request:

gcloud ai-platform jobs submit training $JOB_NAME \
  --job-dir $JOB_DIR \
  --package-path $TRAINING_PACKAGE_PATH \
  --module-name $MAIN_TRAINER_MODULE \
  --region $REGION \
  --runtime-version=$RUNTIME_VERSION \
  --python-version=$PYTHON_VERSION \
  --scale-tier $SCALE_TIER

You should see output similar to the following:

Job [iris_scikit_learn_[DATE]_[TIME]] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe iris_scikit_learn_[DATE]_[TIME]

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs iris_scikit_learn_[DATE]_[TIME]

jobId: iris_scikit_learn_[DATE]_[TIME]
state: QUEUED

Viewing your training logs (optional)

AI Platform Training captures all stdout and stderr streams and logging statements. These logs are stored in Logging; they are visible both during and after execution.

To view the logs for your training job:

Console

  1. Open your AI Platform Training Jobs page.

    Open jobs in the Google Cloud console

  2. Select the name of the training job to inspect. This brings you to the Job details page for your selected training job.

  3. Within the job details, select the View logs link. This brings you to the Logging page where you can search and filter logs for your selected training job.

gcloud

You can view logs in your terminal with gcloud ai-platform jobs stream-logs.

gcloud ai-platform jobs stream-logs $JOB_NAME

Verify your model file in Cloud Storage

View the contents of the destination model folder to verify that your saved model file has been uploaded to Cloud Storage.

gcloud storage ls gs://$BUCKET_NAME/iris_*

Example output:

gs://bucket-name/iris_20180518_123815/:
gs://bucket-name/iris_20180518_123815/model.joblib

What's next