Getting online predictions with XGBoost

This sample trains a simple model to predict a person's income level based on the Census Income Data Set. After you train and save the model locally, you deploy it to AI Platform and query it to get online predictions.

This content is also available on GitHub as a Jupyter notebook.

How to bring your model to AI Platform

You can bring your model to AI Platform to get predictions in five steps:

  • Save your model to a file
  • Upload the saved model to Cloud Storage
  • Create a model resource on AI Platform
  • Create a model version, linking your saved model
  • Make an online prediction

Before you begin

Complete the following steps to set up a GCP account, activate the AI Platform API, and install and activate the Cloud SDK.

Set up your GCP project

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  4. Enable the AI Platform ("Cloud Machine Learning Engine") and Compute Engine APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.

Set up your environment

Choose one of the options below to set up your environment locally on macOS or in a remote environment on Cloud Shell.

For macOS users, we recommend that you set up your environment using the MACOS tab below. Cloud Shell, shown on the CLOUD SHELL tab, is available on macOS, Linux, and Windows. Cloud Shell provides a quick way to try AI Platform, but isn’t suitable for ongoing development work.

macOS

  1. Check Python installation
    Confirm that you have Python installed and, if necessary, install it.

    python -V
  2. Check pip installation
    pip is Python’s package manager, included with current versions of Python. Check if you already have pip installed by running pip --version. If not, see how to install pip.

    You can upgrade pip using the following command:

    pip install -U pip

    See the pip documentation for more details.

  3. Install virtualenv
    virtualenv is a tool to create isolated Python environments. Check if you already have virtualenv installed by running virtualenv --version. If not, install virtualenv:

    pip install --user --upgrade virtualenv

    To create an isolated development environment for this guide, create a new virtual environment in virtualenv. For example, the following command activates an environment named cmle-env:

    virtualenv cmle-env
    source cmle-env/bin/activate
  4. For the purposes of this tutorial, run the rest of the commands within your virtual environment.

    See more information about using virtualenv. To exit virtualenv, run deactivate.

Cloud Shell

  1. Open the Google Cloud Console.

    Google Cloud Console

  2. Click the Activate Google Cloud Shell button at the top of the console window.

    Activate Google Cloud Shell

    A Cloud Shell session opens inside a new frame at the bottom of the console and displays a command-line prompt. It can take a few seconds for the shell session to be initialized.

    Cloud Shell session

    Your Cloud Shell session is ready to use.

  3. Configure the gcloud command-line tool to use your selected project.

    gcloud config set project [selected-project-id]

    where [selected-project-id] is your project ID. (Omit the enclosing brackets.)

Verify the Google Cloud SDK components

To verify that the Google Cloud SDK components are installed:

  1. List your models:

    gcloud ai-platform models list
  2. If you have not created any models before, the command returns an empty list:

    Listed 0 items.

    After you start creating models, you can see them listed by using this command.

  3. If you have installed gcloud previously, update gcloud:

    gcloud components update

Install frameworks

macOS

Within your virtual environment, run the following command to install the versions of scikit-learn, XGBoost, and pandas used in AI Platform runtime version 1.14:

(cmle-env)$ pip install scikit-learn==0.20.2 xgboost==0.81 pandas==0.24.0

By providing version numbers in the preceding command, you ensure that the dependencies in your virtual environment match the dependencies in the runtime version. This helps prevent unexpected behavior when your code runs on AI Platform.

For more details, installation options, and troubleshooting information, refer to the installation instructions for each framework:

Cloud Shell

Run the following command to install scikit-learn, XGBoost, and pandas:

pip install --user scikit-learn xgboost pandas

For more details, installation options, and troubleshooting information, refer to the installation instructions for each framework:

Download the data

The Census Income Data Set that this sample uses for training is hosted by the UC Irvine Machine Learning Repository. See About the data for more information.

  • Training file is adult.data
  • Evaluation file is adult.test

Train and save a model

To train and save a model, complete the following steps:

  1. Load the data into a pandas DataFrame to prepare it for use with XGBoost.
  2. Train a simple model in XGBoost.
  3. Save the model to a file that can be uploaded to AI Platform.

If you already have a trained model to upload, see how to export your model.

Load and transform data

In this step, you load the training and testing datasets into a pandas DataFrame and transform the categorical data into numeric features to prepare it for use with your model.

import json
import numpy as np
import os
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder

# these are the column labels from the census data files
COLUMNS = (
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income-level'
)

# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country'
)

# load training set
with open('./census_data/adult.data', 'r') as train_data:
    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
# remove column we are trying to predict ('income-level') from features list
train_features = raw_training_data.drop('income-level', axis=1)
# create training labels list
train_labels = (raw_training_data['income-level'] == ' >50K')


# load test set
with open('./census_data/adult.test', 'r') as test_data:
    raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)
# remove column we are trying to predict ('income-level') from features list
test_features = raw_testing_data.drop('income-level', axis=1)
# create training labels list
test_labels = (raw_testing_data['income-level'] == ' >50K.')

# convert data in categorical columns to numerical values
encoders = {col:LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
    train_features[col] = encoders[col].fit_transform(train_features[col])
for col in CATEGORICAL_COLUMNS:
    test_features[col] = encoders[col].fit_transform(test_features[col])

Train and export your model

To export your model, you can use the save_model method of the Booster object, or the Python pickle library.

# load data into DMatrix object
dtrain = xgb.DMatrix(train_features, train_labels)
dtest = xgb.DMatrix(test_features)

# train XGBoost model
bst = xgb.train({}, dtrain, 20)
bst.save_model('./model.bst')

Model file naming requirements

The saved model file that you upload to Cloud Storage must be named one of: model.pkl, model.joblib, or model.bst, depending on which library you used. This restriction ensures that AI Platform uses the same pattern to reconstruct the model on import as was used during export.

Library used to export model Correct model name
pickle model.pkl
joblib model.joblib
xgboost.Booster model.bst

For future iterations of your model, organize your Cloud Storage bucket so that each new model has a dedicated directory.

Store your model in Cloud Storage

For the purposes of this tutorial, it is easiest to use a dedicated Cloud Storage bucket in the same project you're using for AI Platform.

If you're using a bucket in a different project, you must ensure that your AI Platform service account can access your model in Cloud Storage. Without the appropriate permissions, your request to create an AI Platform model version fails. See more about granting permissions for storage.

Set up your Cloud Storage bucket

This section shows you how to create a new bucket. You can use an existing bucket, but it must be in the same region where you plan on running AI Platform jobs. Additionally, if it is not part of the project you are using to run AI Platform, you must explicitly grant access to the AI Platform service accounts.

  1. Specify a name for your new bucket. The name must be unique across all buckets in Cloud Storage.

    BUCKET_NAME="your_bucket_name"

    For example, use your project name with -mlengine appended:

    PROJECT_ID=$(gcloud config list project --format "value(core.project)")
    BUCKET_NAME=${PROJECT_ID}-mlengine
  2. Check the bucket name that you created.

    echo $BUCKET_NAME
  3. Select a region for your bucket and set a REGION environment variable.

    Use the same region where you plan on running AI Platform jobs. See the available regions for AI Platform services.

    For example, the following code creates REGION and sets it to us-central1:

    REGION=us-central1
  4. Create the new bucket:

    gsutil mb -l $REGION gs://$BUCKET_NAME

Upload the exported model file to Cloud Storage

Run the following command to upload the model you exported earlier in this tutorial, to your bucket in Cloud Storage:

gsutil cp ./model.bst gs://$BUCKET_NAME/model.bst

You can use the same Cloud Storage bucket for multiple model files. Each model file must be within its own directory inside the bucket.

Format data for prediction

Before you send an online prediction request, you must format your test data to prepare it for use by the AI Platform prediction service. Make sure that the format of your input instances matches what your model expects.

gcloud

Create an input.json file with each input instance on a separate line. The following example uses the first ten data instances in the test_features list that was defined in previous steps.

    [25, "Private", 226802, "11th", 7, "Never-married", "Machine-op-inspct", "Own-child", "Black", "Male", 0, 0, 40, "United-States"]
    [38, "Private", 89814, "HS-grad", 9, "Married-civ-spouse", "Farming-fishing", "Husband", "White", "Male", 0, 0, 50, "United-States"]
    [28, "Local-gov", 336951, "Assoc-acdm", 12, "Married-civ-spouse", "Protective-serv", "Husband", "White", "Male", 0, 0, 40, "United-States"]
    [44, "Private", 160323, "Some-college", 10, "Married-civ-spouse", "Machine-op-inspct", "Husband", "Black", "Male", 7688, 0, 40, "United-States"]
    [18, "?", 103497, "Some-college", 10, "Never-married", "?", "Own-child", "White", "Female", 0, 0, 30, "United-States"]
    [34, "Private", 198693, "10th", 6, "Never-married", "Other-service", "Not-in-family", "White", "Male", 0, 0, 30, "United-States"]
    [29, "?", 227026, "HS-grad", 9, "Never-married", "?", "Unmarried", "Black", "Male", 0, 0, 40, "United-States"]
    [63, "Self-emp-not-inc", 104626, "Prof-school", 15, "Married-civ-spouse", "Prof-specialty", "Husband", "White", "Male", 3103, 0, 32, "United-States"]
    [24, "Private", 369667, "Some-college", 10, "Never-married", "Other-service", "Unmarried", "White", "Female", 0, 0, 40, "United-States"]
    [55, "Private", 104996, "7th-8th", 4, "Married-civ-spouse", "Craft-repair", "Husband", "White", "Male", 0, 0, 10, "United-States"]

Note that the format of input instances needs to match what your model expects. In this example, the Census model requires 14 features, so your input must be a matrix of shape (num_instances, 14).

REST API

Create an input.json file formatted with each input instance on a separate line. The following example uses the first ten data instances in the test_features list that was defined in previous steps.

{
  "instances": [

    [25, "Private", 226802, "11th", 7, "Never-married", "Machine-op-inspct", "Own-child", "Black", "Male", 0, 0, 40, "United-States"],
    [38, "Private", 89814, "HS-grad", 9, "Married-civ-spouse", "Farming-fishing", "Husband", "White", "Male", 0, 0, 50, "United-States"],
    [28, "Local-gov", 336951, "Assoc-acdm", 12, "Married-civ-spouse", "Protective-serv", "Husband", "White", "Male", 0, 0, 40, "United-States"],
    [44, "Private", 160323, "Some-college", 10, "Married-civ-spouse", "Machine-op-inspct", "Husband", "Black", "Male", 7688, 0, 40, "United-States"],
    [18, "?", 103497, "Some-college", 10, "Never-married", "?", "Own-child", "White", "Female", 0, 0, 30, "United-States"],
    [34, "Private", 198693, "10th", 6, "Never-married", "Other-service", "Not-in-family", "White", "Male", 0, 0, 30, "United-States"],
    [29, "?", 227026, "HS-grad", 9, "Never-married", "?", "Unmarried", "Black", "Male", 0, 0, 40, "United-States"],
    [63, "Self-emp-not-inc", 104626, "Prof-school", 15, "Married-civ-spouse", "Prof-specialty", "Husband", "White", "Male", 3103, 0, 32, "United-States"],
    [24, "Private", 369667, "Some-college", 10, "Never-married", "Other-service", "Unmarried", "White", "Female", 0, 0, 40, "United-States"],
    [55, "Private", 104996, "7th-8th", 4, "Married-civ-spouse", "Craft-repair", "Husband", "White", "Male", 0, 0, 10, "United-States"]
  ]
}

Note that the format of input instances needs to match what your model expects. In this example, the Census model requires 14 features, so your input must be a matrix of shape (num_instances, 14).

See more information on formatting your input for online prediction.

Test your model with local predictions

You can use the gcloud ai-platform local predict command to test how your model serves predictions before you deploy it to AI Platform Prediction. The command uses dependencies in your local environment to perform prediction and returns results in the same format that gcloud ai-platform predict uses when it performs online predictions. Testing predictions locally can help you discover errors before you incur costs for online prediction requests.

For the --model-dir argument, specify a directory containing your exported machine learning model, either on your local machine or in Cloud Storage. For the --framework argument, specify tensorflow, scikit-learn, or xgboost. You cannot use the gcloud ai-platform local predict command with a custom prediction routine.

The following example shows how to perform local prediction:

gcloud ai-platform local predict --model-dir local-or-cloud-storage-path-to-model-directory/ \
  --json-instances local-path-to-prediction-input.json \
  --framework name-of-framework

Deploy models and versions

AI Platform organizes your trained models using model and version resources. An AI Platform model is a container for the versions of your machine learning model.

To deploy a model, you create a model resource in AI Platform, create a version of that model, then link the model version to the model file stored in Cloud Storage.

Create a model resource

AI Platform uses model resources to organize different versions of your model.

console

  1. Open the AI Platform models page in the Cloud Console:

    Open models in the Cloud Console

  2. If needed, create the model to add your new version to:

    1. Click the New Model button at the top of the Models page. This brings you to the Create model page.

    2. Enter a unique name for your model in the Model name box. Optionally, enter a description for your model in the Description field.

    3. Click Create.

    4. Verify that you have returned to the Models page, and that your new model appears in the list.

gcloud

Create a model resource for your model versions, filling in your desired name for your model without the enclosing brackets:

gcloud ai-platform models create "[YOUR-MODEL-NAME]"

REST API

  1. Format your request by placing the model object in the request body. At minimum, you must specify a name for your model. Fill in your desired name for your model without the enclosing brackets:

    {"name": "[YOUR-MODEL-NAME]"}
    
  2. Make your REST API call to the following path, replacing [VALUES_IN_BRACKETS] with the appropriate values:

    POST https://ml.googleapis.com/v1/projects/[YOUR-PROJECT-ID]/models/
    

    For example, you can make the following request using cURL:

    curl -X POST -H "Content-Type: application/json" \
      -d '{"name": "[YOUR-MODEL-NAME]"}' \
      -H "Authorization: Bearer `gcloud auth print-access-token`" \
      "https://ml.googleapis.com/v1/projects/[YOUR-PROJECT-ID]/models"
    

    You should see output similar to this:

    {
      "name": "projects/[YOUR-PROJECT-ID]/models/[YOUR-MODEL-NAME]",
      "regions": [
        "us-central1"
      ]
    }
    

See the AI Platform model API for more details.

Create a model version

Now you are ready to create a model version with the trained model you previously uploaded to Cloud Storage. When you create a version, specify the following parameters:

  • name: must be unique within the AI Platform model.
  • deploymentUri: the path to your model directory in Cloud Storage.

    • If you're deploying a TensorFlow model, this is a SavedModel directory.
    • If you're deploying a scikit-learn or XGBoost model, this is the directory containing your model.joblib, model.pkl, or model.bst file.
    • If you're deploying a custom prediction routine, this is the directory containing all your model artifacts. The total size of this directory must be 500 MB or less.
  • framework: TENSORFLOW, SCIKIT_LEARN, or XGBOOST.

  • runtimeVersion: a runtime version based on the dependencies your model needs. If you're deploying a scikit-learn model or an XGBoost model, this must be at least 1.4.

  • pythonVersion: must be set to "3.5" to be compatible with model files exported using Python 3. If not set, this defaults to "2.7".

  • machineType (optional): the type of virtual machine that AI Platform Prediction uses for the nodes that serve predictions. Learn more about machine types. If not set, this defaults to mls1-c1-m2.

See more information about each of these parameters in the AI Platform Training and Prediction API for a version resource.

See the full details for each runtime version.

console

  1. On the Models page, select the name of the model resource you would like to use to create your version. This brings you to the Model Details page.

    Open models in the Cloud Console

  2. Click the New Version button at the top of the Model Details page. This brings you to the Create version page.

  3. Enter your version name in the Name field. Optionally, enter a description for your version in the Description field.

  4. Enter the following information about how you trained your model in the corresponding dropdown boxes:

    • Select the Python version you used to train your model.
    • Select the Framework and Framework version.
    • Select the ML runtime version. Learn more about AI Platform runtime versions.
  5. Optionally, select a Machine type to run online prediction. This field defaults to "Single core CPU".

  6. In the Model URI field, enter the Cloud Storage bucket location where you uploaded your model file. You may use the Browse button to find the correct path.

    Make sure to specify the path to the directory containing the file, not the path to the model file itself. For example, use gs://your_bucket_name/model-dir/ instead of gs://your_bucket_name/model-dir/saved_model.pb or gs://your_bucket_name/model-dir/model.pkl.

  7. Select a Scaling option for online prediction deployment:

    • If you select "Auto scaling", the optional Minimum number of nodes field displays. You can enter the minimum number of nodes to keep running at all times, when the service has scaled down. This field defaults to 0.

    • If you select "Manual scaling", you must enter the Number of nodes you want to keep running at all times.

      Learn more about pricing for prediction costs.

  8. To finish creating your model version, click Save.

gcloud

  1. Set environment variables to store the path to the Cloud Storage directory where your model binary is located, your model name, your version name and your framework choice.

    When you create a version with the gcloud tool, you may provide the framework name in capital letters with underscores (for example, SCIKIT_LEARN) or in lowercase letters with hyphens (for example, scikit-learn). Both options lead to identical behavior.

    Replace [VALUES_IN_BRACKETS] with the appropriate values:

    MODEL_DIR="gs://your_bucket_name/"
    VERSION_NAME="[YOUR-VERSION-NAME]"
    MODEL_NAME="[YOUR-MODEL-NAME]"
    FRAMEWORK="[YOUR-FRAMEWORK_NAME]"
    

  2. Create the version:

    gcloud ai-platform versions create $VERSION_NAME \
      --model $MODEL_NAME \
      --origin $MODEL_DIR \
      --runtime-version=1.14 \
      --framework $FRAMEWORK \
      --python-version=3.5
    

    Creating the version takes a few minutes. When it is ready, you should see the following output:

    Creating version (this might take a few minutes)......done.

  3. Get information about your new version:

    gcloud ai-platform versions describe $VERSION_NAME \
      --model $MODEL_NAME
    

    You should see output similar to this:

    createTime: '2018-02-28T16:30:45Z'
    deploymentUri: gs://your_bucket_name
    framework: [YOUR-FRAMEWORK-NAME]
    machineType: mls1-c1-m2
    name: projects/[YOUR-PROJECT-ID]/models/[YOUR-MODEL-NAME]/versions/[YOUR-VERSION-NAME]
    pythonVersion: '3.5'
    runtimeVersion: '1.14'
    state: READY

REST API

  1. Format your request body to contain the version object. This example specifies the version name, deploymentUri, runtimeVersion and framework. Replace [VALUES_IN_BRACKETS] with the appropriate values:

      {
        "name": "[YOUR-VERSION-NAME]",
        "deploymentUri": "gs://your_bucket_name/"
        "runtimeVersion": "1.14"
        "framework": "[YOUR_FRAMEWORK_NAME]"
        "pythonVersion": "3.5"
      }
    
  2. Make your REST API call to the following path, replacing [VALUES_IN_BRACKETS] with the appropriate values:

      POST https://ml.googleapis.com/v1/projects/[YOUR-PROJECT-ID]/models/[YOUR-MODEL-NAME]/versions
    

    For example, you can make the following request using cURL:

        curl -X POST -H "Content-Type: application/json" \
          -d '{"name": "[YOUR-VERSION-NAME]", "deploymentUri": "gs://your_bucket_name/", "runtimeVersion": "1.14", "framework": "[YOUR_FRAMEWORK_NAME]", "pythonVersion": "3.5"}' \
          -H "Authorization: Bearer `gcloud auth print-access-token`" \
          "https://ml.googleapis.com/v1/projects/[YOUR-PROJECT-ID]/models/[YOUR-MODEL-NAME]/versions"
    

    Creating the version takes a few minutes. When it is ready, you should see output similar to this:

      {
        "name": "projects/[YOUR-PROJECT-ID]/operations/create_[YOUR-MODEL-NAME]_[YOUR-VERSION-NAME]-[TIMESTAMP]",
        "metadata": {
          "@type": "type.googleapis.com/google.cloud.ml.v1.OperationMetadata",
          "createTime": "2018-07-07T02:51:50Z",
          "operationType": "CREATE_VERSION",
          "modelName": "projects/[YOUR-PROJECT-ID]/models/[YOUR-MODEL-NAME]",
          "version": {
            "name": "projects/[YOUR-PROJECT-ID]/models/[YOUR-MODEL-NAME]/versions/[YOUR-VERSION-NAME]",
            "deploymentUri": "gs://your_bucket_name",
            "createTime": "2018-07-07T02:51:49Z",
            "runtimeVersion": "1.14",
            "framework": "[YOUR_FRAMEWORK_NAME]",
            "machineType": "mls1-c1-m2",
            "pythonVersion": "3.5"
          }
        }
      }
    

Send online prediction request

After you have successfully created a version, AI Platform starts a new server that is ready to serve prediction requests.

gcloud

  1. Set environment variables for your model name, version name, and the name of your input file. Replace [VALUES_IN_BRACKETS] with the appropriate values:

    MODEL_NAME="[YOUR-MODEL-NAME]"
    VERSION_NAME="[YOUR-VERSION-NAME]"
    INPUT_FILE="input.json"
    
  2. Send the prediction request:

    gcloud ai-platform predict --model $MODEL_NAME --version \
      $VERSION_NAME --json-instances $INPUT_FILE
    

Python

This sample uses the Python client library to send prediction requests for the entire Census dataset, and prints out the first ten results. See more information about how to use the Python Client Library.

Replace [VALUES_IN_BRACKETS] with the appropriate values:

import googleapiclient.discovery

# Fill in your PROJECT_ID, VERSION_NAME and MODEL_NAME before running
# this code.

PROJECT_ID = [YOUR PROJECT_ID HERE]
VERSION_NAME = [YOUR VERSION_NAME HERE]
MODEL_NAME = [YOUR MODEL_NAME HERE]

service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}'.format(PROJECT_ID, MODEL_NAME)
name += '/versions/{}'.format(VERSION_NAME)

response = service.projects().predict(
    name=name,
    body={'instances': data}
).execute()

if 'error' in response:
  print (response['error'])
else:
  online_results = response['predictions']

For XGBoost, the results are floats, and they need to be converted to booleans at whichever threshold is appropriate for your model. For example:

# convert floats to booleans
converted_responses = [x > 0.5 for x in online_results]

See more information about prediction input parameters in the AI Platform API Predict Request details.

About the data

The Census Income Data Set that this sample uses for training is hosted by the UC Irvine Machine Learning Repository.

Census data courtesy of: Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://archive.ics.uci.edu/ml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

What's next

Var denne siden nyttig? Si fra hva du synes:

Send tilbakemelding om ...

Trenger du hjelp? Gå til brukerstøttesiden vår.