Run this tutorial as a notebook in Colab | View the notebook on GitHub |
This tutorial shows how to use AI Platform Prediction to deploy a scikit-learn pipeline that uses custom transformers.
scikit-learn
pipelines
allow you to compose multiple estimators. For example, you can use transformers
to preprocess data and pass the transformed data to a classifier. scikit-learn
provides many
transformers
in the sklearn
package.
You can also use scikit-learn's
FunctionTransformer
or TransformerMixin
class to create your own custom transformer. If you want to deploy a pipeline
that uses custom transformers to AI Platform Prediction, you must provide that code to
AI Platform Prediction as a source distribution
package.
This tutorial presents a sample problem involving Census data to walk you through the following steps:
- Training a scikit-learn pipeline with custom transformers on AI Platform Training
- Deploying the trained pipeline and your custom code to AI Platform Prediction
- Serving prediction requests from that deployment
Dataset
This tutorial uses the United States Census Income Dataset provided by the UC Irvine Machine Learning Repository. This dataset contains information about people from a 1994 Census database, including age, education, marital status, occupation, and whether they make more than $50,000 a year.
The data used in this tutorial is available in a public Cloud Storage
bucket:
gs://cloud-samples-data/ai-platform/sklearn/census_data/
Objective
The goal is to train a scikit-learn pipeline that predicts whether a person makes more than $50,000 a year (target label) based on other Census information about the person (features).
This tutorial focuses more on using this model with AI Platform Prediction than on the design of the model itself. However, it's always important to think about potential problems and unintended consequences when building machine learning systems. See the Machine Learning Crash Course exercise about fairness to learn about sources of bias in the Census dataset, as well as machine learning fairness more generally.
Costs
This tutorial uses billable components of Google Cloud:
- AI Platform Training
- AI Platform Prediction
- Cloud Storage
Learn about AI Platform Training pricing, AI Platform Prediction pricing, and Cloud Storage pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.
Before you begin
You must do several things before you can train and deploy a model on AI Platform Prediction:
- Set up your local development environment.
- Set up a Google Cloud project with billing and the necessary APIs enabled.
- Create a Cloud Storage bucket to store your training package and your trained model.
Set up your local development environment
You need the following to complete this tutorial:
- Python 3
- virtualenv
- The Google Cloud SDK
The Google Cloud guide to Setting up a Python development environment provides detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:
Install virtualenv and create a virtual environment that uses Python 3.
Activate that environment.
Complete the steps in the following section to install the Google Cloud SDK.
Set up your Google Cloud project
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the AI Platform Training & Prediction and Compute Engine APIs.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the AI Platform Training & Prediction and Compute Engine APIs.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
Authenticate your GCP account
To set up authentication, you need to create a service account key and set an environment variable for the file path to the service account key.
-
Create a service account:
-
In the Google Cloud console, go to the Create service account page.
- In the Service account name field, enter a name.
- Optional: In the Service account description field, enter a description.
- Click Create.
- Click the Select a role field. Under All roles, select AI Platform > AI Platform Admin.
- Click Add another role.
-
Click the Select a role field. Under All roles, select Storage > Storage Object Admin.
-
Click Done to create the service account.
Do not close your browser window. You will use it in the next step.
-
-
Create a service account key for authentication:
- In the Google Cloud console, click the email address for the service account that you created.
- Click Keys.
- Click Add key, then Create new key.
- Click Create. A JSON key file is downloaded to your computer.
- Click Close.
-
Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of the JSON file that contains your service account key. This variable only applies to your current shell session, so if you open a new session, set the variable again.
Create a Cloud Storage bucket
This tutorial uses Cloud Storage in several ways:
When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. AI Platform Training runs the code from this package.
In this tutorial, AI Platform Training also saves the trained model that results from your job in the same bucket.
To deploy your scikit-learn pipeline that uses custom code to AI Platform Prediction, you must upload the custom transformers that your pipeline uses to Cloud Storage.
When you create the AI Platform Prediction version resource that serves predictions, you provide the trained scikit-learn pipeline and your custom code as Cloud Storage URIs.
Set the name of your Cloud Storage bucket as an environment variable. It must be unique across all Cloud Storage buckets:
BUCKET_NAME="your-bucket-name"
Select a region where AI Platform Training and AI Platform Prediction are available, and create another environment variable. For example:
REGION="us-central1"
Create your Cloud Storage bucket in this region and, later, use the same region for training and prediction. Run the following command to create the bucket if it doesn't already exist:
gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION
Creating a training application and custom pipeline code
Create an application to train a scikit-learn pipeline with the Census data. In this tutorial, the training package also contains the custom code that the trained pipeline uses during prediction. This is a useful pattern, because pipelines are generally designed to use the same transformers during training and prediction.
Use the following steps to create a directory with three files inside that matches the following structure:
census_package/
__init__.py
my_pipeline.py
train.py
First, create the empty census_package/
directory:
mkdir census_package
Within census_package/
create a blank file named __init__.py
:
touch ./census_package/__init__.py
This makes it possible to import census_package/
as a package in Python.
Create custom transformers
scikit-learn provides many transformers that you can use as part of a pipeline, but it also lets you define your own custom transformers. These transformers can even learn a saved state during training that gets used later during prediction.
Extend
sklearn.base.TransformerMixin
to define three transformers:
PositionalSelector
: Given a list of indices C and a matrix M, this returns a matrix with a subset of M's columns, indicated by C.StripString
: Given a matrix of strings, this strips whitespaces from each string.SimpleOneHotEncoder
: A simple one-hot encoder which can be applied to a matrix of strings.
To do this, write the following code to a file named
census_package/my_pipeline.py
.
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
class PositionalSelector(BaseEstimator, TransformerMixin):
def __init__(self, positions):
self.positions = positions
def fit(self, X, y=None):
return self
def transform(self, X):
return np.array(X)[:, self.positions]
class StripString(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
strip = np.vectorize(str.strip)
return strip(np.array(X))
class SimpleOneHotEncoder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.values = []
for c in range(X.shape[1]):
Y = X[:, c]
values = {v: i for i, v in enumerate(np.unique(Y))}
self.values.append(values)
return self
def transform(self, X):
X = np.array(X)
matrices = []
for c in range(X.shape[1]):
Y = X[:, c]
matrix = np.zeros(shape=(len(Y), len(self.values[c])), dtype=np.int8)
for i, x in enumerate(Y):
if x in self.values[c]:
matrix[i][self.values[c][x]] = 1
matrices.append(matrix)
res = np.concatenate(matrices, axis=1)
return res
Define pipeline and create training module
Next, create a training module to train your scikit-learn pipeline on Census data. Part of this code involves defining the pipeline.
This training module does several things:
- It downloads training data and loads it into a pandas
DataFrame
that can be used by scikit-learn. - It defines the scikit-learn pipeline to train. This examples only uses three
numerical features (
'age'
,'education-num'
, and'hours-per-week'
) and three categorical features ('workclass'
,'marital-status'
, and'relationship'
) from the input data. It transforms the numerical features using scikit-learn's built-inStandardScaler
and transforms the categorical ones with the custom one-hot encoder you defined inmy_pipeline.py
. Then it combines the preprocessed data as input for a classifier. - Finally, it exports the model using the version of
joblib
included in scikit-learn and saves it to your Cloud Storage bucket.
Write the following code to census_package/train.py
:
import warnings
import argparse
from google.cloud import storage
import pandas as pd
import numpy as np
from sklearn.externals import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
import census_package.my_pipeline as mp
warnings.filterwarnings('ignore')
def download_data(bucket_name, gcs_path, local_path):
bucket = storage.Client().bucket(bucket_name)
blob = bucket.blob(gcs_path)
blob.download_to_filename(local_path)
def upload_data(bucket_name, gcs_path, local_path):
bucket = storage.Client().bucket(bucket_name)
blob = bucket.blob(gcs_path)
blob.upload_from_filename(local_path)
def get_features_target(local_path):
strip = np.vectorize(str.strip)
raw_df = pd.read_csv(local_path, header=None)
target_index = len(raw_df.columns) - 1 # Last columns, 'income-level', is the target
features_df = raw_df.drop(target_index, axis=1)
features = features_df.as_matrix()
target = strip(raw_df[target_index].values)
return features, target
def create_pipeline():
# We want to use 3 categorical and 3 numerical features in this sample.
# Categorical features: age, education-num, and hours-per-week
# Numerical features: workclass, marital-status, and relationship
numerical_indices = [0, 4, 12] # age, education-num, and hours-per-week
categorical_indices = [1, 5, 7] # workclass, marital-status, and relationship
p1 = make_pipeline(mp.PositionalSelector(categorical_indices), mp.StripString(), mp.SimpleOneHotEncoder())
p2 = make_pipeline(mp.PositionalSelector(numerical_indices), StandardScaler())
feats = FeatureUnion([
('numericals', p1),
('categoricals', p2),
])
pipeline = Pipeline([
('pre', feats),
('estimator', GradientBoostingClassifier(max_depth=4, n_estimators=100))
])
return pipeline
def get_bucket_path(gcs_uri):
if not gcs_uri.startswith('gs://'):
raise Exception('{} does not start with gs://'.format(gcs_uri))
no_gs_uri = gcs_uri[len('gs://'):]
first_slash_index = no_gs_uri.find('/')
bucket_name = no_gs_uri[:first_slash_index]
gcs_path = no_gs_uri[first_slash_index + 1:]
return bucket_name, gcs_path
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--gcs_data_path', action="store", required=True)
parser.add_argument('--gcs_model_path', action="store", required=True)
arguments, others = parser.parse_known_args()
local_path = '/tmp/adul.data'
data_bucket, data_path = get_bucket_path(arguments.gcs_data_path)
print('Downloading the data...')
download_data(data_bucket, data_path, local_path)
features, target = get_features_target(local_path)
pipeline = create_pipeline()
print('Training the model...')
pipeline.fit(features, target)
joblib.dump(pipeline, './model.joblib')
model_bucket, model_path = get_bucket_path(arguments.gcs_model_path)
upload_data(model_bucket, model_path, './model.joblib')
print('Model was successfully uploaded.')
Training the pipeline on AI Platform Training
Use gcloud
to submit a training job to AI Platform Training. The following
command packages your training application, uploads it to
Cloud Storage, and tells AI Platform Training to run your training
module.
The --
argument is a separator: the AI Platform Training service doesn't use
arguments that follow the separator, but your training module can still access
them.
gcloud ai-platform jobs submit training census_training_$(date +"%Y%m%d_%H%M%S") \
--job-dir gs://$BUCKET_NAME/custom_pipeline_tutorial/job \
--package-path ./census_package \
--module-name census_package.train \
--region $REGION \
--runtime-version 1.13 \
--python-version 3.5 \
--scale-tier BASIC \
--stream-logs \
-- \
--gcs_data_path gs://cloud-samples-data/ai-platform/census/data/adult.data.csv \
--gcs_model_path gs://$BUCKET_NAME/custom_pipeline_tutorial/model/model.joblib
Deploying the pipeline and serving predictions
To serve predictions from AI Platform Prediction, you must deploy a model resource and a version resource. The model helps you organize multiple deployments if you modify and train your pipeline multiple times. The version uses your trained model and custom code to serve predictions.
To deploy these resources, you need to provide two artifacts:
- A Cloud Storage directory containing your trained pipeline. The
training job from the previous step created this file when it exported
model.joblib
to your bucket. - A
.tar.gz
source distribution package in Cloud Storage containing any custom transformers your pipeline uses. Create this in the next step.
Package your custom transformers
If you deploy a version without providing the code from my_pipeline.py
,
AI Platform Prediction won't be able to import the custom transformers (for
example, mp.SimpleOneHotEncoder
) and it will be unable to serve predictions.
Create the following setup.py
to define a source distribution package for your code:
import setuptools
setuptools.setup(name='census_package',
packages=['census_package'],
version="1.0",
)
Then run the following command to create dist/census_package-1.0.tar.gz
:
python setup.py sdist --formats=gztar
Finally, upload this tarball to your Cloud Storage bucket:
gcloud storage cp ./dist/census_package-1.0.tar.gz gs://$BUCKET_NAME/custom_pipeline_tutorial/code/census_package-1.0.tar.gz
Create model and version resources
First, define model and version names:
MODEL_NAME='CensusPredictor'
VERSION_NAME='v1'
Then use the following command to create the model resource:
gcloud ai-platform models create $MODEL_NAME \
--regions $REGION
Finally, create the version resource by providing Cloud Storage paths
to your model directory (the one that contains model.joblib
) and your custom
code (census_package-1.0.tar.gz
):
gcloud components install beta
gcloud beta ai-platform versions create $VERSION_NAME --model $MODEL_NAME \
--origin gs://$BUCKET_NAME/custom_pipeline_tutorial/model/ \
--runtime-version 1.13 \
--python-version 3.5 \
--framework SCIKIT_LEARN \
--package-uris gs://$BUCKET_NAME/custom_pipeline_tutorial/code/census_package-1.0.tar.gz
Serving online predictions
Try out your deployment by sending an online prediction request. First, install the Google API Client Library for Python:
pip install --upgrade google-api-python-client
Then send two instances of Census data to your deployed version:
import googleapiclient.discovery
instances = [
[39, 'State-gov', 77516, ' Bachelors . ', 13, 'Never-married', 'Adm-clerical', 'Not-in-family',
'White', 'Male', 2174, 0, 40, 'United-States', '<=50K'],
[50, 'Self-emp-not-inc', 83311, 'Bachelors', 13, 'Married-civ-spouse', 'Exec-managerial', 'Husband',
'White', 'Male', 0, 0, 13, 'United-States', '<=50K']
]
service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}/versions/{}'.format(PROJECT_ID, MODEL_NAME, VERSION_NAME)
response = service.projects().predict(
name=name,
body={'instances': instances}
).execute()
if 'error' in response:
raise RuntimeError(response['error'])
else:
print(response['predictions'])
The version passes the input data through the trained pipeline and returns
the classifier's results: either <=50K
or >50K
for each instance, depending
on its prediction for the person's income bracket.
Cleaning up
To clean up all Google Cloud resources used in this project, you can delete the Google Cloud project you used for the tutorial.
Alternatively, you can clean up individual resources by running the following commands:
# Delete version resource
gcloud ai-platform versions delete $VERSION_NAME --quiet --model $MODEL_NAME
# Delete model resource
gcloud ai-platform models delete $MODEL_NAME --quiet
# Delete Cloud Storage objects that were created
gcloud storage rm gs://$BUCKET_NAME/custom_pipeline_tutorial --recursive
What's next
- Read more about how to use custom scikit-learn pipelines with AI Platform Prediction.
- Learn about creating a custom prediction routine (beta) for even more control over how AI Platform Prediction serves predictions.