This tutorial shows how to explore data and build a scikit-learn machine learning (ML) model on Google Cloud. The use case for this tutorial is a predictive, "propensity to buy" model for financial services.
Propensity models are widely used within the financial industry to analyze a prospective customer's inclination to make a purchase. Companies often use on-premises solutions that are inflexible and difficult to scale. This tutorial describes a flexible, serverless ML model on Google Cloud that can be deployed as a part of a business workflow.
The best practices described in this tutorial can be applied to a broad range of ML use cases, not just to financial services.
This article assumes you're familiar with the following technologies:
Objectives
- Learn best practices for exploring data and building a serverless ML scikit-learn model on Google Cloud by using BigQuery, AI Platform Notebooks, and AI Platform.
- Get a better understanding of how to use open source components like Pandas Profiling and Lime to get more insight into your data and model.
- Explore how to select the best model by using model comparison.
- Learn how to use hyperparameter tuning for scikit-learn.
- Deploy a scikit-learn model as a managed API on Google Cloud.
Architecture
This ML model for this tutorial uses the following Google Cloud components:
- BigQuery stores the training data.
- AI Platform Notebooks provides the notebook experience for execution of the profiling and training.
- AI Platform provides scalable training and prediction.
The following diagram illustrates the architecture for this model.
This architecture highlights several functional steps:
- Storing. You store the training data in BigQuery.
- Profiling. You query the data using BigQuery and load the data as a Pandas dataframe into an AI Platform Notebook. You then use Pandas for basic preprocessing.
- Modeling. You test multiple models using scikit-learn and select the one that performs the best. You then use Lime to explain the chosen predictor.
- Training. You package the model for training and prediction by using AI Platform.
This tutorial uses Pandas and scikit-learn because they provide an easier starting point than other approaches. This approach offers several advantages:
- Scalability. Google Cloud helps you with scaling training and predictions.
- Transparency. Lime and Pandas Profiling provide insights into the data and model.
- Flexibility. The open source models can be ported and re-used.
- Simplicity. Pandas and scikit-learn can produce quick results.
The training data for this tutorial is designed to fit in memory. If you want to build an ML model that can handle a larger dataset or want to do distributed training by using accelerators, see this article on distributed training with TensorFlow and AI Platform and the Dataflow documentation for preprocessing data.
Costs
This tutorial uses the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.
When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.
Before you begin
-
Select or create a Cloud project.
-
Enable billing for your Cloud project.
- Enable the BigQuery, AI Platform, and Compute Engine APIs for that project.
Starting AI Platform Notebooks
In the following steps you create an AI Platform Notebooks instance.
In the Cloud Console, go to the AI Platform Notebook instances page.
On the menu bar, click New Instance add, and then select the TensorFlow 1.x framework.
Select Without GPUs.
In the New notebook instance dialog, review the following options:
- You can keep the default instance name or edit it.
- You can click Customize to select the region, zone, machine learning framework, machine type, GPU type, number of GPUs, boot disk type, and size. If you add a GPU, select the Install NVIDIA GPU driver automatically for me checkbox.
Click Create.
It takes a few minutes for AI Platform Notebooks to create the new instance.
Cloning the notebook from GitHub and setting up your notebook
Now that you have an AI Platform Notebooks instance, you can download the notebook file for this tutorial. This notebook contains all the pre-populated cells to analyze the dataset and then build an ML model.
In the Cloud Console, go to the AI Platform Notebook instances page.
For the instance you just created, click Open JupyterLab.
On the Launcher page, click Terminal.
In the terminal window, paste the following command, and then click Run:
git clone https://github.com/GoogleCloudPlatform/professional-services.git
The output is similar to the following:
Cloning into 'professional-services'... remote: Enumerating objects: 24, done. remote: Counting objects: 100% (24/24), done. remote: Compressing objects: 100% (20/20), done. remote: Total 5085 (delta 8), reused 13 (delta 4), pack-reused 5061 Receiving objects: 100% (5085/5085), 50.50 MiB | 19.47 MiB/s, done. Resolving deltas: 100% (2672/2672), done.
After cloning is finished, a new folder called
professional-services
is displayed in an adjacent pane.In the notebook file resources, select professional-services > examples > cloudml-bank-marketing, and then double-click
bank_marketing_classification_model.ipynb
.In the notebook, run the code in cell 1 to ensure that the Lime package is installed.
If you receive an error indicating that the Python package is missing, remove the first line of the code in cell 1, which has the command
install pandas-profiling
.After the installation, click the Kernel tab, and then click Restart Kernel.
Unless you're running in Colab, skip cell 2.
In the terminal window, run cell 3, which appears with a
[2]
.Run the cell with the following command, replacing
PROJECT_ID
with the value of your Cloud project ID:%env GOOGLE_CLOUD_PROJECT=PROJECT_ID
Skip to cell 7 and update the code:
import os your_dataset = 'your_dataset' your_table = 'your_table' project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
Replace
your_dataset
andyour_table
with any name that you choose.For details on naming the dataset, see BigQuery naming conventions.
After updating the code in cell 7, set the environment variables and create the BigQuery dataset and table:
!bq mk -d {project_id}:{your_dataset} !bq mk -t {your_dataset}.{your_table}
The output is similar to the following:
Dataset 'your_project.your_dataset' successfully created Table 'your_project.your_dataset.your_table' successfully created
Run the next cell to download the CSV file and save it locally as
data.csv
. This tutorial uses the UCI Bank Marketing Dataset.!curl https://storage.googleapis.com/erwinh-public-data/bankingdata/bank-full.csv --output data.csv
Upload the
data.csv
file into your BigQuery table by running the cell:!bq load --autodetect --source_format=CSV --field_delimiter ';' --skip_leading_rows=1 --replace {your_dataset}.{your_table} data.csv
The output is similar to the following:
Upload complete. Waiting on job … (2s) Current status: DONE
Getting data from BigQuery and creating a Pandas dataframe
To create a Pandas dataframe, you use the Google Cloud Client Libraries to fetch data from BigQuery, and then you load it into a Pandas dataframe. This built-in functionality of the library simplifies getting the data from BigQuery into a Pandas dataframe without writing additional code.
Run the cell containing this code:
client = bq.Client(project=project_id) df = client.query(''' SELECT * FROM `%s.%s` ''' % (your_dataset, your_table)).to_dataframe()
The preceding SQL statement returns data that is then used to construct a Pandas dataframe. The advantage of using an SQL statement is that you can change your SQL statement to take different data samples from BigQuery.
Exploring the data by using Pandas Profiling
Run the cell containing this code:
import pandas_profiling as pp pp.ProfileReport(df)
After the Pandas dataframe is loaded, you can explore the data by using Pandas
Profiling. Pandas Profiling generates a report for explanatory analysis. Pandas
Profiling provides statistics that go beyond what is typically produced by
the df.describe()
function. For each column, you receive relevant summary
statistics that are presented in an interactive HTML format.
In this case, the y
column is the label and the other columns are available to
use as feature columns. The analysis of these columns helps you select which
features to use and whether any preprocessing is required before model training.
The following diagram illustrates a sample report from Pandas Profiling.
The sample report from Pandas Profiling includes a warning that the data in the
previous
column is skewed. The profiling report also indicates that
for the y
column, there are about 5,000 True
examples and about 40,000
Missing
or "False" examples.
Because the dataset is highly skewed, you need to ensure that when you split
your data into training and test sets, all the True
examples don't end up
either in the test set or in the training set.
Handling skewed datasets
You can address the challenge of skewed data in two ways:
- By shuffling the dataset to avoid any form of pre-ordering.
- By using stratified sampling to ensure that your test and training datasets
maintain a similar distribution of
y
for both datasets. Approximately 12% of examples in each dataset should beTrue
and the restFalse
.
The following code both shuffles and employs stratified sampling.
Run the cell containing this code:
from sklearn.model_selection import StratifiedShuffleSplit #Here we apply a shuffle and stratified split to create a train and test set. split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=40) for train_index, test_index in split.split(df, df["y"]): strat_train_set = df.loc[train_index] strat_test_set = df.loc[test_index]
Preprocessing the data
Before you can create an ML model, you must format the data into a form that the model can process, a step that is called preprocessing.
To preprocess your data, follow these steps:
- For numeric columns, ensure that all values are between
0
and1
. This step normalizes the columns so that if one column has a very large value, that value doesn't bias the results. - Turn categorical values into numeric values by replacing each unique
value in a column with an integer. For example, if the column
Color
has three unique strings (red
,yellow
, andblue
), you replace the values with0
,1
, and2
, respectively. - Convert
True
/False
values to1
/0
integers, respectively.
The following code implements all three approaches.
Run the cell containing this code:
def data_pipeline(df): """Normalizes and converts data and returns dataframe """ num_cols = df.select_dtypes(include=np.number).columns cat_cols = list(set(df.columns) - set(num_cols)) # Normalize Numeric Data df[num_cols] = StandardScaler().fit_transform(df[num_cols]) # Convert categorical variables to integers df[cat_cols] = df[cat_cols].apply(LabelEncoder().fit_transform) return df
Another important preprocessing step is feature selection. Datasets can often contain features that might not be useful in predicting a label. By removing these features, you not only get a more accurate model but also save a lot of computation time.
You can use either of the following options for feature selection:
- The
SelectKBest
method. This method selects the top K features by using a scoring function (f_classif
in this tutorial). - A tree classifier. This option also determines the top K features.
For this tutorial, either option selects the top five features.
To use the
SelectKBest
method, run the cell containing this code:from sklearn.feature_selection import SelectKBest, f_classif predictors = train_features_prepared.columns # Perform feature selection where `k` (5 in this case) indicates the number of features we wish to select selector = SelectKBest(f_classif, k=5) selector.fit(train_features_prepared[predictors], train_label)
To use a Tree Classifier, run the cell containing this code:
from sklearn.ensemble import ExtraTreesClassifier from sklearn.feature_selection import SelectFromModel predictors_tree = train_features_prepared.columns selector_clf = ExtraTreesClassifier(n_estimators=50, random_state=0) selector_clf.fit(train_features_prepared[predictors], train_label)
Comparing and evaluating different machine learning models
It can be difficult to know what model is best for your use case and what are the best hyperparameters.
To help select the best model and hyperparameters, the create_classifiers
function is defined below.
To create the
create_classifiers
function, run the cell containing this code:def create_classifiers(): """Create classifiers and specify hyper parameters""" log_params = [{'penalty': ['l1', 'l2'], 'C': np.logspace(0, 4, 10)}] knn_params = [{'n_neighbors': [3, 4, 5]}] svc_params = [{'kernel': ['linear', 'rbf'], 'probability': [True]}] tree_params = [{'criterion': ['gini', 'entropy']}] forest_params = {'n_estimators': [1, 5, 10]} mlp_params = {'activation': [ 'identity', 'logistic', 'tanh', 'relu' ]} ada_params = {'n_estimators': [1, 5, 10]} classifiers = [ ['LogisticRegression', LogisticRegression(random_state=42), log_params], ['KNeighborsClassifier', KNeighborsClassifier(), knn_params], ['SVC', SVC(random_state=42), svc_params], ['DecisionTreeClassifier', DecisionTreeClassifier(random_state=42), tree_params], ['RandomForestClassifier', RandomForestClassifier(random_state=42), forest_params], ['MLPClassifier', MLPClassifier(random_state=42), mlp_params], ['AdaBoostClassifier', AdaBoostClassifier(random_state=42), ada_params], ] return classifiers
The create_classifiers
function takes up to seven classifiers and
hyperparameters as input parameters. The next few cells in this section of the
notebook use this function to select the optimal configuration
(hyperparameters) for each classifier.
This tutorial shows you two methods to select the best classifier:
The first method is to build a table with a number of metrics for each classifier. This approach provides different metrics to consider, depending on your use case. For instance, the
Accuracy
metric might not be the best metric for your case.The second method is to generate a Receiver Operating Characteristics (ROC) graph to further analyze each classifier, as shown in the following diagram.
Looking at both sets of data, you might choose logistic regression as your model because it has the highest area under the curve (AUC). The "best" model can depend on your use case.
Explaining the model
After you select your model, you need to better understand the performance of your predictions. To do this, this tutorial uses the Python package Lime. Using Lime, you can create an explanation instance and show the result.
Run the following notebook cell's code:
i = 106 exp = explainer.explain_instance(train[i], predict_fn) pprint.pprint(exp.as_list()) fig = exp.as_pyplot_figure()
In Lime, you see the value (x-axis) for each of the top features (y-axis), as shown in the following diagram.
Training and predicting with AI Platform
Up to this point in the tutorial, you have trained and predicted your models locally. However, if you want to reduce training time or to predict at scale, you can use AI Platform. AI Platform offers a managed service which can scale your training across many nodes and provides an API endpoint for deployed models.
In these next steps, you deploy the model to AI Platform and then use AI Platform to deploy the model for use in future predictions.
Using AI Platform to train your model and request predictions requires two steps:
- Submit a training job to AI Platform.
- Create a model in AI Platform.
Submit a training job
Set the following environment variables, replacing
gcs_bucket
with the name of your Cloud Storage bucket:%env GCS_BUCKET=gcs_bucket %env REGION=us-central1 %env LOCAL_DIRECTORY=./trainer/data %env TRAINER_PACKAGE_PATH=./trainer
Submit the training job by running the cells that contain the following commands:
%%bash JOBNAME=banking_$(date -u +%y%m%d_%H%M%S) echo $JOBNAME gcloud ai-platform jobs submit training model_training_$JOBNAME \ --job-dir $GCS_BUCKET/$JOBNAME/output \ --package-path trainer \ --module-name trainer.task \ --region $REGION \ --runtime-version=1.9 \ --python-version=3.5 \ --scale-tier BASIC
These commands store your training dataset in a Cloud Storage bucket and create a directory to store your Python files.
To view the status of the job, go to AI Platform in the sidebar, and then select Jobs. It takes about eight minutes to finish training.
Create a model in AI Platform
After you've finished training the model, you can request predictions from it.
Set configuration and environment variables
In the cell with the following code, replace the environment variable
gcs_bucket
to specify the Cloud Storage bucket used in the preceding training procedure:gcs_bucket
This Cloud Storage bucket is used to stage intermediate output and to store the results.
Find the model directory and copy the name.
The model directory was stored in your Cloud Storage bucket during the earlier training step. The directory takes the form
model_YYYYMMDD_HHMMSS
.In the cell with the following code, replace
model_directory
with the value of your model directory that you copied earlier:model_directory
Create the model
In the notebook, run the cell with the following code:
! gcloud ai-platform models create $MODEL_NAME --regions=us-central1 ! gcloud ai-platform versions create $VERSION_NAME \ --model $MODEL_NAME --origin $MODEL_DIR \ --runtime-version 1.9 --framework $FRAMEWORK \ --python-version 3.5
These commands create a model called
cmle_model
, and a version calledv1
. A version is a specific configuration for the given model that you're building. A model can have many different versions.
Use the model
After you create a model and its version, you can request prediction by the AI Platform's prediction API. You can use the command line or Python client library to make a prediction, as shown in the final few cells of the notebook.
To use the deployed model for production as a part of a business process, you can take additional steps to develop a production-ready implementation. These steps, which are outside the scope of this tutorial, include the following:
- Evaluating and de-identifying any data (for more information, see Considerations for sensitive data within Machine Learning datasets)
- Building a repeatable data transformation and processing pipeline for training and predictions, such that training and predictions go through the same data transformations
- Building a repeatable deployment process to handle model retraining
- Setting appropriate Identity and Access Management (IAM) permissions on data access, data pipelines, and any deployed AI Platform resources
Cleaning up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can delete the Google Cloud project that you created for this tutorial.
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Review the sample code in the Professional Services repo on GitHub.
- Learn about other predictive forecasting solutions.
- Try out other Google Cloud features for yourself. Have a look at our tutorials.