Building a propensity model for financial services on Google Cloud

This tutorial shows how to explore data and build a scikit-learn machine learning (ML) model on Google Cloud. The use case for this tutorial is a predictive, "propensity to buy" model for financial services.

Propensity models are widely used within the financial industry to analyze a prospective customer's inclination to make a purchase. Companies often use on-premises solutions that are inflexible and difficult to scale. This tutorial describes a flexible, serverless ML model on Google Cloud that can be deployed as a part of a business workflow.

The best practices described in this tutorial can be applied to a broad range of ML use cases, not just to financial services.

This article assumes you're familiar with the following technologies:

Objectives

  • Learn best practices for exploring data and building a serverless ML scikit-learn model on Google Cloud by using BigQuery, Notebooks, and AI Platform.
  • Get a better understanding of how to use open source components like Pandas Profiling and Lime to get more insight into your data and model.
  • Explore how to select the best model by using model comparison.
  • Learn how to use hyperparameter tuning for scikit-learn.
  • Deploy a scikit-learn model as a managed API on Google Cloud.

Architecture

This ML model for this tutorial uses the following Google Cloud components:

  • BigQuery stores the training data.
  • Notebooks provides the notebook experience for execution of the profiling and training.
  • AI Platform provides scalable training and prediction.

The following diagram illustrates the architecture for this model.

Diagram that shows the architecture of the machine learning model used in this tutorial.

This architecture highlights several functional steps:

  • Storing. You store the training data in BigQuery.
  • Profiling. You query the data using BigQuery and load the data as a Pandas dataframe into an AI Platform Notebook. You then use Pandas for basic preprocessing.
  • Modeling. You test multiple models using scikit-learn and select the one that performs the best. You then use Lime to explain the chosen predictor.
  • Training. You package the model for training and prediction by using AI Platform.

This tutorial uses Pandas and scikit-learn because they provide an easier starting point than other approaches. This approach offers several advantages:

  • Scalability. Google Cloud helps you with scaling training and predictions.
  • Transparency. Lime and Pandas Profiling provide insights into the data and model.
  • Flexibility. The open source models can be ported and re-used.
  • Simplicity. Pandas and scikit-learn can produce quick results.

The training data for this tutorial is designed to fit in memory. If you want to build an ML model that can handle a larger dataset or want to do distributed training by using accelerators, see this article on distributed training with TensorFlow and AI Platform and the Dataflow documentation for preprocessing data.

Costs

This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.

Before you begin

  1. Select or create a Cloud project.

    Go to the project selector page

  2. Enable billing for your Cloud project.

    Learn how to enable billing

  3. Enable the BigQuery, AI Platform, and Compute Engine APIs for that project.

    Enable the APIs

Starting Notebooks

In the following steps you create an Notebooks instance.

  1. In the Cloud Console, go to the AI Platform Notebook instances page.

    Go to the Notebook instances page

  2. On the menu bar, click New Instance , and then select the TensorFlow 1.x framework.

  3. Select Without GPUs.

  4. In the New notebook instance dialog, review the following options:

    • You can keep the default instance name or edit it.
    • You can click Customize to select the region, zone, machine learning framework, machine type, GPU type, number of GPUs, boot disk type, and size. If you add a GPU, select the Install NVIDIA GPU driver automatically for me checkbox.
  5. Click Create.

    It takes a few minutes for Notebooks to create the new instance.

Cloning the notebook from GitHub and setting up your notebook

Now that you have an Notebooks instance, you can download the notebook file for this tutorial. This notebook contains all the pre-populated cells to analyze the dataset and then build an ML model.

  1. In the Cloud Console, go to the AI Platform Notebook instances page.

    Go to the Notebook instances page

  2. For the instance you just created, click Open JupyterLab.

  3. On the Launcher page, click Terminal.

    Screenshot of Terminal button in the notebook instance.

  4. In the terminal window, paste the following command, and then click Run:

    git clone https://github.com/GoogleCloudPlatform/professional-services.git
    

    The output is similar to the following:

    Cloning into 'professional-services'...
    remote: Enumerating objects: 24, done.
    remote: Counting objects: 100% (24/24), done.
    remote: Compressing objects: 100% (20/20), done.
    remote: Total 5085 (delta 8), reused 13 (delta 4), pack-reused 5061
    Receiving objects: 100% (5085/5085), 50.50 MiB | 19.47 MiB/s, done.
    Resolving deltas: 100% (2672/2672), done.
    

    After cloning is finished, a new folder called professional-services is displayed in an adjacent pane.

  5. In the notebook file resources, select professional-services > examples > cloudml-bank-marketing, and then double-click bank_marketing_classification_model.ipynb.

  6. In the notebook, run the code in cell 1 to ensure that the Lime package is installed.

    If you receive an error indicating that the Python package is missing, remove the first line of the code in cell 1, which has the command install pandas-profiling.

  7. After the installation, click the Kernel tab, and then click Restart Kernel.

  8. Unless you're running in Colab, skip cell 2.

  9. In the terminal window, run cell 3, which appears with a [2].

  10. Run the cell with the following command, replacing PROJECT_ID with the value of your Cloud project ID:

    %env GOOGLE_CLOUD_PROJECT=PROJECT_ID
    
  11. Skip to cell 7 and update the code:

    import os
    your_dataset = 'your_dataset'
    your_table = 'your_table'
    project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
    

    Replace your_dataset and your_table with any name that you choose.

    For details on naming the dataset, see BigQuery naming conventions.

  12. After updating the code in cell 7, set the environment variables and create the BigQuery dataset and table:

    !bq mk -d {project_id}:{your_dataset}
    !bq mk -t {your_dataset}.{your_table}
    

    The output is similar to the following:

    Dataset 'your_project.your_dataset' successfully created
    Table 'your_project.your_dataset.your_table' successfully created
    
  13. Run the next cell to download the CSV file and save it locally as data.csv. This tutorial uses the UCI Bank Marketing Dataset.

    !curl https://storage.googleapis.com/erwinh-public-data/bankingdata/bank-full.csv --output data.csv
    
  14. Upload the data.csv file into your BigQuery table by running the cell:

    !bq load --autodetect --source_format=CSV --field_delimiter ';' --skip_leading_rows=1 --replace {your_dataset}.{your_table} data.csv
    

    The output is similar to the following:

    Upload complete.
    Waiting on job … (2s) Current status: DONE
    

Getting data from BigQuery and creating a Pandas dataframe

To create a Pandas dataframe, you use the Google Cloud Client Libraries to fetch data from BigQuery, and then you load it into a Pandas dataframe. This built-in functionality of the library simplifies getting the data from BigQuery into a Pandas dataframe without writing additional code.

  • Run the cell containing this code:

    client = bq.Client(project=project_id)
    df = client.query('''
      SELECT *
      FROM `%s.%s`
    ''' % (your_dataset, your_table)).to_dataframe()
    

The preceding SQL statement returns data that is then used to construct a Pandas dataframe. The advantage of using an SQL statement is that you can change your SQL statement to take different data samples from BigQuery.

Exploring the data by using Pandas Profiling

  • Run the cell containing this code:

    import pandas_profiling as pp
    pp.ProfileReport(df)
    

After the Pandas dataframe is loaded, you can explore the data by using Pandas Profiling. Pandas Profiling generates a report for explanatory analysis. Pandas Profiling provides statistics that go beyond what is typically produced by the df.describe() function. For each column, you receive relevant summary statistics that are presented in an interactive HTML format.

In this case, the y column is the label and the other columns are available to use as feature columns. The analysis of these columns helps you select which features to use and whether any preprocessing is required before model training.

The following diagram illustrates a sample report from Pandas Profiling.

Screenshot of Pandas Profiling report overview.

The sample report from Pandas Profiling includes a warning that the data in the previous column is skewed. The profiling report also indicates that for the y column, there are about 5,000 True examples and about 40,000 Missing or "False" examples.

Screenshot of Pandas Profiling report data.

Because the dataset is highly skewed, you need to ensure that when you split your data into training and test sets, all the True examples don't end up either in the test set or in the training set.

Handling skewed datasets

You can address the challenge of skewed data in two ways:

  • By shuffling the dataset to avoid any form of pre-ordering.
  • By using stratified sampling to ensure that your test and training datasets maintain a similar distribution of y for both datasets. Approximately 12% of examples in each dataset should be True and the rest False.

The following code both shuffles and employs stratified sampling.

  • Run the cell containing this code:

    from sklearn.model_selection import StratifiedShuffleSplit
    
    #Here we apply a shuffle and stratified split to create a train and test set.
    split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=40)
    for train_index, test_index in split.split(df, df["y"]):
        strat_train_set = df.loc[train_index]
        strat_test_set = df.loc[test_index]
    

Preprocessing the data

Before you can create an ML model, you must format the data into a form that the model can process, a step that is called preprocessing.

To preprocess your data, follow these steps:

  1. For numeric columns, ensure that all values are between 0 and 1. This step normalizes the columns so that if one column has a very large value, that value doesn't bias the results.
  2. Turn categorical values into numeric values by replacing each unique value in a column with an integer. For example, if the column Color has three unique strings (red, yellow, and blue), you replace the values with 0, 1, and 2, respectively.
  3. Convert True/False values to 1/0 integers, respectively.

The following code implements all three approaches.

  • Run the cell containing this code:

      def data_pipeline(df):
        """Normalizes and converts data and returns dataframe """
    
        num_cols = df.select_dtypes(include=np.number).columns
        cat_cols = list(set(df.columns) - set(num_cols))
        # Normalize Numeric Data
        df[num_cols] = StandardScaler().fit_transform(df[num_cols])
        # Convert categorical variables to integers
        df[cat_cols] = df[cat_cols].apply(LabelEncoder().fit_transform)
        return df
    

Another important preprocessing step is feature selection. Datasets can often contain features that might not be useful in predicting a label. By removing these features, you not only get a more accurate model but also save a lot of computation time.

You can use either of the following options for feature selection:

  • The SelectKBest method. This method selects the top K features by using a scoring function (f_classif in this tutorial).
  • A tree classifier. This option also determines the top K features.

For this tutorial, either option selects the top five features.

  • To use the SelectKBest method, run the cell containing this code:

    from sklearn.feature_selection import SelectKBest, f_classif
    
    predictors = train_features_prepared.columns
    
    # Perform feature selection where `k` (5 in this case) indicates the number of features we wish to select
    selector = SelectKBest(f_classif, k=5)
    selector.fit(train_features_prepared[predictors], train_label)
    
  • To use a Tree Classifier, run the cell containing this code:

    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.feature_selection import SelectFromModel
    
    predictors_tree = train_features_prepared.columns
    
    selector_clf = ExtraTreesClassifier(n_estimators=50, random_state=0)
    selector_clf.fit(train_features_prepared[predictors], train_label)
    

Comparing and evaluating different machine learning models

It can be difficult to know what model is best for your use case and what are the best hyperparameters.

To help select the best model and hyperparameters, the create_classifiers function is defined below.

  • To create the create_classifiers function, run the cell containing this code:

    def create_classifiers():
        """Create classifiers and specify hyper parameters"""
    
        log_params = [{'penalty': ['l1', 'l2'], 'C': np.logspace(0, 4, 10)}]
    
        knn_params = [{'n_neighbors': [3, 4, 5]}]
    
        svc_params = [{'kernel': ['linear', 'rbf'], 'probability': [True]}]
    
        tree_params = [{'criterion': ['gini', 'entropy']}]
    
        forest_params = {'n_estimators': [1, 5, 10]}
    
        mlp_params = {'activation': [
                        'identity', 'logistic', 'tanh', 'relu'
                      ]}
    
        ada_params = {'n_estimators': [1, 5, 10]}
    
        classifiers = [
            ['LogisticRegression', LogisticRegression(random_state=42),
             log_params],
            ['KNeighborsClassifier', KNeighborsClassifier(), knn_params],
            ['SVC', SVC(random_state=42), svc_params],
            ['DecisionTreeClassifier',
             DecisionTreeClassifier(random_state=42), tree_params],
            ['RandomForestClassifier',
             RandomForestClassifier(random_state=42), forest_params],
            ['MLPClassifier', MLPClassifier(random_state=42), mlp_params],
            ['AdaBoostClassifier', AdaBoostClassifier(random_state=42),
             ada_params],
            ]
    
        return classifiers
    

The create_classifiers function takes up to seven classifiers and hyperparameters as input parameters. The next few cells in this section of the notebook use this function to select the optimal configuration (hyperparameters) for each classifier.

This tutorial shows you two methods to select the best classifier:

  • The first method is to build a table with a number of metrics for each classifier. This approach provides different metrics to consider, depending on your use case. For instance, the Accuracy metric might not be the best metric for your case.

    Screenshot of classifiers table.

  • The second method is to generate a Receiver Operating Characteristics (ROC) graph to further analyze each classifier, as shown in the following diagram.

    Screenshot of classifiers graph.

Looking at both sets of data, you might choose logistic regression as your model because it has the highest area under the curve (AUC). The "best" model can depend on your use case.

Explaining the model

After you select your model, you need to better understand the performance of your predictions. To do this, this tutorial uses the Python package Lime. Using Lime, you can create an explanation instance and show the result.

  • Run the following notebook cell's code:

    i = 106
    exp = explainer.explain_instance(train[i], predict_fn)
    pprint.pprint(exp.as_list())
    fig = exp.as_pyplot_figure()
    

In Lime, you see the value (x-axis) for each of the top features (y-axis), as shown in the following diagram.

Screenshot of Lime explanation.

Training and predicting with AI Platform

Up to this point in the tutorial, you have trained and predicted your models locally. However, if you want to reduce training time or to predict at scale, you can use AI Platform. AI Platform offers a managed service which can scale your training across many nodes and provides an API endpoint for deployed models.

In these next steps, you deploy the model to AI Platform and then use AI Platform to deploy the model for use in future predictions.

Using AI Platform to train your model and request predictions requires two steps:

  1. Submit a training job to AI Platform.
  2. Create a model in AI Platform.

Submit a training job

  1. Set the following environment variables, replacing gcs_bucket with the name of your Cloud Storage bucket:

    %env GCS_BUCKET=gcs_bucket
    %env REGION=us-central1
    %env LOCAL_DIRECTORY=./trainer/data
    %env TRAINER_PACKAGE_PATH=./trainer
    
  2. Submit the training job by running the cells that contain the following commands:

    %%bash
    
    JOBNAME=banking_$(date -u +%y%m%d_%H%M%S)
    
    echo $JOBNAME
    
    gcloud ai-platform jobs submit training model_training_$JOBNAME \
            --job-dir $GCS_BUCKET/$JOBNAME/output \
            --package-path trainer \
            --module-name trainer.task \
            --region $REGION \
            --runtime-version=1.9 \
            --python-version=3.5 \
            --scale-tier BASIC
    

    These commands store your training dataset in a Cloud Storage bucket and create a directory to store your Python files.

    To view the status of the job, go to AI Platform in the sidebar, and then select Jobs. It takes about eight minutes to finish training.

Create a model in AI Platform

After you've finished training the model, you can request predictions from it.

Set configuration and environment variables

  1. In the cell with the following code, replace the environment variable gcs_bucket to specify the Cloud Storage bucket used in the preceding training procedure:

    gcs_bucket
    

    This Cloud Storage bucket is used to stage intermediate output and to store the results.

  2. Find the model directory and copy the name.

    The model directory was stored in your Cloud Storage bucket during the earlier training step. The directory takes the form model_YYYYMMDD_HHMMSS.

  3. In the cell with the following code, replace model_directory with the value of your model directory that you copied earlier:

    model_directory
    

Create the model

  • In the notebook, run the cell with the following code:

    ! gcloud ai-platform models create $MODEL_NAME --regions=us-central1
    
    ! gcloud ai-platform versions create $VERSION_NAME \
            --model $MODEL_NAME --origin $MODEL_DIR \
            --runtime-version 1.9 --framework $FRAMEWORK \
            --python-version 3.5
    

    These commands create a model called cmle_model, and a version called v1. A version is a specific configuration for the given model that you're building. A model can have many different versions.

Use the model

After you create a model and its version, you can request prediction by the AI Platform's prediction API. You can use the command line or Python client library to make a prediction, as shown in the final few cells of the notebook.

To use the deployed model for production as a part of a business process, you can take additional steps to develop a production-ready implementation. These steps, which are outside the scope of this tutorial, include the following:

  • Evaluating and de-identifying any data (for more information, see Considerations for sensitive data within Machine Learning datasets)
  • Building a repeatable data transformation and processing pipeline for training and predictions, such that training and predictions go through the same data transformations
  • Building a repeatable deployment process to handle model retraining
  • Setting appropriate Identity and Access Management (IAM) permissions on data access, data pipelines, and any deployed AI Platform resources

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can delete the Google Cloud project that you created for this tutorial.

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next