Building a Serverless Machine Learning Model

This article explores how to build a custom machine learning (ML) task on Google Cloud Platform (GCP) in a serverless manner. The article also explains the benefits that AI Platform brings compared to the Perception APIs. The following topics are covered:

  • Understanding the steps for building an ML model
  • Improving model quality
  • Moving from a toy problem to a scalable environment

This article follows Architecture of a Serverless Machine Learning Model, where you learn how to use Firebase, Cloud Functions, the Natural Language API, and AI Platform to enrich helpdesk tickets in a serverless way.

Understanding the building steps

With GCP, there are two main ways to perform machine learning tasks:

  • Use available prebuilt and trained RESTful Perception APIs such as the Cloud Vision API or the Cloud Speech API. You use these pre-trained models for the sentiment analysis and auto-tagging mentioned in the first part of this solution.
  • Build or reuse models. You can train these models with your custom data using TensorFlow and AI Platform. This approach is preferred for predicting resolution time and priority, as discussed in the first part of the solution.

The following diagram shows the second approach, where you collect and store the dataset, organize the data, and design, train, and deploy the model.

Machine Learning tasks

Collecting and storing data

Data is key for generating a compelling model. The more data that's available, the more chances to train a model for better predictions, assuming that data is relevant and cleaned. See the Improving model quality section for more information.

Getting data might be easy in some cases, but it can be difficult in cases where some datasets are limited to a few hundred examples. In the case of support tickets, you can expect to have a few hundred thousands of historical examples, which should be good enough to train a decent model.

If you are using a Customer Relationship Management (CRM) system or any other system that handles ticketing, you can likely export or run a query to access all tickets that were created since your product launched. The goal is to get access to tickets fields. These fields usually contain information such as the ticket ID, customer ID, category, priority, resolution time, agent ID, customer product experience, and more. A common output of those exports is CSV files:

t0,Patrick Blevins,12,3-Advanced,Performance,Request,4-Critical,P1,5
t1,Kaitlyn Ruiz,3,2-Experienced,Technical,Issue,1-Minor,P4,6
t2,Chelsea Martin,11,2-Experienced,Technical,Request,4-Critical,P1,2
t3,Richard Arnold,8,2-Experienced,Performance,Request,1-Minor,P3,5
...

Make sure that when you collect data, you keep a few things in mind:

  • If you are doing a binary classification, do not gather only positive examples. Gather negatives as well.
  • More generally, when possible try to find a balanced number of examples for each class. Otherwise, you will have to compensate for any imbalance in your code.
  • Look for more examples of outliers. They are real-world cases, so the system should be able to predict them.

In the tutorial, historical data has already been gathered and is available as a single CSV file in Cloud Storage.

Preprocessing data

Preprocessing data is critical before training a model. The amount of work and the technologies to use will mostly be dictated by these two parameters:

  • Quality: If your data contains irrelevant fields, or if it needs cleaning or feature crossing, it is important that you address these issues before sending it for training.
  • Size: If your data is too big, or if it might become too big due to one-of-k encodings, a single instance might not be able to process it.

To learn more about pre-processing best practices, refer to the Moving from toy problem to scalable solution section.

Cloud Datalab is a great tool for preparing the data because it can run gcloud commands directly from its UI and run Jupyter Notebooks in a managed environment. Cloud Datalab comes with ML Workbench, a library that simplifies interactions with TensorFlow.

Cloud Datalab can also act as an interactive orchestrator:

  • If the data is small enough, you can leverage libraries like Pandas interactively.
  • If data does not fit in memory, you can use tools such as Apache Beam to process the data outside of Cloud Datalab. But you can still interactively use the results.

The following are some of the tasks you might want to perform:

  • Filter out columns that won't be an available input at prediction time. Example: agent ID
  • Make sure that your labels are correctly transformed and available. Example: delete empty values
  • Eliminate exceptions, or find more examples if they are not exceptions but recurrent events. Example: Non-existing category text
  • Split your data for training, evaluating, and testing. Example: 80%, 10%, 10%
  • Transform inputs and labels into usable features. Example: resolution time = (closing time – creation time)
  • Remove duplicated rows

The following table shows some of the given inputs:

Field name Keep Reason Type
Ticket ID NO It is a unique value per ticket and will not help train the predictive algorithm. N/A
Customer ID MAYBE Learn from recurrent customers. Discrete
Category YES Can impact ticket complexity. Categorical
Agent ID NO This is not a value known when the user submits the ticket.

Note: It would be an interesting case to predict this value through a classification scheme.
N/A
Years of product experience YES Can impact ticket complexity. Continuous

After you have selected the input columns, you need to create at least two different datasets, preferably three, that do not overlap and are all representative of the data.

Training set
Represents about 80 percent of your total dataset. Your model uses it to tweak various parameters, called model weights, to get as close to the truth (that is, the label) as possible. The model does this by minimizing a loss function, which calculates how far the model's prediction is from the truth—this is the current error level. The best set of weights is used for the final predictive model.
Evaluation set
Usually between 10 and 20 percent of the dataset. This set prevents the model from overfitting. Overfitting results when a model performs well on the training set, generating only a small error, but struggles with new or unknown data. In other words, the model overfits itself to the data. Instead of training a model to pick out general features in a given type of data, an overtrained model learns only how to pick out specific features found in the training set.
Test set
Usually between 10 and 15 percent of the data. The test set validates the model's ability to generalize. Because an evaluation set is used during training, the model might have an implicit bias. To assess the model's effectiveness, you must measure its performance on a distinct test set.

Use one of the following techniques to divide the data:

  • Shuffle the data randomly.
  • Shuffle using a generator seed.
  • Use a hash of a column to decide whether an example goes into the training, evaluation, or test set.

Using a hash is the recommended approach because it won't be affected when fetching an updated version of the dataset. This approach also provides a consistent test set over time.

Designing the model

Resolution time and priority prediction are supervised machine learning tasks. This means that the input data is labeled. The relevant data fields fall into two categories:

Inputs
Ticket ID, seniority, experience, category, type, impact
Labels
Resolution time and priority

Inputs associated with labels are useful training examples in a supervised learning context. These types of labels represent the truth to be discovered by the model. The goal of a predicting model is to use inputs to predict unknown labels. To get there, the model trains on valid examples. In this use case, predictions involve:

Resolution time
If the helpdesk agent sees a high value, it's likely that more resources might be needed to deal with an unknown problem.
Priority
Helps the agent prioritize tickets and resources in line with established process.

Designing a model requires framing the problem to decide which model should make predictions. This use case has two different supervised machine learning problems:

Regression
Resolution time is a continuous numeric value. It can be defined as a regression problem where the model predicts a continuous value.
Classification
Priority can have several values, such as P1, P2, P3, and P4. Multi-classification is a good way to approach this problem. The model outputs a probability for each priority class where all the probabilities add up to 1. After the probabilities are assigned, you can decide which priority you assign to which ticket. For example, if P1 = 10%, P2 = 20%, P3 = 60%, and P4 = 10%, assign the support ticket a P3 priority.

When leveraging TensorFlow and AI Platform, the model development process is similar for both problems. The same features feed one of the Estimator APIs. Switching from classification to regression is as simple as changing a couple of lines of code.

Now that the inputs are known and you have some working datasets, you need to convert the data into entities that can be used in the TensorFlow graph, including non-numeric values.

Field name Type Description
Category Categorical You know the full vocabulary (all possible values).
Product experience Continuous Numerical values that you want to keep. Each year of Product experience is important.

ML Workbench provides a simple way to convert these inputs into entities that TensorFlow can use. ML Workbench uses TensorFlow feature column functions for each type of field.

features:
  ticketid:
    transform: key
  seniority:
    transform: identity
  category:
    transform: one_hot
  [...]
  priority:
    transform: target

ML Workbench also offers a single function to analyze and prepare the data for your model:

%%ml analyze [--cloud]
output: [OUTPUT_ANALYSIS]
data: $[CREATED_DATASET]
features:
...

Generating the model

For the use case in this article, you have enough tickets to make a decent prediction but not enough to worry about data size, so you can use the local trainer. Note that with a simple --cloud parameter, you can switch from local to cloud training.

%%ml train [--cloud]
output: [OUTPUT_TRAIN]
analysis: [OUTPUT_ANALYSIS]
data: $[CREATED_DATASET]
  model_args:
    model: dnn_regression
    max-steps: 2000
    hidden-layer-size1: 32
    hidden-layer-size2: 8
    train-batch-size: 100
    eval-batch-size: 100
    learning-rate: 0.001

To learn more about the hidden-layer-sizeX and max-steps parameters, see the tuning hyperparameters section.

Deploying the model

Building a graph in TensorFlow and porting it to AI Platform for training are key to developing an application. But what's the point of a powerful prediction model that only a data scientist can use? The current use case is intended to provide real-time predictions to your support desk. To do so, the models need to be accessible from Cloud Functions written in Node.js. This model was built and written in Python.

AI Platform offers the option to deploy a model as a RESTful API to give prediction at scale, whether you have one or millions of users. You can deploy the model with a simple command line, substituting the model name and bucket for [MODEL_NAME] and [BUCKET].

gcloud ml-engine models create [MODEL_NAME]
DEPLOYMENT_SOURCE=[BUCKET]
gcloud ml-engine versions create "version_name" --model [MODEL_NAME] --origin $DEPLOYMENT_SOURCE

The model is available for both online and offline prediction. In the next section, you learn how to call it by using Cloud Functions in a serverless manner.

Serverless enrichment

After the model is deployed, it is available through a RESTful API, which is one of the advantages of using AI Platform. The API makes the model available to all sort of clients, including Cloud Functions. In the solution's use case, after the Firebase database records a ticket, it triggers a Cloud Function to enrich the data using both the Natural Language API and custom models. The following diagram illustrates this model.

serverless enrichment

The following two examples call the Natural Language API and a custom model from Cloud Functions.

  • Natural Language API:

    const text = ticket.description;
    const document = language.document({content: text});
    
    document.detectSentiment()
     .then((results) => {
        const sentiment = results[1].documentSentiment;
        admin.database().ref(`/tickets/${key}/pred_sentiment`).set(sentiment.score);
     })
     .catch((err) => {
        console.error('ERROR detectSentiment:', err);
     });
  • Resolution-time regression model:

    ml.projects.predict({
      name: `projects/${MDL_PROJECT_NAME}/models/${RESOLUTION_TIME_MODEL_NAME}`,
      resource: {
        name: `projects/${MDL_PROJECT_NAME}/models/${RESOLUTION_TIME_MODEL_NAME}`,
        instances: [
          `${key},${ticket.seniority},${ticket.experience},${ticket.category},
          ${ticket.type},${ticket.impact}`
        ]
      }
    },

Improving model quality

This section outlines some extra steps that you can take to help improve your models.

Data preparation

Before you start training a model, you need to prepare your data:

Collect examples
Whether your data has outliers or you have several classes to predict, make sure to collect examples for all cases.
Clean the data
Cleaning might be as simple as deleting duplicates to avoid having identical examples in both training set and evaluation set, or it might mean ensuring that all values make sense (no seniority years less than zero, for example).
Bring human insights
A machine learning model can perform better with the right features. Crossing features using the TensorFlow-provided cross_function is a best practice. For example, to predict a taxi fare, this function could improve model quality by crossing longitude and latitude. Or in the helpdesk case, the function could cross seniority with product experience.
Create buckets of data
Functions like TensorFlow bucketized_column can help discretize inputs and improve performance, especially in extreme cases where you do not have many examples.

Tuning hyperparameters

Some training parameter values are challenging to find but are key to building a successful model. The right combination can drastically increase model performance. Common values to tweak include:

  • The size of hidden layers (for a neural network)
  • The number of neurons (for a neural network)
  • The training steps
  • The size of the buckets for categorical inputs when you don't have the whole dictionary (example: origin city)

Any value that you pick could be a hyperparameter. There are several ways to find the right combination:

  • Do a grid search using "For" loops.

    You can set different values, loop through all combinations, and zero in on the best result. Depending on the number of hyperparameters and their range sizes, it might be time consuming and computationally expensive.

  • The preferred approach is to tune hyperparameters using AI Platform. AI Platform tunes hyperparameters automatically based on a declarative YAML setup. The system jumps quickly to the best parameter combinations and stops before going through all the training steps, saving time, compute power, and money.

Increasing the number of examples

The more data a model trains with, and the more use cases it sees, the better its resulting predictions. Feeding the model with millions of examples might prove difficult to train locally. One solution is to use ML Workbench, which can leverage:

  • TensorFlow to distribute training tasks and run steps in batches.
  • BigQuery to analyze the training data.
  • AI Platform to run training and prediction on a scalable, on-demand infrastructure

The following diagram shows how more use cases and training data lead to better predictions.

data and use cases

Note that ML training is not an embarrassingly parallel problem the way MapReduce is. This means that gradient descent optimization, for example, requires shared memory on a parameter server. Your ML platform must support this type of approach.

Moving from a toy problem to a scalable environment

The previous section mentioned some best practices in order to improve model quality. Some of those tasks might seem easy to achieve with a sampled or limited dataset. But it can prove challenging when using a lot of data, which is necessary to train performant models.

The following diagram outlines a scalable approach.

serverless enrichment

Exploring data using BigQuery and Cloud Datalab

It is a recommended approach to store and query data using BigQuery. BigQuery is a columnar database built for big data, allowing users to run ad hoc queries on terabytes of data in seconds.

Through Cloud Datalab and BigQuery, you can:

  • Explore and visualize TBs of data by leveraging BigQuery from Cloud Datalab.
  • Filter the inputs that might not have a strong impact on the prediction.
  • Create a sample set of representative data to start creating your model.
  • Possibly split data for training, evaluation, and test.

Preprocessing data using Cloud Dataprep and Cloud Dataflow

Inputs rarely come ready to use. They often require processing to become usable features, for example:

  • Cleaning up values such as None and Null, which become Unknown.
  • Splitting datasets for training, evaluation, and test.
  • Converting data to tfRecord files, a recommended format to use with TensorFlow.

Cloud Dataprep is a visual tool that can prepare and clean data at scale with limited programming overhead. Cloud Dataprep uses Apache Beam behind the scenes, but it saves a lot of boilerplate code thanks to its easy-to-use UI.

Apache Beam can run on Cloud Dataflow. Cloud Dataflow can help you develop and execute a wide range of data processing patterns. These patterns include extraction, transformation, and loading (ETL), batch computation, and continuous computation.

Minimizing skew at scale with tf.Transform and Dataflow

Developers often write preprocessing code for training. Sometimes, they adapt this code to run in a distributed environment, but they also write similar but different code for incoming data for the prediction. This approach can cause two main problems:

  • Two codebases need to be maintained, possibly by different teams.
  • More importantly, it can create a skew between training and inference.

tf.Transform solves both problems by letting you write common code for both training and prediction. It can also use Apache Beam to run the code in a distributed environment.

Leveraging TensorFlow with tf.Transform on Google Cloud Platform

While some of the features mentioned previously are part of TensorFlow—therefore, open source—AI Platform offers key advantages when running TensorFlow:

  • Running machine learning tasks in a serverless environment
  • Facilitating hyperparameter tuning
  • Hosting models as a RESTful API accessible from heterogenous clients (not only Python)

The following diagram illustrates the approach to leveraging TensorFlow.

leveraging TensorFlow

Note that by leveraging tf.Transform—a TensorFlow library—you limit potential skew between training and prediction as they can use the same code base.

Next steps

Was this page helpful? Let us know how we did:

Send feedback about...