This page describes a few example use cases for the Prediction API, and how to design training data for each.
Imagine a site that sells beer, wine, and cheese, and you want to predict whether a visitor will be interested in wine, given their purchase history. In this situation, we might create training data with three features:
- The number of times the customer has bought fancy dessert cheese.
- A value of 1 if the customer has ever bought wine, and 0 if not.
- The number of times the customer has bought beer.
A sample of training data for this problem might be encoded as follows:
The first instance (row) would then encode an example where a customer who likes wine has done the following:
- Bought fancy dessert cheese 5 times.
- Bought wine at least once in the past (value of 1).
- Bought beer once.
It's relevant to note that some of the features are ambiguous—not everyone who buys fancy dessert cheese buys wine; some features are negatively correlated with buying wine (for example, perhaps beer drinkers are somewhat less likely to buy wine). Both types of features are legitimate and useful for the prediction system.
Spam Comment Detection
Suppose you're trying to detect whether user-submitted comments to a web page are actual comments, or just spam. This is a similar task to spam email detection, but you have a lot fewer signals to go on. The available data is the comment itself, typically.
A sample of the training data for this problem might look something like:
"spam","this is a spam comment" "not_spam","this is a regular comment"
To enhance the accuracy, you might add additional features such as number of links in the comments, or if comments require users to log in, user names.