This document explains how to create a basic sentiment analysis model using the Google Prediction API. A sentiment analysis model is used to analyze a text string and classify it with one of the labels that you provide; for example, you could analyze a tweet to determine whether it is positive or negative, or analyze an email to determine whether it is happy, frustrated, or sad. You should read the Hello Prediction page before reading this introduction.
Step 1: Collect Data
You must train your sentiment model against examples of the type of data that you are going to see when you use your model. For example, if you are trying to determine the sentiment of tweets, you will need to obtain examples of tweet entries. For example:
- "Feeling kind of low...."
- "OMG! Just had a fabulous day!"
- "Eating eggplant. Why bother?"
You can either provide your own data or buy it. For a good model, you will need at least several hundred examples, if not thousands. All of the samples should be in the same language (though it doesn't matter what language, as long as its consistent and space-delimited).
Step 2: Label Your Data
Once you have collected your training samples, you will need to pre-classify each sample with a label. A label is a string that you think best describes that example, for example: "happy", "sad", "on the fence". So, to assign labels to the previous examples:
- "sad", "Feeling kind of low...."
- "excited", "OMG! Just had a fabulous day!"
- "bored", "Eating eggplant. Why bother?"
A few tips about labels:
- You can have up to 1,000 labels for a model, but you should only use as many labels as are useful to you, and you must have at least a few dozen examples assigned each type of label that you assign.
- Labels are just strings, so they can have spaces. However, you should put double quotes around any labels that have spaces, and you should escape any nested quotation marks using a \ mark. Example: "that\'s fine"
- Labels are case-sensitive. So "Happy" and "happy" will be seen as two separate labels by the training system. Best practice is to use lowercase for all labels, to avoid mix-ups.
- Each line can only have one label assigned, but you can apply multiple labels to one example by repeating an example and applying different labels to each one. For example:
- "excited", "OMG! Just had a fabulous day!"
- "annoying", "OMG! Just had a fabulous day!"
Step 3: Prepare Your Data
The Google Prediction API takes training data formatted as a comma-separated values (CSV) file with one row per example. The format of this file is basically this:
label1, feature1, feature2, feature3,.... label2, feature1, feature2, feature3.... ...
In the previous example, each example had a single feature: a text string that is a tweet. So the file would look something like this:
"sad", "Feeling kind of low...." "excited", "OMG! Just had a fabulous day!" "bored", "Eating eggplant. Why bother?"
However, if you have more data that you think would help Google Prediction find some underlying patterns, it would be useful to include that information as well. For example, if you think that message length is meaningful (longer messages indicate happier tweets) or time of day (daytime tweets are happier than nighttime tweets), you could create additional features for that data. The following example shows the label, tweet text, message word count, and numeric version of the time of day for each tweet:
"sad", "Feeling kind of low....", 4, 18.30 "excited", "OMG! Just had a fabulous day!", 6, 9.10 "bored", "Eating eggplant. Why bother?", 4, 12.00
Note that the Google Prediction API doesn't take datetime values, so you must find an equivalent; here, times are specified as numbers. Also note that there should be a good correlation between all features and the label that you assign. See Designing a Good Model for more tips on creating good training data.
See Training Data File Format to learn all the details of the CSV training file.
There are offline tools you can use (such as Weka and R) that can help build sample models and identify what features impact your model the most. There are unfortunately no guarantees though when building a model for your problem, the best way to achieve optimal results is to build a few models with different feature sets and use the one that works best. As a general rule though including all the relevant data possible will produce the best results as the algorithms are good at ignoring features that don't influence outcomes.
Step 4: Upload Data to Google Cloud Storage
Once you have your data available in CSV format it is time to upload it to Google Cloud Storage. There are many ways to do this.
- With a web interface in the Google Cloud Platform Console
- With a command line tool: GSUtil
- By using the Google Cloud Storage API
Step 5: Train a Model with the Google Prediction API
Train your model using either the Google API Explorer or a client library. The method for training is
prediction.trainedmodels.insert(), passing in the path to your training data in Google Cloud Storage. Note: do not prefix the data file path with “gs://”. That syntax is exclusively for the GSUtil utility.
Note that training can take a while for very large training sets (with tens of thousands of rows).
Step 6: Make Predictions with the Google Prediction API in your application.
Now that you have successfully built a model, it is time to actually make predictions! The output from a prediction call for a classification problem like sentiment analysis will include several important fields:
outputLabel- The label that the API determined was most likely to apply to this sample
outputMulti- A detailed breakdown per label of how likely that label is to apply to this sample (All scores in
outputMultiwill sum to 1)
Step 7: Update Your Model with New Data
You can continue to improve your model by adding additional examples. There are two ways to add additional examples to your model:
- Add the new data to the original data file and retrain the model with another insert() call.
- Update your existing model using the
updatemethod; this adds new examples on the fly to an existing model. The downside of this option is that some classifiers cannot be updated, so it is possible that an updated classifier will have worse accuracy than the classifier trained on all the data in batch.
The Google Prediction API is a generic machine learning service and can be used to solve almost any regression or classification type problem. If you have any questions feel free to email our discussion forum at email@example.com.