Preparing your training data

To train your custom model, you must provide representative samples of the type of content you want AutoML Natural Language Sentiment Analysis to analyze, each labeled with a value indicating how positive the sentiment is within the content.

The sentiment score is an integer ranging from 0 (relatively negative) to a maximum value of your choice (positive). For example, if you want to identify whether the sentiment is negative, positive, or neutral, you would label the training data with sentiment scores of 0 (negative), 1 (neutral), and 2 (positive). The Maximum sentiment score (sentiment_max) for the dataset is 2. If you want to capture more granularity with five levels of sentiment, you still label items with the most negative sentiment as 0 and use 4 for the most positive sentiment. The Maximum sentiment score (sentiment_max) for the dataset is 5.

For best results, make sure your training data includes a balanced number of items with each sentiment score; having more examples for particular sentiment scores can introduce bias into the model.

You provide your labeled training data in a comma-separated values (.csv) file. Each line in the file provides a training item and the sentiment score for that item. The .csv file can have any filename, must be UTF-8 encoded, and must end with a .csv extension. It must be stored in the Google Cloud Storage bucket associated with your project. The file has one row for each item in the set you are uploading, with these columns in each row:

  1. Which set to assign the content in this row to. This column is optional and can be one of these values:

    • TRAIN - Use the content to train the model.
    • VALIDATE - Use the content to validate the results that the model returns during training. (Validation data sets are also known as "dev" datasets.)
    • TEST - Use the content to verify the model's results after the model has been trained.

    If you do not include this column to specify a set for the content in each row, AutoML Natural Language Sentiment Analysis automatically places the row in one of the three sets to ensure that there is enough training, validation, and testing content. AutoML Natural Language Sentiment Analysis uses the 80% of your content documents for training, 10% for validating, and 10% for testing.

    If you explicitly assign any items to the TRAIN, VALIDATE, or TEST sets, you must explicitly assign all items. AutoML Natural Language Sentiment Analysis automatically assigns items only when none of them have been explicitly assigned to a set.

  2. The content to be analyzed. This field contains the content as quoted in-line text or provides a path to a text (.txt) or compressed zip (.zip) file. If the document is in a Google Cloud Storage bucket, the path is its Google Cloud Storage URI.

  3. An integer indicating the sentiment value for the content. The sentiment value ranges from 0 (strongly negative) to a maximum value of 10 (strongly positive).

For example, you might have the following in your .csv file:

TRAIN,gs://my-project-lcm/training-data/birthday.txt,5
VALIDATE,"The movie is surprising with plenty of unsettling plot twists.",3
TEST,gs://my-project-lcm/training-data/cubs.txt,2

For the best possible model, you should have an approximately equal number of training items for each sentiment value. We recommend providing at least 1000 items per sentiment value.

Common .csv errors

  • Using Unicode characters in labels. For example, Japanese characters are not supported.
  • Using spaces and non-alphanumeric characters in labels.
  • Empty lines.
  • Empty columns (lines with two successive commas).
  • Missing quotes around embedded text that includes commas.
  • Incorrect capitalization of Cloud Storage text paths.
  • Incorrect access control configured for your text files. Your service account should have read or greater access, or files must be publicly-readable.
  • References to non-text files, such as JPEG files. Likewise, files that are not text files but that have been renamed with a text extension will cause an error.
  • The URI of a text file points to a different bucket than the current project. Only files in the project bucket can be accessed.
  • Non-CSV-formatted files.
Was this page helpful? Let us know how we did:

Send feedback about...

AutoML Natural Language Sentiment Analysis