Prepare text training data for sentiment analysis

This page describes how to prepare text data for use in a Vertex AI dataset to train a sentiment analysis model.

Sentiment analysis training data consists of documents that are associated with a sentiment value that indicates the sentiment of the content. For example, you might have tweets about a particular domain such as air travel. Each tweet is associated with sentiment value that indicates if the tweet is positive, negative, or neutral.

Data requirements

  • You must supply at least 10, but no more than 100,000, total training documents.
  • A sentiment value must be an integer from 0 to 10. The maximum sentiment value is your choice. For example, if you want to identify whether the sentiment is negative, positive, or neutral, you can label the training data with sentiment scores of 0 (negative), 1 (neutral), and 2 (positive). The maximum sentiment score for this dataset is 2. If you want to capture more granularity, such as five levels of sentiment, you can label documents from 0 (most negative) to 4 (most positive).
  • You must apply each sentiment value to at least 10 documents.
  • Sentiment score values must be consecutive integers starting from zero. If you have gaps in scores or don't start from zero, remap your scores to be consecutive integers starting from zero.
  • You can include documents inline or reference TXT files that are in Cloud Storage buckets.

Best practices for text data used to train AutoML models

The following recommendations apply to datasets used to train AutoML models.

  • Provide at least 100 documents per sentiment value.
  • Use a balanced number of documents for each sentiment score. Having more examples for particular sentiment scores can introduce bias into the model.

Input files

Input file types for sentiment analysis can be JSON Lines or CSV.

JSON Lines

The format, field names, value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.

You can download the schema file for sentiment analysis from the following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_sentiment_io_format_1.0.0.yaml

JSON Lines example

The following example shows how you might use the schema to create your own JSON Lines file. The example includes line breaks for readability. In your JSON Lines files, include line breaks only after each document. The dataItemResourceLabels field specifies, for example, ml_use and is optional.

{
  "sentimentAnnotation": {
    "sentiment": number,
    "sentimentMax": number
  },
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}
{
  "sentimentAnnotation": {
    "sentiment": number,
    "sentimentMax": number
  },
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

CSV

Each line in a CSV file refers to a single document. The following example shows the general format of a valid CSV file. The ml_use column is optional.

  [ml_use],gcs_file_uri|"inline_text",sentiment,sentimentMax
  

The following snippet is an example of an input CSV file.

  test,gs://path_to_file,sentiment_value,sentiment_max_value
  test,"inline_text",sentiment_value,sentiment_max_value
  training,gs://path_to_file,sentiment_value,sentiment_max_value
  validation,gs://path_to_file,sentiment_value,sentiment_max_value