This page describes how to prepare text data for use in a Vertex AI dataset to train a sentiment analysis model.
Sentiment analysis training data consists of documents that are associated with a sentiment value that indicates the sentiment of the content. For example, you might have tweets about a particular domain such as air travel. Each tweet is associated with sentiment value that indicates if the tweet is positive, negative, or neutral.
Data requirements
- You must supply at least 10, but no more than 100,000, total training documents.
- A sentiment value must be an integer from 0 to 10. The maximum sentiment value is your choice. For example, if you want to identify whether the sentiment is negative, positive, or neutral, you can label the training data with sentiment scores of 0 (negative), 1 (neutral), and 2 (positive). The maximum sentiment score for this dataset is 2. If you want to capture more granularity, such as five levels of sentiment, you can label documents from 0 (most negative) to 4 (most positive).
- You must apply each sentiment value to at least 10 documents.
- Sentiment score values must be consecutive integers starting from zero. If you have gaps in scores or don't start from zero, remap your scores to be consecutive integers starting from zero.
- You can include documents inline or reference TXT files that are in Cloud Storage buckets.
Best practices for text data used to train AutoML models
The following recommendations apply to datasets used to train AutoML models.
- Provide at least 100 documents per sentiment value.
- Use a balanced number of documents for each sentiment score. Having more examples for particular sentiment scores can introduce bias into the model.
Input files
Input file types for sentiment analysis can be JSON Lines or CSV.
JSON Lines
The format, field names, value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.
You can download the schema file for sentiment analysis from the
following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_sentiment_io_format_1.0.0.yaml
JSON Lines example
The following example shows how you might use the schema to create your
own JSON Lines file. The example includes line breaks for readability.
In your JSON Lines files, include line breaks only after each document. The
dataItemResourceLabels
field specifies, for example, ml_use
and is
optional.
{ "sentimentAnnotation": { "sentiment": number, "sentimentMax": number }, "textContent": "inline_text", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } } { "sentimentAnnotation": { "sentiment": number, "sentimentMax": number }, "textGcsUri": "gcs_uri_to_file", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
CSV
Each line in a CSV file refers to a single document. The following
example shows the general format of a valid CSV file. The ml_use
column
is optional.
[ml_use],gcs_file_uri|"inline_text",sentiment,sentimentMax
The following snippet is an example of an input CSV file.
test,gs://path_to_file,sentiment_value,sentiment_max_value test,"inline_text",sentiment_value,sentiment_max_value training,gs://path_to_file,sentiment_value,sentiment_max_value validation,gs://path_to_file,sentiment_value,sentiment_max_value