This page describes how to prepare text data for use in a Vertex AI dataset. The format of text input data depends on the objective. For example, the data preparation for text classification is different compared to text sentiment analysis.
The following sections describe the data requirements, recommendations, and input files for each objective.
Single-label classification
For single-label classification, training data consists of documents and the classification category that apply to those documents. Single-label classification allows a document to be assigned only one label.
Data requirements
- You must supply at least 20, and no more than 1,000,000, training documents.
- You must supply at least 2, and no more than 5000, unique category labels.
- You must apply each label to at least 10 documents.
- For multi-label classification, you can apply one or multiple labels to a document.
- You can include documents inline or reference TXT files that are in Cloud Storage buckets.
Best practices for text data used to train AutoML models
The following recommendations apply to datasets used to train AutoML models.
- Use training data that is as varied as the data on which predictions will be made. Include different lengths of documents, documents authored by different people, documents that use different wording or style, and so on.
- Use documents that can be easily categorized by a human reader. AutoML models can't generally predict category labels that humans can't assign. So, if a human can't be trained to assign a label by reading a document, your model likely can't be trained to do it either.
- Provide as many training documents per label as possible. You can improve the confidence scores from your model by using more examples per label. Train a model using 50 examples per label and evaluate the results. Add more examples and retrain until you meet your accuracy targets, which might require hundreds or even 1000 examples per label.
- The model works best when there are at most 100 times more documents for the most common label than for the least common label. We recommend removing very low frequency labels.
- Consider including an out-of-domain label (for example,
None_of_the_above
) for documents that don't match any of your defined labels. For example, if you only labeled documents about arts and entertainment, but your dataset contains documents about other subjects, such as sports or technology, label the documents about other subjects asNone_of_the_above
. Without such a label, the trained model will attempt to assign all documents to one of the defined labels, even documents for which those labels are unsuitable. - If you have a large number of documents that don't currently match your labels, filter them out so that your model doesn't skew predictions to an out-of-domain label. For example, you could have a filtering model that predicts whether a document fits within the current set of labels or is out of domain. After filtering, you would have another model that classifies in-domain documents only.
Input files
Single-label classification supports JSON Lines or CSV input files. You can specify only one label (annotation) for a given document. The following sections describe the input files and provide examples for each file type.
JSON Lines
The format, field names, value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.
You can download the schema file for single-label classification from the
following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_single_label_io_format_1.0.0.yaml
JSON Lines example
The following example shows how you might use the schema to create your
own JSON Lines file. The example includes line breaks for readability. In your
JSON files, include line breaks only after each document. The
dataItemResourceLabels
field specifies, for example, ml_use
and is
optional.
{ "classificationAnnotation": { "displayName": "label" }, "textContent": "inline_text", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } } { "classificationAnnotation": { "displayName": "label2" }, "textGcsUri": "gcs_uri_to_file", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
CSV
Each line in a CSV file refers to a single document. The following
example shows the general format of a valid CSV file. The ml_use
column
is optional.
[ml_use],gcs_file_uri|"inline_text",label
The following snippet is an example of an input CSV file.
test,gs://path_to_file,label1 test,"inline_text",label2 training,gs://path_to_file,label3 validation,gs://path_to_file,label1
Multi-label classification
For multi-label classification, training data consists of documents and the classification categories that apply to those documents. Multi-label classification allows a document to be assigned one or more labels.
Data requirements
- You must supply at least 20, and no more than 1,000,000, training documents.
- You must supply at least 2, and no more than 5000, unique category labels.
- You must apply each label to at least 10 documents.
- For multi-label classification, you can apply one or multiple labels to a document.
- You can include documents inline or reference TXT files that are in Cloud Storage buckets.
Best practices for text data used to train AutoML models
The following recommendations apply to datasets used to train AutoML models.
- Use training data that is as varied as the data on which predictions will be made. Include different lengths of documents, documents authored by different people, documents that use different wording or style, and so on.
- Use documents that can be easily categorized by a human reader. AutoML models can't generally predict category labels that humans can't assign. So, if a human can't be trained to assign a label by reading a document, your model likely can't be trained to do it either.
- When using multi-label classification, apply all relevant labels to each
document. For example, if you are labeling documents that provide details
about pharmaceuticals, you might have a document that is labeled with
Dosage
andSide Effects
. - Provide as many training documents per label as possible. You can improve the confidence scores from your model by using more examples per label. Better confidence scores are especially helpful when your model returns multiple labels when it classifies a document. Train a model using 50 examples per label and evaluate the results. Add more examples and retrain until you meet your accuracy targets, which might require hundreds or even 1000 examples per label.
- The model works best when there are at most 100 times more documents for the most common label than for the least common label. We recommend removing very low frequency labels.
- Consider including an out-of-domain label (for example,
None_of_the_above
) for documents that don't match any of your defined labels. For example, if you only labeled documents about arts and entertainment, but your dataset contains documents about other subjects, such as sports or technology, label the documents about other subjects asNone_of_the_above
. Without such a label, the trained model will attempt to assign all documents to one of the defined labels, even documents for which those labels are unsuitable. - If you have a large number of documents that don't currently match your labels, filter them out so that your model doesn't skew predictions to an out-of-domain label. For example, you could have a filtering model that predicts whether a document fits within the current set of labels or is out of domain. After filtering, you would have another model that classifies in-domain documents only.
Input files
Multi-label classification supports JSON Lines or CSV input files. You can specify more than one label (annotation) for a given document. The following sections describe the input files and provide examples for each file type.
JSON Lines
The format, field names, value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.
You can download the schema file for multi-label classification from the
following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_multi_label_io_format_1.0.0.yaml
JSON Lines example
The following example shows how you might use the schema to create your
own JSON Lines file. The example includes line breaks for readability. In your
JSON files, include line breaks only after each document. The
dataItemResourceLabels
field specifies, for example, ml_use
and is
optional.
{ "classificationAnnotations": [{ "displayName": "label1" },{ "displayName": "label2" }], "textGcsUri": "gcs_uri_to_file", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } } { "classificationAnnotations": [{ "displayName": "label2" },{ "displayName": "label3" }], "textContent": "inline_text", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
CSV
Each line in a CSV file refers to a single document. The following
example shows the general format of a valid CSV file. The ml_use
column
is optional.
[ml_use],gcs_file_uri|"inline_text",label1,label2,...
The following snippet is an example of an input CSV file.
test,gs://path_to_file,label1,label2 test,"inline_text",label3 training,gs://path_to_file,label1,label2,label3 validation,gs://path_to_file,label4,label5
Entity extraction
Entity extraction training data consists of documents that are annotated with the labels that identify the types of entities that you want your model to identify. For example, you might create an entity extraction model to identify specialized terminology in legal documents or patents. Annotations specify the locations of the entities that you're labeling and the labels themselves.
If you're annotating structured or semi-structure documents for a dataset used to train AutoML models, such as invoices or contracts, Vertex AI can consider an annotation's position on the page as a factor contributing to its proper label. For example, a real estate contract has both an acceptance date and a closing date. Vertex AI can learn to distinguish between the entities based on the spatial position of the annotation.
Data requirements
- You must supply at least 50, and no more than 100,000, training documents.
- You must supply at least 1, and no more than 100, unique labels to annotate entities that you want to extract.
- You can use a label to annotate between 1 and 10 words.
- Label names can be between 2 and 30 characters.
- You can include annotations in your JSON Lines files, or you can add annotations later by using the Google Cloud console after uploading documents.
- You can include documents inline or reference TXT files that are in Cloud Storage buckets.
Best practices for text data used to train AutoML models
The following recommendations apply to datasets used to train AutoML models.
- Use each label at least 200 times in your training dataset.
- Annotate every occurrence of entities that you want your model to identify.
Input files
Input file types for entity extraction must be JSON Lines. The format, field names, and value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.
You can download the schema file for entity extraction from the following
Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_extraction_io_format_1.0.0.yaml.
The following example shows how you might use the schema to create your
own JSON Lines file. The example includes line breaks for readability. In your
JSON files, include line breaks only after each document. The
dataItemResourceLabels
field specifies, for example, ml_use
and is
optional.
{ "textSegmentAnnotations": [ { "startOffset":number, "endOffset":number, "displayName": "label" }, ... ], "textContent": "inline_text", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } } { "textSegmentAnnotations": [ { "startOffset":number, "endOffset":number, "displayName": "label" }, ... ], "textGcsUri": "gcs_uri_to_file", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
You can also annotate documents by using the Google Cloud console. Create a
JSON Lines file with content only (without the textSegmentAnnotations
field);
documents are uploaded to Vertex AI without any annotations.
Sentiment analysis
Sentiment analysis training data consists of documents that are associated with a sentiment value that indicates the sentiment of the content. For example, you might have tweets about a particular domain such as air travel. Each tweet is associated with sentiment value that indicates if the tweet is positive, negative, or neutral.
Data requirements
- You must supply at least 10, but no more than 100,000, total training documents.
- A sentiment value must be an integer from 0 to 10. The maximum sentiment value is your choice. For example, if you want to identify whether the sentiment is negative, positive, or neutral, you can label the training data with sentiment scores of 0 (negative), 1 (neutral), and 2 (positive). The maximum sentiment score for this dataset is 2. If you want to capture more granularity, such as five levels of sentiment, you can label documents from 0 (most negative) to 4 (most positive).
- You must apply each sentiment value to at least 10 documents.
- Sentiment score values must be consecutive integers starting from zero. If you have gaps in scores or don't start from zero, remap your scores to be consecutive integers starting from zero.
- You can include documents inline or reference TXT files that are in Cloud Storage buckets.
Best practices for text data used to train AutoML models
The following recommendations apply to datasets used to train AutoML models.
- Provide at least 100 documents per sentiment value.
- Use a balanced number of documents for each sentiment score. Having more examples for particular sentiment scores can introduce bias into the model.
Input files
Input file types for sentiment analysis can be JSON Lines or CSV.
JSON Lines
The format, field names, value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.
You can download the schema file for sentiment analysis from the
following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_sentiment_io_format_1.0.0.yaml
JSON Lines example
The following example shows how you might use the schema to create your
own JSON Lines file. The example includes line breaks for readability.
In your JSON Lines files, include line breaks only after each document. The
dataItemResourceLabels
field specifies, for example, ml_use
and is
optional.
{ "sentimentAnnotation": { "sentiment": number, "sentimentMax": number }, "textContent": "inline_text", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } } { "sentimentAnnotation": { "sentiment": number, "sentimentMax": number }, "textGcsUri": "gcs_uri_to_file", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
CSV
Each line in a CSV file refers to a single document. The following
example shows the general format of a valid CSV file. The ml_use
column
is optional.
[ml_use],gcs_file_uri|"inline_text",sentiment,sentimentMax
The following snippet is an example of an input CSV file.
test,gs://path_to_file,sentiment_value,sentiment_max_value test,"inline_text",sentiment_value,sentiment_max_value training,gs://path_to_file,sentiment_value,sentiment_max_value validation,gs://path_to_file,sentiment_value,sentiment_max_value
What's next
- Creating a dataset using the console or using the API.
- Creating an annotation set to use an existing dataset for a different objective.
- Requesting a data labeling job for your dataset.