Prepare text training data

This page describes how to prepare text data for use in a Vertex AI dataset. The format of text input data depends on the objective. For example, the data preparation for text classification is different compared to text sentiment analysis.

The following sections describe the data requirements, recommendations, and input files for each objective.

Single-label classification

For single-label classification, training data consists of documents and the classification category that apply to those documents. Single-label classification allows a document to be assigned only one label.

Data requirements

  • You must supply at least 20, and no more than 1,000,000, training documents.
  • You must supply at least 2, and no more than 5000, unique category labels.
  • You must apply each label to at least 10 documents.
  • For multi-label classification, you can apply one or multiple labels to a document.
  • You can include documents inline or reference TXT files that are in Cloud Storage buckets.

Best practices for text data used to train AutoML models

The following recommendations apply to datasets used to train AutoML models.

  • Use training data that is as varied as the data on which predictions will be made. Include different lengths of documents, documents authored by different people, documents that use different wording or style, and so on.
  • Use documents that can be easily categorized by a human reader. AutoML models can't generally predict category labels that humans can't assign. So, if a human can't be trained to assign a label by reading a document, your model likely can't be trained to do it either.
  • Provide as many training documents per label as possible. You can improve the confidence scores from your model by using more examples per label. Train a model using 50 examples per label and evaluate the results. Add more examples and retrain until you meet your accuracy targets, which might require hundreds or even 1000 examples per label.
  • The model works best when there are at most 100 times more documents for the most common label than for the least common label. We recommend removing very low frequency labels.
  • Consider including an out-of-domain label (for example, None_of_the_above) for documents that don't match any of your defined labels. For example, if you only labeled documents about arts and entertainment, but your dataset contains documents about other subjects, such as sports or technology, label the documents about other subjects as None_of_the_above. Without such a label, the trained model will attempt to assign all documents to one of the defined labels, even documents for which those labels are unsuitable.
  • If you have a large number of documents that don't currently match your labels, filter them out so that your model doesn't skew predictions to an out-of-domain label. For example, you could have a filtering model that predicts whether a document fits within the current set of labels or is out of domain. After filtering, you would have another model that classifies in-domain documents only.

Input files

Single-label classification supports JSON Lines or CSV input files. You can specify only one label (annotation) for a given document. The following sections describe the input files and provide examples for each file type.

JSON Lines

The format, field names, value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.

You can download the schema file for single-label classification from the following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_single_label_io_format_1.0.0.yaml

JSON Lines example

The following example shows how you might use the schema to create your own JSON Lines file. The example includes line breaks for readability. In your JSON files, include line breaks only after each document. The dataItemResourceLabels field specifies, for example, ml_use and is optional.



{
  "classificationAnnotation": {
    "displayName": "label"
  },
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
},
{
  "classificationAnnotation": {
    "displayName": "label2"
  },
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

CSV

Each line in a CSV file refers to a single document. The following example shows the general format of a valid CSV file. The ml_use column is optional.

[ml_use],gcs_file_uri|"inline_text",label

The following snippet is an example of an input CSV file.

test,gs://path_to_file,label1
test,"inline_text",label2
training,gs://path_to_file,label3
validation,gs://path_to_file,label1

Multi-label classification

For multi-label classification, training data consists of documents and the classification categories that apply to those documents. Multi-label classification allows a document to be assigned one or more labels.

Data requirements

  • You must supply at least 20, and no more than 1,000,000, training documents.
  • You must supply at least 2, and no more than 5000, unique category labels.
  • You must apply each label to at least 10 documents.
  • For multi-label classification, you can apply one or multiple labels to a document.
  • You can include documents inline or reference TXT files that are in Cloud Storage buckets.

Best practices for text data used to train AutoML models

The following recommendations apply to datasets used to train AutoML models.

  • Use training data that is as varied as the data on which predictions will be made. Include different lengths of documents, documents authored by different people, documents that use different wording or style, and so on.
  • Use documents that can be easily categorized by a human reader. AutoML models can't generally predict category labels that humans can't assign. So, if a human can't be trained to assign a label by reading a document, your model likely can't be trained to do it either.
  • When using multi-label classification, apply all relevant labels to each document. For example, if you are labeling documents that provide details about pharmaceuticals, you might have a document that is labeled with Dosage and Side Effects.
  • Provide as many training documents per label as possible. You can improve the confidence scores from your model by using more examples per label. Better confidence scores are especially helpful when your model returns multiple labels when it classifies a document. Train a model using 50 examples per label and evaluate the results. Add more examples and retrain until you meet your accuracy targets, which might require hundreds or even 1000 examples per label.
  • The model works best when there are at most 100 times more documents for the most common label than for the least common label. We recommend removing very low frequency labels.
  • Consider including an out-of-domain label (for example, None_of_the_above) for documents that don't match any of your defined labels. For example, if you only labeled documents about arts and entertainment, but your dataset contains documents about other subjects, such as sports or technology, label the documents about other subjects as None_of_the_above. Without such a label, the trained model will attempt to assign all documents to one of the defined labels, even documents for which those labels are unsuitable.
  • If you have a large number of documents that don't currently match your labels, filter them out so that your model doesn't skew predictions to an out-of-domain label. For example, you could have a filtering model that predicts whether a document fits within the current set of labels or is out of domain. After filtering, you would have another model that classifies in-domain documents only.

Input files

Multi-label classification supports JSON Lines or CSV input files. You can specify more than one label (annotation) for a given document. The following sections describe the input files and provide examples for each file type.

JSON Lines

The format, field names, value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.

You can download the schema file for multi-label classification from the following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_multi_label_io_format_1.0.0.yaml

JSON Lines example

The following example shows how you might use the schema to create your own JSON Lines file. The example includes line breaks for readability. In your JSON files, include line breaks only after each document. The dataItemResourceLabels field specifies, for example, ml_use and is optional.



{
  "classificationAnnotations": [{
    "displayName": "label1"
    },{
    "displayName": "label2"
  }],
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
},
{
  "classificationAnnotations": [{
    "displayName": "label2"
    },{
    "displayName": "label3"
  }],
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

CSV

Each line in a CSV file refers to a single document. The following example shows the general format of a valid CSV file. The ml_use column is optional.

[ml_use],gcs_file_uri|"inline_text",label1,label2,...

The following snippet is an example of an input CSV file.

test,gs://path_to_file,label1,label2
test,"inline_text",label3
training,gs://path_to_file,label1,label2,label3
validation,gs://path_to_file,label4,label5

Entity extraction

Entity extraction training data consists of documents that are annotated with the labels that identify the types of entities that you want your model to identify. For example, you might create an entity extraction model to identify specialized terminology in legal documents or patents. Annotations specify the locations of the entities that you're labeling and the labels themselves.

If you're annotating structured or semi-structure documents for a dataset used to train AutoML models, such as invoices or contracts, Vertex AI can consider an annotation's position on the page as a factor contributing to its proper label. For example, a real estate contract has both an acceptance date and a closing date. Vertex AI can learn to distinguish between the entities based on the spatial position of the annotation.

Data requirements

  • You must supply at least 50, and no more than 100,000, training documents.
  • You must supply at least 1, and no more than 100, unique labels to annotate entities that you want to extract.
  • You can use a label to annotate between 1 and 10 words.
  • Label names can be between 2 and 30 characters.
  • You can include annotations in your JSON Lines files, or you can add annotations later by using the Google Cloud Console after uploading documents.
  • You can include documents inline or reference TXT files that are in Cloud Storage buckets.

Best practices for text data used to train AutoML models

The following recommendations apply to datasets used to train AutoML models.

  • Use each label at least 200 times in your training dataset.
  • Annotate every occurrence of entities that you want your model to identify.

Input files

Input file types for entity extraction must be JSON Lines. The format, field names, and value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.

You can download the schema file for entity extraction from the following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_extraction_io_format_1.0.0.yaml.

The following example shows how you might use the schema to create your own JSON Lines file. The example includes line breaks for readability. In your JSON files, include line breaks only after each document. The dataItemResourceLabels field specifies, for example, ml_use and is optional.



{
    "textSegmentAnnotations": [
      {
        "startOffset":number,
        "endOffset":number,
        "displayName": "label"
      },
      ...
    ],
    "textContent": "inline_text"|"textGcsUri": "gcs_uri_to_file",
    "dataItemResourceLabels": {
      "aiplatform.googleapis.com/ml_use": "training|test|validation"
    }
}

You can also annotate documents by using the Google Cloud Console. Create a JSON Lines file with content only (without the textSegmentAnnotations field); documents are uploaded to Vertex AI without any annotations.

Sentiment analysis

Sentiment analysis training data consists of documents that are associated with a sentiment value that indicates the sentiment of the content. For example, you might have tweets about a particular domain such as air travel. Each tweet is associated with sentiment value that indicates if the tweet is positive, negative, or neutral.

Data requirements

  • You must supply at least 10, but no more than 100,000, total training documents.
  • A sentiment value must be an integer from 0 to 10. The maximum sentiment value is your choice. For example, if you want to identify whether the sentiment is negative, positive, or neutral, you can label the training data with sentiment scores of 0 (negative), 1 (neutral), and 2 (positive). The maximum sentiment score for this dataset is 2. If you want to capture more granularity, such as five levels of sentiment, you can label documents from 0 (most negative) to 4 (most positive).
  • You must apply each sentiment value to at least 10 documents.
  • Sentiment score values must be consecutive integers starting from zero. If you have gaps in scores or don't start from zero, remap your scores to be consecutive integers starting from zero.
  • You can include documents inline or reference TXT files that are in Cloud Storage buckets.

Best practices for text data used to train AutoML models

The following recommendations apply to datasets used to train AutoML models.

  • Provide at least 100 documents per sentiment value.
  • Use a balanced number of documents for each sentiment score. Having more examples for particular sentiment scores can introduce bias into the model.

Input files

Input file types for sentiment analysis can be JSON Lines or CSV.

JSON Lines

The format, field names, value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.

You can download the schema file for sentiment analysis from the following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_sentiment_io_format_1.0.0.yaml

JSON Lines example

The following example shows how you might use the schema to create your own JSON Lines file. The example includes line breaks for readability. In your JSON Lines files, include line breaks only after each document. The dataItemResourceLabels field specifies, for example, ml_use and is optional.



{
  "sentimentAnnotation": {
    "sentiment": number,
    "sentimentMax": number
  },
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
},
{
  "sentimentAnnotation": {
    "sentiment": number,
    "sentimentMax": number
  },
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

CSV

Each line in a CSV file refers to a single document. The following example shows the general format of a valid CSV file. The ml_use column is optional.

[ml_use],gcs_file_uri|"inline_text",sentiment,sentimentMax

The following snippet is an example of an input CSV file.

test,gs://path_to_file,sentiment_value,sentiment_max_value
test,"inline_text",sentiment_value,sentiment_max_value
training,gs://path_to_file,sentiment_value,sentiment_max_value
validation,gs://path_to_file,sentiment_value,sentiment_max_value

What's next