Prepare text training data for classification

This page describes how to prepare text data for use in a Vertex AI dataset to train single-label and multi-label classification models.

Single-label classification

For single-label classification, training data consists of documents and the classification category that apply to those documents. Single-label classification allows a document to be assigned only one label.

Data requirements

  • You must supply at least 20, and no more than 1,000,000, training documents.
  • You must supply at least 2, and no more than 5000, unique category labels.
  • You must apply each label to at least 10 documents.
  • For multi-label classification, you can apply one or multiple labels to a document.
  • You can include documents inline or reference TXT files that are in Cloud Storage buckets.

Best practices for text data used to train AutoML models

The following recommendations apply to datasets used to train AutoML models.

  • Use training data that is as varied as the data on which predictions will be made. Include different lengths of documents, documents authored by different people, documents that use different wording or style, and so on.
  • Use documents that can be easily categorized by a human reader. AutoML models can't generally predict category labels that humans can't assign. So, if a human can't be trained to assign a label by reading a document, your model likely can't be trained to do it either.
  • Provide as many training documents per label as possible. You can improve the confidence scores from your model by using more examples per label. Train a model using 50 examples per label and evaluate the results. Add more examples and retrain until you meet your accuracy targets, which might require hundreds or even 1000 examples per label.
  • The model works best when there are at most 100 times more documents for the most common label than for the least common label. We recommend removing very low frequency labels.
  • Consider including an out-of-domain label (for example, None_of_the_above) for documents that don't match any of your defined labels. For example, if you only labeled documents about arts and entertainment, but your dataset contains documents about other subjects, such as sports or technology, label the documents about other subjects as None_of_the_above. Without such a label, the trained model will attempt to assign all documents to one of the defined labels, even documents for which those labels are unsuitable.
  • If you have a large number of documents that don't currently match your labels, filter them out so that your model doesn't skew predictions to an out-of-domain label. For example, you could have a filtering model that predicts whether a document fits within the current set of labels or is out of domain. After filtering, you would have another model that classifies in-domain documents only.

Input files

Single-label classification supports JSON Lines or CSV input files. You can specify only one label (annotation) for a given document. The following sections describe the input files and provide examples for each file type.

JSON Lines

The format, field names, value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.

You can download the schema file for single-label classification from the following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_single_label_io_format_1.0.0.yaml

JSON Lines example

The following example shows how you might use the schema to create your own JSON Lines file. The example includes line breaks for readability. In your JSON files, include line breaks only after each document. The dataItemResourceLabels field specifies, for example, ml_use and is optional.

{
  "classificationAnnotation": {
    "displayName": "label"
  },
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}
{
  "classificationAnnotation": {
    "displayName": "label2"
  },
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

CSV

Each line in a CSV file refers to a single document. The following example shows the general format of a valid CSV file. The ml_use column is optional.

[ml_use],gcs_file_uri|"inline_text",label

The following snippet is an example of an input CSV file.

test,gs://path_to_file,label1
test,"inline_text",label2
training,gs://path_to_file,label3
validation,gs://path_to_file,label1

Multi-label classification

For multi-label classification, training data consists of documents and the classification categories that apply to those documents. Multi-label classification allows a document to be assigned one or more labels.

Data requirements

  • You must supply at least 20, and no more than 1,000,000, training documents.
  • You must supply at least 2, and no more than 5000, unique category labels.
  • You must apply each label to at least 10 documents.
  • For multi-label classification, you can apply one or multiple labels to a document.
  • You can include documents inline or reference TXT files that are in Cloud Storage buckets.

Best practices for text data used to train AutoML models

The following recommendations apply to datasets used to train AutoML models.

  • Use training data that is as varied as the data on which predictions will be made. Include different lengths of documents, documents authored by different people, documents that use different wording or style, and so on.
  • Use documents that can be easily categorized by a human reader. AutoML models can't generally predict category labels that humans can't assign. So, if a human can't be trained to assign a label by reading a document, your model likely can't be trained to do it either.
  • When using multi-label classification, apply all relevant labels to each document. For example, if you are labeling documents that provide details about pharmaceuticals, you might have a document that is labeled with Dosage and Side Effects.
  • Provide as many training documents per label as possible. You can improve the confidence scores from your model by using more examples per label. Better confidence scores are especially helpful when your model returns multiple labels when it classifies a document. Train a model using 50 examples per label and evaluate the results. Add more examples and retrain until you meet your accuracy targets, which might require hundreds or even 1000 examples per label.
  • The model works best when there are at most 100 times more documents for the most common label than for the least common label. We recommend removing very low frequency labels.
  • Consider including an out-of-domain label (for example, None_of_the_above) for documents that don't match any of your defined labels. For example, if you only labeled documents about arts and entertainment, but your dataset contains documents about other subjects, such as sports or technology, label the documents about other subjects as None_of_the_above. Without such a label, the trained model will attempt to assign all documents to one of the defined labels, even documents for which those labels are unsuitable.
  • If you have a large number of documents that don't currently match your labels, filter them out so that your model doesn't skew predictions to an out-of-domain label. For example, you could have a filtering model that predicts whether a document fits within the current set of labels or is out of domain. After filtering, you would have another model that classifies in-domain documents only.

Input files

Multi-label classification supports JSON Lines or CSV input files. You can specify more than one label (annotation) for a given document. The following sections describe the input files and provide examples for each file type.

JSON Lines

The format, field names, value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.

You can download the schema file for multi-label classification from the following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_multi_label_io_format_1.0.0.yaml

JSON Lines example

The following example shows how you might use the schema to create your own JSON Lines file. The example includes line breaks for readability. In your JSON files, include line breaks only after each document. The dataItemResourceLabels field specifies, for example, ml_use and is optional.

{
  "classificationAnnotations": [{
    "displayName": "label1"
    },{
    "displayName": "label2"
  }],
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}
{
  "classificationAnnotations": [{
    "displayName": "label2"
    },{
    "displayName": "label3"
  }],
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

CSV

Each line in a CSV file refers to a single document. The following example shows the general format of a valid CSV file. The ml_use column is optional.

[ml_use],gcs_file_uri|"inline_text",label1,label2,...

The following snippet is an example of an input CSV file.

test,gs://path_to_file,label1,label2
test,"inline_text",label3
training,gs://path_to_file,label1,label2,label3
validation,gs://path_to_file,label4,label5