This page describes how to prepare text data for use in a Vertex AI dataset to train single-label and multi-label classification models.
Single-label classification
For single-label classification, training data consists of documents and the classification category that apply to those documents. Single-label classification allows a document to be assigned only one label.
Data requirements
- You must supply at least 20, and no more than 1,000,000, training documents.
- You must supply at least 2, and no more than 5000, unique category labels.
- You must apply each label to at least 10 documents.
- For multi-label classification, you can apply one or multiple labels to a document.
- You can include documents inline or reference TXT files that are in Cloud Storage buckets.
Best practices for text data used to train AutoML models
The following recommendations apply to datasets used to train AutoML models.
- Use training data that is as varied as the data on which predictions will be made. Include different lengths of documents, documents authored by different people, documents that use different wording or style, and so on.
- Use documents that can be easily categorized by a human reader. AutoML models can't generally predict category labels that humans can't assign. So, if a human can't be trained to assign a label by reading a document, your model likely can't be trained to do it either.
- Provide as many training documents per label as possible. You can improve the confidence scores from your model by using more examples per label. Train a model using 50 examples per label and evaluate the results. Add more examples and retrain until you meet your accuracy targets, which might require hundreds or even 1000 examples per label.
- The model works best when there are at most 100 times more documents for the most common label than for the least common label. We recommend removing very low frequency labels.
- Consider including an out-of-domain label (for example,
None_of_the_above
) for documents that don't match any of your defined labels. For example, if you only labeled documents about arts and entertainment, but your dataset contains documents about other subjects, such as sports or technology, label the documents about other subjects asNone_of_the_above
. Without such a label, the trained model will attempt to assign all documents to one of the defined labels, even documents for which those labels are unsuitable. - If you have a large number of documents that don't currently match your labels, filter them out so that your model doesn't skew predictions to an out-of-domain label. For example, you could have a filtering model that predicts whether a document fits within the current set of labels or is out of domain. After filtering, you would have another model that classifies in-domain documents only.
Input files
Single-label classification supports JSON Lines or CSV input files. You can specify only one label (annotation) for a given document. The following sections describe the input files and provide examples for each file type.
JSON Lines
The format, field names, value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.
You can download the schema file for single-label classification from the
following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_single_label_io_format_1.0.0.yaml
JSON Lines example
The following example shows how you might use the schema to create your
own JSON Lines file. The example includes line breaks for readability. In your
JSON files, include line breaks only after each document. The
dataItemResourceLabels
field specifies, for example, ml_use
and is
optional.
{ "classificationAnnotation": { "displayName": "label" }, "textContent": "inline_text", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } } { "classificationAnnotation": { "displayName": "label2" }, "textGcsUri": "gcs_uri_to_file", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
CSV
Each line in a CSV file refers to a single document. The following
example shows the general format of a valid CSV file. The ml_use
column
is optional.
[ml_use],gcs_file_uri|"inline_text",label
The following snippet is an example of an input CSV file.
test,gs://path_to_file,label1 test,"inline_text",label2 training,gs://path_to_file,label3 validation,gs://path_to_file,label1
Multi-label classification
For multi-label classification, training data consists of documents and the classification categories that apply to those documents. Multi-label classification allows a document to be assigned one or more labels.
Data requirements
- You must supply at least 20, and no more than 1,000,000, training documents.
- You must supply at least 2, and no more than 5000, unique category labels.
- You must apply each label to at least 10 documents.
- For multi-label classification, you can apply one or multiple labels to a document.
- You can include documents inline or reference TXT files that are in Cloud Storage buckets.
Best practices for text data used to train AutoML models
The following recommendations apply to datasets used to train AutoML models.
- Use training data that is as varied as the data on which predictions will be made. Include different lengths of documents, documents authored by different people, documents that use different wording or style, and so on.
- Use documents that can be easily categorized by a human reader. AutoML models can't generally predict category labels that humans can't assign. So, if a human can't be trained to assign a label by reading a document, your model likely can't be trained to do it either.
- When using multi-label classification, apply all relevant labels to each
document. For example, if you are labeling documents that provide details
about pharmaceuticals, you might have a document that is labeled with
Dosage
andSide Effects
. - Provide as many training documents per label as possible. You can improve the confidence scores from your model by using more examples per label. Better confidence scores are especially helpful when your model returns multiple labels when it classifies a document. Train a model using 50 examples per label and evaluate the results. Add more examples and retrain until you meet your accuracy targets, which might require hundreds or even 1000 examples per label.
- The model works best when there are at most 100 times more documents for the most common label than for the least common label. We recommend removing very low frequency labels.
- Consider including an out-of-domain label (for example,
None_of_the_above
) for documents that don't match any of your defined labels. For example, if you only labeled documents about arts and entertainment, but your dataset contains documents about other subjects, such as sports or technology, label the documents about other subjects asNone_of_the_above
. Without such a label, the trained model will attempt to assign all documents to one of the defined labels, even documents for which those labels are unsuitable. - If you have a large number of documents that don't currently match your labels, filter them out so that your model doesn't skew predictions to an out-of-domain label. For example, you could have a filtering model that predicts whether a document fits within the current set of labels or is out of domain. After filtering, you would have another model that classifies in-domain documents only.
Input files
Multi-label classification supports JSON Lines or CSV input files. You can specify more than one label (annotation) for a given document. The following sections describe the input files and provide examples for each file type.
JSON Lines
The format, field names, value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.
You can download the schema file for multi-label classification from the
following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_multi_label_io_format_1.0.0.yaml
JSON Lines example
The following example shows how you might use the schema to create your
own JSON Lines file. The example includes line breaks for readability. In your
JSON files, include line breaks only after each document. The
dataItemResourceLabels
field specifies, for example, ml_use
and is
optional.
{ "classificationAnnotations": [{ "displayName": "label1" },{ "displayName": "label2" }], "textGcsUri": "gcs_uri_to_file", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } } { "classificationAnnotations": [{ "displayName": "label2" },{ "displayName": "label3" }], "textContent": "inline_text", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
CSV
Each line in a CSV file refers to a single document. The following
example shows the general format of a valid CSV file. The ml_use
column
is optional.
[ml_use],gcs_file_uri|"inline_text",label1,label2,...
The following snippet is an example of an input CSV file.
test,gs://path_to_file,label1,label2 test,"inline_text",label3 training,gs://path_to_file,label1,label2,label3 validation,gs://path_to_file,label4,label5
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-10-31 UTC.