Stay organized with collections
Save and categorize content based on your preferences.
To train your custom model, you provide representative samples of the type
of documents you want to analyze, labeled in the way you want AutoML Natural Language
to label similar documents. The quality of your training data strongly impacts
the effectiveness of the model you create, and by extension, the quality of the
predictions returned from that model.
Collecting and labeling training documents
The first step is to collect a diverse set of training documents that reflects
the range of documents you want the custom model to handle. The preparation
steps for training documents differs depending on whether you're training a
model for classification, entity extraction, or sentiment analysis.
Classification
For classification models, your training data consists of documents and the
classification categories that apply to those documents.
Documents. You must supply at least 20, and no more than 1,000,000,
training documents containing the content to use to train your custom model.
Documents can be in text, PDF, or TIFF format, or compressed into a ZIP file.
Category labels. You must supply at least 2, and no more than 5000,
unique labels. You must apply each label to at least 10 documents.
Providing quality training data
Try to make your training data as varied as the data on which
predictions will be made. Include different lengths of documents, documents
authored by different people, documents that use different wording or style,
and so on.
Use documents that can be easily categorized by a human reader.
AutoML Natural Language models can't generally predict labels that humans
can't assign. So, if a human can't be trained to assign a label by reading
a document, your model likely can't be trained to
do it either.
When using multi-label classification, apply all relevant labels to each document.
For example, if you are labeling documents that provide details about pharmaceuticals,
you might have labels for Dosage and Side Effects.
If a document includes both types of information, ensure that you apply both labels.
We recommend providing as many training documents per label as possible. The minimum
number of documents per label is 10. However, you can improve the
confidence scores from your model by using more examples per label.
Better confidence scores are especially helpful when your model returns
multiple labels when it classifies a document. Train a model using
50 examples per label and evaluate the results. Add more examples and retrain
until you meet your accuracy targets, which may require hundreds or even 1000
examples per label.
The model works best when there are at most 100 times more documents for the
most common label than for the least common label. We recommend removing
very low frequency labels.
Consider including a None_of_the_above label for documents
that don't match any of your defined labels. For example, if you only labeled
documents about arts and entertainment, but your dataset contains documents
about other subjects, such as sports or technology, label the documents
about other subjects as None_of_the_above. Without such a label,
the trained model will attempt to assign all documents to one of the defined labels, even
documents for which those labels are unsuitable.
You can use a label with a different name that has the same meaning as None_of_the_above.
See the next section if you have a lot of None_of_the_above content.
Dealing with "out of domain" documents
Suppose your long-term plan is to train a model that classifies corporate documents
based on their document type (invoice, business plan, policy document, non-disclosure
agreement, and so on). There are thousands of document types, but for testing purposes
you start by training a model that identifies 100 types, with plans to train more
comprehensive models in the future. During this early stage, most documents sent
for classification will be "out of domain" for the initial label set; that is, they
are document types outside of the initial 100 types. If you train a model with the
initial 100 labels and use it with all of your documents, the model will attempt to
classify the "out of domain" documents using one of the existing labels, making it less accurate.
In scenarios when you expect your set of labels to expand over time, we recommend
training two models using the initial smaller label set:
Classification model: A model that classifies documents into the current
set of labels
Filtering model: A model that predicts whether a document fits within the
current set of labels or is "out of domain"
Submit each document to the filtering model first, and only send documents to
the classification model that are "in domain."
With the example described above, the classification model identifies the type of
a document and the filtering model makes a binary prediction about whether a
document belongs to any of the 100 types for which the classification model has
labels.
To train the filtering model, use the same set of documents you used for the
classification model, except label each document as "in domain" instead of using
a specific label from your set. Add an equivalent number of documents for which
the current label set is not appropriate, and label them as "out of domain."
Entity extraction
To train an entity extraction model, you provide representative samples of the type of content
you want to analyze, annotated with labels that identify the types of entities
you want AutoML Natural Language to identify.
You supply between 50 and 100,000 documents to use for training your custom model.
You use between one and 100 unique labels to annotate the entities you want the model
to learn to extract. Each annotation is a span of text and an associated label.
Label names can be between 2 and 30 characters, and can be used
to annotate between one and 10 words. We recommend using each label at least 200
times in your training data set.
If you are annotating a structured or semi-structured document type, such as invoices
or contracts, AutoML Natural Language can consider an annotation's position on the
page as a factor contributing to its proper label. For example, a real estate contract
has both an acceptance date and a closing date, and AutoML Natural Language can learn
to distinguish between the entities based on the spatial position of the annotation.
Formatting training documents
You upload training data to AutoML Natural Language as JSONL
files that contain the sample documents. Each line in the file is a single training
document, specified in one of two forms:
The full content of the document, between 10 and 10000 bytes long (UTF-8 encoded)
The URI of a PDF or TIFF file from a Cloud Storage bucket associated with your
project
Consideration of spatial position is available only for training documents in PDF format.
You can annotate the text documents in three ways:
Annotate the JSONL files directly before uploading them
Add annotations in the AutoML Natural Language UI after uploading unannotated documents
You can combine the first two options by uploading labeled JSONL files and modifying them in the UI.
You can only annotate PDF files using the AutoML Natural Language UI.
JSONL documents
To help you create JSONL training files, AutoML Natural Language offers a
Python script that converts plain
text files into appropriately formatted JSONL files. See the comments in the script
for details.
Each document in the JSONL file has one of the following formats:
Each document must be one line in the JSONL file. The example below includes
line breaks for readability; you need to remove them in the JSONL file. For more
information, see http://jsonlines.org/.
Each text_extraction element identifies an annotation within the
text_snippet.content. It indicates the position of the annotated
text by specifying the number of characters from the start of text_snippet.content
to the beginning (start_offset) and the end (end_offset)
of the text; display_name is the label for the entity.
Both start_offset and the end_offset` are character
offsets rather than byte offsets. The character at the end_offset
is not included in the text segment. Refer to TextSegment
for more details. The text_extraction elements are optional;
you can omit them if you plan to annotate the document using the AutoML Natural Language
UI. Each annotation can cover up to ten tokens (words). They cannot
overlap; the start_offset of an annotation cannot be between
the start_offset and end_offset of an annotation in the same document.
For example, this example training document identifies the specific diseases mentioned
in an abstract from the NCBI corpus.
{
"annotations": [
{
"text_extraction": {
"text_segment": {
"end_offset": 67,
"start_offset": 62
}
},
"display_name": "Modifier"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 158,
"start_offset": 141
}
},
"display_name": "SpecificDisease"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 330,
"start_offset": 290
}
},
"display_name": "SpecificDisease"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 337,
"start_offset": 332
}
},
"display_name": "SpecificDisease"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 627,
"start_offset": 610
}
},
"display_name": "Modifier"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 754,
"start_offset": 749
}
},
"display_name": "Modifier"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 875,
"start_offset": 865
}
},
"display_name": "Modifier"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 968,
"start_offset": 951
}
},
"display_name": "Modifier"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 1553,
"start_offset": 1548
}
},
"display_name": "Modifier"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 1652,
"start_offset": 1606
}
},
"display_name": "CompositeMention"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 1833,
"start_offset": 1826
}
},
"display_name": "DiseaseClass"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 1860,
"start_offset": 1843
}
},
"display_name": "SpecificDisease"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 1930,
"start_offset": 1913
}
},
"display_name": "SpecificDisease"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 2129,
"start_offset": 2111
}
},
"display_name": "SpecificDisease"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 2188,
"start_offset": 2160
}
},
"display_name": "SpecificDisease"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 2260,
"start_offset": 2243
}
},
"display_name": "Modifier"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 2356,
"start_offset": 2339
}
},
"display_name": "Modifier"
}
],
"text_snippet": {
"content": "10051005\tA common MSH2 mutation in English and North American HNPCC families:
origin, phenotypic expression, and sex specific differences in colorectal cancer .\tThe
frequency , origin , and phenotypic expression of a germline MSH2 gene mutation previously
identified in seven kindreds with hereditary non-polyposis cancer syndrome (HNPCC) was
investigated . The mutation ( A-- > T at nt943 + 3 ) disrupts the 3 splice site of exon 5
leading to the deletion of this exon from MSH2 mRNA and represents the only frequent MSH2
mutation so far reported . Although this mutation was initially detected in four of 33
colorectal cancer families analysed from eastern England , more extensive analysis has
reduced the frequency to four of 52 ( 8 % ) English HNPCC kindreds analysed . In contrast ,
the MSH2 mutation was identified in 10 of 20 ( 50 % ) separately identified colorectal
families from Newfoundland . To investigate the origin of this mutation in colorectal cancer
families from England ( n = 4 ) , Newfoundland ( n = 10 ) , and the United States ( n = 3 ) ,
haplotype analysis using microsatellite markers linked to MSH2 was performed . Within the
English and US families there was little evidence for a recent common origin of the MSH2
splice site mutation in most families . In contrast , a common haplotype was identified
at the two flanking markers ( CA5 and D2S288 ) in eight of the Newfoundland families .
These findings suggested a founder effect within Newfoundland similar to that reported by
others for two MLH1 mutations in Finnish HNPCC families . We calculated age related risks
of all , colorectal , endometrial , and ovarian cancers in nt943 + 3 A-- > T MSH2 mutation
carriers ( n = 76 ) for all patients and for men and women separately . For both sexes combined ,
the penetrances at age 60 years for all cancers and for colorectal cancer were 0 . 86 and 0 . 57 ,
respectively . The risk of colorectal cancer was significantly higher ( p < 0.01 ) in males
than females ( 0 . 63 v 0 . 30 and 0 . 84 v 0 . 44 at ages 50 and 60 years , respectively ) .
For females there was a high risk of endometrial cancer ( 0 . 5 at age 60 years ) and premenopausal
ovarian cancer ( 0 . 2 at 50 years ) . These intersex differences in colorectal cancer risks
have implications for screening programmes and for attempts to identify colorectal cancer
susceptibility modifiers .\n "
}
}
A JSONL file can contain multiple training documents with this structure, one on each line of the file.
PDF or TIFF documents
To upload a PDF or TIFF file as a document, wrap the file path inside of a JSONL document element:
Each document must be one line in the JSONL file. The example below includes
line breaks for readability; you need to remove them in the JSONL file. For more
information, see http://jsonlines.org/.
The value of the input_uris element is the path to a PDF or TIFF file in a Cloud Storage
bucket associated with your project. The maximum size of the PDF or TIFF file is 2MB.
Sentiment analysis
To train a sentiment analysis model, you provide representative samples of the type
of content you want AutoML Natural Language to analyze, each labeled with a value
indicating how positive the sentiment is within the content.
The sentiment score is an integer ranging from 0 (relatively negative) to a maximum value of your
choice (positive). For example, if you want to identify whether the sentiment is negative,
positive, or neutral, you would label the training data with sentiment scores of 0 (negative),
1 (neutral), and 2 (positive). The Maximum sentiment score (sentiment_max) for the
dataset is 2. If you want to capture more granularity with five levels of sentiment, you still
label documents with the most negative sentiment as 0 and use 4 for the most positive sentiment.
The Maximum sentiment score (sentiment_max) for the dataset is 4.
Sentiment score values should be consecutive integers starting from zero. If
you have scores have gaps or do not start from zero, remap the scores to consecutive
integers starting from zero.
For best results, make sure your training data includes a balanced number of documents
with each sentiment score; having more examples for particular sentiment scores can introduce
bias into the model. We recommend providing at least 100 documents per sentiment value.
Importing training documents
You import training data into AutoML Natural Language using a CSV file that lists the
documents and optionally includes their category labels or sentiment values.
AutoML Natural Language creates a dataset
from the listed documents.
Training vs. evaluation data
AutoML Natural Language divides your training documents into three sets for training
a model: a training set, a validation set, and a test set.
AutoML Natural Language uses the training set to build the model. The model
tries multiple algorithms and parameters while searching for patterns in the
training data. As the model identifies patterns, it uses the validation set to
test the algorithms and patterns. AutoML Natural Language chooses the best
performing algorithms and patterns from those identified during the training stage.
After identifying the best performing algorithms and patterns, AutoML Natural Language
applies them to the test set to test for error rate, quality, and accuracy.
By default, AutoML Natural Language splits your training data randomly into
the three sets:
80% of documents are used for training
10% of documents are used for validation (hyper-parameter tuning and/or to
decide when to stop training)
10% of documents are reserved for testing (not used during training)
If you'd like to specify which set each document in your training data should
belong to, you can explicitly assign documents to sets in the CSV file as
described in the next section.
Creating an import CSV file
Once you have collected all of your training documents, create a CSV file that
lists them all. The CSV file can have any filename, must be UTF-8
encoded, and must end with a .csv extension. It must be stored in the
Cloud Storage bucket associated with your project.
The CSV file has one row for each training document, with these columns in each
row:
Which set to assign the content in this row to. This column is optional
and can be one of these values:
TRAIN - Use the document to train the model.
VALIDATION - Use the document
to validate the results that the model returns during training.
TEST - Use the document
to verify the model's results after the model has been trained.
If you include values in this column to specify the sets, we recommend that
you identify at least 5% of your data for each category. Using less
than 5% of your data for training, validation, or testing can produce
unexpected results and ineffective models.
If you do not include values in this column, start each row with a comma to
indicate the empty first column. AutoML Natural Language automatically
divides your documents into three sets, using approximately 80% of your data
for training, 10% for validation, and 10% for testing (up to 10,000 pairs
for validation and testing).
The content to be categorized. This column contains the Cloud Storage
URI for the document. Cloud Storage URIs are case-sensitive.
For classification and sentiment analysis, the document can be a text file,
PDF file, TIFF file, or ZIP file; for entity extraction, it is a JSONL file.
For classification and sentiment analysis, the value in this column can be
quoted in-line text rather than a Cloud Storage URI.
For classification datasets, you can optionally include a comma-separated
list of labels that identify how the document is categorized. Labels must
start with a letter and only contain letters, numbers, and underscores. You
can include up to 20 labels for each document.
For sentiment analysis datasets, you can optionally include an integer
indicating the sentiment value for the content. The sentiment
value ranges from 0 (strongly negative) to a maximum value of 10 (strongly
positive).
For example, the CSV file for a multi-label classification dataset might have:
TRAIN, gs://my-project-lcm/training-data/file1.txt,Sports,Basketball
VALIDATION, gs://my-project-lcm/training-data/ubuntu.zip,Computers,Software,Operating_Systems,Linux,Ubuntu
TRAIN, gs://news/documents/file2.txt,Sports,Baseball
TEST, "Miles Davis was an American jazz trumpeter, bandleader, and composer.",Arts_Entertainment,Music,Jazz
TRAIN,gs://my-project-lcm/training-data/astros.txt,Sports,Baseball
VALIDATION,gs://my-project-lcm/training-data/mariners.txt,Sports,Baseball
TEST,gs://my-project-lcm/training-data/cubs.txt,Sports,Baseball
Common .csv errors
Using Unicode characters in labels. For example, Japanese characters are not
supported.
Using spaces and non-alphanumeric characters in labels.
Empty lines.
Empty columns (lines with two successive commas).
Missing quotes around embedded text that includes commas.
Incorrect capitalization of Cloud Storage paths.
Incorrect access control configured for your documents. Your service
account should have read or greater access, or files
must be publicly-readable.
References to non-text files, such as JPEG files. Likewise,
files that are not text files but that have been
renamed with a text extension will cause an error.
The URI of a document points to a different bucket than the current project.
Only files in the project bucket can be accessed.
Non-CSV-formatted files.
Creating an import ZIP file
For classification datasets, you can import training documents using a ZIP file.
Within the ZIP file, create one folder for each label or sentiment value, and
save each document within the folder corresponding to the label or value to apply
to that document. For example, the ZIP file for a model that classifies business
correspondence might have this structure:
AutoML Natural Language applies the folder names as labels to the documents in the folder.
For a sentiment analysis dataset, the folder names are the sentiment values: