Preparing your training data

To train your custom model, you provide representative samples of the type of documents you want to analyze, labeled in the way you want AutoML Natural Language to label similar documents. The quality of your training data strongly impacts the effectiveness of the model you create, and by extension, the quality of the predictions returned from that model.

Collecting and labeling training documents

The first step is to collect a diverse set of training documents that reflects the range of documents you want the custom model to handle. The preparation steps for training documents differs depending on whether you're training a model for classification, entity extraction, or sentiment analysis.

Importing training documents

You import training data into AutoML Natural Language using a CSV file that lists the documents and optionally includes their category labels or sentiment values. AutoML Natural Language creates a dataset from the listed documents.

Training vs. evaluation data

AutoML Natural Language divides your training documents into three sets for training a model: a training set, a validation set, and a test set.

AutoML Natural Language uses the training set to build the model. The model tries multiple algorithms and parameters while searching for patterns in the training data. As the model identifies patterns, it uses the validation set to test the algorithms and patterns. AutoML Natural Language chooses the best performing algorithms and patterns from those identified during the training stage.

After identifying the best performing algorithms and patterns, AutoML Natural Language applies them to the test set to test for error rate, quality, and accuracy.

By default, AutoML Natural Language splits your training data randomly into the three sets:

  • 80% of documents are used for training
  • 10% of documents are used for validation (hyper-parameter tuning and/or to decide when to stop training)
  • 10% of documents are reserved for testing (not used during training)

If you'd like to specify which set each document in your training data should belong to, you can explicitly assign documents to sets in the CSV file as described in the next section.

Creating an import CSV file

Once you have collected all of your training documents, create a CSV file that lists them all. The CSV file can have any filename, must be UTF-8 encoded, and must end with a .csv extension. It must be stored in the Cloud Storage bucket associated with your project.

The CSV file has one row for each training document, with these columns in each row:

  1. Which set to assign the content in this row to. This column is optional and can be one of these values:

    • TRAIN - Use the document to train the model.
    • VALIDATION - Use the document to validate the results that the model returns during training.
    • TEST - Use the document to verify the model's results after the model has been trained.

    If you include values in this column to specify the sets, we recommend that you identify at least 5% of your data for each category. Using less than 5% of your data for training, validation, or testing can produce unexpected results and ineffective models.

    If you do not include values in this column, start each row with a comma to indicate the empty first column. AutoML Natural Language automatically divides your documents into three sets, using approximately 80% of your data for training, 10% for validation, and 10% for testing (up to 10,000 pairs for validation and testing).

  2. The content to be categorized. This column contains the Cloud Storage URI for the document. Cloud Storage URIs are case-sensitive.

    For classification and sentiment analysis, the document can be a text file, PDF file, TIFF file, or ZIP file; for entity extraction, it is a JSONL file.

    For classification and sentiment analysis, the value in this column can be quoted in-line text rather than a Cloud Storage URI.

  3. For classification datasets, you can optionally include a comma-separated list of labels that identify how the document is categorized. Labels must start with a letter and only contain letters, numbers, and underscores. You can include up to 20 labels for each document.

    For sentiment analysis datasets, you can optionally include an integer indicating the sentiment value for the content. The sentiment value ranges from 0 (strongly negative) to a maximum value of 10 (strongly positive).

For example, the CSV file for a multi-label classification dataset might have:

TRAIN, gs://my-project-lcm/training-data/file1.txt,Sports,Basketball
VALIDATION, gs://my-project-lcm/training-data/,Computers,Software,Operating_Systems,Linux,Ubuntu
TRAIN, gs://news/documents/file2.txt,Sports,Baseball
TEST, "Miles Davis was an American jazz trumpeter, bandleader, and composer.",Arts_Entertainment,Music,Jazz

Common .csv errors

  • Using Unicode characters in labels. For example, Japanese characters are not supported.
  • Using spaces and non-alphanumeric characters in labels.
  • Empty lines.
  • Empty columns (lines with two successive commas).
  • Missing quotes around embedded text that includes commas.
  • Incorrect capitalization of Cloud Storage paths.
  • Incorrect access control configured for your documents. Your service account should have read or greater access, or files must be publicly-readable.
  • References to non-text files, such as JPEG files. Likewise, files that are not text files but that have been renamed with a text extension will cause an error.
  • The URI of a document points to a different bucket than the current project. Only files in the project bucket can be accessed.
  • Non-CSV-formatted files.

Creating an import ZIP file

For classification datasets, you can import training documents using a ZIP file. Within the ZIP file, create one folder for each label or sentiment value, and save each document within the folder corresponding to the label or value to apply to that document. For example, the ZIP file for a model that classifies business correspondence might have this structure:

AutoML Natural Language applies the folder names as labels to the documents in the folder. For a sentiment analysis dataset, the folder names are the sentiment values:

What's next