Prepare text training data for entity extraction

This page describes how to prepare text data for use in a Vertex AI dataset to train a entity extraction model.

Entity extraction training data consists of documents that are annotated with the labels that identify the types of entities that you want your model to identify. For example, you might create an entity extraction model to identify specialized terminology in legal documents or patents. Annotations specify the locations of the entities that you're labeling and the labels themselves.

If you're annotating structured or semi-structure documents for a dataset used to train AutoML models, such as invoices or contracts, Vertex AI can consider an annotation's position on the page as a factor contributing to its proper label. For example, a real estate contract has both an acceptance date and a closing date. Vertex AI can learn to distinguish between the entities based on the spatial position of the annotation.

Data requirements

  • You must supply at least 50, and no more than 100,000, training documents.
  • You must supply at least 1, and no more than 100, unique labels to annotate entities that you want to extract.
  • You can use a label to annotate between 1 and 10 words.
  • Label names can be between 2 and 30 characters.
  • You can include annotations in your JSON Lines files, or you can add annotations later by using the Google Cloud console after uploading documents.
  • You can include documents inline or reference TXT files that are in Cloud Storage buckets.

Best practices for text data used to train AutoML models

The following recommendations apply to datasets used to train AutoML models.

  • Use each label at least 200 times in your training dataset.
  • Annotate every occurrence of entities that you want your model to identify.

Input files

Input file types for entity extraction must be JSON Lines. The format, field names, and value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.

You can download the schema file for entity extraction from the following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_extraction_io_format_1.0.0.yaml.

The following example shows how you might use the schema to create your own JSON Lines file. The example includes line breaks for readability. In your JSON files, include line breaks only after each document. The dataItemResourceLabels field specifies, for example, ml_use and is optional.

{
    "textSegmentAnnotations": [
      {
        "startOffset":number,
        "endOffset":number,
        "displayName": "label"
      },
      ...
    ],
    "textContent": "inline_text",
    "dataItemResourceLabels": {
      "aiplatform.googleapis.com/ml_use": "training|test|validation"
    }
}
{
    "textSegmentAnnotations": [
      {
        "startOffset":number,
        "endOffset":number,
        "displayName": "label"
      },
      ...
    ],
    "textGcsUri": "gcs_uri_to_file",
    "dataItemResourceLabels": {
      "aiplatform.googleapis.com/ml_use": "training|test|validation"
    }
}

You can also annotate documents by using the Google Cloud console. Create a JSON Lines file with content only (without the textSegmentAnnotations field); documents are uploaded to Vertex AI without any annotations.