Preparing your training data

To train your custom model, you provide representative samples of the type of texts you want to analyze, annotated with labels that identify the types of entities you want AutoML Natural Language Entity Extraction to identify in the text items.

Annotation list

You supply between 50 and 100,000 text items to use for training your custom model. You use between one and 100 unique labels to annotate the entities you want the model to learn to extract. Each annotation is a span of text and an associated label. Label names can be between 2 and 30 characters, and can be used to annotate between one and 10 words. We recommend using each label at least 200 times in your training data set.

If you are annotating a structured or semi-structured document type, such as invoices or contracts, AutoML Natural Language Entity Extraction can consider an annotation's position on the page as a factor contributing to its proper label. For example, a real estate contract has both an acceptance date and a closing date, and AutoML Natural Language Entity Extraction can learn to distinguish between the entities based on the spatial position of the annotation.

Formatting training items

You upload training data to AutoML Natural Language Entity Extraction as JSONL files that contain the sample text items. Each line in the file is a single training text item, specified in one of two forms:

  • The full content of the text item, between 10 and 10000 bytes long (UTF-8 encoded)
  • The URI of a PDF file from a Google Cloud Storage bucket associated with your project

You can annotate the text items in three ways:

  • Annotate the JSONL files directly before uploading them
  • Add annotations in the AutoML Natural Language Entity Extraction UI after uploading unannotated text items
  • Request labeling from human labelers using the AI Platform Data Labeling Service

You can combine the first two options by uploading labeled JSONL files and modifying them in the UI.

JSONL text items

Each text item in the JSONL file has one of the following formats:

For unannotated text items:

{
  "text_snippet":
    {"content": string}
}

For annotated text items:

{
  "annotations": [
     {
      "text_extraction": {
         "text_segment": {
            "end_offset": number, "start_offset": number
          }
       },
       "display_name": string
     },
     {
       "text_extraction": {
          "text_segment": {
             "end_offset": number, "start_offset": number
           }
        },
        "display_name": string
     },
   ...
  ],
  "text_snippet":
    {"content": string}
}

Each text_extraction element identifies an annotation within the text_snippet.content. It indicates the position of the annotated text by specifying the number of characters from the start of text_snippet.content to the beginning (start_offset) and the end (end_offset) of the text; display_name is the label for the entity.

For example, this example training item identifies the specific diseases mentioned in an abstract from the NCBI corpus.

{
  "annotations": [
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 67,
          "start_offset": 62
        }
      },
      "display_name": "Modifier"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 158,
          "start_offset": 141
        }
      },
      "display_name": "SpecificDisease"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 330,
          "start_offset": 290
        }
      },
      "display_name": "SpecificDisease"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 337,
          "start_offset": 332
        }
      },
      "display_name": "SpecificDisease"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 627,
          "start_offset": 610
        }
      },
      "display_name": "Modifier"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 754,
          "start_offset": 749
        }
      },
      "display_name": "Modifier"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 875,
          "start_offset": 865
        }
      },
      "display_name": "Modifier"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 968,
          "start_offset": 951
        }
      },
      "display_name": "Modifier"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 1553,
          "start_offset": 1548
        }
      },
      "display_name": "Modifier"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 1652,
          "start_offset": 1606
        }
      },
      "display_name": "CompositeMention"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 1833,
          "start_offset": 1826
        }
      },
      "display_name": "DiseaseClass"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 1860,
          "start_offset": 1843
        }
      },
      "display_name": "SpecificDisease"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 1930,
          "start_offset": 1913
        }
      },
      "display_name": "SpecificDisease"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 2129,
          "start_offset": 2111
        }
      },
      "display_name": "SpecificDisease"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 2188,
          "start_offset": 2160
        }
      },
      "display_name": "SpecificDisease"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 2260,
          "start_offset": 2243
        }
      },
      "display_name": "Modifier"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 2356,
          "start_offset": 2339
        }
      },
      "display_name": "Modifier"
    }
  ],
  "text_snippet": {
    "content": "10051005\tA common MSH2 mutation in English and North American HNPCC families:
      origin, phenotypic expression, and sex specific differences in colorectal cancer .\tThe
      frequency , origin , and phenotypic expression of a germline MSH2 gene mutation previously
      identified in seven kindreds with hereditary non-polyposis cancer syndrome (HNPCC) was
      investigated . The mutation ( A-- > T at nt943 + 3 ) disrupts the 3 splice site of exon 5
      leading to the deletion of this exon from MSH2 mRNA and represents the only frequent MSH2
      mutation so far reported . Although this mutation was initially detected in four of 33
      colorectal cancer families analysed from eastern England , more extensive analysis has
      reduced the frequency to four of 52 ( 8 % ) English HNPCC kindreds analysed . In contrast ,
      the MSH2 mutation was identified in 10 of 20 ( 50 % ) separately identified colorectal
      families from Newfoundland . To investigate the origin of this mutation in colorectal cancer
      families from England ( n = 4 ) , Newfoundland ( n = 10 ) , and the United States ( n = 3 ) ,
      haplotype analysis using microsatellite markers linked to MSH2 was performed . Within the
      English and US families there was little evidence for a recent common origin of the MSH2
      splice site mutation in most families . In contrast , a common haplotype was identified
      at the two flanking markers ( CA5 and D2S288 ) in eight of the Newfoundland families .
      These findings suggested a founder effect within Newfoundland similar to that reported by
      others for two MLH1 mutations in Finnish HNPCC families . We calculated age related risks
      of all , colorectal , endometrial , and ovarian cancers in nt943 + 3 A-- > T MSH2 mutation
      carriers ( n = 76 ) for all patients and for men and women separately . For both sexes combined ,
      the penetrances at age 60 years for all cancers  and for colorectal cancer were 0 . 86 and 0 . 57 ,
      respectively . The risk of colorectal cancer was significantly higher ( p < 0 . 01 ) in males
      than females ( 0 . 63 v 0 . 30 and 0 . 84 v 0 . 44 at ages 50 and 60 years , respectively ) .
      For females there was a high risk of endometrial cancer ( 0 . 5 at age 60 years ) and premenopausal
      ovarian cancer ( 0 . 2 at 50 years ) . These intersex differences in colorectal cancer risks
      have implications for screening programmes and for attempts to identify colorectal cancer
      susceptibility modifiers .\n "
  }
}

A JSONL file can contain multiple training items with this structure, one on each line of the file.

PDF text items

To upload a PDF file as a text item, wrap the file path inside of a JSONL document element:

{
  "document": {
    "input_config": {
      "gcs_source": {
        "input_uris": [ "gs://cloud-ml-data/NL-entity/sample.pdf" ]
      }
    }
  }
}

The value of the input_uris element is the path to a PDF file in a Google Cloud Storage bucket associated with your project. The maximum size of the PDF file is 2MB.

Creating a CSV file for importing items

To import the training items into a dataset, you need to create a comma-separated values (CSV) file that has one row for each JSONL file you want to import. The CSV file can have any filename, must be UTF-8 encoded, and must end with a .csv extension. It must be stored in the Google Cloud Storage bucket associated with your project.

The columns in each row of the CSV are:

  1. Which set to assign the content in this row to. This option column can have one of these values:

    • TRAIN - Use this text item to train the model.
    • VALIDATE - Use this text item to validate the results that the model returns during training. (Validation data sets are also known as "dev" datasets.)
    • TEST - Use this text item to verify the model's results after the model has been trained.

    If you do not include values in this column, AutoML Natural Language Entity Extraction automatically divides your text items into three sets, using approximately 80% of your data for training, 10% for validation, and 10% for testing (up to 10,000 pairs for validation and testing).

    If you explicitly assign any items to the TRAIN, VALIDATE, or TEST sets, you must explicitly assign all items. AutoML Natural Language Entity Extraction automatically assigns items only when none of them have been explicitly assigned to a set.
    If you do not assign values in this column, each row in the CSV file must start with a comma to indicate the empty first column.

  2. The URL of the JSONL file. This column provides the Google Cloud Storage URI for a JSONL file in your Google Cloud Storage bucket.

For example, you might have the following in your .csv file:

TRAIN,gs://my-project-lcm/training-data/traindata.jsonl
VALIDATE,gs://my-project-lcm/training-data/validatedata.jsonl
TEST,gs://my-project-lcm/training-data/testdata.jsonl

Common .csv errors

  • Incorrect order of columns.
  • Using spaces and non-alphanumeric characters in labels.
  • Empty lines.
  • Empty columns (lines with two successive commas).
  • Incorrect capitalization of Cloud Storage text paths.
  • Incorrect access control configured for your text files. Your service account should have read or greater access, or files must be publicly-readable.
  • References to non-text files, such as JPEG files. Likewise, files that are not text files but that have been renamed with a text extension will cause an error.
  • The URI of a text file points to a different bucket than the current project. Only files in the project bucket can be accessed.
  • Non-CSV-formatted files.
Was this page helpful? Let us know how we did:

Send feedback about...

AutoML Natural Language Entity Extraction