Preparing your training data

To train your custom model, you provide representative samples of the type of texts you want to analyze, annotated with labels that identify the types of entities you want AutoML Natural Language Entity Extraction to identify in the text items.

Annotation list

You supply between 50 and 100,000 text items to use to train your custom model. Each text item can be between 10 and 4000 bytes long (UTF-8 encoded). You use between one and 100 unique labels to annotate the location of entities you want the model to learn to extract.

Label names can be between 2 and 30 characters, and can be used to annotate between one and 10 words. We recommend using each label at least 200 times in your training data set.

Text items as JSONL files

You upload training data to AutoML Natural Language Entity Extraction as JSONL files that contain the sample text items. Each line in the file is a single training text item. The line can contain the full content of the text item or provide the URI of a PDF file from a Google Cloud Storage bucket associated with your project.

You can label the location of entities in text items in two ways:

  • Annotate the JSONL files directly before uploading them
  • Add annotations in the AutoML Natural Language Entity Extraction UI after uploading unannotated JSONL text items

You can combine these two options by uploading labeled JSONL files and modifying them in the UI.

Annotated text items

Each training item in the JSONL file has this format:

{
  "annotations": [
     {
      "text_extraction": {
         "text_segment": {
            "end_offset": number, "start_offset": number
          }
       },
       "display_name": string
     },
     {
       "text_extraction": {
          "text_segment": {
             "end_offset": number, "start_offset": number
           }
        },
        "display_name": string
     },
   ...
  ],
  "text_snippet":
    {"content": string}
}

Each text_extraction element identifies an annotation within the text_snippet.content. It indicates the position of the annotated text by specifying the number of characters from the start of text_snippet.content to the beginning (start_offset) and the end (end_offset) of the text; display_name is the label for the entity.

For example, this example training item identifies the specific diseases mentioned in an abstract from the NCBI corpus.

{"annotations": [{"text_extraction": {"text_segment": {"end_offset": 67, "start_offset": 62}}, "display_name": "Modifier"}, {"text_extraction": {"text_segment": {"end_offset": 158, "start_offset": 141}}, "display_name": "SpecificDisease"}, {"text_extraction": {"text_segment": {"end_offset": 330, "start_offset": 290}}, "display_name": "SpecificDisease"}, {"text_extraction": {"text_segment": {"end_offset": 337, "start_offset": 332}}, "display_name": "SpecificDisease"}, {"text_extraction": {"text_segment": {"end_offset": 627, "start_offset": 610}}, "display_name": "Modifier"}, {"text_extraction": {"text_segment": {"end_offset": 754, "start_offset": 749}}, "display_name": "Modifier"}, {"text_extraction": {"text_segment": {"end_offset": 875, "start_offset": 865}}, "display_name": "Modifier"}, {"text_extraction": {"text_segment": {"end_offset": 968, "start_offset": 951}}, "display_name": "Modifier"}, {"text_extraction": {"text_segment": {"end_offset": 1553, "start_offset": 1548}}, "display_name": "Modifier"}, {"text_extraction": {"text_segment": {"end_offset": 1652, "start_offset": 1606}}, "display_name": "CompositeMention"}, {"text_extraction": {"text_segment": {"end_offset": 1833, "start_offset": 1826}}, "display_name": "DiseaseClass"}, {"text_extraction": {"text_segment": {"end_offset": 1860, "start_offset": 1843}}, "display_name": "SpecificDisease"}, {"text_extraction": {"text_segment": {"end_offset": 1930, "start_offset": 1913}}, "display_name": "SpecificDisease"}, {"text_extraction": {"text_segment": {"end_offset": 2129, "start_offset": 2111}}, "display_name": "SpecificDisease"}, {"text_extraction": {"text_segment": {"end_offset": 2188, "start_offset": 2160}}, "display_name": "SpecificDisease"}, {"text_extraction": {"text_segment": {"end_offset": 2260, "start_offset": 2243}}, "display_name": "Modifier"}, {"text_extraction": {"text_segment": {"end_offset": 2356, "start_offset": 2339}}, "display_name": "Modifier"}], "text_snippet": {"content": "10051005\tA common MSH2 mutation in English and North American HNPCC families: origin, phenotypic expression, and sex specific differences in colorectal cancer .\tThe frequency , origin , and phenotypic expression of a germline MSH2 gene mutation previously identified in seven kindreds with hereditary non-polyposis cancer syndrome (HNPCC) was investigated . The mutation ( A-- > T at nt943 + 3 ) disrupts the 3 splice site of exon 5 leading to the deletion of this exon from MSH2 mRNA and represents the only frequent MSH2 mutation so far reported . Although this mutation was initially detected in four of 33 colorectal cancer families analysed from eastern England , more extensive analysis has reduced the frequency to four of 52 ( 8 % ) English HNPCC kindreds analysed . In contrast , the MSH2 mutation was identified in 10 of 20 ( 50 % ) separately identified colorectal families from Newfoundland . To investigate the origin of this mutation in colorectal cancer families from England ( n = 4 ) , Newfoundland ( n = 10 ) , and the United States ( n = 3 ) , haplotype analysis using microsatellite markers linked to MSH2 was performed . Within the English and US families there was little evidence for a recent common origin of the MSH2 splice site mutation in most families . In contrast , a common haplotype was identified at the two flanking markers ( CA5 and D2S288 ) in eight of the Newfoundland families . These findings suggested a founder effect within Newfoundland similar to that reported by others for two MLH1 mutations in Finnish HNPCC families . We calculated age related risks of all , colorectal , endometrial , and ovarian cancers in nt943 + 3 A-- > T MSH2 mutation carriers ( n = 76 ) for all patients and for men and women separately . For both sexes combined , the penetrances at age 60 years for all cancers  and for colorectal cancer were 0 . 86 and 0 . 57 , respectively . The risk of colorectal cancer was significantly higher ( p < 0 . 01 ) in males than females ( 0 . 63 v 0 . 30 and 0 . 84 v 0 . 44 at ages 50 and 60 years , respectively ) . For females there was a high risk of endometrial cancer ( 0 . 5 at age 60 years ) and premenopausal ovarian cancer ( 0 . 2 at 50 years ) . These intersex differences in colorectal cancer risks have implications for screening programmes and for attempts to identify colorectal cancer susceptibility modifiers .\n "}}

A JSONL file can contain multiple training items with this structure, one on each line of the file.

PDF text items

To upload a PDF file as a text item, you wrap the file path inside of a JSONL document element:

{
  "document": {
    "input_config": {
      "gcs_source": {
        "input_uris": [ "gs://cloud-ml-data/NL-entity/sample.pdf" ]
      }
    }
  }
}

The value of the input_uris element is the path to a PDF file in a Google Cloud Storage bucket associated with your project. The maximum size of the PDF file is 2MB.

Creating a CSV file for importing items

To import the training items into a dataset, you need to create a comma-separated values (CSV) file that has one row for each JSONL file you want to import. The CSV file can have any filename, must be UTF-8 encoded, and must end with a .csv extension. It must be stored in the Google Cloud Storage bucket associated with your project.

The columns in each row of the CSV are:

  1. Which set to assign the content in this row to. This option column can have one of these values:

    • TRAIN - Use this text item to train the model.
    • VALIDATE - Use this text item to validate the results that the model returns during training. (Validation data sets are also known as "dev" datasets.)
    • TEST - Use this text item to verify the model's results after the model has been trained.

    If you do not include values in this column, AutoML Natural Language Entity Extraction automatically divides your text items into three sets, using approximately 80% of your data for training, 10% for validation, and 10% for testing (up to 10,000 pairs for validation and testing).

    If you explicitly assign any items to the TRAIN, VALIDATE, or TEST sets, you must explicitly assign all items. AutoML Natural Language Entity Extraction automatically assigns items only when none of them have been explicitly assigned to a set.
    If you do not assign values in this column, each row in the CSV file must start with a comma to indicate the empty first column.

  2. The URL of the JSONL file. This column provides the Google Cloud Storage URI for a JSONL file in your Google Cloud Storage bucket.

For example, you might have the following in your .csv file:

TRAIN,gs://my-project-lcm/training-data/traindata.jsonl
VALIDATE,gs://my-project-lcm/training-data/validatedata.jsonl
TEST,gs://my-project-lcm/training-data/testdata.jsonl

Common .csv errors

  • Incorrect order of columns.
  • Using spaces and non-alphanumeric characters in labels.
  • Empty lines.
  • Empty columns (lines with two successive commas).
  • Incorrect capitalization of Cloud Storage text paths.
  • Incorrect access control configured for your text files. Your service account should have read or greater access, or files must be publicly-readable.
  • References to non-text files, such as JPEG files. Likewise, files that are not text files but that have been renamed with a text extension will cause an error.
  • The URI of a text file points to a different bucket than the current project. Only files in the project bucket can be accessed.
  • Non-CSV-formatted files.
Was this page helpful? Let us know how we did:

Send feedback about...

AutoML Natural Language Entity Extraction