Entity extraction models for healthcare

AutoML Entity Extraction for Healthcare provides a starting point for you to train custom Healthcare Natural Language models.

Preparing your training data

To train an AutoML Entity Extraction for Healthcare model, you provide representative samples of the type of medical text that you want to analyze, annotated with labels that identify the types of entities you want your custom model to identify. Consider the following recommendations when compiling training data:

  • You must supply between 50 and 100,000 samples of medical text to train your custom model.
  • You can label the medical text with between one and 100 unique labels to annotate the entities that you want the model to learn to extract.
  • Each annotation is a span of text and an associated label.
  • Label names can be between two and 30 characters.
  • Each label can annotate between one and 10 words.
  • To train a model effectively, your training data set should use each label at least 200 times.

If you are annotating a structured or semi-structured document type, such as a medical invoice or a consent form, AutoML Natural Language can consider an annotation's position on the page as a factor contributing to its proper label.

Formatting training documents

To format training documents, upload training data to AutoML Natural Language as JSONL files that contain the sample text and documents. Each line in the file is a single training document, specified in one of the following forms:

  • The full content of the document, between 10 and 10000 bytes long (UTF-8 encoded)
  • The URI of a PDF file from a Cloud Storage bucket associated with your project

You can annotate the text documents directly before uploading them, in the AutoML Natural Language UI after uploading unannotated documents, or add annotations to previously annotated documents in the UI.

JSONL documents

To help you create JSONL training files, AutoML Natural Language offers a Python script that converts plain text files into appropriately formatted JSONL files. See the comments in the script for details.

Each document in the JSONL file has one of the following formats:

For unannotated documents:

{
  "text_snippet":
    {"content": string}
}

For annotated documents:

{
  "annotations": [
     {
      "text_extraction": {
         "text_segment": {
            "end_offset": number, "start_offset": number
          }
       },
       "display_name": string
     },
     {
       "text_extraction": {
          "text_segment": {
             "end_offset": number, "start_offset": number
           }
        },
        "display_name": string
     },
   ...
  ],
  "text_snippet":
    {"content": string}
}

In the sample JSONL files:

  • Each text_extraction element identifies an annotation within the text_snippet.content. text_extraction indicates the position of the annotated text by specifying the number of characters from the start of text_snippet.content to the beginning (start_offset) and the end (end_offset) of the text.
  • display_name is the label for the entity.
  • start_offset and the end_offset are character offsets not byte offsets. The character at the end_offset is not included in the text

For more information, see TextSegment.

The text_extraction elements are optional; you can omit them if you plan to annotate the document using the AutoML Natural Language UI. Each annotation can cover up to ten tokens, typically words. They can't overlap, that is, the start_offset of an annotation can't be between the start_offset and end_offset of another annotation in the same document.

The following sample training document identifies the specific diseases mentioned in an abstract from the NCBI corpus:

{
  "annotations": [
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 67,
          "start_offset": 62
        }
      },
      "display_name": "Modifier"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 158,
          "start_offset": 141
        }
      },
      "display_name": "SpecificDisease"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 330,
          "start_offset": 290
        }
      },
      "display_name": "SpecificDisease"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 337,
          "start_offset": 332
        }
      },
      "display_name": "SpecificDisease"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 627,
          "start_offset": 610
        }
      },
      "display_name": "Modifier"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 754,
          "start_offset": 749
        }
      },
      "display_name": "Modifier"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 875,
          "start_offset": 865
        }
      },
      "display_name": "Modifier"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 968,
          "start_offset": 951
        }
      },
      "display_name": "Modifier"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 1553,
          "start_offset": 1548
        }
      },
      "display_name": "Modifier"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 1652,
          "start_offset": 1606
        }
      },
      "display_name": "CompositeMention"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 1833,
          "start_offset": 1826
        }
      },
      "display_name": "DiseaseClass"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 1860,
          "start_offset": 1843
        }
      },
      "display_name": "SpecificDisease"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 1930,
          "start_offset": 1913
        }
      },
      "display_name": "SpecificDisease"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 2129,
          "start_offset": 2111
        }
      },
      "display_name": "SpecificDisease"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 2188,
          "start_offset": 2160
        }
      },
      "display_name": "SpecificDisease"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 2260,
          "start_offset": 2243
        }
      },
      "display_name": "Modifier"
    },
    {
      "text_extraction": {
        "text_segment": {
          "end_offset": 2356,
          "start_offset": 2339
        }
      },
      "display_name": "Modifier"
    }
  ],
  "text_snippet": {
    "content": "10051005\tA common MSH2 mutation in English and North American HNPCC families:
      origin, phenotypic expression, and sex specific differences in colorectal cancer .\tThe
      frequency , origin , and phenotypic expression of a germline MSH2 gene mutation previously
      identified in seven kindreds with hereditary non-polyposis cancer syndrome (HNPCC) was
      investigated . The mutation ( A-- > T at nt943 + 3 ) disrupts the 3 splice site of exon 5
      leading to the deletion of this exon from MSH2 mRNA and represents the only frequent MSH2
      mutation so far reported . Although this mutation was initially detected in four of 33
      colorectal cancer families analysed from eastern England , more extensive analysis has
      reduced the frequency to four of 52 ( 8 % ) English HNPCC kindreds analysed . In contrast ,
      the MSH2 mutation was identified in 10 of 20 ( 50 % ) separately identified colorectal
      families from Newfoundland . To investigate the origin of this mutation in colorectal cancer
      families from England ( n = 4 ) , Newfoundland ( n = 10 ) , and the United States ( n = 3 ) ,
      haplotype analysis using microsatellite markers linked to MSH2 was performed . Within the
      English and US families there was little evidence for a recent common origin of the MSH2
      splice site mutation in most families . In contrast , a common haplotype was identified
      at the two flanking markers ( CA5 and D2S288 ) in eight of the Newfoundland families .
      These findings suggested a founder effect within Newfoundland similar to that reported by
      others for two MLH1 mutations in Finnish HNPCC families . We calculated age related risks
      of all , colorectal , endometrial , and ovarian cancers in nt943 + 3 A-- > T MSH2 mutation
      carriers ( n = 76 ) for all patients and for men and women separately . For both sexes combined ,
      the penetrances at age 60 years for all cancers  and for colorectal cancer were 0 . 86 and 0 . 57 ,
      respectively . The risk of colorectal cancer was significantly higher  in males
      than females ( 0 . 63 v 0 . 30 and 0 . 84 v 0 . 44 at ages 50 and 60 years , respectively ) .
      For females there was a high risk of endometrial cancer ( 0 . 5 at age 60 years ) and premenopausal
      ovarian cancer ( 0 . 2 at 50 years ) . These intersex differences in colorectal cancer risks
      have implications for screening programmes and for attempts to identify colorectal cancer
      susceptibility modifiers .\n "
  }
}

PDF documents

Each document must be one line in the JSONL file. The following sample includes line breaks for readability; you need to remove them in the JSONL file. For more information, see jsonlines.org. To upload a PDF file as a document, wrap the file path inside a JSONL document element as shown in the following sample:

{
  "document": {
    "input_config": {
      "gcs_source": {
        "input_uris": [ "gs://cloud-ml-data/NL-entity/sample.pdf" ]
      }
    }
  }
}

The value of the input_uris element is the path to a PDF file in a Cloud Storage bucket associated with your project. The maximum size of the PDF file is 2MB.