AutoML Entity Extraction for Healthcare provides a starting point for you to train custom Healthcare Natural Language models.
Preparing your training data
To train an AutoML Entity Extraction for Healthcare model, you provide representative samples of the type of medical text that you want to analyze, annotated with labels that identify the types of entities you want your custom model to identify. Consider the following recommendations when compiling training data:
- You must supply between 50 and 100,000 samples of medical text to train your custom model.
- You can label the medical text with between one and 100 unique labels to annotate the entities that you want the model to learn to extract.
- Each annotation is a span of text and an associated label.
- Label names can be between two and 30 characters.
- Each label can annotate between one and 10 words.
- To train a model effectively, your training data set should use each label at least 200 times.
If you are annotating a structured or semi-structured document type, such as a medical invoice or a consent form, AutoML Natural Language can consider an annotation's position on the page as a factor contributing to its proper label.
Formatting training documents
To format training documents, upload training data to AutoML Natural Language as JSONL files that contain the sample text and documents. Each line in the file is a single training document, specified in one of the following forms:
- The full content of the document, between 10 and 10000 bytes long (UTF-8 encoded)
- The URI of a PDF file from a Cloud Storage bucket associated with your project
You can annotate the text documents directly before uploading them, in the AutoML Natural Language UI after uploading unannotated documents, or add annotations to previously annotated documents in the UI.
JSONL documents
To help you create JSONL training files, AutoML Natural Language offers a Python script that converts plain text files into appropriately formatted JSONL files. See the comments in the script for details.
Each document in the JSONL file has one of the following formats:
For unannotated documents:
{ "text_snippet": {"content": string} }
For annotated documents:
{ "annotations": [ { "text_extraction": { "text_segment": { "end_offset": number, "start_offset": number } }, "display_name": string }, { "text_extraction": { "text_segment": { "end_offset": number, "start_offset": number } }, "display_name": string }, ... ], "text_snippet": {"content": string} }
In the sample JSONL files:
- Each
text_extraction
element identifies an annotation within thetext_snippet.content
.text_extraction
indicates the position of the annotated text by specifying the number of characters from the start oftext_snippet.content
to the beginning (start_offset
) and the end (end_offset
) of the text. display_name
is the label for the entity.start_offset
and theend_offset
are character offsets not byte offsets. The character at theend_offset
is not included in the text
For more information, see
TextSegment
.
The text_extraction
elements are optional; you can omit them if you plan to
annotate the document using the AutoML Natural Language UI. Each annotation can cover
up to ten tokens, typically words. They can't overlap, that is, the
start_offset
of an annotation can't be between the start_offset
and
end_offset
of another annotation in the same document.
The following sample training document identifies the specific diseases mentioned in an abstract from the NCBI corpus:
{ "annotations": [ { "text_extraction": { "text_segment": { "end_offset": 67, "start_offset": 62 } }, "display_name": "Modifier" }, { "text_extraction": { "text_segment": { "end_offset": 158, "start_offset": 141 } }, "display_name": "SpecificDisease" }, { "text_extraction": { "text_segment": { "end_offset": 330, "start_offset": 290 } }, "display_name": "SpecificDisease" }, { "text_extraction": { "text_segment": { "end_offset": 337, "start_offset": 332 } }, "display_name": "SpecificDisease" }, { "text_extraction": { "text_segment": { "end_offset": 627, "start_offset": 610 } }, "display_name": "Modifier" }, { "text_extraction": { "text_segment": { "end_offset": 754, "start_offset": 749 } }, "display_name": "Modifier" }, { "text_extraction": { "text_segment": { "end_offset": 875, "start_offset": 865 } }, "display_name": "Modifier" }, { "text_extraction": { "text_segment": { "end_offset": 968, "start_offset": 951 } }, "display_name": "Modifier" }, { "text_extraction": { "text_segment": { "end_offset": 1553, "start_offset": 1548 } }, "display_name": "Modifier" }, { "text_extraction": { "text_segment": { "end_offset": 1652, "start_offset": 1606 } }, "display_name": "CompositeMention" }, { "text_extraction": { "text_segment": { "end_offset": 1833, "start_offset": 1826 } }, "display_name": "DiseaseClass" }, { "text_extraction": { "text_segment": { "end_offset": 1860, "start_offset": 1843 } }, "display_name": "SpecificDisease" }, { "text_extraction": { "text_segment": { "end_offset": 1930, "start_offset": 1913 } }, "display_name": "SpecificDisease" }, { "text_extraction": { "text_segment": { "end_offset": 2129, "start_offset": 2111 } }, "display_name": "SpecificDisease" }, { "text_extraction": { "text_segment": { "end_offset": 2188, "start_offset": 2160 } }, "display_name": "SpecificDisease" }, { "text_extraction": { "text_segment": { "end_offset": 2260, "start_offset": 2243 } }, "display_name": "Modifier" }, { "text_extraction": { "text_segment": { "end_offset": 2356, "start_offset": 2339 } }, "display_name": "Modifier" } ], "text_snippet": { "content": "10051005\tA common MSH2 mutation in English and North American HNPCC families: origin, phenotypic expression, and sex specific differences in colorectal cancer .\tThe frequency , origin , and phenotypic expression of a germline MSH2 gene mutation previously identified in seven kindreds with hereditary non-polyposis cancer syndrome (HNPCC) was investigated . The mutation ( A-- > T at nt943 + 3 ) disrupts the 3 splice site of exon 5 leading to the deletion of this exon from MSH2 mRNA and represents the only frequent MSH2 mutation so far reported . Although this mutation was initially detected in four of 33 colorectal cancer families analysed from eastern England , more extensive analysis has reduced the frequency to four of 52 ( 8 % ) English HNPCC kindreds analysed . In contrast , the MSH2 mutation was identified in 10 of 20 ( 50 % ) separately identified colorectal families from Newfoundland . To investigate the origin of this mutation in colorectal cancer families from England ( n = 4 ) , Newfoundland ( n = 10 ) , and the United States ( n = 3 ) , haplotype analysis using microsatellite markers linked to MSH2 was performed . Within the English and US families there was little evidence for a recent common origin of the MSH2 splice site mutation in most families . In contrast , a common haplotype was identified at the two flanking markers ( CA5 and D2S288 ) in eight of the Newfoundland families . These findings suggested a founder effect within Newfoundland similar to that reported by others for two MLH1 mutations in Finnish HNPCC families . We calculated age related risks of all , colorectal , endometrial , and ovarian cancers in nt943 + 3 A-- > T MSH2 mutation carriers ( n = 76 ) for all patients and for men and women separately . For both sexes combined , the penetrances at age 60 years for all cancers and for colorectal cancer were 0 . 86 and 0 . 57 , respectively . The risk of colorectal cancer was significantly higher in males than females ( 0 . 63 v 0 . 30 and 0 . 84 v 0 . 44 at ages 50 and 60 years , respectively ) . For females there was a high risk of endometrial cancer ( 0 . 5 at age 60 years ) and premenopausal ovarian cancer ( 0 . 2 at 50 years ) . These intersex differences in colorectal cancer risks have implications for screening programmes and for attempts to identify colorectal cancer susceptibility modifiers .\n " } }
PDF documents
Each document must be one line in the JSONL file. The following sample
includes line breaks for readability; you need to remove them in the JSONL file.
For more information, see jsonlines.org. To
upload a PDF file as a document, wrap the file path inside a JSONL document
element as shown in the following sample:
{ "document": { "input_config": { "gcs_source": { "input_uris": [ "gs://cloud-ml-data/NL-entity/sample.pdf" ] } } } }
The value of the input_uris
element is the path to a PDF file in a
Cloud Storage bucket associated with your project. The maximum size
of the PDF file is 2MB.