Training pipeline will perform following transformation functions.
The text as is--no change to case, punctuation, spelling, tense,
and so on.
Tokenize text to words. Convert each words to a dictionary lookup
index and generate an embedding for each index. Combine the
embedding of all elements into a single embedding using the mean.
Tokenization is based on unicode script boundaries.
Missing values get their own lookup index and resulting
embedding.
Stop-words receive no special treatment and are not removed.