WordSeparatorAnalyzer

com.google.appengine.api.search.dev

Class WordSeparatorAnalyzer

  • java.lang.Object
    • Analyzer
      • com.google.appengine.api.search.dev.WordSeparatorAnalyzer


  • public class WordSeparatorAnalyzer
    extends Analyzer
    A custom analyzer to tokenize text like the Search API backend. It detects when provided text is in a CJK language and uses CJKTokenizer to tokenize it if it is. CJKTokenizer tokenizes based on bigrams, so a string like "ABCD" will be tokenized to ["A", "AB", "BC", "CD", "D"]. If the string is not CJK, we assume that it uses standard latin word separators. For latin text, this uses a slightly-customized LetterTokenizer and passes tokens through StandardFilter and LowerCaseFilter. The LetterTokenizer is customized to use the same word separators as ST-BTI.
    • Constructor Summary

      Constructors 
      Constructor and Description
      WordSeparatorAnalyzer()
      Create a new WordSeparatorAnalyzer that always tries to detect CJK.
      WordSeparatorAnalyzer(boolean detectCjk)
      Create a new WordSeparatorAnalyzer.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method and Description
      static java.lang.String normalize(java.lang.String tokenizeString)
      Transforms to lowercase and replaces all word separators with spaces.
      static java.lang.String removeDiacriticals(java.lang.String input)
      Removes all diacritical marks from the input.
      static java.util.List<java.lang.String> tokenList(java.lang.String tokenizeString)
      Returns a list of tokens for a string.
      TokenStream tokenStream(java.lang.String fieldName, java.io.Reader reader)
      Constructs a tokenizer that can tokenize CJK or latin text.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • WordSeparatorAnalyzer

        public WordSeparatorAnalyzer(boolean detectCjk)
        Create a new WordSeparatorAnalyzer.
        Parameters:
        detectCjk - If true, will attempt to detect and segment CJK. If false, assumes all text can be segmented using word separators.
      • WordSeparatorAnalyzer

        public WordSeparatorAnalyzer()
        Create a new WordSeparatorAnalyzer that always tries to detect CJK.
    • Method Detail

      • tokenStream

        public TokenStream tokenStream(java.lang.String fieldName,
                                       java.io.Reader reader)
        Constructs a tokenizer that can tokenize CJK or latin text.
        Parameters:
        fieldName - Ignored.
        reader - A stream to tokenize. mark() and reset() support is not needed.
        Returns:
        A TokenStream that represents the tokenization of the data in reader.
      • tokenList

        public static java.util.List<java.lang.String> tokenList(java.lang.String tokenizeString)
        Returns a list of tokens for a string.
      • normalize

        public static java.lang.String normalize(java.lang.String tokenizeString)
        Transforms to lowercase and replaces all word separators with spaces.
      • removeDiacriticals

        public static java.lang.String removeDiacriticals(java.lang.String input)
        Removes all diacritical marks from the input. This has the effect of transforming marked glyphs into their "equivalent" non-marked form. For example, "éøç" becomes "eoc".