com.google.appengine.api.search.dev
Class WordSeparatorAnalyzer
- java.lang.Object
-
- Analyzer
-
- com.google.appengine.api.search.dev.WordSeparatorAnalyzer
-
public class WordSeparatorAnalyzer extends Analyzer
A custom analyzer to tokenize text like the Search API backend. It detects when provided text is in a CJK language and usesCJKTokenizer
to tokenize it if it is.CJKTokenizer
tokenizes based on bigrams, so a string like "ABCD" will be tokenized to ["A", "AB", "BC", "CD", "D"]. If the string is not CJK, we assume that it uses standard latin word separators. For latin text, this uses a slightly-customized LetterTokenizer and passes tokens through StandardFilter and LowerCaseFilter. The LetterTokenizer is customized to use the same word separators as ST-BTI.
-
-
Constructor Summary
Constructors Constructor and Description WordSeparatorAnalyzer()
Create a new WordSeparatorAnalyzer that always tries to detect CJK.WordSeparatorAnalyzer(boolean detectCjk)
Create a new WordSeparatorAnalyzer.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method and Description static java.lang.String
normalize(java.lang.String tokenizeString)
Transforms to lowercase and replaces all word separators with spaces.static java.lang.String
removeDiacriticals(java.lang.String input)
Removes all diacritical marks from the input.static java.util.List<java.lang.String>
tokenList(java.lang.String tokenizeString)
Returns a list of tokens for a string.TokenStream
tokenStream(java.lang.String fieldName, java.io.Reader reader)
Constructs a tokenizer that can tokenize CJK or latin text.
-
-
-
Constructor Detail
-
WordSeparatorAnalyzer
public WordSeparatorAnalyzer(boolean detectCjk)
Create a new WordSeparatorAnalyzer.- Parameters:
detectCjk
- If true, will attempt to detect and segment CJK. If false, assumes all text can be segmented using word separators.
-
WordSeparatorAnalyzer
public WordSeparatorAnalyzer()
Create a new WordSeparatorAnalyzer that always tries to detect CJK.
-
-
Method Detail
-
tokenStream
public TokenStream tokenStream(java.lang.String fieldName, java.io.Reader reader)
Constructs a tokenizer that can tokenize CJK or latin text.- Parameters:
fieldName
- Ignored.reader
- A stream to tokenize. mark() and reset() support is not needed.- Returns:
- A
TokenStream
that represents the tokenization of the data in reader.
-
tokenList
public static java.util.List<java.lang.String> tokenList(java.lang.String tokenizeString)
Returns a list of tokens for a string.
-
normalize
public static java.lang.String normalize(java.lang.String tokenizeString)
Transforms to lowercase and replaces all word separators with spaces.
-
removeDiacriticals
public static java.lang.String removeDiacriticals(java.lang.String input)
Removes all diacritical marks from the input. This has the effect of transforming marked glyphs into their "equivalent" non-marked form. For example, "éøç" becomes "eoc".
-
-