Note: Java 8 has reached end of support on January 31, 2024. Your existing Java 8 applications will continue to run and receive traffic. However, App Engine might block re-deployment of applications that use runtimes after their end of support date. We recommend that you migrate to the latest supported version of Java.

WordSeparatorAnalyzer

java.lang.Object
- Analyzer
- - com.google.appengine.api.search.dev.WordSeparatorAnalyzer

```
public class WordSeparatorAnalyzer
extends Analyzer
```
A custom analyzer to tokenize text like the Search API backend. It detects when provided text is in a CJK language and uses CJKTokenizer to tokenize it if it is. CJKTokenizer tokenizes based on bigrams, so a string like "ABCD" will be tokenized to ["A", "AB", "BC", "CD", "D"]. If the string is not CJK, we assume that it uses standard latin word separators. For latin text, this uses a slightly-customized LetterTokenizer and passes tokens through StandardFilter and LowerCaseFilter. The LetterTokenizer is customized to use the same word separators as ST-BTI.

Constructor Summary

Constructors
Constructor and Description
`WordSeparatorAnalyzer()` Create a new WordSeparatorAnalyzer that always tries to detect CJK.
`WordSeparatorAnalyzer(boolean detectCjk)` Create a new WordSeparatorAnalyzer.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static java.lang.String`	`normalize(java.lang.String tokenizeString)` Transforms to lowercase and replaces all word separators with spaces.
`static java.lang.String`	`removeDiacriticals(java.lang.String input)` Removes all diacritical marks from the input.
`static java.util.List<java.lang.String>`	`tokenList(java.lang.String tokenizeString)` Returns a list of tokens for a string.
`TokenStream`	`tokenStream(java.lang.String fieldName, java.io.Reader reader)` Constructs a tokenizer that can tokenize CJK or latin text.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - WordSeparatorAnalyzer
```
public WordSeparatorAnalyzer(boolean detectCjk)
```
    Create a new WordSeparatorAnalyzer.
    
    Parameters:
    
    detectCjk - If true, will attempt to detect and segment CJK. If false, assumes all text can be segmented using word separators.
  - WordSeparatorAnalyzer
```
public WordSeparatorAnalyzer()
```
    Create a new WordSeparatorAnalyzer that always tries to detect CJK.
- Method Detail
  - tokenStream
```
public TokenStream tokenStream(java.lang.String fieldName,
                               java.io.Reader reader)
```
    Constructs a tokenizer that can tokenize CJK or latin text.
    
    Parameters:
    
    fieldName - Ignored.
    
    reader - A stream to tokenize. mark() and reset() support is not needed.
    
    Returns:
    
    A TokenStream that represents the tokenization of the data in reader.
  - tokenList
```
public static java.util.List<java.lang.String> tokenList(java.lang.String tokenizeString)
```
    Returns a list of tokens for a string.
  - normalize
```
public static java.lang.String normalize(java.lang.String tokenizeString)
```
    Transforms to lowercase and replaces all word separators with spaces.
  - removeDiacriticals
```
public static java.lang.String removeDiacriticals(java.lang.String input)
```
    Removes all diacritical marks from the input. This has the effect of transforming marked glyphs into their "equivalent" non-marked form. For example, "éøç" becomes "eoc".

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-02-15 UTC.

WordSeparatorAnalyzer

Class WordSeparatorAnalyzer

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

WordSeparatorAnalyzer

WordSeparatorAnalyzer

Method Detail

tokenStream

tokenList

normalize

removeDiacriticals