Text analysis functions

GoogleSQL for BigQuery supports the following text analysis functions.

Function list

Name	Summary
`BAG_OF_WORDS`	Gets the frequency of each term (token) in a tokenized document.
`TEXT_ANALYZE`	Extracts terms (tokens) from text and converts them into a tokenized document.
`TF_IDF`	Evaluates how relevant a term (token) is to a tokenized document in a set of tokenized documents.

`BAG_OF_WORDS`

BAG_OF_WORDS(tokenized_document)

Definition

Gets the frequency of each term (token) in a tokenized document.

Definitions

tokenized_document: ARRAY<STRING> value that represents a document that has been tokenized. A tokenized document is a collection of terms (tokens), which are used for text analysis.

Return type

ARRAY<STRUCT<term STRING, count INT64>>

Definitions:

term: A unique term in the tokenized document.
count: The number of times the term was found in the tokenized document.

Examples

The following query produces terms and their frequencies in two tokenized documents:

WITH
  ExampleTable AS (
    SELECT 1 AS id, ['I', 'like', 'pie', 'pie', 'pie', NULL] AS f UNION ALL
    SELECT 2 AS id, ['yum', 'yum', 'pie', NULL] AS f
  )
SELECT id, BAG_OF_WORDS(f) AS results
FROM ExampleTable
ORDER BY id;

/*----+------------------------------------------------*
 | id | results                                        |
 +----+------------------------------------------------+
 | 1  | [(null, 1), ('I', 1), ('like', 1), ('pie', 3)] |
 | 2  | [(null, 1), ('pie', 1), ('yum', 2)]            |
 *----+------------------------------------------------*/

`TEXT_ANALYZE`

TEXT_ANALYZE(
  text
  [, analyzer => { 'LOG_ANALYZER' | 'NO_OP_ANALYZER' | 'PATTERN_ANALYZER' } ]
  [, analyzer_options => analyzer_options_values ]
)

Description

Extracts terms (tokens) from text and converts them into a tokenized document.

Definitions

text: STRING value that represents the input text to tokenize.
analyzer: A named argument with a STRING value. Determines which analyzer to use to convert text into an array of terms (tokens). This can be:
- 'LOG_ANALYZER' (default): Breaks the input into terms when delimiters are encountered and then normalizes the terms. If analyzer isn't specified, this is used by default. For more information, see LOG_ANALYZER text analyzer.
- 'NO_OP_ANALYZER': Extracts the text as a single term (token), but doesn't apply normalization. For more information, see NO_OP_ANALYZER text analyzer.
- 'PATTERN_ANALYZER': Breaks the input into terms that match a regular expression. For more information, see PATTERN_ANALYZER text analyzer.
analyzer_options: A named argument with a JSON-formatted STRING value. Takes a list of text analysis rules. For more information, see Text analyzer options.

Details

There is no guarantee on the order of the tokens produced by this function.

If no analyzer is specified, the LOG_ANALYZER analyzer is used by default.

Return type

ARRAY<STRING>

Examples

The following query uses the default text analyzer, LOG_ANALYZER, with the input text:

SELECT TEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.') AS results

/*--------------------------------------------------------------------------*
 | results                                                                  |
 +--------------------------------------------------------------------------+
 | ['i', 'like', 'pie', 'you', 'like', 'pie', 'they', 'like', '2', 'pies' ] |
 *--------------------------------------------------------------------------*/

The following query uses the NO_OP_ANALYZER text analyzer with the input text:

SELECT TEXT_ANALYZE(
  'I like pie, you like-pie, they like 2 PIEs.',
  analyzer=>'NO_OP_ANALYZER'
) AS results

/*-----------------------------------------------*
 | results                                       |
 +-----------------------------------------------+
 | 'I like pie, you like-pie, they like 2 PIEs.' |
 *-----------------------------------------------*/

The following query uses the PATTERN_ANALYZER text analyzer with the input text:

SELECT TEXT_ANALYZE(
  'I like pie, you like-pie, they like 2 PIEs.',
  analyzer=>'PATTERN_ANALYZER'
) AS results

/*----------------------------------------------------------------*
 | results                                                        |
 +----------------------------------------------------------------+
 | ['like', 'pie', 'you', 'like', 'pie', 'they', 'like', 'pies' ] |
 *----------------------------------------------------------------*/

For additional examples that include analyzer options, see Text analysis.

For helpful analyzer recipes that you can use to enhance analyzer-supported queries, see Search with text analyzers.

`TF_IDF`

TF_IDF(tokenized_document) OVER()

TF_IDF(tokenized_document, max_distinct_tokens) OVER()

TF_IDF(tokenized_document, max_distinct_tokens, frequency_threshold) OVER()

Description

Evaluates how relevant a term is to a tokenized document in a set of tokenized documents, using the TF-IDF (term frequency-inverse document frequency) algorithm.

Definitions

tokenized_document: ARRAY<STRING> value that represents a document that has been tokenized. A tokenized document is a collection of terms (tokens), which are used for text analysis.
max_distinct_tokens: Optional argument. Takes a non-negative INT64 value, which represents the size of the dictionary, excluding the unknown term.

Terms are added to the dictionary until this threshold is met. So, if this value is 20, the first 20 unique terms are added and then no additional terms are added.

If this argument isn't provided, the default value is 32000. If this argument is specified, the maximum value is 1048576.
frequency_threshold: Optional argument. Takes a non-negative INT64 value that represents the minimum number of times a term must appear in a tokenized document to be included in the dictionary. So, if this value is 3, a term must appear at least three times in the tokenized document to be added to the dictionary.

If this argument isn't provided, the default value is 5.

Details

This function uses a TF-IDF (term frequency-inverse document frequency) algorithm to compute the relevance of terms in a set of tokenized documents. TF-IDF multiplies two metrics: how many times a term appears in a document (term frequency), and the inverse document frequency of the term across a collection of documents (inverse document frequency).

TDIF:

term frequency * inverse document frequency

term frequency:

(count of term in document) / (document size)

inverse document frequency:

log(1 + document set size / (1 + count of documents containing term))

Terms are added to a dictionary of terms if they satisfy the criteria for max_distinct_tokens and frequency_threshold, otherwise they are considered the unknown term. The unknown term is always the first term in the dictionary and represented as NULL. The rest of the dictionary is ordered by term frequency rather than alphabetically.

Return type

ARRAY<STRUCT<term STRING, tf_idf DOUBLE>>

Definitions:

term: The unique term that was added to the dictionary.
tf_idf: The TF-IDF computation for the term.

Examples

The following query computes the relevance of up to 10 terms that appear at least twice in a set of tokenized documents. In this example, 10 represents max_distinct_tokens and 2 represents frequency_threshold:

WITH ExampleTable AS (
  SELECT 1 AS id, ['I', 'like', 'pie', 'pie', 'pie', NULL] AS f UNION ALL
  SELECT 2 AS id, ['yum', 'yum', 'pie', NULL] AS f UNION ALL
  SELECT 3 AS id, ['I', 'yum', 'pie', NULL] AS f UNION ALL
  SELECT 4 AS id, ['you', 'like', 'pie', 'too', NULL] AS f
)
SELECT id, TF_IDF(f, 10, 2) OVER() AS results
FROM ExampleTable
ORDER BY id;

/*----+-------------------------------------------------*
 | id | results                                         |
 +----+-------------------------------------------------+
 | 1  | [{"index":null,"value":"0.1304033435859887"},   |
 |    |  {"index":"I","value":"0.1412163100645339"},    |
 |    |  {"index":"like","value":"0.1412163100645339"}, |
 |    |  {"index":"pie","value":"0.29389333245105953"}] |
 +----+-------------------------------------------------+
 | 2  | [{"index":null,"value":"0.1956050153789831"},   |
 |    |  {"index":"pie","value":"0.14694666622552977"}, |
 |    |  {"index":"yum","value":"0.4236489301936017"}]  |
 +----+-------------------------------------------------+
 | 3  | [{"index":null,"value":"0.1956050153789831"},   |
 |    |  {"index":"I","value":"0.21182446509680086"},   |
 |    |  {"index":"pie","value":"0.14694666622552977"}, |
 |    |  {"index":"yum","value":"0.21182446509680086"}] |
 +----+-------------------------------------------------+
 | 4  | [{"index":null,"value":"0.4694520369095594"},   |
 |    |  {"index":"like","value":"0.1694595720774407"}, |
 |    |  {"index":"pie","value":"0.11755733298042381"}] |
 *----+-------------------------------------------------*/

The following query computes the relevance of up to three terms that appear at least once in a set of tokenized documents:

WITH ExampleTable AS (
  SELECT 1 AS id, ['I', 'like', 'pie', 'pie', 'pie', NULL] AS f UNION ALL
  SELECT 2 AS id, ['yum', 'yum', 'pie', NULL] AS f UNION ALL
  SELECT 3 AS id, ['I', 'yum', 'pie', NULL] AS f UNION ALL
  SELECT 4 AS id, ['you', 'like', 'pie', 'too', NULL] AS f
)
SELECT id, TF_IDF(f, 3, 2) OVER() AS results
FROM ExampleTable
ORDER BY id;

/*----+-------------------------------------------------*
 | id | results                                         |
 +----+-------------------------------------------------+
 | 1  | [{"index":null,"value":"0.12679902142647365"},  |
 |    |  {"index":"I","value":"0.1412163100645339"},    |
 |    |  {"index":"like","value":"0.1412163100645339"}, |
 |    |  {"index":"pie","value":"0.29389333245105953"}] |
 +----+-------------------------------------------------+
 | 2  | [{"index":null,"value":"0.5705955964191315"},   |
 |    |  {"index":"pie","value":"0.14694666622552977"}] |
 +----+-------------------------------------------------+
 | 3  | [{"index":null,"value":"0.380397064279421"},    |
 |    |  {"index":"I","value":"0.21182446509680086"},   |
 |    |  {"index":"pie","value":"0.14694666622552977"}] |
 +----+-------------------------------------------------+
 | 4  | [{"index":null,"value":"0.45647647713530515"},  |
 |    |  {"index":"like","value":"0.1694595720774407"}, |
 |    |  {"index":"pie","value":"0.11755733298042381"}] |
 *----+-------------------------------------------------*/