GoogleSQL for BigQuery supports the following text analysis functions.
Function list
Name | Summary |
---|---|
BAG_OF_WORDS
|
Gets the frequency of each term (token) in a tokenized document. |
TEXT_ANALYZE
|
Extracts terms (tokens) from text and converts them into a tokenized document. |
TF_IDF
|
Evaluates how relevant a term (token) is to a tokenized document in a set of tokenized documents. |
BAG_OF_WORDS
BAG_OF_WORDS(tokenized_document)
Definition
Gets the frequency of each term (token) in a tokenized document.
Definitions
tokenized_document
:ARRAY<STRING>
value that represents a document that has been tokenized. A tokenized document is a collection of terms (tokens), which are used for text analysis.
Return type
ARRAY<STRUCT<term STRING, count INT64>>
Definitions:
term
: A unique term in the tokenized document.count
: The number of times the term was found in the tokenized document.
Examples
The following query produces terms and their frequencies in two tokenized documents:
WITH
ExampleTable AS (
SELECT 1 AS id, ['I', 'like', 'pie', 'pie', 'pie', NULL] AS f UNION ALL
SELECT 2 AS id, ['yum', 'yum', 'pie', NULL] AS f
)
SELECT id, BAG_OF_WORDS(f) AS results
FROM ExampleTable
ORDER BY id;
/*----+------------------------------------------------*
| id | results |
+----+------------------------------------------------+
| 1 | [(null, 1), ('I', 1), ('like', 1), ('pie', 3)] |
| 2 | [(null, 1), ('pie', 1), ('yum', 2)] |
*----+------------------------------------------------*/
TEXT_ANALYZE
TEXT_ANALYZE(
text
[, analyzer=>{ 'LOG_ANALYZER' | 'NO_OP_ANALYZER' | 'PATTERN_ANALYZER' }]
[, analyzer_options=>analyzer_options_values]
)
Description
Extracts terms (tokens) from text and converts them into a tokenized document.
Definitions
text
:STRING
value that represents the input text to tokenize.analyzer
: Optional mandatory-named argument that determines which analyzer to use to converttext
into an array of terms (tokens). This can be:'LOG_ANALYZER'
(default): Breaks the input into terms when delimiters are encountered and then normalizes the terms. Ifanalyzer
isn't specified, this is used by default. For more information, seeLOG_ANALYZER
text analyzer.'NO_OP_ANALYZER'
: Extracts the text as a single term (token), but doesn't apply normalization. For more information, seeNO_OP_ANALYZER
text analyzer.'PATTERN_ANALYZER'
: Breaks the input into terms that match a regular expression. For more information, seePATTERN_ANALYZER
text analyzer.
analyzer_options
: Optional mandatory named argument that takes a list of text analysis rules as a JSON-formattedSTRING
. For more information, see Text analyzer options.
Details
There is no guarantee on the order of the tokens produced by this function.
If no analyzer is specified, the LOG_ANALYZER
analyzer is used by default.
Return type
ARRAY<STRING>
Examples
The following query uses the default text analyzer,
LOG_ANALYZER
, with the input text:
SELECT TEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.') AS results
/*--------------------------------------------------------------------------*
| results |
+--------------------------------------------------------------------------+
| ['i', 'like', 'pie', 'you', 'like', 'pie', 'they', 'like', '2', 'pies' ] |
*--------------------------------------------------------------------------*/
The following query uses the NO_OP_ANALYZER
text analyzer
with the input text:
SELECT TEXT_ANALYZE(
'I like pie, you like-pie, they like 2 PIEs.',
analyzer=>'NO_OP_ANALYZER'
) AS results
/*-----------------------------------------------*
| results |
+-----------------------------------------------+
| 'I like pie, you like-pie, they like 2 PIEs.' |
*-----------------------------------------------*/
The following query uses the PATTERN_ANALYZER
text analyzer with the input text:
SELECT TEXT_ANALYZE(
'I like pie, you like-pie, they like 2 PIEs.',
analyzer=>'PATTERN_ANALYZER'
) AS results
/*----------------------------------------------------------------*
| results |
+----------------------------------------------------------------+
| ['like', 'pie', 'you', 'like', 'pie', 'they', 'like', 'pies' ] |
*----------------------------------------------------------------*/
For additional examples that include analyzer options, see Text analysis.
For helpful analyzer recipes that you can use to enhance analyzer-supported queries, see Search with text analyzers.
TF_IDF
TF_IDF(tokenized_document) OVER()
TF_IDF(tokenized_document, max_distinct_tokens) OVER()
TF_IDF(tokenized_document, max_distinct_tokens, frequency_threshold) OVER()
Description
Evaluates how relevant a term is to a tokenized document in a set of tokenized documents, using the TF-IDF (term frequency-inverse document frequency) algorithm.
Definitions
tokenized_document
:ARRAY<STRING>
value that represents a document that has been tokenized. A tokenized document is a collection of terms (tokens), which are used for text analysis.max_distinct_tokens
: Optional argument. Takes a non-negativeINT64
value, which represents the size of the dictionary, excluding the unknown term.Terms are added to the dictionary until this threshold is met. So, if this value is
20
, the first 20 unique terms are added and then no additional terms are added.If this argument is not provided, the default value is
32000
. If this argument is specified, the maximum value is1048576
.frequency_threshold
: Optional argument. Takes a non-negativeINT64
value that represents the minimum number of times a term must appear in a tokenized document to be included in the dictionary. So, if this value is3
, a term must appear at least three times in the tokenized document to be added to the dictionary.If this argument is not provided, the default value is
5
.
Details
This function uses a TF-IDF (term frequency-inverse document frequency) algorithm to compute the relevance of terms in a set of tokenized documents. TF-IDF multiplies two metrics: how many times a term appears in a document (term frequency), and the inverse document frequency of the term across a collection of documents (inverse document frequency).
TDIF:
term frequency * inverse document frequency
term frequency:
(count of term in document) / (document size)
inverse document frequency:
log(1 + document set size / (1 + count of documents containing term))
Terms are added to a dictionary of terms if they satisfy the criteria for
max_distinct_tokens
and frequency_threshold
, otherwise they are considered
the unknown term. The unknown term is always the first term in the dictionary
and represented as NULL
. The rest of the dictionary is ordered by
term frequency rather than alphabetically.
Return type
ARRAY<STRUCT<term STRING, tf_idf DOUBLE>>
Definitions:
term
: The unique term that was added to the dictionary.tf_idf
: The TF-IDF computation for the term.
Examples
The following query computes the relevance of up to 10 terms that appear at
least twice in a set of tokenized documents. In this example, the
named arguments are passed in positionally. 10
represents
max_distinct_tokens
and 2
represents frequency_threshold
:
WITH ExampleTable AS (
SELECT 1 AS id, ['I', 'like', 'pie', 'pie', 'pie', NULL] AS f UNION ALL
SELECT 2 AS id, ['yum', 'yum', 'pie', NULL] AS f UNION ALL
SELECT 3 AS id, ['I', 'yum', 'pie', NULL] AS f UNION ALL
SELECT 4 AS id, ['you', 'like', 'pie', 'too', NULL] AS f
)
SELECT id, TF_IDF(f, 10, 2) OVER() AS results
FROM ExampleTable
ORDER BY id;
/*----+-------------------------------------------------*
| id | results |
+----+-------------------------------------------------+
| 1 | [{"index":null,"value":"0.1304033435859887"}, |
| | {"index":"I","value":"0.1412163100645339"}, |
| | {"index":"like","value":"0.1412163100645339"}, |
| | {"index":"pie","value":"0.29389333245105953"}] |
+----+-------------------------------------------------+
| 2 | [{"index":null,"value":"0.1956050153789831"}, |
| | {"index":"pie","value":"0.14694666622552977"}, |
| | {"index":"yum","value":"0.4236489301936017"}] |
+----+-------------------------------------------------+
| 3 | [{"index":null,"value":"0.1956050153789831"}, |
| | {"index":"I","value":"0.21182446509680086"}, |
| | {"index":"pie","value":"0.14694666622552977"}, |
| | {"index":"yum","value":"0.21182446509680086"}] |
+----+-------------------------------------------------+
| 4 | [{"index":null,"value":"0.4694520369095594"}, |
| | {"index":"like","value":"0.1694595720774407"}, |
| | {"index":"pie","value":"0.11755733298042381"}] |
*----+-------------------------------------------------*/
The following query computes the relevance of up to three terms that appear at least once in a set of tokenized documents:
WITH ExampleTable AS (
SELECT 1 AS id, ['I', 'like', 'pie', 'pie', 'pie', NULL] AS f UNION ALL
SELECT 2 AS id, ['yum', 'yum', 'pie', NULL] AS f UNION ALL
SELECT 3 AS id, ['I', 'yum', 'pie', NULL] AS f UNION ALL
SELECT 4 AS id, ['you', 'like', 'pie', 'too', NULL] AS f
)
SELECT id, TF_IDF(f, 3, 2) OVER() AS results
FROM ExampleTable
ORDER BY id;
/*----+-------------------------------------------------*
| id | results |
+----+-------------------------------------------------+
| 1 | [{"index":null,"value":"0.12679902142647365"}, |
| | {"index":"I","value":"0.1412163100645339"}, |
| | {"index":"like","value":"0.1412163100645339"}, |
| | {"index":"pie","value":"0.29389333245105953"}] |
+----+-------------------------------------------------+
| 2 | [{"index":null,"value":"0.5705955964191315"}, |
| | {"index":"pie","value":"0.14694666622552977"}] |
+----+-------------------------------------------------+
| 3 | [{"index":null,"value":"0.380397064279421"}, |
| | {"index":"I","value":"0.21182446509680086"}, |
| | {"index":"pie","value":"0.14694666622552977"}] |
+----+-------------------------------------------------+
| 4 | [{"index":null,"value":"0.45647647713530515"}, |
| | {"index":"like","value":"0.1694595720774407"}, |
| | {"index":"pie","value":"0.11755733298042381"}] |
*----+-------------------------------------------------*/