Text analysis functions

GoogleSQL for BigQuery supports the following text analysis functions.

Function list

Name Summary
BAG_OF_WORDS Gets the frequency of each term (token) in a tokenized document.
TEXT_ANALYZE Extracts terms (tokens) from text and converts them into a tokenized document.
TF_IDF Evaluates how relevant a term (token) is to a tokenized document in a set of tokenized documents.

BAG_OF_WORDS

BAG_OF_WORDS(tokenized_document)

Definition

Gets the frequency of each term (token) in a tokenized document.

Definitions

  • tokenized_document: ARRAY<STRING> value that represents a document that has been tokenized. A tokenized document is a collection of terms (tokens), which are used for text analysis.

Return type

ARRAY<STRUCT<term STRING, count INT64>>

Definitions:

  • term: A unique term in the tokenized document.
  • count: The number of times the term was found in the tokenized document.

Examples

The following query produces terms and their frequencies in two tokenized documents:

WITH
  ExampleTable AS (
    SELECT 1 AS id, ['I', 'like', 'pie', 'pie', 'pie', NULL] AS f UNION ALL
    SELECT 2 AS id, ['yum', 'yum', 'pie', NULL] AS f
  )
SELECT id, BAG_OF_WORDS(f) AS results
FROM ExampleTable
ORDER BY id;

/*----+------------------------------------------------*
 | id | results                                        |
 +----+------------------------------------------------+
 | 1  | [(null, 1), ('I', 1), ('like', 1), ('pie', 3)] |
 | 2  | [(null, 1), ('pie', 1), ('yum', 2)]            |
 *----+------------------------------------------------*/

TEXT_ANALYZE

TEXT_ANALYZE(
  text
  [, analyzer=>{ 'LOG_ANALYZER' | 'NO_OP_ANALYZER' | 'PATTERN_ANALYZER' }]
  [, analyzer_options=>analyzer_options_values]
)

Description

Extracts terms (tokens) from text and converts them into a tokenized document.

Definitions

  • text: STRING value that represents the input text to tokenize.
  • analyzer: Optional mandatory-named argument that determines which analyzer to use to convert text into an array of terms (tokens). This can be:

    • 'LOG_ANALYZER' (default): Breaks the input into terms when delimiters are encountered and then normalizes the terms. If analyzer isn't specified, this is used by default. For more information, see LOG_ANALYZER text analyzer.

    • 'NO_OP_ANALYZER': Extracts the text as a single term (token), but doesn't apply normalization. For more information, see NO_OP_ANALYZER text analyzer.

    • 'PATTERN_ANALYZER': Breaks the input into terms that match a regular expression. For more information, see PATTERN_ANALYZER text analyzer.

  • analyzer_options: Optional mandatory named argument that takes a list of text analysis rules as a JSON-formatted STRING. For more information, see Text analyzer options.

Details

There is no guarantee on the order of the tokens produced by this function.

If no analyzer is specified, the LOG_ANALYZER analyzer is used by default.

Return type

ARRAY<STRING>

Examples

The following query uses the default text analyzer, LOG_ANALYZER, with the input text:

SELECT TEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.') AS results

/*--------------------------------------------------------------------------*
 | results                                                                  |
 +------------------------------------------