The ML.TF_IDF function
The term frequency-inverse document frequency (TF-IDF) reflects how important a word
is to a document in a collection or corpus. Use the ML.TF_IDF
function to
compute TF-IDF of terms in a document, given the precomputed inverse-document
frequency for use in machine learning model creation. You can use ML.TF_IDF
within the TRANSFORM
clause.
This function uses a TF-IDF algorithm to compute the relevance of terms in a set of tokenized documents. TF-IDF multiplies two metrics: how many times a term appears in a document (term frequency), and the inverse document frequency of the term across a collection of documents (inverse document frequency).
TF-IDF:
term frequency * inverse document frequency
Term frequency:
(count of term in document) / (document size)
Inverse document frequency:
log(1 + num_documents / (1 + token_document_count))
Terms are added to a dictionary of terms if they satisfy the criteria for
top_k
and frequency_threshold
, otherwise they are considered
the unknown term. The unknown term is always the first term in the dictionary
and represented as 0
. The rest of the dictionary is ordered alphabetically.
Syntax
ML.TF_IDF( tokenized_document [, top_k] [, frequency_threshold] ) OVER()
Arguments
ML.TF_IDF
takes the following arguments:
tokenized_document
:ARRAY<STRING>
value that represents a document that has been tokenized. A tokenized document is a collection of terms (tokens), which are used for text analysis.top_k
: Optional argument. Takes anINT64
value, which represents the size of the dictionary, excluding the unknown term. Thetop_k
terms that appear in the most documents are added to the dictionary until this threshold is met. For example, if this value is20
, the top 20 unique terms that appear in the most documents are added and then no additional terms are added.frequency_threshold
: Optional argument. Take anINT64
value that represents the minimum number of documents a term must appear in to be included in the dictionary. For example, if this value is3
, a term must appear in at least three documents to be added to the dictionary.
Output
ML.TF_IDF
returns the input table plus the following two columns:
ARRAY<STRUCT<index INT64, value FLOAT64>>
Definitions:
index
: The index of the term that was added to the dictionary. Unknown terms have an index of 0.value
: The TF-IDF computation for the term.
Quotas
See Cloud AI service functions quotas and limits.
Example
The following example creates a table ExampleTable
and applies the ML.TF_IDF
function:
WITH
ExampleTable AS (
SELECT 1 AS id, ['I', 'like', 'pie', 'pie', 'pie', NULL] AS f
UNION ALL
SELECT 2 AS id, ['yum', 'yum', 'pie', NULL] AS f
UNION ALL
SELECT 3 AS id, ['I', 'yum', 'pie', NULL] AS f
UNION ALL
SELECT 4 AS id, ['you', 'like', 'pie', NULL] AS f
)
SELECT id, ML.TF_IDF(f, 3, 1) OVER () AS results
FROM ExampleTable
ORDER BY id;
The output is similar to the following:
+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | id | results | +----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 1 | [{"index":"0","value":"0.12679902142647365"},{"index":"1","value":"0.1412163100645339"},{"index":"2","value":"0.1412163100645339"},{"index":"3","value":"0.29389333245105953"}] | | 2 | [{"index":"0","value":"0.5705955964191315"},{"index":"3","value":"0.14694666622552977"}] | | 3 | [{"index":"0","value":"0.380397064279421"},{"index":"1","value":"0.21182446509680086"},{"index":"3","value":"0.14694666622552977"}] | | 4 | [{"index":"0","value":"0.380397064279421"},{"index":"2","value":"0.21182446509680086"},{"index":"3","value":"0.14694666622552977"}] | +----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
What's next
- Learn more about TF-IDF outside of machine learning.