ML.DETECT_ANOMALIES
function
The ML.DETECT_ANOMALIES
function provides anomaly detection for
BigQuery ML.
- For time-series data, the function runs against it using
ARIMA_PLUS
models. - For independent and identically distributed random variables (IID)
data, the function runs against it
using
AUTOENCODER
,KMEANS
, orPCA
models.
For information about model inference in BigQuery ML, see Model inference overview.
For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.
Syntax
# `ARIMA_PLUS` models: ML.DETECT_ANOMALIES(MODEL model_name [, STRUCT<anomaly_prob_threshold FLOAT64> settings] [, {TABLE table_name | (query_statement)}]); # `AUTOENCODER`, `KMEANS`, or `PCA` models: ML.DETECT_ANOMALIES(MODEL model_name, STRUCT<contamination FLOAT64> settings, {TABLE table_name | (query_statement)});
model_name
model_name
is the name of the model that's used to perform anomaly detection.
If you do not have a default project configured, then prepend the project ID to
the model name in following format:
`[PROJECT_ID].[DATASET].[MODEL]`
(including the backticks). For example,
`myproject.mydataset.mymodel`
.
anomaly_prob_threshold
The ARIMA_PLUS
model supports the anomaly_prob_threshold
custom threshold
for anomaly detection. The value of the anomaly probability at each timestamp is
calculated using the actual time-series data value and the values of the
predicted time-series data and the variance from the model training. The actual
time-series data value at a specific timestamp is identified as anomalous if
the anomaly probability exceeds the anomaly_prob_threshold
value. The
anomaly_prob_threshold
value also determines the lower and upper bounds, where
a larger threshold value results in a larger interval size.
The anomaly_prob_threshold
value's data type is
FLOAT64
,
which is part of the settings
STRUCT
. The
value should be in the range [0, 1)
with a default value of 0.95
.
contamination
contamination
is the proportion of anomalies in the training dataset that are
used to create the AUTOENCODER
, KMEANS
, or PCA
input models. For example, a
contamination
value of 0.1
means that 10% of the training data that was used
to create the input model is anomalous. The contamination
value determines the
cutoff threshold of the target metric to become anomalous, and any input data
with a target metric greater than the cutoff threshold is identified as
anomalous. The target metric is mean_squared_error
for AUTOENCODER
and PCA
models, and the target metric is normalized_distance
for KMEANS
models.
The contamination
value's data type is
FLOAT64
,
which is part of the settings
STRUCT
.
This value is required for AUTOENCODER
, KMEANS
, and PCA
models. The valid
value range for contamination
is [0, 0.5]
.
table_name
The input table_name
table is used to perform anomaly detection. If you do not
have a default project configured, then prepend the project ID to the table name
in the following format:
`[PROJECT_ID].[DATASET].[TABLE]`
(including the backticks). For example,
`myproject.mydataset.mytable`
.
query_statement
The query_statement
clause specifies the standard SQL query that generates the
data that you use to perform anomaly detection. For the supported SQL syntax of
the query_statement
clause, see
Query syntax in Standard SQL
page for the supported SQL syntax of the query_statement
clause.
Input
The input requirements for the ML.DETECT_ANOMALIES
function depend upon the
input model type.
ARIMA_PLUS
model input
ARIMA_PLUS
model detection of anomalies has the following requirements:
- To detect anomalies in historical time-series data, the
DECOMPOSE_TIME_SERIES
training option must be set as its default value ofTRUE
when the input model is created. Neithertable_name
norquery_statement
is accepted. - The
anomaly_prob_threshold
value must be specified in the settingsSTRUCT
to detect anomalies in new time-series data. - The column names of either the
table_name
input table or thequery_statement
clause must match the column names that are used to create the input model. - The data types of the
TIME_SERIES_ID_COL
columns must match the data types of the columns that are used to create the input model.
For a list of supported data types for the TIME_SERIES_TIMESTAMP_COL
and
TIME_SERIES_DATA_COL
columns, see
Supported data types for time series model inputs.
AUTOENCODER
, KMEANS
or PCA
model input
AUTOENCODER
, KMEANS
, or PCA
model detection of anomalies has the following
requirements:
- The column names of either the
table_name
input table or thequery_statement
clause must match the column names of the model. The column data types must be compatible according to BigQuery implicit coercion rules. - If the
TRANSFORM
clause was present in theCREATE MODEL
statement that createdmodel_name
, then only the input columns present in theTRANSFORM
clause can appear inquery_statement
.
For information about how BigQuery ML handles NULL
values in the
feature column of the input data, see
Imputation.
Output
The ML.DETECT_ANOMALIES
function returns any anomaly detection results in the
is_anomaly
column. The output columns can differ, however, based upon the
input model type and input data table.
ARIMA_PLUS
model output
Time-series ARIMA_PLUS
model output includes the following columns, followed
by the input table columns, if present. Output can include the following:
time_series_id_col
ortime_series_id_cols
time series identifiers: This output only occurs when you forecast multiple time series at once. The column names and data types are inherited from theTIME_SERIES_ID_COL
option as specified in the model creation query.time_series_timestamp
: The column name is inherited from theTIME_SERIES_TIMESTAMP_COL
option as specified in the model creation query. The column has a type ofTIMESTAMP
, regardless of theTIME_SERIES_TIMESTAMP_COL
input column data type.time_series_data
: The column name is inherited from theTIME_SERIES_DATA_COL
option as specified in the model creation query. The column has a type ofFLOAT64
, regardless of theTIME_SERIES_DATA_COL
input column data type.is_anomaly
(BOOL
): Indicates whether the value at a specific timestamp is an anomaly. If theanomaly_probability
value is above theanomaly_prob_threshold
value, then thetime_series_data
value is out of the range for the lower and upper bounds. The state is identified withis_anomaly=true
.lower_bound
(FLOAT64
)upper_bound
(FLOAT64
)anomaly_probability
(FLOAT64
): Indicates the probability that this point is an anomaly. For example, ananomaly_probability
value of0.95
means that, among all possible values at the given timestamp, there is a 95% chance that the value is closer to the predicted value than it is to the given time series data value. This indicates a 95% probability that the given time series data value is an anomaly.
The output of the ML.DETECT_ANOMALIES
query for ARIMA_PLUS
models has the
following property:
- The function returns
NULL
values in theis_anomaly
,upper_bound
,lower_bound
andanomaly_probability
columns for rows with invalid input, which include the following cases:
AUTOENCODER
, KMEANS
or PCA
model output
AUTOENCODER
, KMEANS
, or PCA
model output includes the following columns,
followed by the input table columns.
AUTOENCODER
orPCA
model output includes the following:KMEANS
model output includes the following:is_anomaly
(BOOL
)normalized_distance
(FLOAT64
): The shortest distance among the normalized distances from the input data to each cluster centroid. Normalized distances are computed as the absolute distance from the input data to a cluster centroid, divided by the cluster's radius. The cluster radius is defined as the root mean square of all of the distances from each cluster's assigned data points to its centroid. Normalized distance is used in favor of absolute distance to determine anomalies because anomalies might not be detected as effectively using absolute distances, since they do not account for cluster radius. The distance type is determined by theDISTANCE_TYPE
value specified during model training.centroid_id
(INT64
)
Examples
The following examples show different input models and settings.
ARIMA_PLUS
model example without specified settings
The following query detects anomalies using the ARIMA_PLUS
model that has the
DECOMPOSE_TIME_SERIES
training option set to its default value of TRUE
, without specifying the
anomaly_prob_threshold
.
SELECT * FROM ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`)
If the time series input column names are ts_timestamp
and ts_data
, then
this query returns results like the following:
+-------------------------+----------+------------+-------------+-------------+---------------------+ | ts_timestamp | ts_data | is_anomaly | lower_bound | upper_bound | anomaly_probability | +-------------------------+----------+------------+-------------+-------------+---------------------+ | 2021-01-01 00:00:01 UTC | 125.3 | FALSE | 123.5 | 139.1 | 0.93 | | 2021-01-02 00:00:01 UTC | 145.3 | TRUE | 128.5 | 143.1 | 0.96 | +-------------------------+----------+------------+-------------+-------------+---------------------+
ARIMA_PLUS
model example with a custom anomaly_prob_threshold
The following query detects anomalies using the ARIMA_PLUS
model that has the
DECOMPOSE_TIME_SERIES
training option set to its default value of TRUE
and specifying a custom
anomaly_prob_threshold
value of 0.8
.
SELECT * FROM ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`, STRUCT(0.8 AS anomaly_prob_threshold))
If the time series input column names are ts_timestamp
and ts_data
, then
this query returns results like the following:
+-------------------------+----------+------------+-------------+-------------+---------------------+ | ts_timestamp | ts_data | is_anomaly | lower_bound | upper_bound | anomaly_probability | +-------------------------+----------+------------+-------------+-------------+---------------------+ | 2021-01-01 00:00:01 UTC | 125.3 | TRUE | 129.5 | 133.6 | 0.93 | | 2021-01-02 00:00:01 UTC | 145.3 | TRUE | 131.5 | 136.6 | 0.96 | +-------------------------+----------+------------+-------------+-------------+---------------------+
ARIMA_PLUS
model example with input data as a query statement
The following query detects anomalies using the ARIMA_PLUS
model, specifying a
custom anomaly_prob_threshold
value of 0.9
and passing an input data table
into the query.
SELECT * FROM ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`, STRUCT(0.9 AS anomaly_prob_threshold), ( SELECT state, city, date, temperature, weather FROM `mydataset.my_time_series_data_table`))
If the TIME_SERIES_ID_COL
column names are state
, city
, and
TIME_SERIES_TIMESTAMP_COL
, and the TIME_SERIES_DATA_COL
column names are
date
and temperature
, then this query returns results like the following:
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+ | state | city | date | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | weather | +-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+ | "WA" | "Kirkland" | 2021-01-01 00:00:00 UTC | 38.1 | FALSE | 36.4 | 42.0 | 0.8293 | "sunny" | | "WA" | "Kirkland" | 2021-01-02 00:00:00 UTC | 37.1 | TRUE | 37.4 | 43.3 | 0.9124 | "rainy" | +-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
ARIMA_PLUS
model example with input data as a table
The following query detects anomalies using the ARIMA_PLUS
model, specifying
a custom anomaly_prob_threshold
value of 0.9
and passing an input data table
into the query.
SELECT * FROM ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`, STRUCT(0.9 AS anomaly_prob_threshold), TABLE `mydataset.my_time_series_data_table`)
If the TIME_SERIES_ID_COL
column names are state
, city
, and
TIME_SERIES_TIMESTAMP_COL
, and the TIME_SERIES_DATA_COL
column names are
date
and temperature
, and one additional column weather
is in the input
data table, then this query returns results like the following:
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+ | state | city | date | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | weather | +-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+ | "WA" | "Kirkland" | 2021-01-01 00:00:00 UTC | 38.1 | FALSE | 36.4 | 42.0 | 0.8293 | "sunny" | | "WA" | "Kirkland" | 2021-01-02 00:00:00 UTC | 37.1 | TRUE | 37.4 | 43.3 | 0.9124 | "rainy" | +-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
AUTOENCODER
model example
The following query detects anomalies using the AUTOENCODER
model and a
contamination
value of 0.1
.
SELECT * FROM ML.DETECT_ANOMALIES(MODEL `mydataset.my_autoencoder_model`, STRUCT(0.1 AS contamination), TABLE `mydataset.mytable`)
If the feature column names are f1
and f2
, then this query returns results
like the following:
+------------+--------------------+---------+--------+ | is_anomaly | mean_squared_error | f1 | f2 | +------------+--------------------+---------+--------+ | FALSE | 0.63456 | 120 | "a" | | TRUE | 11.342 | 15000 | "b" | +------------+--------------------+---------+--------+
KMEANS
model example
The following query detects anomalies using the KMEANS
model and a
contamination
value of 0.2
.
SELECT * FROM ML.DETECT_ANOMALIES(MODEL `mydataset.my_kmeans_model`, STRUCT(0.2 AS contamination), ( SELECT f1, f2 FROM `mydataset.mytable`))
This query returns results like the following:
+------------+---------------------+-------------+--------+--------+ | is_anomaly | normalized_distance | centroid_id | f1 | f2 | +------------+---------------------+-------------+--------+--------+ | FALSE | 0.63456 | 1 | 120 | "a" | | TRUE | 6.3243 | 2 | 15000 | "b" | +------------+---------------------+-------------+--------+--------+
PCA
model example
The following query detects anomalies using the PCA
model and a
contamination
value of 0.1
.
SELECT * FROM ML.DETECT_ANOMALIES(MODEL `mydataset.my_pca_model`, STRUCT(0.1 AS contamination), TABLE `mydataset.mytable`)
If the feature column names are f1
, f2
and f3
, then this query returns
results like the following:
+------------+--------------------+---------+--------+------+ | is_anomaly | mean_squared_error | f1 | f2 | f3 | +------------+--------------------+---------+--------+------+ | FALSE | 0.63456 | 120 | "a" | 0.9 | | TRUE | 11.342 | 15000 | "b" | 25 | +------------+--------------------+---------+--------+------+
Pricing
All queries that use the ML.DETECT_ANOMALIES
function are billable, regardless
of the pricing model.