The ML.DETECT_ANOMALIES function
ML.DETECT_ANOMALIES
function
The ML.DETECT_ANOMALIES
function provides anomaly detection for
BigQuery ML.
- For time-series data, the function runs against it using
ARIMA_PLUS
models. - For independent and identically distributed random variables (IID)
data, the function runs against it
using
AUTOENCODER
,KMEANS
, orPCA
models.
For information about model inference in BigQuery ML, see Model inference overview.
For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.
Syntax
# `ARIMA_PLUS` models: ML.DETECT_ANOMALIES(MODEL model_name [, STRUCT<anomaly_prob_threshold FLOAT64> settings] [, {TABLE table_name | (query_statement)}]); # `AUTOENCODER`, `KMEANS`, or `PCA` models: ML.DETECT_ANOMALIES(MODEL model_name, STRUCT<contamination FLOAT64> settings, {TABLE table_name | (query_statement)});
model_name
model_name
is the name of the model that's used to perform anomaly detection.
If you do not have a default project configured, then prepend the project ID to
the model name in following format:
`[PROJECT_ID].[DATASET].[MODEL]`
(including the backticks). For example,
`myproject.mydataset.mymodel`
.
anomaly_prob_threshold
The ARIMA_PLUS
model supports the anomaly_prob_threshold
custom threshold
for anomaly detection. The value of the anomaly probability at each timestamp is
calculated using the actual time-series data value and the values of the
predicted time-series data and the variance from the model training. The actual
time-series data value at a specific timestamp is identified as anomalous if
the anomaly probability exceeds the anomaly_prob_threshold
value. The
anomaly_prob_threshold
value also determines the lower and upper bounds, where
a larger threshold value results in a larger interval size.
The anomaly_prob_threshold
value's data type is
FLOAT64
,
which is part of the settings
STRUCT
. The
value should be in the range [0, 1)
with a default value of 0.95
.
contamination
contamination
is the proportion of anomalies in the training dataset that are
used to create the AUTOENCODER
, KMEANS
, or PCA
input models. For example, a
contamination
value of 0.1
means that 10% of the training data that was used
to create the input model is anomalous. The contamination
value determines the
cutoff threshold of the target metric to become anomalous, and any input data
with a target metric greater than the cutoff threshold is identified as
anomalous. The target metric is mean_squared_error
for AUTOENCODER
and PCA
models, and the target metric is normalized_distance
for KMEANS
models.
The contamination
value's data type is
FLOAT64
,
which is part of the settings
STRUCT
.
This value is required for AUTOENCODER
, KMEANS
, and PCA
models. The valid
value range for contamination
is [0, 0.5]
.
table_name
The input table_name
table is used to perform anomaly detection. If you do not
have a default project configured, then prepend the project ID to the table name
in the following format:
`[PROJECT_ID].[DATASET].[TABLE]`
(including the backticks). For example,
`myproject.mydataset.mytable`
.
query_statement
The query_statement
clause specifies the GoogleSQL query that generates the
data that you use to perform anomaly detection. For the supported SQL syntax of
the query_statement
clause, see
Query syntax in GoogleSQL
page for the supported SQL syntax of the query_statement
clause.
Input
The input requirements for the ML.DETECT_ANOMALIES
function depend upon the
input model type.
ARIMA_PLUS
model input
ARIMA_PLUS
model detection of anomalies has the following requirements:
- To detect anomalies in historical time-series data, the
DECOMPOSE_TIME_SERIES
training option must be set as its default value ofTRUE
when the input model is created. Neithertable_name
norquery_statement
is accepted. - The
anomaly_prob_threshold
value must be specified in the settingsSTRUCT
to detect anomalies in new time-series data. - The column names of either the
table_name
input table or thequery_statement
clause must match the column names that are used to create the input model. - The data types of the
TIME_SERIES_ID_COL
columns must match the data types of the columns that are used to create the input model.
For a list of supported data types for the TIME_SERIES_TIMESTAMP_COL
and
TIME_SERIES_DATA_COL
columns, see
Supported data types for time series model inputs.
AUTOENCODER
, KMEANS
or PCA
model input
AUTOENCODER
, KMEANS
, or PCA
model detection of anomalies has the following
requirements:
- The column names of either the
table_name
input table or thequery_statement
clause must match the column names of the model. The column data types must be compatible according to BigQuery implicit coercion rules. - If the
TRANSFORM
clause was present in theCREATE MODEL
statement that createdmodel_name
, then only the input columns present in theTRANSFORM
clause can appear inquery_statement
.
For information about how BigQuery ML handles NULL
values in the
feature column of the input data, see
Imputation.
Output
The ML.DETECT_ANOMALIES
function returns any anomaly detection results in the
is_anomaly
column. The output columns can differ, however, based upon the
input model type and input data table.
ARIMA_PLUS
model output
Time-series ARIMA_PLUS
model output includes the following columns, followed
by the input table columns, if present. Output can include the following:
time_series_id_col
ortime_series_id_cols
time series identifiers: This output only occurs when you forecast multiple time series at once. The column names and data types are inherited from theTIME_SERIES_ID_COL
option as specified in the model creation query.time_series_timestamp
: The column name is inherited from theTIME_SERIES_TIMESTAMP_COL
option as specified in the model creation query. The column has a type ofTIMESTAMP
, regardless of theTIME_SERIES_TIMESTAMP_COL
input column data type.time_series_data
: The column name is inherited from theTIME_SERIES_DATA_COL
option as specified in the model creation query. The column has a type ofFLOAT64
, regardless of theTIME_SERIES_DATA_COL
input column data type.is_anomaly
(BOOL
): Indicates whether the value at a specific timestamp is an anomaly. If theanomaly_probability
value is above theanomaly_prob_threshold
value, then thetime_series_data
value is out of the range for the lower and upper bounds. The state is identified withis_anomaly=true
.lower_bound
(FLOAT64
)upper_bound
(FLOAT64
)anomaly_probability
(FLOAT64
): Indicates the probability that this point is an anomaly. For example, ananomaly_probability
value of0.95
means that, among all possible values at the given timestamp, there is a 95% chance that the value is closer to the predicted value than it is to the given time series data value. This indicates a 95% probability that the given time series data value is an anomaly.
The output of the ML.DETECT_ANOMALIES
query for ARIMA_PLUS
models has the
following property:
- The function returns
NULL
values in theis_anomaly
,upper_bound
,lower_bound
andanomaly_probability
columns for rows with invalid input, which include the following cases:
AUTOENCODER
, KMEANS
or PCA
model output
AUTOENCODER
, KMEANS
, or PCA
model output includes the following columns,
followed by the input table columns.