The ML.DETECT_ANOMALIES function

ML.DETECT_ANOMALIES function

The ML.DETECT_ANOMALIES function provides anomaly detection for BigQuery ML.

  1. For time-series data, the function runs against it using ARIMA_PLUS models.
  2. For independent and identically distributed random variables (IID) data, the function runs against it using AUTOENCODER, KMEANS, or PCA models.

For information about model inference in BigQuery ML, see Model inference overview.

For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.

Syntax

# `ARIMA_PLUS` models:
ML.DETECT_ANOMALIES(MODEL model_name
                    [, STRUCT<anomaly_prob_threshold FLOAT64> settings]
                    [, {TABLE table_name | (query_statement)}]);

# `AUTOENCODER`, `KMEANS`, or `PCA` models:
ML.DETECT_ANOMALIES(MODEL model_name,
                    STRUCT<contamination FLOAT64> settings,
                    {TABLE table_name | (query_statement)});

model_name

model_name is the name of the model that's used to perform anomaly detection. If you do not have a default project configured, then prepend the project ID to the model name in following format: `[PROJECT_ID].[DATASET].[MODEL]` (including the backticks). For example, `myproject.mydataset.mymodel`.

anomaly_prob_threshold

The ARIMA_PLUS model supports the anomaly_prob_threshold custom threshold for anomaly detection. The value of the anomaly probability at each timestamp is calculated using the actual time-series data value and the values of the predicted time-series data and the variance from the model training. The actual time-series data value at a specific timestamp is identified as anomalous if the anomaly probability exceeds the anomaly_prob_threshold value. The anomaly_prob_threshold value also determines the lower and upper bounds, where a larger threshold value results in a larger interval size.

The anomaly_prob_threshold value's data type is FLOAT64, which is part of the settings STRUCT. The value should be in the range [0, 1) with a default value of 0.95.

contamination

contamination is the proportion of anomalies in the training dataset that are used to create the AUTOENCODER, KMEANS, or PCA input models. For example, a contamination value of 0.1 means that 10% of the training data that was used to create the input model is anomalous. The contamination value determines the cutoff threshold of the target metric to become anomalous, and any input data with a target metric greater than the cutoff threshold is identified as anomalous. The target metric is mean_squared_error for AUTOENCODER and PCA models, and the target metric is normalized_distance for KMEANS models.

The contamination value's data type is FLOAT64, which is part of the settings STRUCT. This value is required for AUTOENCODER, KMEANS, and PCAmodels. The valid value range for contamination is [0, 0.5].

table_name

The input table_name table is used to perform anomaly detection. If you do not have a default project configured, then prepend the project ID to the table name in the following format: `[PROJECT_ID].[DATASET].[TABLE]` (including the backticks). For example, `myproject.mydataset.mytable`.

query_statement

The query_statement clause specifies the GoogleSQL query that generates the data that you use to perform anomaly detection. For the supported SQL syntax of the query_statement clause, see Query syntax in GoogleSQL page for the supported SQL syntax of the query_statement clause.

Input

The input requirements for the ML.DETECT_ANOMALIES function depend upon the input model type.

ARIMA_PLUS model input

ARIMA_PLUS model detection of anomalies has the following requirements:

  • To detect anomalies in historical time-series data, the DECOMPOSE_TIME_SERIES training option must be set as its default value of TRUE when the input model is created. Neither table_name nor query_statement is accepted.
  • The anomaly_prob_threshold value must be specified in the settings STRUCT to detect anomalies in new time-series data.
  • The column names of either the table_name input table or the query_statement clause must match the column names that are used to create the input model.
  • The data types of the TIME_SERIES_ID_COL columns must match the data types of the columns that are used to create the input model.

For a list of supported data types for the TIME_SERIES_TIMESTAMP_COL and TIME_SERIES_DATA_COL columns, see Supported data types for time series model inputs.

AUTOENCODER, KMEANS or PCA model input

AUTOENCODER, KMEANS, or PCA model detection of anomalies has the following requirements:

  • The column names of either the table_name input table or the query_statement clause must match the column names of the model. The column data types must be compatible according to BigQuery implicit coercion rules.
  • If the TRANSFORM clause was present in the CREATE MODEL statement that created model_name, then only the input columns present in the TRANSFORM clause can appear in query_statement.

For information about how BigQuery ML handles NULL values in the feature column of the input data, see Imputation.

Output

The ML.DETECT_ANOMALIES function returns any anomaly detection results in the is_anomaly column. The output columns can differ, however, based upon the input model type and input data table.

ARIMA_PLUS model output

Time-series ARIMA_PLUS model output includes the following columns, followed by the input table columns, if present. Output can include the following:

  • time_series_id_col or time_series_id_cols time series identifiers: This output only occurs when you forecast multiple time series at once. The column names and data types are inherited from the TIME_SERIES_ID_COL option as specified in the model creation query.
  • time_series_timestamp: The column name is inherited from the TIME_SERIES_TIMESTAMP_COL option as specified in the model creation query. The column has a type of TIMESTAMP, regardless of the TIME_SERIES_TIMESTAMP_COL input column data type.
  • time_series_data: The column name is inherited from the TIME_SERIES_DATA_COL option as specified in the model creation query. The column has a type of FLOAT64, regardless of the TIME_SERIES_DATA_COL input column data type.
  • is_anomaly (BOOL): Indicates whether the value at a specific timestamp is an anomaly. If the anomaly_probability value is above the anomaly_prob_threshold value, then the time_series_data value is out of the range for the lower and upper bounds. The state is identified with is_anomaly=true.
  • lower_bound (FLOAT64)
  • upper_bound (FLOAT64)
  • anomaly_probability (FLOAT64): Indicates the probability that this point is an anomaly. For example, an anomaly_probability value of 0.95 means that, among all possible values at the given timestamp, there is a 95% chance that the value is closer to the predicted value than it is to the given time series data value. This indicates a 95% probability that the given time series data value is an anomaly.

The output of the ML.DETECT_ANOMALIES query for ARIMA_PLUS models has the following property:

  • The function returns NULL values in the is_anomaly, upper_bound, lower_bound and anomaly_probability columns for rows with invalid input, which include the following cases:
    • The value in the TIME_SERIES_ID_COL column does not exist in the model.
    • The value in the TIME_SERIES_TIMESTAMP_COL column is not in the range of the forecast horizon.
    • The value in the TIME_SERIES_TIMESTAMP_COL column does not follow the same frequency as the one in the model.

AUTOENCODER, KMEANS or PCA model output

AUTOENCODER, KMEANS, or PCA model output includes the following columns, followed by the input table columns.

  • AUTOENCODER or PCA model output includes the following:

  • KMEANS model output includes the following:

    • is_anomaly (BOOL)
    • normalized_distance (FLOAT64): The shortest distance among the normalized distances from the input data to