The ML.DETECT_ANOMALIES function

ML.DETECT_ANOMALIES function

The ML.DETECT_ANOMALIES function provides anomaly detection for BigQuery ML. The function runs against time-series data using ARIMA_PLUS models. The function runs against independent and identically distributed random variables (IID) data using AUTOENCODER and KMEANS models.

For information about model inference in BigQuery ML, see Model inference overview.

For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.

Syntax

# `ARIMA_PLUS` models:
ML.DETECT_ANOMALIES(MODEL model_name
                    [, STRUCT<anomaly_prob_threshold FLOAT64> settings]
                    [, {TABLE table_name | (query_statement)}]);

# `AUTOENCODER` or `KMEANS` models:
ML.DETECT_ANOMALIES(MODEL model_name,
                    STRUCT<contamination FLOAT64> settings,
                    {TABLE table_name | (query_statement)});

model_name

model_name is the name of the model that's used to perform anomaly detection. If you do not have a default project configured, then prepend the project ID to the model name in following format: `[PROJECT_ID].[DATASET].[MODEL]` (including the backticks). For example, `myproject.mydataset.mymodel`.

anomaly_prob_threshold

The ARIMA_PLUS model supports the anomaly_prob_threshold custom threshold for anomaly detection. The value of the anomaly probability at each timestamp is calculated using the actual time-series data value and the values of the predicted time-series data and the variance from the model training. The actual time-series data value at a specific timestamp is identified as anomalous if the anomaly probability exceeds the anomaly_prob_threshold value. The anomaly_prob_threshold value also determines the lower and upper bounds, where a larger threshold value results in a larger interval size.

The anomaly_prob_threshold value's data type is FLOAT64, which is part of the settings STRUCT. The value should be in the range [0, 1) with a default value of 0.95.

contamination

contamination is the proportion of anomalies in the training dataset that are used to create the AUTOENCODER or KMEANS input models. For example, a contamination value of 0.1 means that 10% of the training data that was used to create the input model is anomalous. The contamination value determines the cutoff threshold of the target metric to become anomalous, and any input data with a target metric greater than the cutoff threshold is identified as anomalous. The target metric is mean_squared_error for AUTOENCODER models, and the target metric is normalized_distance for KMEANS models.

The contamination value's data type is FLOAT64, which is part of the settings STRUCT. This value is required for AUTOENCODER or KMEANS models. The valid value range for contamination is [0, 0.5].

table_name

The input table_name table is used to perform anomaly detection. If you do not have a default project configured, then prepend the project ID to the table name in the following format: `[PROJECT_ID].[DATASET].[TABLE]` (including the backticks). For example, `myproject.mydataset.mytable`.

query_statement

The query_statement clause specifies the standard SQL query that generates the data that you use to perform anomaly detection. For the supported SQL syntax of the query_statement clause, see Query syntax in Standard SQL page for the supported SQL syntax of the query_statement clause.

Input

The input requirements for the ML.DETECT_ANOMALIES function depend upon the input model type.

ARIMA_PLUS model input

ARIMA_PLUS model detection of anomalies has the following requirements:

  • To detect anomalies in historical time-series data, the DECOMPOSE_TIME_SERIES training option must be set as its default value of TRUE when the input model is created. Neither table_name nor query_statement is accepted.
  • The anomaly_prob_threshold value must be specified in the settings STRUCT to detect anomalies in new time-series data.
  • The column names of either the table_name input table or the query_statement clause must match the column names that are used to create the input model.
  • The data types of the TIME_SERIES_ID_COL columns must match the data types of the columns that are used to create the input model.

For a list of supported data types for the TIME_SERIES_TIMESTAMP_COL and TIME_SERIES_DATA_COL columns, see Supported data types for time series model inputs.

AUTOENCODER and KMEANS model input

AUTOENCODER or KMEANS model detection of anomalies has the following requirements:

  • The column names of either the table_name input table or the query_statement clause must match the column names of the model. The column data types must be compatible according to BigQuery implicit coercion rules.
  • If the TRANSFORM clause was present in the CREATE MODEL statement that created model_name, then only the input columns present in the TRANSFORM clause can appear in query_statement.

For information about how BigQuery ML handles NULL values in the feature column of the input data, see Imputation.

Output

The ML.DETECT_ANOMALIES function returns any anomaly detection results in the is_anomaly column. The output columns can differ, however, based upon the input model type and input data table.

ARIMA_PLUS model output

Time-series ARIMA_PLUS model output includes the following columns, followed by the input table columns, if present. Output can include the following:

  • time_series_id_col or time_series_id_cols time series identifiers: This output only occurs when you forecast multiple time series at once. The column names and data types are inherited from the TIME_SERIES_ID_COL option as specified in the model creation query.
  • time_series_timestamp: The column name is inherited from the TIME_SERIES_TIMESTAMP_COL option as specified in the model creation query. The column has a type of TIMESTAMP, regardless of the TIME_SERIES_TIMESTAMP_COL input column data type.
  • time_series_data: The column name is inherited from the TIME_SERIES_DATA_COL option as specified in the model creation query. The column has a type of FLOAT64, regardless of the TIME_SERIES_DATA_COL input column data type.
  • is_anomaly (BOOL): Indicates whether the value at a specific timestamp is an anomaly. If the anomaly_probability value is above the anomaly_prob_threshold value, then the time_series_data value is out of the range for the lower and upper bounds. The state is identified with is_anomaly=true.
  • lower_bound (FLOAT64)
  • upper_bound (FLOAT64)
  • anomaly_probability (FLOAT64): Indicates the probability that this point is an anomaly. For example, an anomaly_probability value of 0.95 means that, among all possible values at the given timestamp, there is a 95% chance that the value is closer to the predicted value than it is to the given time series data value. This indicates a 95% probability that the given time series data value is an anomaly.

The output of the ML.DETECT_ANOMALIES query for ARIMA_PLUS models has the following property:

  • The function returns NULL values in the is_anomaly, upper_bound, lower_bound and anomaly_probability columns for rows with invalid input, which include the following cases:
    • The value in the TIME_SERIES_ID_COL column does not exist in the model.
    • The value in the TIME_SERIES_TIMESTAMP_COL column is not in the range of the forecast horizon.
    • The value in the TIME_SERIES_TIMESTAMP_COL column does not follow the same frequency as the one in the model.

AUTOENCODER and KMEANS model output

AUTOENCODER or KMEANS model output includes the following columns, followed by the input table columns.

  • AUTOENCODER model output includes the following:

  • KMEANS model output includes the following:

    • is_anomaly (BOOL)
    • normalized_distance (FLOAT64): The shortest distance among the normalized distances from the input data to each cluster centroid. Normalized distances are computed as the absolute distance from the input data to a cluster centroid, divided by the cluster's radius. The cluster radius is defined as the root mean square of all of the distances from each cluster's assigned data points to its centroid. Normalized distance is used in favor of absolute distance to determine anomalies because anomalies might not be detected as effectively using absolute distances, since they do not account for cluster radius. The distance type is determined by the DISTANCE_TYPE value specified during model training.

    • centroid_id (INT64)

Examples

The following examples show different input models and settings.

ARIMA_PLUS model example without specified settings

The following query detects anomalies using the ARIMA_PLUS model that has the DECOMPOSE_TIME_SERIES training option set to its default value of TRUE, without specifying the anomaly_prob_threshold.

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`)

If the time series input column names are ts_timestamp and ts_data, then this query returns results like the following:

+-------------------------+----------+------------+-------------+-------------+---------------------+
|      ts_timestamp       | ts_data  | is_anomaly | lower_bound | upper_bound | anomaly_probability |
+-------------------------+----------+------------+-------------+-------------+---------------------+
| 2021-01-01 00:00:01 UTC |  125.3   |   FALSE    |  123.5      |  139.1      |  0.93               |
| 2021-01-02 00:00:01 UTC |  145.3   |   TRUE     |  128.5      |  143.1      |  0.96               |
+-------------------------+----------+------------+-------------+-------------+---------------------+

ARIMA_PLUS model example with a custom anomaly_prob_threshold

The following query detects anomalies using the ARIMA_PLUS model that has the DECOMPOSE_TIME_SERIES training option set to its default value of TRUE and specifying a custom anomaly_prob_threshold value of 0.8.

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`,
                      STRUCT(0.8 AS anomaly_prob_threshold))

If the time series input column names are ts_timestamp and ts_data, then this query returns results like the following:

+-------------------------+----------+------------+-------------+-------------+---------------------+
|      ts_timestamp       | ts_data  | is_anomaly | lower_bound | upper_bound | anomaly_probability |
+-------------------------+----------+------------+-------------+-------------+---------------------+
| 2021-01-01 00:00:01 UTC |  125.3   |    TRUE    |  129.5      |  133.6      |  0.93               |
| 2021-01-02 00:00:01 UTC |  145.3   |    TRUE    |  131.5      |  136.6      |  0.96               |
+-------------------------+----------+------------+-------------+-------------+---------------------+

ARIMA_PLUS model example with input data as a query statement

The following query detects anomalies using the ARIMA_PLUS model, specifying a custom anomaly_prob_threshold value of 0.9 and passing an input data table into the query.

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`,
                      STRUCT(0.9 AS anomaly_prob_threshold),
                      (
                        SELECT
                          state, city, date, temperature, weather
                        FROM
                          `mydataset.my_time_series_data_table`))

If the TIME_SERIES_ID_COL column names are state, city, and TIME_SERIES_TIMESTAMP_COL, and the TIME_SERIES_DATA_COL column names are date and temperature, then this query returns results like the following:

+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
| state |   city     |           date          | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | weather |
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
| "WA"  | "Kirkland" | 2021-01-01 00:00:00 UTC |   38.1      |   FALSE    |     36.4    |    42.0     |        0.8293       | "sunny" |
| "WA"  | "Kirkland" | 2021-01-02 00:00:00 UTC |   37.1      |   TRUE     |     37.4    |    43.3     |        0.9124       | "rainy" |
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+

ARIMA_PLUS model example with input data as a table

The following query detects anomalies using the ARIMA_PLUS model, specifying a custom anomaly_prob_threshold value of 0.9 and passing an input data table into the query.

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`,
                      STRUCT(0.9 AS anomaly_prob_threshold),
                      TABLE `mydataset.my_time_series_data_table`)

If the TIME_SERIES_ID_COL column names are state, city, and TIME_SERIES_TIMESTAMP_COL, and the TIME_SERIES_DATA_COL column names are date and temperature, and one additional column weather is in the input data table, then this query returns results like the following:

+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
| state |   city     |           date          | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | weather |
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
| "WA"  | "Kirkland" | 2021-01-01 00:00:00 UTC |   38.1      |   FALSE    |     36.4    |    42.0     |        0.8293       | "sunny" |
| "WA"  | "Kirkland" | 2021-01-02 00:00:00 UTC |   37.1      |   TRUE     |     37.4    |    43.3     |        0.9124       | "rainy" |
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+

AUTOENCODER model example

The following query detects anomalies using the AUTOENCODER model and a contamination value of 0.1.

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_autoencoder_model`,
                      STRUCT(0.1 AS contamination),
                      TABLE `mydataset.mytable`)

If the feature column names are f1 and f2, then this query returns results like the following:

+------------+--------------------+---------+--------+
| is_anomaly | mean_squared_error |    f1   |   f2   |
+------------+--------------------+---------+--------+
|   FALSE    |     0.63456        |   120   |  "a"   |
|   TRUE     |     11.342         |  15000  |  "b"   |
+------------+--------------------+---------+--------+

KMEANS model example

The following query detects anomalies using the KMEANS model and a contamination value of 0.2.

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_kmeans_model`,
                      STRUCT(0.2 AS contamination),
                      (
                        SELECT
                          f1,
                          f2
                        FROM
                          `mydataset.mytable`))

This query returns results like the following:

+------------+---------------------+-------------+--------+--------+
| is_anomaly | normalized_distance | centroid_id |   f1   |   f2   |
+------------+---------------------+-------------+--------+--------+
|   FALSE    |     0.63456         |     1       |  120   |  "a"   |
|   TRUE     |     6.3243          |     2       | 15000  |  "b"   |
+------------+---------------------+-------------+--------+--------+

Pricing

All queries that use the ML.DETECT_ANOMALIES function are billable, regardless of the pricing model.