The ML.DETECT_ANOMALIES function

This document describes the ML.DETECT_ANOMALIES function, which lets you perform anomaly detection in BigQuery ML.

For time series data, ML.DETECT_ANOMALIES runs using ARIMA_PLUS and ARIMA_PLUS_XREG (preview) models. For independent and identically distributed random variables (IID) data, ML.DETECT_ANOMALIES runs using autoencoder, k-means, or principal component analysis (PCA) models.

Syntax

# ARIMA_PLUS and ARIMA_PLUS_XREG models:
ML.DETECT_ANOMALIES(
  MODEL `project_id.dataset.model`
  [, STRUCT(anomaly_prob_threshold AS anomaly_prob_threshold)]
  [, { TABLE `project_id.dataset.table` | (query_statement) }]
)

#Autoencoder, k-means, or PCA models:
ML.DETECT_ANOMALIES(
  MODEL `project_id.dataset.model`,
  STRUCT(contamination AS contamination),
  { TABLE `project_id.dataset.table` | (query_statement) }
)

Arguments

ML.DETECT_ANOMALIES takes the following arguments:

  • project_id: Your project ID.
  • dataset: The BigQuery dataset that contains the model.
  • model: The name of the model.
  • table: The name of the table to use to perform anomaly detection.
  • query_statement: The GoogleSQL query that generates the data to use to perform anomaly detection. For the supported SQL syntax for the query_statement clause in GoogleSQL, see Query syntax.
  • anomaly_prob_threshold: a FLOAT64 value that identifies the custom threshold to use for anomaly detection. The value must be in the range [0, 1), with a default value of 0.95.

    The value of the anomaly probability at each timestamp is calculated using the actual time series data value and the values of the predicted time series data and the variance from the model training. The actual time series data value at a specific timestamp is identified as anomalous if the anomaly probability exceeds the anomaly_prob_threshold value. The anomaly_prob_threshold value also determines the lower and upper bounds, where a larger threshold value results in a larger interval size.

  • contamination: a FLOAT64 value that identifies the proportion of anomalies in the training dataset that are used to create the autoencoder, k-means, or PCA input models. The value must be in the range [0, 0.5].

    For example, contamination value of 0.1 means that 10% of the training data that was used to create the input model is anomalous. The contamination value determines the cutoff threshold of the target metric to become anomalous, and any input data with a target metric greater than the cutoff threshold is identified as anomalous. The target metric is mean squared error for autoencoder and PCA models, and the target metric is normalized distance for k-means models. For more information on normalized distance, see K-means model output.

Input

The input requirements for the ML.DETECT_ANOMALIES function depend upon the input model type.

Time series model input

Anomaly detection with ARIMA_PLUS and ARIMA_PLUS_XREG models has the following requirements:

  • To detect anomalies in historical time-series data, the DECOMPOSE_TIME_SERIES training option must be set as its default value of TRUE when the input model is created. Neither table_name nor query_statement is accepted.
  • The anomaly_prob_threshold value must be specified to detect anomalies in new time-series data.
  • The column names of either the table_name input table or the query_statement clause must match the column names that are used to create the input model.
  • The data types of the TIME_SERIES_ID_COL columns must match the data types of the columns that are used to create the input model.

For a list of supported data types for the TIME_SERIES_TIMESTAMP_COL and TIME_SERIES_DATA_COL columns, see Supported data types for time series model inputs.

Autoencoder, k-means, or PCA model input

Anomaly detection with autoencoder, k-means, or PCA models has the following requirements:

  • The column names of the input data from either the table or the query_statement argument must match the column names of the model. The column data types must be compatible according to BigQuery implicit coercion rules.
  • If you used the TRANSFORM clause in the CREATE MODEL statement that created the model, then only the input columns present in the TRANSFORM clause can appear in query_statement.

For information about how BigQuery ML handles NULL values in the feature column of the input data, see Imputation.

Output

ML.DETECT_ANOMALIES always returns the is_anomaly column that contains the anomaly detection results. Other output columns differ based upon the input model type and input data table.

Time series model output

ARIMA_PLUS and ARIMA_PLUS_XREG model output includes the following columns, followed by the input table columns, if present. Output can include the following:

  • time_series_id_col or time_series_id_cols: the identifiers of a time series. Only present when forecasting multiple time series at once. The column names and types are inherited from the TIME_SERIES_ID_COL option as specified in the model creation query.
  • time_series_timestamp: a STRING value that contains the timestamp column for a time series. The column name is inherited from the TIME_SERIES_TIMESTAMP_COL option as specified in the CREATE MODEL statement. The column has a type of TIMESTAMP, regardless of the TIME_SERIES_TIMESTAMP_COL input column data type.
  • time_series_data: a STRING value that contains the data column for a time series. The column name is inherited from the TIME_SERIES_DATA_COL option as specified in the CREATE MODEL statement. The column has a type of FLOAT64, regardless of the TIME_SERIES_DATA_COL input column data type.
  • is_anomaly: a BOOL value that indicates whether the value at a specific timestamp is an anomaly. If the anomaly_probability value is above the anomaly_prob_threshold value, then the time_series_data value is out of the range for the lower and upper bounds and the is_anomaly value is TRUE.
  • lower_bound: a FLOAT64 value that contains the lower bound of the prediction result.
  • upper_bound: a FLOAT64 value that that contains the upper bound of the prediction result.
  • anomaly_probability: a FLOAT64 value that contains the probability that this point is an anomaly. For example, an anomaly_probability value of 0.95 means that, among all possible values at the given timestamp, there is a 95% chance that the value is closer to the predicted value than it is to the given time series data value. This indicates a 95% probability that the given time series data value is an anomaly.

ML.DETECT_ANOMALIES output for time series models has the following properties:

  • The function returns NULL values in the is_anomaly, upper_bound, lower_bound and anomaly_probability columns for rows with invalid input, which include the following cases:
    • The value in the TIME_SERIES_ID_COL column does not exist in the model.
    • The value in the TIME_SERIES_TIMESTAMP_COL column is not in the range of the forecast horizon.
    • The value in the TIME_SERIES_TIMESTAMP_COL column does not follow the same frequency as the one in the model.

Autoencoder and PCA model output

Autoencoder and PCA model output includes the following columns, followed by the input table columns:

  • is_anomaly: a BOOL value that indicates whether the value is anomalous.
  • mean_squared_error: a FLOAT64 value that contains the mean squared error.

K-means model output

K-means model output includes the following, followed by the input table columns:

  • is_anomaly: a BOOL value that indicates whether the value is anomalous.
  • normalized_distance: a FLOAT64 value that contains the shortest distance among the normalized distances from the input data to each cluster centroid. Normalized distances are computed as the absolute distance from the input data to a cluster centroid, divided by the cluster's radius. The cluster radius is defined as the root mean square of all of the distances from each cluster's assigned data points to its centroid. Normalized distance is used in favor of absolute distance to determine anomalies because anomalies might not be detected as effectively using absolute distances, since they don't account for cluster radius. The distance type is determined by the DISTANCE_TYPE value specified during model training.
  • centroid_id: an INT64 value that contains the centroid ID.

Examples

The following examples show how to use ML.DETECT_ANOMALIES with different input models and settings.

ARIMA_PLUS model without specified settings

The following example detects anomalies using an ARIMA_PLUS model that has the DECOMPOSE_TIME_SERIES training option set to its default value of TRUE, without specifying the anomaly_prob_threshold argument.

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`)

If the time series input column names are ts_timestamp and ts_data, then this query returns results similar to the following:

+-------------------------+----------+------------+-------------+-------------+---------------------+
|      ts_timestamp       | ts_data  | is_anomaly | lower_bound | upper_bound | anomaly_probability |
+-------------------------+----------+------------+-------------+-------------+---------------------+
| 2021-01-01 00:00:01 UTC |  125.3   |   FALSE    |  123.5      |  139.1      |  0.93               |
| 2021-01-02 00:00:01 UTC |  145.3   |   TRUE     |  128.5      |  143.1      |  0.96               |
+-------------------------+----------+------------+-------------+-------------+---------------------+

ARIMA_PLUS model with a custom anomaly_prob_threshold value

The following example detects anomalies using an ARIMA_PLUS model that has the DECOMPOSE_TIME_SERIES training option set to its default value of TRUE, using a custom anomaly_prob_threshold value of 0.8:

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`,
    STRUCT(0.8 AS anomaly_prob_threshold))

If the time series input column names are ts_timestamp and ts_data, then this query returns results similar to the following:

+-------------------------+----------+------------+-------------+-------------+---------------------+
|      ts_timestamp       | ts_data  | is_anomaly | lower_bound | upper_bound | anomaly_probability |
+-------------------------+----------+------------+-------------+-------------+---------------------+
| 2021-01-01 00:00:01 UTC |  125.3   |    TRUE    |  129.5      |  133.6      |  0.93               |
| 2021-01-02 00:00:01 UTC |  145.3   |    TRUE    |  131.5      |  136.6      |  0.96               |
+-------------------------+----------+------------+-------------+-------------+---------------------+

ARIMA_PLUS model with input data as a query statement

The following example detects anomalies using an ARIMA_PLUS model, using a custom anomaly_prob_threshold value of 0.9 and passing an input data table into the query:

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`,
    STRUCT(0.9 AS anomaly_prob_threshold),
    (
      SELECT
        state, city, date, temperature, weather
      FROM
        `mydataset.my_time_series_data_table`))

This example uses the following column values:

  • TIME_SERIES_ID_COL is state, city.
  • TIME_SERIES_TIMESTAMP_COL isdate.
  • TIME_SERIES_DATA_COL is temperature.

This example returns results similar to the following:

+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
| state |   city     |           date          | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | weather |
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
| "WA"  | "Kirkland" | 2021-01-01 00:00:00 UTC |   38.1      |   FALSE    |     36.4    |    42.0     |        0.8293       | "sunny" |
| "WA"  | "Kirkland" | 2021-01-02 00:00:00 UTC |   37.1      |   TRUE     |     37.4    |    43.3     |        0.9124       | "rainy" |
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+

ARIMA_PLUS model with input data as a table

The following example detects anomalies using an ARIMA_PLUS model, using a custom anomaly_prob_threshold value of 0.9 and passing an input data table into the query:

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`,
    STRUCT(0.9 AS anomaly_prob_threshold),
    TABLE `mydataset.my_time_series_data_table`)

If the TIME_SERIES_ID_COL column names are state, city, and TIME_SERIES_TIMESTAMP_COL, and the TIME_SERIES_DATA_COL column names are date and temperature, and one additional column weather is in the input data table, then this query returns results similar to the following:

+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
| state |   city     |           date          | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | weather |
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
| "WA"  | "Kirkland" | 2021-01-01 00:00:00 UTC |   38.1      |   FALSE    |     36.4    |    42.0     |        0.8293       | "sunny" |
| "WA"  | "Kirkland" | 2021-01-02 00:00:00 UTC |   37.1      |   TRUE     |     37.4    |    43.3     |        0.9124       | "rainy" |
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+

Autoencoder model

The following example detects anomalies using an autoencoder model and a contamination value of 0.1.

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_autoencoder_model`,
    STRUCT(0.1 AS contamination),
    TABLE `mydataset.mytable`)

If the feature column names are f1 and f2, then this query returns results similar to the following:

+------------+--------------------+---------+--------+
| is_anomaly | mean_squared_error |    f1   |   f2   |
+------------+--------------------+---------+--------+
|   FALSE    |     0.63456        |   120   |  "a"   |
|   TRUE     |     11.342         |  15000  |  "b"   |
+------------+--------------------+---------+--------+

K-means model

The following example detects anomalies using a k-means model and a contamination value of 0.2.

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_kmeans_model`,
    STRUCT(0.2 AS contamination),
    (
      SELECT
        f1,
        f2
      FROM
        `mydataset.mytable`))

This query returns results similar to the following:

+------------+---------------------+-------------+--------+--------+
| is_anomaly | normalized_distance | centroid_id |   f1   |   f2   |
+------------+---------------------+-------------+--------+--------+
|   FALSE    |     0.63456         |     1       |  120   |  "a"   |
|   TRUE     |     6.3243          |     2       | 15000  |  "b"   |
+------------+---------------------+-------------+--------+--------+

PCA model

The following example detects anomalies using a PCA model and a contamination value of 0.1.

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_pca_model`,
    STRUCT(0.1 AS contamination),
    TABLE `mydataset.mytable`)

If the feature column names are f1, f2 and f3, then this query returns results similar to the following:

+------------+--------------------+---------+--------+------+
| is_anomaly | mean_squared_error |    f1   |   f2   |  f3  |
+------------+--------------------+---------+--------+------+
|   FALSE    |     0.63456        |   120   |  "a"   |  0.9 |
|   TRUE     |     11.342         |  15000  |  "b"   |  25  |
+------------+--------------------+---------+--------+------+

Pricing

All queries that use the ML.DETECT_ANOMALIES function are billable, regardless of the pricing model.

What's next