The ML.VALIDATE_DATA_DRIFT function

This document describes the ML.VALIDATE_DATA_DRIFT function, which you can use to compute the data drift between two sets of serving data. This function computes and compares the statistics for the two data sets, and then identifies where there are anomalous differences between the two data sets. For example, you might want to compare the current serving data to historical serving data from a table snapshot, or to the features served at a particular point in time, which you can get by using the ML.FEATURES_AT_TIME function. You can use the data output by this function for model monitoring.

Syntax

ML.VALIDATE_DATA_DRIFT(
  { TABLE `project_id.dataset.base_table` | (base_query_statement) },
  { TABLE `project_id.dataset.study_table` | (study_query_statement) },
  STRUCT(
    [num_histogram_buckets AS num_histogram_buckets]
    [, num_quantiles_histogram_buckets AS num_quantiles_histogram_buckets]
    [, num_values_histogram_buckets, AS num_values_histogram_buckets,]
    [, num_rank_histogram_buckets AS num_rank_histogram_buckets]
    [, categorical_default_threshold AS categorical_default_threshold]
    [, categorical_metric_type AS categorical_metric_type]
    [, numerical_default_threshold AS numerical_default_threshold]
    [, numerical_metric_type AS numerical_metric_type]
    [, thresholds AS thresholds])
)

Arguments

ML.VALIDATE_DATA_DRIFT takes the following arguments:

  • project_id: your project ID.
  • dataset: the BigQuery dataset that contains the model.
  • model: the name of the model.
  • base_table: the name of the input table of serving data that you want to use as the baseline for comparison.
  • base_query_statement: a query that generates the serving data that you want to use as the baseline for comparison. For the supported SQL syntax of the base_query_statement clause, see GoogleSQL query syntax.
  • study_table: the name of the input table that contains the serving data that you want to compare to the baseline.
  • study_query_statement: a query that generates the serving data that you want to compare to the baseline. For the supported SQL syntax of the study_query_statement clause, see GoogleSQL query syntax.
  • num_histogram_buckets: an INT64 value that specifies the number of buckets to use for a histogram with equal-width buckets. Only applies to numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> columns. The num_histogram_buckets value must be in the range [1, 1,000]. The default value is 10.
  • num_quantiles_histogram_buckets: an INT64 value that specifies the number of buckets to use for a quantiles histogram. Only applies to numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> columns. The num_quantiles_histogram_buckets value must be in the range [1, 1,000]. The default value is 10.
  • num_values_histogram_buckets: an INT64 value that specifies the number of buckets to use for a quantiles histogram. Only applies to ARRAY columns. The num_values_histogram_buckets value must be in the range [1, 1,000]. The default value is 10.
  • num_rank_histogram_buckets: an INT64 value that specifies the number of buckets to use for a rank histogram. Only applies to categorical and ARRAY<categorical> columns. The num_rank_histogram_buckets value must be in the range [1, 10,000]. The default value is 50.
  • categorical_default_threshold: a FLOAT64 value that specifies the custom threshold to use for anomaly detection for categorical and ARRAY<categorical> features. The value must be in the range [0, 1). The default value is 0.3.
  • categorical_metric_type: a STRING value that specifies the metric used to compare statistics for categorical and ARRAY<categorical> features. Valid values are as follows:
  • numerical_default_threshold: a FLOAT64 value that specifies the custom threshold to use for anomaly detection for numerical features. The value must be in the range [0, 1). The default value is 0.3.
  • numerical_metric_type: a STRING value that specifies the metric used to compare statistics for numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> features. The only valid value is JENSEN_SHANNON_DIVERGENCE.
  • thresholds: an ARRAY<STRUCT<STRING, FLOAT64>> value that specifies the anomaly detection thresholds for one or more columns for which you don't want to use the default threshold. The STRING value in the struct specifies the column name, and the FLOAT64 value specifies the threshold. The FLOAT64 value must be in the range [0,1). For example, [('col_a', 0.1), ('col_b', 0.8)].

Output

ML.VALIDATE_DATA_DRIFT returns one row for each column in the input data. ML.VALIDATE_DATA_DRIFT output contains the following columns:

  • input: a STRING column that contains the input column name.
  • metric: a STRING column that contains the metric used to compare the input column statistical value between the two data sets. This column value is JENSEN_SHANNON_DIVERGENCE for numerical features, and either L_INFTY or JENSEN_SHANNON_DIVERGENCE for categorical features.
  • threshold: a FLOAT64 column that contains the threshold used to determine whether the statistical difference in the input column value between the two data sets is anomalous.
  • value: a FLOAT64 column that contains the statistical difference in the input column value between the two data sets.
  • is_anomaly: a BOOL column that indicates whether the value value is higher than the threshold value.

Example

The following example computes data drift between a snapshot of the serving data table and the current serving data table, with a categorical feature threshold of 0.2:

SELECT *
FROM ML.VALIDATE_DATA_DRIFT(
  TABLE `myproject.mydataset.previous_serving_data`,
  TABLE `myproject.mydataset.serving`,
  STRUCT(0.2 AS categorical_default_threshold)
);

Limitations

  • Running the ML.VALIDATE_DATA_DRIFT function on a large amount of input data can cause the query to return the error Dry run query timed out. To resolve the error, disable retrieval of cached results for the query.

  • ML.VALIDATE_DATA_DRIFT doesn't conduct schema validation between the two sets of input data, and so handles data type mismatches as follows:

    • If you specify JENSEN_SHANNON_DIVERGENCE for the categorical_default_threshold or numerical_default_thresholdargument, the feature isn't included in the final anomaly report.
    • If you specify L_INFTY for the categorical_default_threshold argument, the function outputs the computed feature distance as expected.

However, when you run inference on the serving data, the ML.PREDICT function handles schema validation.