The ML.VALIDATE_DATA_DRIFT function
This document describes the ML.VALIDATE_DATA_DRIFT
function, which you can use
to compute the data drift between two sets of serving data. This
function computes and compares the statistics for the two data sets, and then
identifies where there are anomalous differences between the two
data sets. For example, you might want to compare the current serving
data to historical serving data from a
table snapshot, or to the features
served at a particular point in time, which you can get by using the
ML.FEATURES_AT_TIME
function.
You can use the data output by this function for
model monitoring.
Syntax
ML.VALIDATE_DATA_DRIFT( { TABLE `project_id.dataset.base_table` | (base_query_statement) }, { TABLE `project_id.dataset.study_table` | (study_query_statement) }, STRUCT( [num_histogram_buckets AS num_histogram_buckets] [, num_quantiles_histogram_buckets AS num_quantiles_histogram_buckets] [, num_values_histogram_buckets, AS num_values_histogram_buckets,] [, num_rank_histogram_buckets AS num_rank_histogram_buckets] [, categorical_default_threshold AS categorical_default_threshold] [, categorical_metric_type AS categorical_metric_type] [, numerical_default_threshold AS numerical_default_threshold] [, numerical_metric_type AS numerical_metric_type] [, thresholds AS thresholds]) )
Arguments
ML.VALIDATE_DATA_DRIFT
takes the following arguments:
project_id
: your project ID.dataset
: the BigQuery dataset that contains the model.model
: the name of the model.base_table
: the name of the input table of serving data that you want to use as the baseline for comparison.base_query_statement
: a query that generates the serving data that you want to use as the baseline for comparison. For the supported SQL syntax of thebase_query_statement
clause, see GoogleSQL query syntax.study_table
: the name of the input table that contains the serving data that you want to compare to the baseline.study_query_statement
: a query that generates the serving data that you want to compare to the baseline. For the supported SQL syntax of thestudy_query_statement
clause, see GoogleSQL query syntax.num_histogram_buckets
: anINT64
value that specifies the number of buckets to use for a histogram with equal-width buckets. Only applies to numerical,ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
columns. Thenum_histogram_buckets
value must be in the range[1, 1,000]
. The default value is10
.num_quantiles_histogram_buckets
: anINT64
value that specifies the number of buckets to use for a quantiles histogram. Only applies to numerical,ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
columns. Thenum_quantiles_histogram_buckets
value must be in the range[1, 1,000]
. The default value is10
.num_values_histogram_buckets
: anINT64
value that specifies the number of buckets to use for a quantiles histogram. Only applies toARRAY
columns. Thenum_values_histogram_buckets
value must be in the range[1, 1,000]
. The default value is10
.num_rank_histogram_buckets
: anINT64
value that specifies the number of buckets to use for a rank histogram. Only applies to categorical andARRAY<categorical>
columns. Thenum_rank_histogram_buckets
value must be in the range[1, 10,000]
. The default value is50
.categorical_default_threshold
: aFLOAT64
value that specifies the custom threshold to use for anomaly detection for categorical andARRAY<categorical>
features. The value must be in the range[0, 1)
. The default value is0.3
.categorical_metric_type
: aSTRING
value that specifies the metric used to compare statistics for categorical andARRAY<categorical>
features. Valid values are as follows:L_INFTY
: use L-infinity distance. This value is the default.JENSEN_SHANNON_DIVERGENCE
: use Jensen–Shannon divergence.
numerical_default_threshold
: aFLOAT64
value that specifies the custom threshold to use for anomaly detection for numerical features. The value must be in the range[0, 1)
. The default value is0.3
.numerical_metric_type
: aSTRING
value that specifies the metric used to compare statistics for numerical,ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
features. The only valid value isJENSEN_SHANNON_DIVERGENCE
.thresholds
: anARRAY<STRUCT<STRING, FLOAT64>>
value that specifies the anomaly detection thresholds for one or more columns for which you don't want to use the default threshold. TheSTRING
value in the struct specifies the column name, and theFLOAT64
value specifies the threshold. TheFLOAT64
value must be in the range[0,1)
. For example,[('col_a', 0.1), ('col_b', 0.8)]
.
Output
ML.VALIDATE_DATA_DRIFT
returns one row for each column in the input data.
ML.VALIDATE_DATA_DRIFT
output contains the following columns:
input
: aSTRING
column that contains the input column name.metric
: aSTRING
column that contains the metric used to compare theinput
column statistical value between the two data sets. This column value isJENSEN_SHANNON_DIVERGENCE
for numerical features, and eitherL_INFTY
orJENSEN_SHANNON_DIVERGENCE
for categorical features.threshold
: aFLOAT64
column that contains the threshold used to determine whether the statistical difference in theinput
column value between the two data sets is anomalous.value
: aFLOAT64
column that contains the statistical difference in theinput
column value between the two data sets.is_anomaly
: aBOOL
column that indicates whether thevalue
value is higher than thethreshold
value.
Example
The following example computes data drift between a snapshot of the
serving data table and the current serving data table,
with a categorical feature threshold of 0.2
:
SELECT * FROM ML.VALIDATE_DATA_DRIFT( TABLE `myproject.mydataset.previous_serving_data`, TABLE `myproject.mydataset.serving`, STRUCT(0.2 AS categorical_default_threshold) );
Limitations
Running the
ML.VALIDATE_DATA_DRIFT
function on a large amount of input data can cause the query to return the errorDry run query timed out
. To resolve the error, disable retrieval of cached results for the query.ML.VALIDATE_DATA_DRIFT
doesn't conduct schema validation between the two sets of input data, and so handles data type mismatches as follows:- If you specify
JENSEN_SHANNON_DIVERGENCE
for thecategorical_default_threshold
ornumerical_default_threshold
argument, the feature isn't included in the final anomaly report. - If you specify
L_INFTY
for thecategorical_default_threshold
argument, the function outputs the computed feature distance as expected.
- If you specify
However, when you run inference on the serving data, the
ML.PREDICT
function
handles schema validation.