The ML.VALIDATE_DATA_SKEW function
This document describes the ML.VALIDATE_DATA_SKEW
function, which you can use
to compute the data skew between a model's training and serving data. This
function computes the statistics for the serving data, compares them to the
statistics that were computed for the training data at the time the model was
created, and identifies where there are anomalous differences between the two
data sets. You can use the data output by this
function for model monitoring.
Statistics are only computed for feature columns in the serving data that match
feature columns in the training data, in order to achieve better performance and
lower cost. For models that were created with use of the
TRANSFORM
clause,
the statistics are based on the raw feature data before feature preprocessing
within the TRANSFORM
clause.
Syntax
ML.VALIDATE_DATA_SKEW( MODEL `project_id.dataset.model`, { TABLE `project_id.dataset.table` | (query_statement) }, STRUCT( [categorical_default_threshold AS categorical_default_threshold] [, categorical_metric_type AS categorical_metric_type] [, numerical_default_threshold AS numerical_default_threshold] [, numerical_metric_type AS numerical_metric_type] [, thresholds AS thresholds]) )
Arguments
ML.VALIDATE_DATA_SKEW
takes the following arguments:
project_id
: your project ID.dataset
: the BigQuery dataset that contains the model.model
: the name of the model.table
: the name of the input table that contains the serving data to calculate statistics for.query_statement
: a query that generates the serving data to calculate statistics for. For the supported SQL syntax of thequery_statement
clause, see GoogleSQL query syntax.categorical_default_threshold
: aFLOAT64
value that specifies the custom threshold to use for anomaly detection for categorical andARRAY<categorical>
features. The value must be in the range[0, 1)
. The default value is0.3
.categorical_metric_type
: aSTRING
value that specifies the metric used to compare statistics for categorical andARRAY<categorical>
features. Valid values are as follows:L_INFTY
: use L-infinity distance. This value is the default.JENSEN_SHANNON_DIVERGENCE
: use Jensen–Shannon divergence.
numerical_default_threshold
: aFLOAT64
value that specifies the custom threshold to use for anomaly detection for numerical,ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
features. The value must be in the range[0, 1)
. The default value is0.3
.numerical_metric_type
: aSTRING
value that specifies the metric used to compare statistics for numerical,ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
features. The only valid value isJENSEN_SHANNON_DIVERGENCE
.thresholds
: anARRAY<STRUCT<STRING, FLOAT64>>
value that specifies the anomaly detection thresholds for one or more columns for which you don't want to use the default threshold. TheSTRING
value in the struct specifies the column name, and theFLOAT64
value specifies the threshold. TheFLOAT64
value must be in the range[0, 1)
. For example,[('col_a', 0.1), ('col_b', 0.8)]
.
Output
ML.VALIDATE_DATA_SKEW
returns one row for each column in the input data.
ML.VALIDATE_DATA_SKEW
output contains the following columns:
input
: aSTRING
column that contains the input column name.metric
: aSTRING
column that contains the metric used to compare theinput
column statistical value between the training and serving data sets. This column value isJENSEN_SHANNON_DIVERGENCE
for numerical features, and eitherL_INFTY
orJENSEN_SHANNON_DIVERGENCE
for categorical features.threshold
: aFLOAT64
column that contains the threshold used to determine whether the statistical difference in theinput
column value between the training and serving data is anomalous.value
: aFLOAT64
column that contains the statistical difference in theinput
column value between the serving and the training data sets.is_anomaly
: aBOOL
column that indicates whether thevalue
value is higher than thethreshold
value.
Examples
The following examples demonstrate how to use the ML.VALIDATE_DATA_SKEW
function.
Run ML.VALIDATE_DATA_SKEW
The following example computes data skew between the serving data and the
training data used to create the model, with a categorical feature threshold
of 0.2
:
SELECT * FROM ML.VALIDATE_DATA_SKEW( MODEL `myproject.mydataset.mymodel`, TABLE `myproject.mydataset.serving`, STRUCT(0.2 AS categorical_default_threshold) );
Automate skew detection
The following example shows how to automate skew detection for a linear regression model:
DECLARE anomalies ARRAY<STRING>; SET anomalies = ( SELECT ARRAY_AGG(input) FROM ML.VALIDATE_DATA_SKEW( MODEL mydataset.model_linear_reg, TABLE mydataset.serving, STRUCT( 0.3 AS categorical_default_threshold, 0.2 AS numerical_default_threshold, 'JENSEN_SHANNON_DIVERGENCE' AS numerical_metric_type, [STRUCT('fare', 0.15), STRUCT('company', 0.25)] AS thresholds)) WHERE is_anomaly ); IF(ARRAY_LENGTH(anomalies) > 0) THEN CREATE OR REPLACE MODEL mydataset.model_linear_reg TRANSFORM( ML.MIN_MAX_SCALER(fare) OVER() AS f1, ML.ROBUST_SCALER(pickup_longitude) OVER() AS f2, ML.LABEL_ENCODER(company) OVER() AS f3, ML.ONE_HOT_ENCODER(payment_type) OVER() AS f4, label ) OPTIONS ( model_type = 'linear_reg', max_iterations = 1) AS ( SELECT fare, pickup_longitude, company, payment_type, 2 AS label FROM mydataset.new_training_data ); SELECT ERROR( CONCAT( "Found data skew in features: ", ARRAY_TO_STRING(anomalies, ", "), ". Model is retrained with the latest data.")); ELSE SELECT * FROM ML.PREDICT( MODEL mydataset.model_linear_reg, TABLE mydataset.serving); END IF;
Limitations
ML.VALIDATE_DATA_SKEW
doesn't support the following types of models:- AutoML
- Matrix factorization
ARIMA_PLUS
- Remote models over LLMs, Cloud AI services, or Vertex AI endpoints
- Imported Open Neural Network Exchange (ONNX), TensorFlow, TensorFlow Lite, or XGBoost models
ML.VALIDATE_DATA_SKEW
doesn't support models created before March 28, 2024, or models that use theWARM START
option. To enable use ofML.VALIDATE_DATA_SKEW
, retrain the model by running theCREATE OR REPLACE model
statement.Running the
ML.VALIDATE_DATA_SKEW
function on a large amount of input data can cause the query to return the errorDry run query timed out
. To resolve the error, disable retrieval of cached results for the query.ML.VALIDATE_DATA_SKEW
doesn't conduct schema validation between the two sets of input data, and so handles data type mismatches as follows:- If you specify
JENSEN_SHANNON_DIVERGENCE
for thecategorical_default_threshold
ornumerical_default_threshold
argument, the feature isn't included in the final anomaly report. - If you specify
L_INFTY
for thecategorical_default_threshold
argument, the function outputs the computed feature distance as expected.
However, when you run inference on the serving data, the
ML.PREDICT
function handles schema validation.- If you specify