The ML.TFDV_VALIDATE function
This document describes the ML.TFDV_VALIDATE
function, which you can use to
compare the statistics for training and serving data, or two sets of
serving data, in order to identify anomalous differences between the two data
sets. Calling this function provides the same behavior as calling the
TensorFlow
validate_statistics
API.
You can use the data output by this function for
model monitoring.
Syntax
ML.TFDV_VALIDATE( base_statistics, study_statistics [, detection_type] [, categorical_default_threshold] [, categorical_metric_type] [, numerical_default_threshold] [, numerical_metric_type] [, thresholds] )
Arguments
ML.TFDV_VALIDATE
takes the following arguments:
base_statistics
: the statistics of the training or serving data that you want to use as the baseline for comparison. This must be a TensorFlowDatasetFeatureStatisticsList
protocol buffer in JSON format. You can generate a protocol buffer in the correct format by running theML.TFDV_DESCRIBE
function, or you can load it from outside of BigQuery.study_statistics
: the statistics of the training or serving data that you want to compare to the baseline. This must be a TensorFlowDatasetFeatureStatisticsList
protocol buffer in JSON format. You can generate a protocol buffer in the correct format by running theML.TFDV_DESCRIBE
function, or you can load it from outside of BigQuery.detection_type
: aSTRING
value that specifies the type of comparison that you want to make. Valid values are as follows:SKEW
: returns the data skew, which represents the statistical variation between training and serving data.DRIFT
: returns the data drift, which represents the statistical variation between two different sets of serving data.
categorical_default_threshold
: aFLOAT64
value that specifies the custom threshold to use for anomaly detection for categorical andARRAY<categorical>
features. The value must be in the range[0, 1)
. The default value is0.3
.categorical_metric_type
: aSTRING
value that specifies the metric used to compare statistics for categorical andARRAY<categorical>
features. Valid values are as follows:L_INFTY
: use L-infinity distance. This value is the default.JENSEN_SHANNON_DIVERGENCE
: use Jensen–Shannon divergence.
numerical_default_threshold
: aFLOAT64
value that specifies the custom threshold to use for anomaly detection for numerical,ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
features. The value must be in the range[0, 1)
. The default value is0.3
.numerical_metric_type
: aSTRING
value that specifies the metric used to compare statistics for numerical,ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
features. The only valid value isJENSEN_SHANNON_DIVERGENCE
.thresholds
: anARRAY<STRUCT<STRING, FLOAT64>>
value that specifies the anomaly detection thresholds for one or more columns for which you don't want to use the default threshold. TheSTRING
value in the struct specifies the column name, and theFLOAT64
value specifies the threshold. TheFLOAT64
value must be in the range[0, 1)
. For example,[('col_a', 0.1), ('col_b', 0.8)]
.
ML.TFDV_VALIDATE
uses positional arguments, so if you specify an
optional argument, you must also specify all arguments prior to that argument.
For more information on argument types, see
Named arguments.
Output
ML.TFDV_VALIDATE
returns a TensorFlow
Anomalies
protocol buffer
in JSON format.
Examples
The following example returns the skew between training and serving data and also sets custom anomaly detection thresholds for two of the feature columns:
DECLARE stats1 JSON; DECLARE stats2 JSON; SET stats1 = (SELECT * FROM ML.TFDV_DESCRIBE(TABLE `myproject.mydataset.training`)); SET stats2 = (SELECT * FROM ML.TFDV_DESCRIBE(TABLE `myproject.mydataset.serving`)); SELECT ML.TFDV_VALIDATE( stats1, stats2, 'SKEW', .3, 'L_INFTY', .3, 'JENSEN_SHANNON_DIVERGENCE', [('feature1', 0.2), ('feature2', 0.5)] ); INSERT `myproject.mydataset.serve_stats` (t, dataset_feature_statistics_list) SELECT CURRENT_TIMESTAMP() AS t, stats1;
The following example returns the drift between two sets of serving data:
SELECT ML.TFDV_VALIDATE( (SELECT dataset_feature_statistics_list FROM `myproject.mydataset.servingJan24`), (SELECT * FROM ML.TFDV_DESCRIBE(TABLE `myproject.mydataset.serving`)), 'DRIFT' );
Limitations
The ML.TFDV_VALIDATE
function doesn't conduct schema validation.
ML.TFDV_VALIDATE
handles type mismatch as follows:
- If you specify
JENSEN_SHANNON_DIVERGENCE
for thecategorical_default_threshold
ornumerical_default_threshold
argument, the feature isn't included in the final anomaly report. - If you specify
L_INFTY
for thecategorical_default_threshold
argument, the function outputs the computed feature distance as expected.