Identifying training-serving skew with novelty detection

This document is the fifth in a series that shows you how to monitor ML models to AI Platform Prediction for data drift detection. This guide shows you how to compute an overall drift score for the request-response serving in BigQuery by using a novelty detection model. The guide builds on the concepts and tasks that are described in Logging serving requests by using AI Platform Prediction and Analyzing AI Platform Prediction logs in BigQuery.

This guide is for data scientists and ML engineers who want to maintain the performance of their ML models in production by determining whether the serving data drifts over time and by computing the magnitude of this drift (if any).

The series consists of the following documents:

The code for the process described in this document is incorporated into Jupyter notebooks. The notebooks are in a GitHub repository.


Monitoring the predictive performance of a machine learning (ML) model in production is a crucial area of MLOps. The predictive performance of a deployed model might decay over time. This decay can occur for several reasons, including a growing discrepancy between training and serving data, an evolving business context, or a changing technical environment.

The following diagram shows the process that's described in this guide:

Process flow.

Feature-level drift versus dataset-level drift

Previous guides in this series described how to use TensorFlow Data Validation (TFDV) to detect schema and distribution skews and anomalies in a dataset. TFDV uses a reference schema and baseline statistics that are generated from the training data to detect anomalies in new data.

You can use TFDV to identify skews and anomalies such as the following for each feature:

  • The data type of a feature isn't as expected.
  • The values of a numerical feature are outside the expected range.
  • The percentage of the missing values in a specific feature is under a predefined threshold.
  • New or missing categories are detected in a categorical feature.
  • The values distribution of a feature isn't as expected.

In practice, most of the training-serving data skews and anomalies are due to issues in individual features. For example, errors or updates in the data ingestion and processing pipelines might introduce missing values, new values, out-of-range values, or values that have incorrect data types. You might also see inconsistencies in data schema types in any of the following circumstances:

  • Data sources change.
  • You integrate new data sources.
  • You operate your model in a new domain (for example, a new geographic location).

TFDV can detect feature-level skews and anomalies such as these so that you can act on them. For example, you might update the preprocessing logic, add new vocabulary to categorical features, and discard erroneous data and retrain the model if needed.

However, in some cases, individual features in the dataset might not have any significant skew related to the reference schema and the baseline statistics. Even so, the joint probability distribution of these features shows a change. This change represents a type of covariate shift. This type of shift usually happens as new trends and patterns emerge in the data due to the changes in the dynamic of the environment. Examples might be changes in the prices of real estate relative to locations, or the popularity of fashion items among different demographic groups. Covariate shift is also subject to seasonality, because the statistical relationships between input features might change periodically in response to different seasons and events.

Therefore, it can be useful to compute an overall drift score at the dataset level with respect to all the features. This drift score can represent the skew magnitude of your serving data. Computing an overall drift score simplifies the process of monitoring dataset drift over time, and it helps you to define and take actions based on certain drift score thresholds.

Methods for scoring a dataset drift

You can use the following methods to compute a drift magnitude at the dataset level rather than at the feature level:

  • Novelty and outlier detection
  • Two-sample statistical tests

Novelty and outlier detection

In novelty and outlier detection algorithms, the aim is to mimic the task of detecting out-of-distribution samples. You build a model by using the reference (training) data, and then use that model to identify whether a new (serving) data point is an outlier. The drift score is computed as the percentage of the data points that are predicted to be outliers in the serving dataset. The advantage of this technique is that not only does it produce a drift score between 0 and 100, but also it identifies the data points that are considered anomalous.

Novelty and outlier detection algorithms include the following:

Algorithms that are used for detecting outliers in time-series data are spectral residuals and sequence-to-sequence autoencoding.

Two-sample statistical tests

In two-sample statistical tests, you perform statistical hypothesis tests to examine the equivalence of the training distribution and serving distribution. The output of these algorithms is a p-value (between 0 and 1) that indicates the probability that the two datasets come from the same distributions. Because the dataset includes several features, simple univariate statistical tests aren't sufficient. Therefore, you have the following options:

For more information about drift detection techniques, see the paper Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. In addition, see Alibi Detect, an open source Python library for drift detection that covers tabular data, images, and time series.


  • Download training and serving data splits.
  • Train an elliptic envelope model by using the training data.
  • Test the model on normal and mutated datasets.
  • Implement an Apache Beam pipeline to compute drift score in request-response BigQuery data.
  • Run the pipeline and display drift detection output.


This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

Before you begin, you must complete part one and part two of this series.

After you complete these parts, you have the following:

  • An Notebooks instance that uses TensorFlow 2.3.
  • A clone of the GitHub repository that has the Jupyter notebook that you need for this guide.
  • A BigQuery table that contains request-response logs and a view that parses the raw request and response data points.

The Jupyter notebook for this scenario

A complete process for this workflow is coded into a Jupyter notebook on the GitHub repository that's associated with this document. The steps in the notebook are based on the CoverType dataset from UCI Machine Learning Repository. This dataset is the same one that was used for sample data in the previous documents in this series.

Configuring notebook settings

In this section, you prepare the Python environment to run the code for the scenario.

  1. If you don't already have the Notebooks instance from part one open in the Cloud Console, do the following:

    1. Go to the Notebooks page.

      Go to the Notebooks page

    2. In the Notebooks instance list, select the notebook, and then click Open Jupyterlab. The JupyterLab environment opens in your browser.

    3. In the file browser, open the mlops-on-gcp file, then navigate to the skew-detection directory.

  2. Open the 04-covertype-drift-detection_novelty_model.ipynb notebook.

  3. In the notebook, under Setup, run the Install packages and dependencies cell to install the required Python packages and configure the environment variables.

  4. Under Configure Google Cloud environment settings, set the following variables:

    • PROJECT_ID: The ID of the Google Cloud project where the BigQuery dataset for the request-response data is logged.
    • BUCKET: The name of the Cloud Storage bucket where produced artifacts are stored.
    • BQ_DATASET_NAME: The name of the BigQuery dataset where the request-response logs are stored.
    • BQ_TABLE_NAME: The name of the BigQuery table where the request-response logs are stored.
    • MODEL_NAME: The name of the model that's deployed to AI Platform Prediction.
    • MODEL_VERSION: The version name of the model that's deployed to AI Platform Prediction. The version is in the format vN; for example, v1.
  5. Run the remaining cells under Setup to finish configuring the environment:

    1. Authenticate your GCP account
    2. Create a local workspace
  6. Run the cells in the Download data splits section to make data available to the notebook.

Training a novelty detection model

This guide shows how to use an elliptic envelope as a novelty detection model to compute an overall drift score for the serving request-response data that's logged in BigQuery. In this section, you learn how to do the following:

  • Train the elliptic envelope by using the training data.
  • Test the envelope on both normal and mutated datasets.
  • Compare their drift scores.

The elliptic envelope algorithm is a parametric algorithm that assumes that the training data comes from a known distribution (for example, a Gaussian distribution). The algorithm tries to define the "shape" of the data by using a robust covariance estimation. Using this estimation, the model can identify outlying data points that stand a specified distance from the fit shape. Mahalanobis distances from the covariance estimate can be measured; these distances in turn can be used to calculate a measure of outlyingness.

Data preprocessing

The data requires only one kind of preprocessing in order to train the elliptic envelope model: the categorical features must be one-hot encoded to produce a numerical representation of these features. First, the code in the notebook fits a set of OneHotEncoder objects, one for each categorical feature, by using the categories in the training data. Second, the code uses the fitted encoders to transform the categorical features in the input dataset. To perform the transformations, the code uses the prepare_data method, as shown in the following code snippet:

from sklearn.preprocessing import OneHotEncoder

encoders = dict()

for feature_name in CATEGORICAL_FEATURE_NAMES:
  encoder = OneHotEncoder(handle_unknown='ignore')[[feature_name]])
  encoders[feature_name] = encoder

def prepare_data(data_frame):

  if type(data_frame) != pd.DataFrame:
    data_frame = pd.DataFrame(data_frame)
  data_frame = data_frame.reset_index()
  for feature_name, encoder in encoders.items():
    encoded_feature = pd.DataFrame(
    data_frame = data_frame.drop(feature_name, axis=1)
    encoded_feature.columns = [feature_name+"_"+str(column)
                              for column in encoded_feature.columns]
    data_frame = data_frame.join(encoded_feature)
   return data_frame

When you work with high-dimensional datasets (for example, a dataset with many features, with many images, or with text data that has a large vocabulary set), you often reduce dimensionality before you train the model. Typical techniques include principal component analysis (PCA), autoencoders, and random projection.

Model training

To train the elliptic envelope model, the code does the following:

  1. Prepares the training data by using the prepare_data method.

  2. Instantiates an EllipticEnvelope object and sets the contamination value to 0, assuming that there are no outliers in the training data.

  3. Trains the model using the prepared_training_data object by calling the method.

The following code snippet illustrates these steps:

from sklearn.covariance import EllipticEnvelope

prepared_training_data = prepare_data(train_data)
model = EllipticEnvelope(contamination=0.)

Model testing

To train the elliptic envelope model, the code can compute the Mahalanobis distance between a given data point and the model that represents a joint distribution of the training data. To compute the distance, the code calls the model.mahalanobis method and passes the given data point. However, in order to identify distance at which a data point is considered an outlier, the code needs to perform the following steps:

  1. Compute the mean and the standard deviation of Mahalanobis distances for the training data points with respect to the model. These values are stored in the model as the model._mean and model._stdv values.

    Any new data point that's at a Mahalanobis distance greater than mean + N standard deviation units is considered an outlier.

  2. Compute the drift score as the ratio of the number of outlier points to the size of the input (serving) dataset.

    This logic is shown in the following code snippet:

    def compute_drift_score(model, data_frame, stdv_units=2):
      distances = model.mahalanobis(data_frame)
      threshold = model._mean + (stdv_units * model._stdv)
      score = len([v for v in distances if v >= threshold]) / len(data_frame.index)
      return score

In order to test the model behavior for drift detection, the code uses the evaluation data split, which is assumed to have a distribution similar to the training data. The code creates a mutated version of this dataset by shuffling each column's values. In this case, the univariate distribution of each feature doesn't change. However, the multivariate (joint) distribution of the whole dataset is changed, because the feature value combination of each record is now random.

The mutated dataset is generated by using the following code:

def shuffle_values(dataframe):
 shuffeld_dataframe = dataframe.copy()
 for column_name in dataframe.columns:
   shuffeld_dataframe[column_name] = shuffeld_dataframe[column_name].sample(

When the drift score on the normal evaluation dataset is computed, less than 1% of the data points are detected as outliers. In contrast, when the drift scores are computed on the mutated dataset, around 62% of the data points are detected as outliers.

Computing drift magnitude on serving data

You can use the trained elliptic envelope model to identify a drift score in the request-response serving data logged in BigQuery. The code in the Jupyter notebook provides an Apache Beam pipeline implementation, which can run at scale on Dataflow, to perform the following steps:

  1. Read the raw request data from the table represented by the args.bq_table_fullname parameter, which is filtered by the following parameters:

    • args.model_name
    • args.model_version
    • args.start_time
    • args.end_time
  2. Group request records by using the beam.BatchElements method to improve vectorization. The request records are retrieved in JSON format.

  3. Parse the retrieved records batches into structured examples by using the parse_batch_data method.

  4. Preprocess the data (using one-hot encode categorical features) by using the prepare_data method.

  5. Score each record in the data by using the args.drift_model instance. This step returns the number of the records of the current batch that are considered outliers, and the total number of records in this batch.

  6. Aggregate the results by using the beam.CombineGlobally method.

  7. Produce the final result object, which includes the values of the outlier_count, record_count, and drift_score variables.

  8. Write the results as a JSON file to the location that's specified in the args.output_file_path variable.

The following code snippet shows the Apache Beam pipeline implementation for the steps to compute the drift score for the request-response data in BigQuery.

def run_pipeline(args):

 options = beam.options.pipeline_options.PipelineOptions(**args)
 args = namedtuple("options", args.keys())(*args.values())
 query = get_query(
     args.bq_table_fullname, args.model_name, args.model_version,
     args.start_time, args.end_time)

 print("Starting the Beam pipeline...")

 with beam.Pipeline(options=options) as pipeline:
       | 'ReadBigQueryData' >>
 , use_standard_sql=True))
       | 'BatchRecords' >> beam.BatchElements(
           min_batch_size=100, max_batch_size=1000)
       | 'InstancesToBeamExamples' >> beam.Map(parse_batch_data)
       | 'PrepareData' >> beam.Map(prepare_data)
       | 'ScoreData' >> beam.Map(
           lambda data: score_data(data, args.drift_model, stdv_units=1))
       | 'CombineResults' >> beam.CombineGlobally(aggregate_scores)
       | 'ComputeRatio' >> beam.Map(
           lambda result: {
               "outlier_count": result['outlier_count'],
               "records_count": result['records_count'],
               "drift_score": result['outlier_count'] / result['records_count']
        | 'WriteOutput' >>
            file_path_prefix=args.output_file_path, num_shards=1, shard_name_template='')

When you run this pipeline with the skewed data generated by the notebook that's associated with this document, you get results similar to the following:

outlier_count : 103
records_count : 1000
drift_score : 10.3%

For more information about how to populate the request-response logs table using skewed data points, see the Simulating serving data section in the Analyzing AI Platform Prediction logs in BigQuery guide.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next