This guide is for data scientists and ML engineers who want to maintain the performance of their ML models in production by determining whether the serving data drifts over time and by computing the magnitude of this drift (if any).
The series consists of the following documents:
- Logging serving requests by using AI Platform Prediction
- Analyzing logs in BigQuery
- Analyzing training-server skew with TensorFlow Data Validation
- Automating training-server skew detection
- Identifying training-server skew with novelty detection (this guide)
The code for the process described in this document is incorporated into Jupyter notebooks. The notebooks are in a GitHub repository.
Overview
Monitoring the predictive performance of a machine learning (ML) model in production is a crucial area of MLOps. The predictive performance of a deployed model might decay over time. This decay can occur for several reasons, including a growing discrepancy between training and serving data, an evolving business context, or a changing technical environment.
The following diagram shows the process that's described in this guide:
Feature-level drift versus dataset-level drift
Previous guides in this series described how to use TensorFlow Data Validation (TFDV) to detect schema and distribution skews and anomalies in a dataset. TFDV uses a reference schema and baseline statistics that are generated from the training data to detect anomalies in new data.
You can use TFDV to identify skews and anomalies such as the following for each feature:
- The data type of a feature isn't as expected.
- The values of a numerical feature are outside the expected range.
- The percentage of the missing values in a specific feature is under a predefined threshold.
- New or missing categories are detected in a categorical feature.
- The values distribution of a feature isn't as expected.
In practice, most of the training-serving data skews and anomalies are due to issues in individual features. For example, errors or updates in the data ingestion and processing pipelines might introduce missing values, new values, out-of-range values, or values that have incorrect data types. You might also see inconsistencies in data schema types in any of the following circumstances:
- Data sources change.
- You integrate new data sources.
- You operate your model in a new domain (for example, a new geographic location).
TFDV can detect feature-level skews and anomalies such as these so that you can act on them. For example, you might update the preprocessing logic, add new vocabulary to categorical features, and discard erroneous data and retrain the model if needed.
However, in some cases, individual features in the dataset might not have any significant skew related to the reference schema and the baseline statistics. Even so, the joint probability distribution of these features shows a change. This change represents a type of covariate shift. This type of shift usually happens as new trends and patterns emerge in the data due to the changes in the dynamic of the environment. Examples might be changes in the prices of real estate relative to locations, or the popularity of fashion items among different demographic groups. Covariate shift is also subject to seasonality, because the statistical relationships between input features might change periodically in response to different seasons and events.
Therefore, it can be useful to compute an overall drift score at the dataset level with respect to all the features. This drift score can represent the skew magnitude of your serving data. Computing an overall drift score simplifies the process of monitoring dataset drift over time, and it helps you to define and take actions based on certain drift score thresholds.
Methods for scoring a dataset drift
You can use the following methods to compute a drift magnitude at the dataset level rather than at the feature level:
- Novelty and outlier detection
- Two-sample statistical tests
Novelty and outlier detection
In novelty and outlier detection algorithms, the aim is to mimic the task of detecting out-of-distribution samples. You build a model by using the reference (training) data, and then use that model to identify whether a new (serving) data point is an outlier. The drift score is computed as the percentage of the data points that are predicted to be outliers in the serving dataset. The advantage of this technique is that not only does it produce a drift score between 0 and 100, but also it identifies the data points that are considered anomalous.
Novelty and outlier detection algorithms include the following:
Algorithms that are used for detecting outliers in time-series data are spectral residuals and sequence-to-sequence autoencoding.
Two-sample statistical tests
In two-sample statistical tests, you perform statistical hypothesis tests to examine the equivalence of the training distribution and serving distribution. The output of these algorithms is a p-value (between 0 and 1) that indicates the probability that the two datasets come from the same distributions. Because the dataset includes several features, simple univariate statistical tests aren't sufficient. Therefore, you have the following options:
- Multivariate kernel tests. The maximum mean discrepancy (MMD) is a popular kernel-based statistical distance that's used in multivariate two-sample testing. Other statistical distances include KL-divergence, the Wasserstein metric, and energy distance. Multivariate testing does not scale well with the size of the dataset, and its statistical power might decay as the dimensionality of the dataset increases.
- Multiple univariate tests. You can test the Kolmogorov-Smirnov (KS) statistic or other statistical distance on each feature separately and then combine the p-values from each test. You apply a correction procedure to mitigate the issue of multiple hypothesis testing.
For more information about drift detection techniques, see the paper Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. In addition, see Alibi Detect, an open source Python library for drift detection that covers tabular data, images, and time series.
Objectives
- Download training and serving data splits.
- Train an elliptic envelope model by using the training data.
- Test the model on normal and mutated datasets.
- Implement an Apache Beam pipeline to compute drift score in request-response BigQuery data.
- Run the pipeline and display drift detection output.
Costs
This tutorial uses the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
Before you begin
Before you begin, you must complete part one and part two of this series.
After you complete these parts, you have the following:
- A Vertex AI Workbench user-managed notebooks instance that uses TensorFlow 2.3.
- A clone of the GitHub repository that has the Jupyter notebook that you need for this guide.
- A BigQuery table that contains request-response logs and a view that parses the raw request and response data points.
The Jupyter notebook for this scenario
A complete process for this workflow is coded into a Jupyter notebook on the GitHub repository that's associated with this document. The steps in the notebook are based on the CoverType dataset from UCI Machine Learning Repository. This dataset is the same one that was used for sample data in the previous documents in this series.
Configuring notebook settings
In this section, you prepare the Python environment to run the code for the scenario.
If you don't already have the user-managed notebooks instance from part one open, do the following:
In the Google Cloud console, go to the Notebooks page.
On the User-managed notebooks tab, select the notebook, and then click Open Jupyterlab. The JupyterLab environment opens in your browser.
In the file browser, open the
mlops-on-gcp
file, then navigate to theskew-detection
directory.Open the
04-covertype-drift-detection_novelty_model.ipynb
notebook.In the notebook, under Setup, run the Install packages and dependencies cell to install the required Python packages and configure the environment variables.
Under Configure Google Cloud environment settings, set the following variables:
PROJECT_ID
: The ID of the Google Cloud project where the BigQuery dataset for the request-response data is logged.BUCKET
: The name of the Cloud Storage bucket where produced artifacts are stored.BQ_DATASET_NAME
: The name of the BigQuery dataset where the request-response logs are stored.BQ_TABLE_NAME
: The name of the BigQuery table where the request-response logs are stored.MODEL_NAME
: The name of the model that's deployed to AI Platform Prediction.MODEL_VERSION
: The version name of the model that's deployed to AI Platform Prediction. The version is in the format vN; for example,v1
.
Run the remaining cells under Setup to finish configuring the environment:
- Authenticate your GCP account
- Create a local workspace
Run the cells in the Download data splits section to make data available to the notebook.
Training a novelty detection model
This guide shows how to use an elliptic envelope as a novelty detection model to compute an overall drift score for the serving request-response data that's logged in BigQuery. In this section, you learn how to do the following:
- Train the elliptic envelope by using the training data.
- Test the envelope on both normal and mutated datasets.
- Compare their drift scores.
The elliptic envelope algorithm is a parametric algorithm that assumes that the training data comes from a known distribution (for example, a Gaussian distribution). The algorithm tries to define the "shape" of the data by using a robust covariance estimation. Using this estimation, the model can identify outlying data points that stand a specified distance from the fit shape. Mahalanobis distances from the covariance estimate can be measured; these distances in turn can be used to calculate a measure of outlyingness.
Data preprocessing
The data requires only one kind of preprocessing in order to train the elliptic
envelope model: the categorical features must be
one-hot encoded
to produce a numerical representation of these features. First, the code in the
notebook fits a set of OneHotEncoder
objects, one for each categorical
feature, by using the categories in the training data. Second, the code uses the
fitted encoders to transform the categorical features in the input dataset. To
perform the transformations, the code uses the prepare_data
method, as shown
in the following code snippet:
from sklearn.preprocessing import OneHotEncoder encoders = dict() for feature_name in CATEGORICAL_FEATURE_NAMES: encoder = OneHotEncoder(handle_unknown='ignore') encoder.fit(train_data[[feature_name]]) encoders[feature_name] = encoder def prepare_data(data_frame): if type(data_frame) != pd.DataFrame: data_frame = pd.DataFrame(data_frame) data_frame = data_frame.reset_index() for feature_name, encoder in encoders.items(): encoded_feature = pd.DataFrame( encoder.transform(data_frame[[feature_name]]).toarray()) data_frame = data_frame.drop(feature_name, axis=1) encoded_feature.columns = [feature_name+"_"+str(column) for column in encoded_feature.columns] data_frame = data_frame.join(encoded_feature) return data_frame
When you work with high-dimensional datasets (for example, a dataset with many features, with many images, or with text data that has a large vocabulary set), you often reduce dimensionality before you train the model. Typical techniques include principal component analysis (PCA), autoencoders, and random projection.
Model training
To train the elliptic envelope model, the code does the following:
Prepares the training data by using the
prepare_data
method.Instantiates an
EllipticEnvelope
object and sets thecontamination
value to 0, assuming that there are no outliers in the training data.Trains the model using the
prepared_training_data
object by calling themodel.fit
method.
The following code snippet illustrates these steps:
from sklearn.covariance import EllipticEnvelope prepared_training_data = prepare_data(train_data) model = EllipticEnvelope(contamination=0.) model.fit(prepared_training_data)
Model testing
To train the elliptic envelope model, the code can compute the
Mahalanobis distance
between a given data point and the model that represents a joint distribution of
the training data. To compute the distance, the code calls the
model.mahalanobis
method and passes the given data point. However, in order to
identify distance at which a data point is considered an outlier, the code needs
to perform the following steps:
Compute the mean and the standard deviation of Mahalanobis distances for the training data points with respect to the model. These values are stored in the model as the
model._mean
andmodel._stdv
values.Any new data point that's at a Mahalanobis distance greater than mean + N standard deviation units is considered an outlier.
Compute the drift score as the ratio of the number of outlier points to the size of the input (serving) dataset.
This logic is shown in the following code snippet:
def compute_drift_score(model, data_frame, stdv_units=2): distances = model.mahalanobis(data_frame) threshold = model._mean + (stdv_units * model._stdv) score = len([v for v in distances if v >= threshold]) / len(data_frame.index) return score
In order to test the model behavior for drift detection, the code uses the evaluation data split, which is assumed to have a distribution similar to the training data. The code creates a mutated version of this dataset by shuffling each column's values. In this case, the univariate distribution of each feature doesn't change. However, the multivariate (joint) distribution of the whole dataset is changed, because the feature value combination of each record is now random.
The mutated dataset is generated by using the following code:
def shuffle_values(dataframe): shuffeld_dataframe = dataframe.copy() for column_name in dataframe.columns: shuffeld_dataframe[column_name] = shuffeld_dataframe[column_name].sample( frac=1.0).reset_index(drop=True)
When the drift score on the normal evaluation dataset is computed, less than 1% of the data points are detected as outliers. In contrast, when the drift scores are computed on the mutated dataset, around 62% of the data points are detected as outliers.
Computing drift magnitude on serving data
You can use the trained elliptic envelope model to identify a drift score in the request-response serving data logged in BigQuery. The code in the Jupyter notebook provides an Apache Beam pipeline implementation, which can run at scale on Dataflow, to perform the following steps:
Read the raw request data from the table represented by the
args.bq_table_fullname
parameter, which is filtered by the following parameters:args.model_name
args.model_version
args.start_time
args.end_time
Group request records by using the
beam.BatchElements
method to improve vectorization. The request records are retrieved in JSON format.Parse the retrieved records batches into structured examples by using the
parse_batch_data
method.Preprocess the data (using one-hot encode categorical features) by using the
prepare_data
method.Score each record in the data by using the
args.drift_model
instance. This step returns the number of the records of the current batch that are considered outliers, and the total number of records in this batch.Aggregate the results by using the
beam.CombineGlobally
method.Produce the final result object, which includes the values of the
outlier_count
,record_count
, anddrift_score
variables.Write the results as a JSON file to the location that's specified in the
args.output_file_path
variable.
The following code snippet shows the Apache Beam pipeline implementation for the steps to compute the drift score for the request-response data in BigQuery.
def run_pipeline(args): options = beam.options.pipeline_options.PipelineOptions(**args) args = namedtuple("options", args.keys())(*args.values()) query = get_query( args.bq_table_fullname, args.model_name, args.model_version, args.start_time, args.end_time) print("Starting the Beam pipeline...") with beam.Pipeline(options=options) as pipeline: ( pipeline | 'ReadBigQueryData' >> beam.io.Read( beam.io.BigQuerySource(query=query, use_standard_sql=True)) | 'BatchRecords' >> beam.BatchElements( min_batch_size=100, max_batch_size=1000) | 'InstancesToBeamExamples' >> beam.Map(parse_batch_data) | 'PrepareData' >> beam.Map(prepare_data) | 'ScoreData' >> beam.Map( lambda data: score_data(data, args.drift_model, stdv_units=1)) | 'CombineResults' >> beam.CombineGlobally(aggregate_scores) | 'ComputeRatio' >> beam.Map( lambda result: { "outlier_count": result['outlier_count'], "records_count": result['records_count'], "drift_score": result['outlier_count'] / result['records_count'] }) | 'WriteOutput' >> beam.io.WriteToText( file_path_prefix=args.output_file_path, num_shards=1, shard_name_template='') )
When you run this pipeline with the skewed data generated by the notebook that's associated with this document, you get results similar to the following:
outlier_count : 103 records_count : 1000 drift_score : 10.3%
For more information about how to populate the request-response logs table using skewed data points, see the Simulating serving data section in the Analyzing AI Platform Prediction logs in BigQuery guide.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Read the Analyzing training-serving skew in AI Platform Prediction with TensorFlow Data Validation guide.
- Read the Automating training-serving skew detection in AI Platform Prediction guide.
- Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.