Analyzing AI Platform Prediction logs in BigQuery

This document is the second in a series that shows you how to monitor machine learning (ML) models that are deployed to AI Platform Prediction to help you detect data skew. This guide shows you how to parse raw data in the AI Platform Prediction request-response log into an analytical data model. It then shows how to use Data Studio to analyze the logged requests for data skews and drifts.

The series is for data scientists and MLOps engineers who want to maintain the performance of their ML models in production by monitoring how the serving data changes over time. It assumes that you have some experience with Google Cloud, with BigQuery, and with Jupyter notebooks.

The series consists of the following guides:

The tasks discussed in this document are incorporated into Jupyter notebooks. The notebooks are in a GitHub repository.

Overview

As discussed in part one of this series, the code in the Jupyter notebook trains a Keras classification model for the Covertype dataset to predict forest cover type from cartographic variables. The exported SavedModel is then deployed to AI Platform Prediction for online serving. The notebook also enables request-response logging to log a sample of online prediction requests (instances) and responses (predicted label probabilities) to a BigQuery table.

The overall architecture is shown in the following diagram:

Architecture for the flow that's created in this tutorial series.

In this architecture, AI Platform Prediction request-response logging logs a sample of online requests into a BigQuery table. After the raw instances and prediction data have been stored in BigQuery, you can parse this data, compute descriptive statistics, and visualize data skew and data drift.

The following table summarizes the schema of the BigQuery table.

Field name Type Mode Description
model STRING REQUIRED The name of a model
model_version STRING REQUIRED The name of a model version
time TIMESTAMP REQUIRED The date and time when a request was captured
raw_data STRING REQUIRED The request body in the AI Platform Prediction JSON representation
raw_prediction STRING NULLABLE The response body (predictions) in the AI Platform Prediction JSON representation
groundtruth STRING NULLABLE The ground truth if available

The following table shows a sample of the data stored in the raw_data and raw_prediction columns of the BigQuery table.

Column Sample data
raw_data
{
  "signature_name":"serving_default",
  "instances":[
    {
      "Elevation":[3158],
      "Aspect":[78],
      "Slope":[25],
      "Horizontal_Distance_To_Hydrology":[150],
      "Vertical_Distance_To_Hydrology":[53],
      "Horizontal_Distance_To_Roadways":[1080],
      "Hillshade_9am":[243],
      "Hillshade_Noon":[185],
      "Hillshade_3pm":[57],
      "Horizontal_Distance_To_Fire_Points":[2234],
      "Wilderness_Area":["Rawah"],
      "Soil_Type":["7745"],
    }
  ]
}
raw_prediction
{
 "predictions": [
   {
     "probabilities": [
       0.03593460097908974,
       0.9640452861785889,
       1.2815438710234162e-9,
       1.5712469103590365e-9,
       0.000018715836631599814,
       4.030006106603423e-9,
       0.0000013792159734293818
     ],
     "confidence":  0.9640452861785889,
     "predicted_label": "1"

   }
 ]
}

In this document, you parse the raw_data field into an analytical model so you can analyze the content of each feature separately and identify any data skews.

Objectives

  • Create metadata for the dataset.
  • Generate a SQL CREATE VIEW statement that's specific to the dataset.
  • Use the view to query log data in BigQuery.
  • Create visualizations of the log data.

Costs

This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

Before you begin, you must complete part one of this series.

After you complete part one, you have the following:

  • An Notebooks instance that uses TensorFlow 2.3.
  • A clone of the GitHub repository that has the Jupyter notebook that you need for this guide.

The Jupyter notebook for this scenario

The tasks for parsing and analyzing the data are incorporated into a Jupyter notebook that's in a GitHub repository. To perform the tasks, you get the notebook and then execute the code cells in the notebook in order.

In this document, you use the Jupyter notebook to perform the following tasks:

  • Create a BigQuery SQL view to parse the raw request and response data points. You generate the view by running code that assembles information such as metadata for the dataset and names that you provide.
  • Simulate the process of serving data over several days with artificial skews.
  • Use Data Studio to visualize the parsed serving data as logged in BigQuery.

Configuring notebook settings

In this section of the notebook, you prepare the Python environment to run the code for the scenario. The code in the notebook creates the view based on the feature spec of your dataset. In order to generate the CREATE OR REPLACE VIEW SQL script, you need to set a number of variables.

  1. If you don't already have the AI Platforms Notebooks instance from part one open in the Cloud Console, do the following:

    1. Go to the Notebooks page.

      Go to the Notebooks page

    2. In the Notebooks list, select the notebook, and then click Open Jupyterlab. The JupyterLab environment opens in your browser.

    3. In the file browser, open mlops-on-gcp and then navigate to the skew-detection1 directory.

  2. Open the 02-covertype-logs-parsing-analysis.ipynb notebook.

  3. In the notebook, under Setup, run the Install packages and dependencies cell to install the required Python packages and configure the environment variables.

  4. Under Configure Google Cloud environment settings, set the following variables:

    • PROJECT_ID: The ID of the Google Cloud project where the BigQuery dataset for the request-response data is logged.
    • BQ_DATASET_NAME: The name of the BigQuery dataset to store the request-response logs in.
    • BQ_TABLE_NAME: The name of the BigQuery table to store the request-response logs in.
    • MODEL_NAME: The name of the model deployed to AI Platform Prediction.
    • VERSION_NAME: The version name of the model deployed to AI Platform Prediction. The version is in the format vN; for example, v1.
  5. Run the remaining cells under Setup to finish configuring the environment:

    1. Authenticate your GCP account
    2. Import libraries

Defining metadata for the dataset

You run section 1 of the notebook, Define dataset metadata, to set variables that are used later in code that generates a SQL script. For example, the code in this section creates two variables named NUMERIC_FEATURE_NAMES and CATEGORICAL_FEATURES_WITH_VOCABULARY, as shown in the following snippet from the code:

NUMERIC_FEATURE_NAMES = ['Aspect', 'Elevation', 'Hillshade_3pm',
                         'Hillshade_9am', 'Hillshade_Noon',
                         'Horizontal_Distance_To_Fire_Points',
                         'Horizontal_Distance_To_Hydrology',
                         'Horizontal_Distance_To_Roadways','Slope',
                         'Vertical_Distance_To_Hydrology']

CATEGORICAL_FEATURES_WITH_VOCABULARY = {
    'Soil_Type': ['2702', '2703', '2704', '2705', '2706', '2717', '3501', '3502',
                  '4201', '4703', '4704', '4744', '4758', '5101', '6101', '6102',
                  '6731', '7101', '7102', '7103', '7201', '7202', '7700', '7701',
                  '7702', '7709', '7710', '7745', '7746', '7755', '7756', '7757',
                  '7790', '8703', '8707', '8708', '8771', '8772', '8776'],
    'Wilderness_Area': ['Cache', 'Commanche', 'Neota', 'Rawah']
}

The code then creates a variable named FEATURE_NAMES to combine these values, as in the following line:

FEATURE_NAMES = list(CATEGORICAL_FEATURES_WITH_VOCABULARY.keys()) + NUMERIC_FEATURE_NAMES

Generating the CREATE VIEW SQL script

You run the tasks in section 2 of the notebook to generate the CREATE VIEW statement that you run later in order to parse the logs.

The first task runs code to create values out of the data metadata for the json_features_extraction and json_prediction_extraction variables. These variables contain the features and prediction values in a format that can be inserted into a SQL statement.

This code relies on the variables that you set earlier when you configured notebook settings and when you defined the metadata for the dataset. The following snippet shows this code.

LABEL_KEY = 'predicted_label'
SCORE_KEY = 'confidence'
SIGNATURE_NAME = 'serving_default'

def _extract_json(column, feature_name):
  return "JSON_EXTRACT({}, '$.{}')".format(column, feature_name)

def _replace_brackets(field):
  return "REPLACE(REPLACE({}, ']', ''), '[','')".format(field)

def _replace_quotes(field):
  return 'REPLACE({}, "\\"","")'.format(field)

def _cast_to_numeric(field):
  return "CAST({} AS NUMERIC)".format(field)

def _add_alias(field, feature_name):
  return "{} AS {}".format(field, feature_name)

view_name = "vw_"+BQ_TABLE_NAME+"_"+VERSION_NAME

colum_names = FEATURE_NAMES
input_features = ', \r\n  '.join(colum_names)

json_features_extraction = []
for feature_name in colum_names:
  field = _extract_json('instance', feature_name)
  field = _replace_brackets(field)
  if feature_name in NUMERIC_FEATURE_NAMES:
    field = _cast_to_numeric(field)
  else:
    field = _replace_quotes(field)
  field = _add_alias(field, feature_name)
  json_features_extraction.append(field)

json_features_extraction = ', \r\n    '.join(json_features_extraction)

json_prediction_extraction = []
for feature_name in [LABEL_KEY, SCORE_KEY]:
  field = _extract_json('prediction', feature_name)
  field = _replace_brackets(field)
  if feature_name == SCORE_KEY:
    field = _cast_to_numeric(field)
  else:
    field = _replace_quotes(field)
  field = _add_alias(field, feature_name)
  json_prediction_extraction.append(field)

json_prediction_extraction = ', \r\n    '.join(json_prediction_extraction)

The second task sets a variable named sql_script to a long string that contains a CREATE OR REPLACE VIEW statement. The statement contains several placeholders, which are marked in the string by using @ as a prefix. For example, there are placeholders for the names of the dataset and of the view:

CREATE OR REPLACE VIEW @dataset_name.@view_name

There are also placeholders for the names of the project, table, model, and version:

FROM
  `@project.@dataset_name.@table_name`
  WHERE
    model = '@model_name' AND
    model_version = '@version'
)

The end of the statement includes placeholders that use the json_features_extraction and json_prediction_extraction variables that you created by running the code in the previous task:

step3 AS
(SELECT
    model,
    model_version,
    time,
    @json_features_extraction,
    @json_prediction_extraction
FROM step2
)

Finally, you run the next cell to replace the placeholders in the SQL statement with values that you set earlier, as shown in the following snippet:

sql_script = sql_script.replace("@project", PROJECT_ID)
sql_script = sql_script.replace("@dataset_name", BQ_DATASET_NAME)
sql_script = sql_script.replace("@table_name", BQ_TABLE_NAME)
sql_script = sql_script.replace("@view_name", view_name)
sql_script = sql_script.replace("@model_name", MODEL_NAME)
sql_script = sql_script.replace("@version", VERSION_NAME)
sql_script = sql_script.replace("@input_features", input_features)
sql_script = sql_script.replace("@json_features_extraction", json_features_extraction)
sql_script = sql_script.replace("@json_prediction_extraction", json_prediction_extraction)

This step completes the generation of the SQL statement that creates the view to parse the raw request and response data points.

If you want to see the script that you've generated, run the cell that prints the view. The cell contains the following code:

print(sql_script)

Executing the CREATE VIEW SQL script

To execute the CREATE VIEW statement, you run the code in section 3 of the notebook. When you're done, the code displays the message View created or replaced. When you see this message, the view for parsing the data is ready.

The following snippet shows the resulting statement.

CREATE OR REPLACE VIEW prediction_logs.vw_covertype_classifier_logs_v1
AS

WITH step1 AS
(
  SELECT
    model,
    model_version,
    time,
    SPLIT(JSON_EXTRACT(raw_data, '$.instances'), '}],[{') instance_list,
    SPLIT(JSON_EXTRACT(raw_prediction, '$.predictions'), '}],[{') as prediction_list
  FROM
  `sa-data-validation.prediction_logs.covertype_classifier_logs`
  WHERE
    model = 'covertype_classifier' AND
    model_version = 'v1'
),

step2 AS
(
  SELECT
    model,
    model_version,
    time,
    REPLACE(REPLACE(instance, '[{', '{'),'}]', '}') AS instance,
    REPLACE(REPLACE(prediction, '[{', '{'),'}]', '}') AS prediction,
  FROM step1
  JOIN UNNEST(step1.instance_list) AS instance
  WITH OFFSET AS f1
  JOIN UNNEST(step1.prediction_list) AS prediction
  WITH OFFSET AS f2
  ON f1=f2
),

step3 AS
(
  SELECT
    model,
    model_version,
    time,
    REPLACE(REPLACE(REPLACE(JSON_EXTRACT(instance, '$.Soil_Type'), ']', ''), '[',''), "\"","") AS Soil_Type,
    REPLACE(REPLACE(REPLACE(JSON_EXTRACT(instance, '$.Wilderness_Area'), ']', ''), '[',''), "\"","") AS Wilderness_Area,
    CAST(REPLACE(REPLACE(JSON_EXTRACT(instance, '$.Aspect'), ']', ''), '[','') AS NUMERIC) AS Aspect,
    CAST(REPLACE(REPLACE(JSON_EXTRACT(instance, '$.Elevation'), ']', ''), '[','') AS NUMERIC) AS Elevation,
    CAST(REPLACE(REPLACE(JSON_EXTRACT(instance, '$.Hillshade_3pm'), ']', ''), '[','') AS NUMERIC) AS Hillshade_3pm,
    CAST(REPLACE(REPLACE(JSON_EXTRACT(instance, '$.Hillshade_9am'), ']', ''), '[','') AS NUMERIC) AS Hillshade_9am,
    CAST(REPLACE(REPLACE(JSON_EXTRACT(instance, '$.Hillshade_Noon'), ']', ''), '[','') AS NUMERIC) AS Hillshade_Noon,
    CAST(REPLACE(REPLACE(JSON_EXTRACT(instance, '$.Horizontal_Distance_To_Fire_Points'), ']', ''), '[','') AS NUMERIC) AS Horizontal_Distance_To_Fire_Points,
    CAST(REPLACE(REPLACE(JSON_EXTRACT(instance, '$.Horizontal_Distance_To_Hydrology'), ']', ''), '[','') AS NUMERIC) AS Horizontal_Distance_To_Hydrology,
    CAST(REPLACE(REPLACE(JSON_EXTRACT(instance, '$.Horizontal_Distance_To_Roadways'), ']', ''), '[','') AS NUMERIC) AS Horizontal_Distance_To_Roadways,
    CAST(REPLACE(REPLACE(JSON_EXTRACT(instance, '$.Slope'), ']', ''), '[','') AS NUMERIC) AS Slope,
    CAST(REPLACE(REPLACE(JSON_EXTRACT(instance, '$.Vertical_Distance_To_Hydrology'), ']', ''), '[','') AS NUMERIC) AS Vertical_Distance_To_Hydrology,
    REPLACE(REPLACE(REPLACE(JSON_EXTRACT(prediction, '$.predicted_label'), ']', ''), '[',''), "\"","") AS predicted_label,
    CAST(REPLACE(REPLACE(JSON_EXTRACT(prediction, '$.confidence'), ']', ''), '[','') AS NUMERIC) AS confidence
  FROM step2
)

SELECT*
FROM step3

Querying the view

After the view is created, you can query it. To query the view, run the code in section 4, Query the view. The code uses the pandas.io.gbq.read_gbq method in the notebook, as shown in the following code snippet:

query = '''
 SELECT * FROM
 `{}.{}`
 LIMIT {}
'''.format(BQ_DATASET_NAME, view_name, 3)

pd.io.gbq.read_gbq(
   query, project_id=PROJECT_ID).T

The code produces output that's similar to the following:

Output produced by querying the view.

The view query result shows the following:

  • Each feature has its own entry.
  • The quotes are stripped out of the categorical features.
  • The predicted class label is presented in the predicted_label entry.
  • The probability of the predicted class label is presented in the confidence entry.

Using the BigQuery console

As an alternative to querying the view by using the pandas API, you can query the view in the BigQuery console.

  1. Open the BigQuery console.

    Go to the BigQuery console

  2. In the Query editor pane, enter a query like the following:

    Select*
    FROM PROJECT_ID.prediction_logs.vw_covertype_classifier_logs_v1
    Limit 10
    

    Replace PROJECT_ID with the ID of the Cloud project that you set earlier.

    The output is similar to the following:

    Query editor window with SQL statement and output.

(Optional) Simulating serving data

If you're working with your own model and data, skip this section and go to the next section, which describes how to fill the request-response logs table with sample data.

You might use sample data to generate skewed data points and then to simulate prediction requests to the model version that's deployed to AI Platform Prediction. The model produces predictions to the request instances. Both instances and predictions are stored in BigQuery.

You can generate sample (normal and skewed) data points for prediction requests, and then call the covertype classification model that's deployed to AI Platform Prediction by using the generated data points. The repository that you cloned contains a notebook that includes code for this task, or you can load a CSV file that contains log data with skews.

To generate the sample data from the notebook, do the following:

  1. In the notebook, go to the file browser, open mlops-on-gcp, and then navigate to the skew-detection/workload_simulator directory.
  2. Open the covertype-data-generation.ipynb notebook.
  3. Under Setup, set values for your project ID, bucket name, and region.
  4. Run all the cells in the notebook in order.

You can change the size of the data to generate as well as how the data is skewed. The default skews introduced in the data are as follows:

  • Numerical feature skew. For the Elevation feature, the code converts the unit of measure from meters to kilometers in 10% of the data.
  • Numerical feature distribution skew. For the Aspect feature, the code decreases the value by 25%.
  • Categorical feature skew. For the Wilderness_Area feature, the code converts the value of a random 1% of the data to a new category named Others.
  • Categorical feature distribution skew. For the Wilderness_Area feature, the code increases the frequency of the Neota and Cache values. The code does this by converting a randomly selected 25% of the data points from their original values to the values Neota and Cache.

Alternatively, you can load the workload_simulator/bq_prediction_logs.csv data file into your BigQuery request-response logs table. The CSV file includes sample request-response logs, with 2000 normal data points and 1000 skewed data points. For more information, see loading data to BigQuery from a local data.

Visualizing logged serving data

You can use a visualization tool to connect to the BigQuery view and visualize the logged serving data. In the examples that follow, the visualizations were created by using Data Studio.

The following screenshot shows a sample dashboard that was created to visualize the prediction request-response logs from this guide.

Visualization of request-response logs.

The dashboard shows the following information:

  • The number of instances received by the prediction service is the same (500) on each day from June 1 to June 6.
  • For the class distributions, the frequency of the predicted class label 3 has increased in the last two days (June 5 and June 6).
  • For the Wilderness Area values distribution, the Neota and Cache values have increased in the last two days.
  • For descriptive statistics of the Elevation feature, the minimum values of the last two days have significantly lower values than for the preceding four days. The standard deviation values have significantly higher values than for the preceding four days.

In addition, as shown in the following screenshot, the values distribution of the Aspect feature for the last two days has a significant drop in the frequency of the values between 300 and 350.

Aspect distrbution line graph.

Clean up

If you plan to continue with the rest of this series, keep the resources that you've already created. Otherwise, delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next