Architecture and design of predicting customer lifetime value with Kubeflow Pipelines

This document is the first of two documents that discuss how to use Kubeflow Pipelines to orchestrate training, deployment, and inference of customer lifetime value (CLV) predictive models. This document outlines the architecture and design patterns used to implement CLV training and inference pipelines. An accompanying tutorial shows how to deploy, use, and customize the pipeline code samples in the accompanying GitHub repository.

For a generic overview on how to use Kubeflow Pipelines to design, orchestrate, and automate an integrated machine learning (ML) system, and to use Cloud Build to configure a continuous integration/continuous deployment (CI/CD) setup, see Architecture for CI/CD and ML pipelines using Kubeflow and Cloud Build.

Overview

This document has two goals:

  • Demonstrate architectural and design patterns for Kubeflow Pipelines that orchestrate Google Cloud-managed ML and data analytics services.
  • Provide a starter kit for operationalizing training, deployment, and scoring of CLV predictive models.

The architecture and design discussed in this document represent a sales use case where you need to frequently fine-tune and retrain a predictive model. Because a constant flow of new sales transactions is the core of training data, you must keep models up to date with evolving purchase patterns. Automating training and deployment workflows becomes critical.

This document builds on data preprocessing and modeling techniques covered in the Predicting customer lifetime value with AI Platform series. That series can help you to better understand CLV concepts and algorithmic approaches.

This document is intended for data scientists and machine learning engineers. The document and its accompanying tutorial assume that you have a basic understanding of the following Google Cloud concepts and services:

Kubeflow Pipelines is a key component of the Kubeflow project. The Kubeflow project is an open source, Kubernetes-native platform for developing, orchestrating, deploying, and running scalable and portable ML workloads. For more information on Kubeflow Pipelines, see the Kubeflow documentation.

Kubeflow Pipelines is usually installed as part of a Kubeflow installation. You can also install Kubeflow Pipelines as a standalone set of services with no dependencies on other Kubeflow components.

You can host Kubeflow Pipelines on various Kubernetes distributions. This document focuses on running Kubeflow Pipelines on GKE.

Kubeflow Pipelines as an orchestrator

This document demonstrates an architectural pattern where you use Kubeflow Pipelines as an orchestrator for Google Cloud-managed ML and data analytics services. In this case, Kubeflow Pipelines only runs the workflow but performs no other computation on a hosting Kubernetes cluster.

The following diagram shows the high-level architecture of the CLV solution that this document describes.

Overview of the key Kubeflow Pipelines and Google Cloud components for the CLV solution.

The Kubeflow Pipelines services are hosted on GKE. The pipelines discussed here and in the tutorial are created in Kubeflow Pipelines. These pipelines interface with Cloud Storage, BigQuery, and AutoML Tables through a set of Kubeflow Pipelines components that wrap the respective Cloud APIs. You use Container Registry to manage the container images for the components that the pipelines use.

The accompanying GitHub solution includes two sample pipelines:

  • The training and deployment pipeline
  • The batch prediction pipeline

Both pipelines follow a data processing, training, and scoring flow that is similar to one described in Predicting customer lifetime value with AutoML Tables. In that workflow, you do the following:

  • Use BigQuery for data cleaning and feature engineering.
  • Use AutoML Tables for model training, deployment, and scoring.

Training and deployment workflow

The training and deployment pipeline uses historical sales transaction data to train and optionally deploy an ML regression model. The model is trained to predict a value of future purchases for a customer based on that customer's purchase history. For more information about modeling for CLV prediction, see Predicting customer lifetime value with AI Platform: Introduction.

The following diagram shows the workflow that the pipeline runs.

Components in the training and deployment workflow.

In this workflow you do the following:

  1. Load historical sales transactions from Cloud Storage to a BigQuery staging table. If the data is already in BigQuery, this step is skipped.
  2. Prepare a BigQuery query. The query is generated from a query template and from runtime arguments that are passed to the pipeline.
  3. Run a BigQuery query to create features from the historical sales transactions. The engineered features are stored in a BigQuery table.
  4. Import features to an AutoML dataset.
  5. Trigger AutoML model training.
  6. After training completes, retrieve the model's evaluation metrics.
  7. Compare the model's performance against the performance threshold.
  8. If the model meets or exceeds the performance threshold, deploy the model for online predictions.

For more information about the design and use of the training pipeline, see the documentation in the accompanying GitHub repository for this solution.

Batch prediction workflow

The pipeline uses the same preprocessing and feature-engineering steps as the training pipeline. You use the AutoML Tables batchPredict method for batch predictions.

The following diagram shows the workflow that the pipeline runs.

Pipeline workflow for the preprocessing and engineering steps.

  1. Load historical sales transactions from Cloud Storage to a BigQuery staging table. If the data is already in BigQuery, this step is skipped.
  2. Prepare a BigQuery query. The query is generated from a query template and from runtime arguments that are passed to the pipeline.
  3. Run a BigQuery query to create features from the historical sales transactions. The engineered features are stored in a BigQuery table.
  4. Invoke the AutoML Tables batchPredict method to score the data. The AutoML Tables batchPredict method stores the resulting predictions in either Cloud Storage or BigQuery.

Kubeflow Pipelines design patterns

The pipelines discussed in this document are straightforward to manage, deploy, and customize.

Using generic or pipeline-specific Kubeflow Pipelines components

In general, you can categorize Kubeflow components into two types:

  • Generic components. A generic component has functionality that is not specific to a workflow implemented by a given pipeline. For example, a component that submits a BigQuery job is a generic component. Because the component doesn't contain any logic specific to a given pipeline, you can reuse it across multiple scenarios.
  • Pipeline-specific components. A pipeline-specific component is usually a helper component that has functionality specific to a given pipeline or a group of related pipelines—for example, a component that calculates and logs a custom performance metric that's specific to a given training workflow. You can't easily reuse a pipeline-specific component.

You can develop, build, deploy, and manage generic components independently from the pipelines that use them.

Typically, developing, building, and deploying a pipeline-specific component is tightly coupled to the consuming pipeline's lifecycle. Using Lightweight Python components is a convenient and common way to implement pipeline-specific components.

The pipelines use both generic and pipeline-specific components.

To run data preprocessing and feature engineering tasks, the pipelines use the BigQuery component that is packaged with the Kubeflow Pipelines distribution. The component processes an arbitrary BigQuery query and stores the results in a BigQuery table or a Cloud Storage blob.

For model training, deployment, and inference, the pipelines use a set of custom AutoML Tables components provided as part of the GitHub code samples. The AutoML Tables components wrap a subset of AutoML Tables APIs.

The pipelines also use the following helper components that are implemented as lightweight Python components:

  • The Load transactions component. The Load transactions component loads the historical sales transaction data from a set of CSV files in Cloud Storage to a staging table in BigQuery.
  • The Prepare query component. The Prepare query component (a good example of a pipeline-specific component) generates a BigQuery SQL query template. However, BigQuery doesn't support parameterization of identifiers, column names, or table names. To avoid hard coding these parts of the preprocessing and feature- engineering query, Prepare query substitutes placeholders in the query template with the values that are passed to the component as runtime parameters.

For more detailed information about the design and usage of the components used in the pipelines, see the documentation for this solution on GitHub.

Managing pipeline settings

A critical aspect of pipeline design is how to manage configuration settings. We don't recommend hard coding these settings directly in a pipeline's domain-specific language (DSL). Hard-coded settings can lead to problems when you deploy pipelines into different environments.

For example, the names and locations of the container images for the components that a pipeline uses might be different in development, staging, and production environments. Similarly, the URLs to the assets that the components use (such as the query template that the Prepare query component uses) might differ from environment to environment.

The accompanying GitHub solution demonstrates one approach to settings management. In the solution, a single YAML configuration file manages all pipeline settings. The settings file has two sections: argument_defaults and compiler_settings.

In the argument_defaults section, you define the default values for the pipelines' runtime arguments. In the compiler_settings section, you define the settings that control how the DSL compiler converts the Python DSL into the resulting YAML format. An example of a compiler setting is a flag that controls whether a pipeline should be compiled to use the specific pipeline user service account or the default Compute Engine service account when the pipeline accesses Google Cloud-managed services.

During compilation, the settings file is merged with the pipelines' DSL code.

Visualizing results of a pipeline run in the Kubeflow Pipelines dashboard

The Kubeflow Pipelines platform has built-in support for logging and visualizing artifacts that can help with tracking, understanding, evaluating, and comparing a series of runs. This support is especially important when you manage a potentially large number of runs during the experimentation and training phases of the ML lifecycle.

The Kubeflow Pipelines dashboard supports two types of visualizations:

  • Pipeline metrics. A pipeline metric is a scalar metric that is rendered as a visualization in the Runs page for a particular experiment in the Kubeflow Pipelines dashboard. The primary goal of a pipeline metric is to provide a quick look into the performance of a run and to compare multiple runs.
  • Output viewers. Output viewers are used to log and render more extensive information about a run. In the Kubeflow Pipelines dashboard, you can find the artifacts output by a given component in the Artifacts section of the task pane and the summary of all outputs in the Run Output pane for a run.

Currently, the Kubeflow Pipelines dashboard supports the following types of output viewers:

  • Confusion matrix
  • ROC curve
  • TensorBoard
  • Web app
  • Table
  • Markdown

The last three viewers are flexible because they let you capture and visualize arbitrary information.

A pipeline component can use Kubeflow Pipelines dashboard output viewers by writing metadata for the output viewers to a JSON file. The metadata must conform to the schema described in the Kubeflow Pipelines documentation.

Most generic components distributed with Kubeflow Pipelines generate metadata that reflects their actions. We highly recommend that you instrument any custom components developed as a part of the Kubeflow Pipelines solution with the metadata generation.

If you need to capture information that is not automatically generated by any of the prebuilt components that your pipeline uses, you can use the flexible infrastructure of lightweight Python components to output additional artifacts.

The AutoML Tables components demonstrate how to use Kubeflow Pipelines dashboard artifacts to track outcomes of AutoML Tables API calls.

The log_evalutation_metrics component retrieves the latest evaluation metrics of a given AutoML model and outputs those metrics as a Markdown artifact.

The component also logs the primary metric specified as its input argument as a pipeline metric.

Build and deployment automation

Production pipelines can become complex. They might do the following:

  • Use various components from different sources. For example, they might use generic prebuilt Kubeflow Pipelines components and custom components developed for a given solution.
  • Use external assets and resources. For example, they might use SQL scripts and templates, PySpark scripts, or output viewer templates.
  • Require different values from the default values for the runtime parameters or compiler settings, or both, depending on the target runtime environment. For example, for some environments, you might need to compile a pipeline to use a custom service account, while for the other environments, the pipeline might use the default Compute Engine account.

In all but the simplest cases, it's not feasible to manage the building and deployment of Kubeflow Pipelines solutions by using build books or manual step-by-step guides. Automating the build and deployment processes is critical. When combined with flexible configuration settings management, as described in the Managing pipeline settings section, build automation sets a foundation for a robust, repeatable, and trackable configuration management of Kubeflow Pipelines solutions.

The code samples in the accompanying GitHub repository demonstrate one approach to automating the build process—using Cloud Build. In this scenario, Cloud Build performs the following steps:

  1. Builds the base image for the lightweight Python components.
  2. Builds the image that hosts AutoML Tables components.
  3. Deploys the images to your project's container registry.
  4. Compiles the pipelines.
  5. Deploys the compiled pipelines to Cloud Storage.
  6. Deploys the pipelines' artifacts to Cloud Storage.
  7. Copies the sample dataset used by the tutorial to Cloud Storage.

For a more detailed description of the build process, see the solution's GitHub repository.

What's next