Setting up an MLOps environment on Google Cloud

This reference guide outlines the architecture of a machine learning operations (MLOps) environment on Google Cloud. The guide accompanies hands-on labs in GitHub that walk you through the process of provisioning and configuring the environment described here.

Virtually all industries are adopting machine learning (ML) at a rapidly accelerating pace. A key challenge for getting value from ML is to create ways to deploy and operate ML systems effectively. This guide is intended for machine learning (ML) and DevOps engineers. It assumes that you have a basic understanding of the following Google Cloud products and features:

  • Google Cloud projects
  • Cloud Storage
  • Virtual Private Cloud
  • Google Kubernetes Engine (GKE)
  • AI Platform
  • Dataflow
  • Cloud SQL
  • BigQuery

The guide also assumes that you know about TensorFlow and TensorFlow Extended (TFX), and that you want to leverage them for building production ML pipelines.

MLOps environment architecture overview

The MLOps environment is designed to provide the following capabilities:

  • An experimentation environment: a powerful, interactive environment that supports experimentation and collaborative development of ML pipelines.
  • A training service: a scalable service that can execute single-node and distributed training jobs in a variety of ML frameworks and that can utilize accelerated hardware.
  • A prediction service: a scalable service that enables high-performance model serving, both batch and online.
  • Distributed data processing: a scalable service that can manage large-scale data preprocessing jobs on both batch and streaming data.
  • ML pipelines: an orchestration service that's optimized for defining, executing, and monitoring ML workflows.
  • A metadata repository: a repository that tracks and analyzes metadata for ML.
  • An artifact store: a secure, highly durable, and scalable object store for managing ML artifacts.
  • A data warehouse: a data warehousing service that's integrated with ML services.
  • A container registry: a secure service to manage Docker images.
  • A CI/CD service: a flexible service for defining workflows for building, testing, and deploying workloads across multiple environments.
  • Source control: a source control system to store, manage, and track ML code.

The following diagram shows an example of the architecture that provides the MLOps capabilities described in this document. The environment includes a set of integrated Google Cloud services.

An architecture featuring Google Cloud resources that implements an MLOps environment.

The following table describes the core Google Cloud services that are used in this environment.

Service Description
GitHub repository In this environment, the GitHub repository is used as a source control repository. However, you can use any Git-based version control system.
Cloud Build Cloud Build is a service that executes CI/CD routines on Google Cloud infrastructure. In this environment, Cloud Build is used to build, test, and deploy machine learning pipelines.
Vertex AI Workbench user-managed notebooks Vertex AI Workbench enables you to create and manage user-managed notebooks, which are virtual machine (VM) instances that are pre-packaged with JupyterLab. This product has a preinstalled suite of Python and R deep-learning libraries. In this environment, JupyterLab notebooks are used for ML experimentation and development.
Google Kubernetes Engine GKE is an enterprise-grade platform for containerized applications. In this environment, GKE is used to host Kubeflow Pipelines services.
TensorFlow Extended (TFX) TFX is an integrated framework for implementing scalable, high-performance ML tasks and workflows. In this environment, TFX is used to implement the end-to-end ML training pipeline for TensorFlow workloads.
Kubeflow Pipelines Kubeflow Pipelines is a Kubeflow service for composing and automating ML systems, as well as for providing a UI for managing experiments and runs. In this environment, Kubeflow Pipelines is used as an orchestrator for TFX.
Container Registry Container Registry is a single place for a team to store and manage Docker images. In this environment, Container Registry is used to store the container images that are produced by the CI/CD routine of Cloud Build.
AI Platform Training AI Platform Training is a serverless, fully managed service for training ML models at scale. The service can take advantage of cloud accelerators (GPUs and TPUs). In this environment, AI Platform Training is used for training ML models in production during continuous training and during experimentation.
AI Platform Prediction AI Platform Prediction is a serverless, autoscaling service to host ML models as APIs. In this environment, AI Platform Prediction is used to serve trained models for online inference.
Dataflow Dataflow is a serverless, scalable service for distributed data processing, both for batch and streaming workloads. In this environment, Dataflow is used for data extraction, validation, transformation, and model evaluation tasks in the ML pipeline.
Cloud Storage Cloud Storage is a simple, reliable, and highly-durable object store. In this environment, Cloud Storage is used as an artifact store, where the outputs of ML pipeline steps are saved.
Cloud SQL Cloud SQL is a fully managed relational database service. In this environment, it's used as a backend for the ML Metadata service and for the Kubeflow Pipelines metadata.
BigQuery BigQuery is a serverless, cost-effective, petabyte-scale cloud data warehouse. In this environment, BigQuery is used as the source for recurrent ML training and evaluation data.

Key architectural decisions

MLOps products and technologies are evolving and improving at a fast pace. To streamline the design and to support incremental modifications as new products and services become available, the following architectural decisions have driven the MLOps environment architecture that's described in this guide:

  • Use managed services if available.
  • Use a standalone deployment of Kubeflow Pipelines.
  • Use Docker images to provide consistent runtimes for experimentation, training, and serving.

The following sections provide details about these design decisions.

Using fully managed Google Cloud services

In this MLOps environment, fully managed services are used to implement a given capability. In fact, with the exception of ML pipelines and ML Metadata, all the capabilities for this environment are provided through fully managed services.

Because ML pipelines and ML Metadata are not yet available as fully managed services, they are implemented in this design by using an open source standalone deployment of Kubeflow Pipelines on GKE.

Using a standalone deployment of Kubeflow Pipelines

ML pipelines and ML Metadata are core capabilities of an MLOps environment. In the environment described in this reference guide, they're provided by Kubeflow Pipelines. Kubeflow Pipelines is a core component of Kubeflow. As an alternative to deploying Kubeflow as a whole, you can deploy only Kubeflow Pipelines services. This type of deployment is referred to as a standalone deployment of Kubeflow Pipelines.

The following diagram shows a simplified view of the standalone deployment of Kubeflow Pipelines in the environment.

Standalone deployment of Kubeflow Pipelines in the example environment.

In this environment, the Kubeflow Pipelines and ML Metadata services are deployed to a dedicated GKE cluster. The cluster is installed in a dedicated Virtual Private Cloud (VPC) network. The Kubeflow Pipelines and ML Metadata services are configured to use Cloud Storage and Cloud SQL to store ML Metadata and manage ML artifacts. The services use Cloud SQL Proxy to access Cloud SQL. The web-based UI and REST APIs for Kubeflow Pipelines are exposed through a public URL using an inverting proxy and agent.

The hands-on labs that accompany this reference guide provide detailed instructions for provisioning and deploying the environment's services, including the standalone deployment of Kubeflow Pipelines.

Using Docker images to maintain consistent runtimes

In most ML environments, maintaining consistent runtimes is critical. For example, the versions of Python packages used in an experimentation environment must be compatible with (or the same as) the versions that are used in continuous-training or model-serving environments.

The approach taken in this reference guide is to use Docker images to maintain consistent runtimes between Vertex AI Workbench user-managed notebooks instances, AI Platform Training, and TFX on Kubeflow Pipelines. For implementing a given ML workflow, the environment has a base container image that includes all the dependencies that are required by code for data preprocessing, training, and serving. Whenever possible, the services that are used to develop and operationalize the workflow are configured to use derivatives of the base image.

For example, Vertex AI Workbench user-managed notebooks instances used for experimentation are provisioned using custom Docker images derived from the base image. In addition, AI Platform Training jobs used for continuous training are run using custom container images that are also derived from the base image.

The hands-on labs that accompany this reference guide include instructions for how to configure and build an example of the base image and how to use to provision a user-managed notebooks instance.

What's next